UP | HOME

Mnesia internals

I've been studying Mnesia internals lately, and I figured I may start publishing my scattered notes. I will keep updating this post with more details.

Disclaimer: I am not an OTP dev, and not a mnesia dev. Anything that will be posted here will be derived from reading the code.

Pictures contain clickable links.

1 Transaction

Transaction fun runs in the same process that calls mnesia:transaction function. Most of the commit coordination work is also done in the caller process. This helps mnesia scale.

Sorry, your browser does not support SVG.

1.1 Data structures

1.1.1 What's inside mnesia_activity_state?

mnesia_activity_state is a process dictionary variable in the transaction process.

mnesia:transaction(fun() -> mnesia:write({bar, 1, 2}), mnesia:write({foo, 3, 3}) end).
....

{mnesia,
 #tid{counter = 13, pid = <0.125.0>},
 #tidstore{store = #Ref<0.3426409509.1642725377.222071>,
           up_stores = [],
           level = 1
           }}

Counter is the Lamport clock (kept by mnesia_tm process) at the beginning of the transaction.

1.1.2 What's inside the tidstore table?

mnesia:transaction(fun() -> mnesia:write({bar, 1, 2}), mnesia:write({foo, 3, 3}) end).
....


[{{bar,1},{bar,1,2},write},
 {{foo,3},{foo,3,3}, write},
 {{locks,foo,3}, write},
 {{locks,bar,1}, write},
 {nodes, 'foo@me-emq'},
 {nodes, 'bar@me-emq'}]

1.2 TODO Transactional reads and writes

Sorry, your browser does not support SVG.

1.3 Commit process in detail

Commit procedure also mostly happens in the caller process. This process acts as the coordinator.

Sorry, your browser does not support SVG.

1.3.1 Arrange

Arrange function is pretty convoluted. Thankfully, it only uses the local data from the transaction store and the schema. It creates a tuple of the following type:

mnesia:transaction(fun() -> mnesia:write({foo, 1, 2}), mnesia:write({foo, 3, 3}) end).
....

{2,
 #prep{protocol = sym_trans,
       records = [#commit{node = 'bar@localhost',
                          decision = presume_commit,
                          ram_copies = [{{foo,1},{foo,1,2},write},
                                        {{foo,3},{foo,3,3},write}],
                          disc_copies = [],disc_only_copies = [],ext = [],
                          schema_ops = []},
                  #commit{node = 'foo@localhost',decision = presume_commit,
                          ram_copies = [{{foo,1},{foo,1,2},write},
                                        {{foo,3},{foo,3,3},write}],
                          disc_copies = [],disc_only_copies = [],ext = [],
                          schema_ops = []}],
       prev_tab = foo,
       prev_types = [{'bar@localhost',ram_copies},
                     {'foo@localhost',ram_copies}],
       prev_snmp = [],
       types = [{'bar@localhost',ram_copies},
                {'foo@localhost',ram_copies}],
       majority = [],
       sync = false}}

The first element is the number of write/delete ops in the transaction. This number is used to determine whether the transaction is r/o or r/w.

1.3.2 What is stored in the mnesia_tm's state?

mnesia:transaction(fun() -> mnesia:write({foo, 1, 2}), mnesia:write({foo, 3, 3}) end).
....

#state{
    coordinators = {0,nil},
    participants = %% Note: this field is a `gb_tree'. So don't mind stuff in the outer tuple
        {1,
         {#tid{counter = 32,pid = <11304.125.0>},
          #participant{
              tid = #tid{counter = 32,pid = <11304.125.0>},
              pid = nopid,
              commit =
                  #commit{
                      node = 'bar@localhost',decision = presume_commit,
                      ram_copies =
                          [{{foo,1},{foo,1,2},write},{{foo,3},{foo,3,3},write}],
                      disc_copies = [],disc_only_copies = [],ext = [],
                      schema_ops = []},
              disc_nodes = [],
              ram_nodes = ['foo@localhost','bar@localhost'],
              protocol = sym_trans},
          nil,nil}},
    supervisor = <0.99.0>,blocked_tabs = [],dirty_queue = [],
    fixed_tabs = []
  }

1.3.3 What's stored in the mnesia log?

The contents of the #commit{} record for the current node are written to the mnesia log:

mnesia:transaction(fun() -> mnesia:write({foo, 1, 2}), mnesia:write({foo, 3, 3}) end).
....

#commit{node = 'bar@localhost',
        decision = presume_commit,
        ram_copies = [{{foo,1},{foo,1,2},write},
                      {{foo,3},{foo,3,3},write}],
        disc_copies = [],
        disc_only_copies = [],
        ext = [],
        schema_ops = []
       }

Note that the commit is only logged on the node that initiated the transaction and the participant disk nodes.

2 TODO Locker

3 TODO Schema

4 TODO Transaction aborts and restarts

5 Dirty writes

Sorry, your browser does not support SVG.

5.1 Data structures

5.1.1 What's inside #prep record?

mnesia:dirty_write({foo, 1, 1}).
....

#prep{protocol = async_dirty,
      records = [#commit{node = 'bar@localhost',
                         decision = presume_commit,
                         ram_copies = [{{foo,1},{foo,1,1},write}],
                         disc_copies = [],disc_only_copies = [],ext = [],
                         schema_ops = []},
                 #commit{node = 'foo@localhost',decision = presume_commit,
                         ram_copies = [{{foo,1},{foo,1,1},write}],
                         disc_copies = [],disc_only_copies = [],ext = [],
                         schema_ops = []}],
      prev_tab = foo,
      prev_types = [{'bar@localhost',ram_copies},
                    {'foo@localhost',ram_copies}],
      prev_snmp = [],
      types = [{'bar@localhost',ram_copies},
               {'foo@localhost',ram_copies}],
      majority = [],sync = false}

5.1.2 What is sent to the remote node?

mnesia:dirty_write({foo, 1, 1}).
....

{<11304.91.0>,
 {async_dirty,{dirty,<11304.91.0>},
              #commit{node = 'foo@me-emq',
                      decision = presume_commit,
                      ram_copies = [{{foo,1},{foo,1,1},write}],
                      disc_copies = [],
                      disc_only_copies = [],
                      ext = [],
                      schema_ops = []},
              foo}}

6 Ext copies

There is an undocumented feature that allows to implement a custom mnesia backends. It's called ext_copies (I guess). Let's look at how it can be used.

From mnesia_tm:

do_commit(Tid, Bin, DumperMode) when is_binary(Bin) ->
    do_commit(Tid, binary_to_term(Bin), DumperMode);
do_commit(Tid, C, DumperMode) ->
    ...
    R2 = do_update(Tid, ram_copies, C#commit.ram_copies, R),
    R3 = do_update(Tid, disc_copies, C#commit.disc_copies, R2),
    R4 = do_update(Tid, disc_only_copies, C#commit.disc_only_copies, R3),
    R5 = do_update_ext(Tid, C#commit.ext, R4),
    ...
    .

...

do_update_ext(_Tid, [], OldRes) -> OldRes;
do_update_ext(Tid, Ext, OldRes) ->
    case lists:keyfind(ext_copies, 1, Ext) of
      false -> OldRes;
      {_, Ops} ->
        Do = fun({{ext, _, _} = Storage, Op}, R) ->
                 do_update(Tid, Storage, [Op], R)
             end,
        lists:foldl(Do, OldRes, Ops)
    end.

And mnesia_lib.erl has the following functions inside:

db_put(ram_copies, Tab, Val) -> ?ets_insert(Tab, Val), ok;
db_put(disc_copies, Tab, Val) -> ?ets_insert(Tab, Val), ok;
db_put(disc_only_copies, Tab, Val) -> dets:insert(Tab, Val);
db_put({ext, Alias, Mod}, Tab, Val) ->
    Mod:insert(Alias, Tab, Val).


db_erase(Tab, Key) ->
    db_erase(val({Tab, storage_type}), Tab, Key).
db_erase(ram_copies, Tab, Key) -> ?ets_delete(Tab, Key), ok;
db_erase(disc_copies, Tab, Key) -> ?ets_delete(Tab, Key), ok;
db_erase(disc_only_copies, Tab, Key) -> dets:delete(Tab, Key);
db_erase({ext, Alias, Mod}, Tab, Key) ->
    Mod:delete(Alias, Tab, Key),
    ok.

So mnesia expects a list of {{ext, Alias, Module}, Op} tuples in the commit record.

7 TODO Scalability

As should be evident from the above diagram, transaction latency is expected to grow when the number of nodes in the cluster grows. Indeed, we observed this effect in the test with the help of netem.

Date: 2021-04-26