Mnesia internals
I've been studying Mnesia internals lately, and I figured I may start publishing my scattered notes. I will keep updating this post with more details.
Disclaimer: I am not an OTP dev, and not a mnesia dev. Anything that will be posted here will be derived from reading the code.
Pictures contain clickable links.
1 Transaction
Transaction fun runs in the same process that calls mnesia:transaction
function.
Most of the commit coordination work is also done in the caller process.
This helps mnesia scale.
1.1 Data structures
1.1.1 What's inside mnesia_activity_state?
mnesia_activity_state
is a process dictionary variable in the transaction process.
mnesia:transaction(fun() -> mnesia:write({bar, 1, 2}), mnesia:write({foo, 3, 3}) end). .... {mnesia, #tid{counter = 13, pid = <0.125.0>}, #tidstore{store = #Ref<0.3426409509.1642725377.222071>, up_stores = [], level = 1 }}
Counter is the Lamport clock (kept by mnesia_tm
process) at the beginning of the transaction.
1.1.2 What's inside the tidstore table?
mnesia:transaction(fun() -> mnesia:write({bar, 1, 2}), mnesia:write({foo, 3, 3}) end). .... [{{bar,1},{bar,1,2},write}, {{foo,3},{foo,3,3}, write}, {{locks,foo,3}, write}, {{locks,bar,1}, write}, {nodes, 'foo@me-emq'}, {nodes, 'bar@me-emq'}]
1.2 TODO Transactional reads and writes
1.3 Commit process in detail
Commit procedure also mostly happens in the caller process. This process acts as the coordinator.
1.3.1 Arrange
Arrange function is pretty convoluted. Thankfully, it only uses the local data from the transaction store and the schema. It creates a tuple of the following type:
mnesia:transaction(fun() -> mnesia:write({foo, 1, 2}), mnesia:write({foo, 3, 3}) end). .... {2, #prep{protocol = sym_trans, records = [#commit{node = 'bar@localhost', decision = presume_commit, ram_copies = [{{foo,1},{foo,1,2},write}, {{foo,3},{foo,3,3},write}], disc_copies = [],disc_only_copies = [],ext = [], schema_ops = []}, #commit{node = 'foo@localhost',decision = presume_commit, ram_copies = [{{foo,1},{foo,1,2},write}, {{foo,3},{foo,3,3},write}], disc_copies = [],disc_only_copies = [],ext = [], schema_ops = []}], prev_tab = foo, prev_types = [{'bar@localhost',ram_copies}, {'foo@localhost',ram_copies}], prev_snmp = [], types = [{'bar@localhost',ram_copies}, {'foo@localhost',ram_copies}], majority = [], sync = false}}
The first element is the number of write/delete ops in the transaction. This number is used to determine whether the transaction is r/o or r/w.
1.3.2 What is stored in the mnesia_tm's state?
mnesia:transaction(fun() -> mnesia:write({foo, 1, 2}), mnesia:write({foo, 3, 3}) end). .... #state{ coordinators = {0,nil}, participants = %% Note: this field is a `gb_tree'. So don't mind stuff in the outer tuple {1, {#tid{counter = 32,pid = <11304.125.0>}, #participant{ tid = #tid{counter = 32,pid = <11304.125.0>}, pid = nopid, commit = #commit{ node = 'bar@localhost',decision = presume_commit, ram_copies = [{{foo,1},{foo,1,2},write},{{foo,3},{foo,3,3},write}], disc_copies = [],disc_only_copies = [],ext = [], schema_ops = []}, disc_nodes = [], ram_nodes = ['foo@localhost','bar@localhost'], protocol = sym_trans}, nil,nil}}, supervisor = <0.99.0>,blocked_tabs = [],dirty_queue = [], fixed_tabs = [] }
1.3.3 What's stored in the mnesia log?
The contents of the #commit{}
record for the current node are written to the mnesia log:
mnesia:transaction(fun() -> mnesia:write({foo, 1, 2}), mnesia:write({foo, 3, 3}) end). .... #commit{node = 'bar@localhost', decision = presume_commit, ram_copies = [{{foo,1},{foo,1,2},write}, {{foo,3},{foo,3,3},write}], disc_copies = [], disc_only_copies = [], ext = [], schema_ops = [] }
Note that the commit is only logged on the node that initiated the transaction and the participant disk nodes.
2 TODO Locker
3 TODO Schema
4 TODO Transaction aborts and restarts
5 Dirty writes
5.1 Data structures
5.1.1 What's inside #prep record?
mnesia:dirty_write({foo, 1, 1}). .... #prep{protocol = async_dirty, records = [#commit{node = 'bar@localhost', decision = presume_commit, ram_copies = [{{foo,1},{foo,1,1},write}], disc_copies = [],disc_only_copies = [],ext = [], schema_ops = []}, #commit{node = 'foo@localhost',decision = presume_commit, ram_copies = [{{foo,1},{foo,1,1},write}], disc_copies = [],disc_only_copies = [],ext = [], schema_ops = []}], prev_tab = foo, prev_types = [{'bar@localhost',ram_copies}, {'foo@localhost',ram_copies}], prev_snmp = [], types = [{'bar@localhost',ram_copies}, {'foo@localhost',ram_copies}], majority = [],sync = false}
5.1.2 What is sent to the remote node?
mnesia:dirty_write({foo, 1, 1}). .... {<11304.91.0>, {async_dirty,{dirty,<11304.91.0>}, #commit{node = 'foo@me-emq', decision = presume_commit, ram_copies = [{{foo,1},{foo,1,1},write}], disc_copies = [], disc_only_copies = [], ext = [], schema_ops = []}, foo}}
6 Ext copies
There is an undocumented feature that allows to implement a custom mnesia backends.
It's called ext_copies
(I guess).
Let's look at how it can be used.
From mnesia_tm
:
do_commit(Tid, Bin, DumperMode) when is_binary(Bin) -> do_commit(Tid, binary_to_term(Bin), DumperMode); do_commit(Tid, C, DumperMode) -> ... R2 = do_update(Tid, ram_copies, C#commit.ram_copies, R), R3 = do_update(Tid, disc_copies, C#commit.disc_copies, R2), R4 = do_update(Tid, disc_only_copies, C#commit.disc_only_copies, R3), R5 = do_update_ext(Tid, C#commit.ext, R4), ... . ... do_update_ext(_Tid, [], OldRes) -> OldRes; do_update_ext(Tid, Ext, OldRes) -> case lists:keyfind(ext_copies, 1, Ext) of false -> OldRes; {_, Ops} -> Do = fun({{ext, _, _} = Storage, Op}, R) -> do_update(Tid, Storage, [Op], R) end, lists:foldl(Do, OldRes, Ops) end.
And mnesia_lib.erl
has the following functions inside:
db_put(ram_copies, Tab, Val) -> ?ets_insert(Tab, Val), ok; db_put(disc_copies, Tab, Val) -> ?ets_insert(Tab, Val), ok; db_put(disc_only_copies, Tab, Val) -> dets:insert(Tab, Val); db_put({ext, Alias, Mod}, Tab, Val) -> Mod:insert(Alias, Tab, Val). db_erase(Tab, Key) -> db_erase(val({Tab, storage_type}), Tab, Key). db_erase(ram_copies, Tab, Key) -> ?ets_delete(Tab, Key), ok; db_erase(disc_copies, Tab, Key) -> ?ets_delete(Tab, Key), ok; db_erase(disc_only_copies, Tab, Key) -> dets:delete(Tab, Key); db_erase({ext, Alias, Mod}, Tab, Key) -> Mod:delete(Alias, Tab, Key), ok.
So mnesia expects a list of {{ext, Alias, Module}, Op}
tuples in the commit record.
7 TODO Scalability
As should be evident from the above diagram, transaction latency is expected to grow when the number of nodes in the cluster grows. Indeed, we observed this effect in the test with the help of netem.