Introduction to Snabbkaffe
This is an HTML version of snabbkaffe presentation slides.
1 Background and motivation
1.1 What is snabbkaffe
- Snabbkaffe is a library that allows to test concurrent and distributed systems
- It does so by moving focus from states to effects
- Developers find bugs by looking at the logs, snabbkaffe does the same and automates the process
- Advanced modes of testing: fault and scheduling injection
- Efficiency: run test scenario once, verify multiple properties
1.2 Name
Snabbkaffe
means ``instant coffee'' in Swedish- There is no hidden meaning, the name was chosen randomly
1.3 Nondeterminism
- Definitions
- Deterministic functions always produce the same output for the same input
- Nondeterministic functions can produce different outputs for the same input
- Sources of nondeterminism
- Process schedulings
- Fault-tolerance (failovers)
- Network (packet loss and reordering)
- Load balancing, worker pools
Nondeterminism is an inherent property of concurrent systems
1.4 Why change the paradigm of testing?
- Difficulty levels
- Easy: Pure functions
- Normal: Stateful systems
- Hard: Nondeterministic systems
- Nightmare: Nondeterministic fault-tolerant distributed systems
- Traditional approach
- Traditional approach is sufficient for testing pure functions. There are a variety of excellent tools for this, like PropER
- It can do pretty well with stateful systems (stateful PropER)
- Where it starts to crack: typical Erlang application assembled together
1.5 Challenges of testing Erlang applications in situ
Erlang partitions the state of the system between processes
- It eliminates some concurrency bugs and isolates failures
- But it makes the system harder to analyze
- More often than not it's impossible to inspect state of the process (
sys:get_state
) - Often it's useless
- Tests that rely on the state usually need to wait for the system to stabilize, it can be slow, and they can't analyze the system in motion
- Asynchronous data processing patterns (e.g. job queues) are incredibly annoying to test using traditional methods
1.6 Possible solution to this problem?
Move away from states and embrace events and their temporal relationships
2 Instrumenting the code
2.1 Tracing macros
- Replace your regular logs with structured logs (the industry is moving towards structured logs anyway)
- Include
"snabbkaffe/include/trace.hrl"
header to your module - Use
?tp
or?tp_span
macros for logging (tp
stands for Trace Point)- In the release build these macros will become regular
logger
messages - In the test build these macros will emit trace events
- In the release build these macros will become regular
2.2 Examples
%% Disappears in the release build: ?tp(rlog_replica_import_trans, #{ seqno => SeqNo , txid => TXID , transaction => Transaction })
%% Becomes a notice message ?tp(notice, "Remote process died", #{ reason => Reason , my_state => State })
%% Emits two events: ?tp_span(trans_write, #{record => Record, txid => get_txid()}, mnesia:write(Record)).
2.3 Distributed tracing
Forward traces from the remote node:
%% on the ct_master: snabbkaffe:forward_trace(Node)
All features (incl. fault and scheduling injections) will work
2.4 Why not using erlang tracing?
An often asked question: why not using dbg:tracer
or the like?
It was a conscious design choice:
- Traces obtained this way are tightly bound to the structure of the code. Refactoring the code or adding a new function argument would break the tests
- Debug traces are excessive, it would be hard to see the forest through the trees
- Log messages are typically placed in ``interesting'' places
- Interesting for humans = interesting for tests
3 Testing
3.1 Stages of the test
Every testcase is split in two stages:
3.2 check_trace macro
my_test() -> ?check_trace(begin %% Run stage: launch_system(), do_stuff(), wait_for_completion(), get_result() end, fun(Result, Trace) -> %% Check stage: ?assertMatch(ok, Result), check_history(Trace), check_history2(Trace), ... end).
3.3 Waiting for events
It is possible to block run stage until a certain event occurs:
?block_until(#{ ?snk_kind := message_acked , message_id := Id } when Id > 42, Timeout, BackInTime)
This macro will return immediately as soon as the event happens, so it's more efficient and less prone to flakiness than a sleep.
3.4 Starting an asynchronous action
?block_until
is fine, but it requires timeout tuning.
What if the same event happened in the past?
?wait_async_action( send_async_request(Req) , #{ ?snk_kind := request_handled , request := Req } [, Timeout] )
This macro doesn't have to look into the past.
3.5 Integrating with PropER
It's not a problem to combine stateless PropER tests with snabbkaffe.
?check_trace
macro can be used inside proper's ?FORALL
macro.
Snabbkaffe provides some convenience macros that simplify trace-property-based testing:
trace_prop_test(Config) -> Prop = ?forall_trace( X, list(), begin %% Run stage: do_stuff(X) end, fun(Result, Trace) -> %% Check stage true end), ?run_prop(Config, Prop).
3.6 Misc. macros
There are a few macros that are not directly related to trace-based testing, but often needed:
- ?panic
?panic("Unexpected result", #{result => A})
Throw an error
- ?retry
?retry(_RetryPeriod = 100, _MaxRetries = 10, some_action_that_can_fail())
Run some code periodically until it succeeds (use as the last resort, when
?block_until
won't work) - ?give_or_take
?give_or_take(_Expected = 20, _Deviation = 2, Value)
Check that an integer value is within range
4 Basic functions for working with traces
4.1 Structure of the trace event
[#{ ?snk_kind => foo , ?snk_meta => #{ domain => [foo, bar] , node => 'foo@localhost' , time => <monotonic time> , pid => <0.343.0> , group_leader => <0.67.0> , custom_field1 => baz , custom_field2 => foobar } , field1 => 42 , field2 => foo }, ... ]
4.2 Structure of the span trace
[#{ ?snk_kind => foo , ?snk_meta => #{ ... } , ?snk_span => start , field1 => 42 , field2 => foo }, #{ ?snk_kind => foo , ?snk_meta => #{ ... } , ?snk_span => {complete, ReturnValue} , field1 => 42 , field2 => foo } ]
4.3 Filtering the trace
4.4 Extracting fields
[1, 2] = ?projection(id, ?of_kind(handle_message, Trace))
[{1, "hello"}, {2, "world"}] = ?projection( [id, message] , ?of_kind(handle_message, Trace) )
5 Event correlations
5.1 ?strict_causality macro
?strict_causality( #{?snk_kind := msg_received, id := _Id} , #{?snk_kind := msg_processed, id := _Id} , Trace )
Return value:
true
if some matching events were foundfalse
if no events were found- exception if causality is violated
5.2 ?strict_causality with guard
Suppose we're testing a "base64 server":
?strict_causality( #{req := _Req} , #{resp := _Resp} , _Resp =:= base64:encode(_Req) , Trace )
5.3 ?causality macro
Overall, it's the same as ?strict_causality
, except it doesn't require each ``cause'' to have an ``effect''
?causality( #{?snk_kind := msg_received, id := _Id} , #{?snk_kind := msg_processed, id := _Id} [, Guard] , Trace )
5.4 Pitfalls
There is a potential problem with ``causality'' macros:
- Format of the event is changed in the code
- Match expressions in causality stop matching the events
- Tests still pass, even though they didn't find any events
Solution:
Always wrap ?causality
and ?strict_causality
in ?assert
, unless there are legit situations when no events can be produced in the test
5.5 ?find_pairs macro
?find_pairs(Strict, MatchCause, MatchEffect [, Guard] , Trace)
- It returns a list of
{pair, Cause, Effect}
or{singleton, Cause}
- When
Strict
istrue
this macro also checks that effects don't occur before causes (much like?causality
)
5.6 Splitting traces
Often it is useful to split traces to parts before and after some event (for example, restart)
There are a variety of macros for this:
6 Nemesis
6.1 Are your supervisors tested?
- Taking fault-tolerance seriously is one of the selling points of Erlang. Organizing processes in supervisor trees is used widely, but not often tested
- Tuning supervisor trees is an art
- Snabbkaffe wants to turn it into a chore
- It does so by injecting deliberate faults into the system
6.2 Fault injection
Any tracepoint can be used to inject errors into the system
?inject_crash( #{?snk_meta := #{domain := [ekka, rlog|_]}} , snabbkaffe_nemesis:random_crash(0.1) )
- First argument: event matching expression
- Second argument: fault scenario
6.3 Fault scenarios
- Always crash:
?inject_crash(..., snabbkaffe_nemesis:always_crash())
- Crash
N
times:
?inject_crash(..., snabbkaffe_nemesis:recover_after(10))
- Crash randomly with probability
P
:
?inject_crash(..., snabbkaffe_nemesis:random_crash(0.1))
- Crash periodically:
snabbkaffe_nemesis:periodic_crash( _Period = 10, _DutyCycle = 0.5, _Phase = math:pi())
6.4 Manipulating process schedulings
- Sometimes it is necessary to test a certain process scheduling
- Imagine working on a bugfix using TDD-style
- For example, normally effect
foo
occurs after effectbar
, and everything works. But in rare cases the opposite happens, and the bug occurs - Snabbkaffe can manipulate process schedulings to some extend:
%% run stage... ?force_ordering(#{?snk_kind := bar}, #{?snk_kind := foo})
(It also supports guards)
7 Using snabbkaffe as a benchmark library
7.1 Disclaimer
- This usecase is a secondary, and quite rudimentary. Don't expect much
- It lacks many features of proper benchmarking libraries, such as
- Warmup
- Advanced statistical analysis, e.g. outlier detection
7.2 Reporting metrics
%% Report a single scalar datapoint: snabbkaffe:push_stat(scalar_metric_name, Value), %% Report a single datapoint with X coord: snabbkaffe:push_stat(metric_name, X, Value), %% Report multiple scalar datapoints: snabbkaffe:push_stats(metric_name, [Val1, Val2, ...]), %% Report multiple datapoints with X coord: snabbkaffe:push_stats(metric_name, X, [Val1, Val2, ...])
7.3 Using spans to analyze performance
push_stats
function also works together with ?find_pairs
macro:
Pairs = ?find_pairs( #{ ?snk_span := start , ?snk_kind := foo , pid := _Pid } , #{ ?snk_span := {complete, _} , ?snk_kind := foo , pid := _Pid } ), snabbkaffe:push_stats(metric_name, [X,] Pairs)
7.4 Analyzing the results
Use the following function in the end of check stage:
snabbkaffe:analyze_statistics()
It will print the results in the console:
Mean scalar_metric_name: 10.0 Statisitics of metric_name 100.479087 ^ * | * | * | | * | * 0 +----------------------------------------> 0 1100
8 Conclusion
8.1 Useful links
https://github.com/kafka4beam/snabbkaffe
The library is being actively developed.
Suggestions, bug reports and patches are welcome!