Introduction to Snabbkaffe

This is an HTML version of snabbkaffe presentation slides.

1 Background and motivation

1.1 What is snabbkaffe

Snabbkaffe is a library that allows to test concurrent and distributed systems
It does so by moving focus from states to effects
Developers find bugs by looking at the logs, snabbkaffe does the same and automates the process
Advanced modes of testing: fault and scheduling injection
Efficiency: run test scenario once, verify multiple properties

1.2 Name

Snabbkaffe means ``instant coffee'' in Swedish
There is no hidden meaning, the name was chosen randomly

1.3 Nondeterminism

Definitions
- Deterministic functions always produce the same output for the same input
- Nondeterministic functions can produce different outputs for the same input
Sources of nondeterminism
- Process schedulings
- Fault-tolerance (failovers)
- Network (packet loss and reordering)
- Load balancing, worker pools
Nondeterminism is an inherent property of concurrent systems

1.4 Why change the paradigm of testing?

Difficulty levels
1. Easy: Pure functions
2. Normal: Stateful systems
3. Hard: Nondeterministic systems
4. Nightmare: Nondeterministic fault-tolerant distributed systems
Traditional approach
- Traditional approach is sufficient for testing pure functions. There are a variety of excellent tools for this, like PropER
- It can do pretty well with stateful systems (stateful PropER)
- Where it starts to crack: typical Erlang application assembled together

1.5 Challenges of testing Erlang applications in situ

Erlang partitions the state of the system between processes

It eliminates some concurrency bugs and isolates failures
But it makes the system harder to analyze
More often than not it's impossible to inspect state of the process (sys:get_state)
Often it's useless
Tests that rely on the state usually need to wait for the system to stabilize, it can be slow, and they can't analyze the system in motion
Asynchronous data processing patterns (e.g. job queues) are incredibly annoying to test using traditional methods

1.6 Possible solution to this problem?

Move away from states and embrace events and their temporal relationships

2 Instrumenting the code

2.1 Tracing macros

Replace your regular logs with structured logs (the industry is moving towards structured logs anyway)
Include "snabbkaffe/include/trace.hrl" header to your module
Use ?tp or ?tp_span macros for logging (tp stands for Trace Point)
- In the release build these macros will become regular logger messages
- In the test build these macros will emit trace events

2.2 Examples

%% Disappears in the release build:
?tp(rlog_replica_import_trans,
    #{ seqno       => SeqNo
     , txid        => TXID
     , transaction => Transaction
     })

%% Becomes a notice message
?tp(notice, "Remote process died",
    #{ reason   => Reason
     , my_state => State
     })

%% Emits two events:
?tp_span(trans_write, #{record => Record, txid => get_txid()},
         mnesia:write(Record)).

2.3 Distributed tracing

Forward traces from the remote node:

%% on the ct_master:
snabbkaffe:forward_trace(Node)

All features (incl. fault and scheduling injections) will work

2.4 Why not using erlang tracing?

An often asked question: why not using dbg:tracer or the like?

It was a conscious design choice:

Traces obtained this way are tightly bound to the structure of the code. Refactoring the code or adding a new function argument would break the tests
Debug traces are excessive, it would be hard to see the forest through the trees
Log messages are typically placed in ``interesting'' places
- Interesting for humans = interesting for tests

3 Testing

3.1 Stages of the test

Every testcase is split in two stages:

Run stage

where the program runs and emits the event trace
Check stage

where the trace is collected as a list of Erlang maps and validated against the spec(s)

3.2 check_trace macro

my_test() ->
  ?check_trace(begin
                 %% Run stage:
                 launch_system(),
                 do_stuff(),
                 wait_for_completion(),
                 get_result()
               end,
               fun(Result, Trace) ->
                  %% Check stage:
                  ?assertMatch(ok, Result),
                  check_history(Trace),
                  check_history2(Trace),
                  ...
               end).

3.3 Waiting for events

It is possible to block run stage until a certain event occurs:

?block_until(#{ ?snk_kind  := message_acked
              , message_id := Id
              } when Id > 42,
             Timeout, BackInTime)

This macro will return immediately as soon as the event happens, so it's more efficient and less prone to flakiness than a sleep.

3.4 Starting an asynchronous action

?block_until is fine, but it requires timeout tuning. What if the same event happened in the past?

?wait_async_action( send_async_request(Req)
                  , #{ ?snk_kind := request_handled
                     , request := Req
                     }
                 [, Timeout]
                  )

This macro doesn't have to look into the past.

3.5 Integrating with PropER

It's not a problem to combine stateless PropER tests with snabbkaffe. ?check_trace macro can be used inside proper's ?FORALL macro.

Snabbkaffe provides some convenience macros that simplify trace-property-based testing:

trace_prop_test(Config) ->
  Prop = ?forall_trace(
            X, list(),
            begin
              %% Run stage:
              do_stuff(X)
            end,
            fun(Result, Trace) ->
                %% Check stage
                true
            end),
  ?run_prop(Config, Prop).

3.6 Misc. macros

There are a few macros that are not directly related to trace-based testing, but often needed:

?panic

?panic("Unexpected result", #{result => A})

Throw an error

?retry
```
?retry(_RetryPeriod = 100, _MaxRetries = 10,
       some_action_that_can_fail())
```
Run some code periodically until it succeeds (use as the last resort, when ?block_until won't work)

?give_or_take

?give_or_take(_Expected = 20, _Deviation = 2, Value)

Check that an integer value is within range

4 Basic functions for working with traces

4.1 Structure of the trace event

[#{ ?snk_kind => foo
  , ?snk_meta => #{ domain => [foo, bar]
                  , node => 'foo@localhost'
                  , time => <monotonic time>
                  , pid => <0.343.0>
                  , group_leader => <0.67.0>
                  , custom_field1 => baz
                  , custom_field2 => foobar
                  }
  , field1 => 42
  , field2 => foo
  },
  ...
]

4.2 Structure of the span trace

[#{ ?snk_kind => foo
  , ?snk_meta => #{ ... }
  , ?snk_span => start
  , field1 => 42
  , field2 => foo
  },
 #{ ?snk_kind => foo
  , ?snk_meta => #{ ... }
  , ?snk_span => {complete, ReturnValue}
  , field1 => 42
  , field2 => foo
  }
]

4.3 Filtering the trace

?of_kind macro

FooEvents = ?of_kind(foo, Trace)

FooOrBarEvents = ?of_kind([foo, bar], Trace)

?of_domain macro

EMQXEvents = ?of_domain([emqx|_], Trace)

?of_node macro

NodeFooEvents = ?of_node('foo@localhost', Trace)

More complex filtering

Of course, lists:filter and list comprehensions work too

4.4 Extracting fields

[1, 2] = ?projection(id, ?of_kind(handle_message, Trace))

[{1, "hello"}, {2, "world"}] =
     ?projection( [id, message]
                , ?of_kind(handle_message, Trace)
                )

5 Event correlations

5.1 ?strict_causality macro

?strict_causality( #{?snk_kind := msg_received, id := _Id}
                 , #{?snk_kind := msg_processed, id := _Id}
                 , Trace
                 )

Return value:

true if some matching events were found
false if no events were found
exception if causality is violated

5.2 ?strict_causality with guard

Suppose we're testing a "base64 server":

?strict_causality( #{req := _Req}
                 , #{resp := _Resp}
                 , _Resp =:= base64:encode(_Req)
                 , Trace
                 )

5.3 ?causality macro

Overall, it's the same as ?strict_causality, except it doesn't require each ``cause'' to have an ``effect''

?causality( #{?snk_kind := msg_received, id := _Id}
          , #{?snk_kind := msg_processed, id := _Id}
         [, Guard]
          , Trace
          )

5.4 Pitfalls

There is a potential problem with ``causality'' macros:

Format of the event is changed in the code
Match expressions in causality stop matching the events
Tests still pass, even though they didn't find any events

Solution:

Always wrap ?causality and ?strict_causality in ?assert, unless there are legit situations when no events can be produced in the test

5.5 ?find_pairs macro

?find_pairs(Strict, MatchCause, MatchEffect [, Guard] , Trace)

It returns a list of {pair, Cause, Effect} or {singleton, Cause}
When Strict is true this macro also checks that effects don't occur before causes (much like ?causality)

5.6 Splitting traces

Often it is useful to split traces to parts before and after some event (for example, restart)

There are a variety of macros for this:

split_at

?split_trace_at(bar, [foo, bar, baz, bar]).
{[foo], [bar, baz, bar]}.

splitr

?splitr_trace(foo, [1, 2, foo, 1, 2, foo])
[[1, 2], [foo, 1, 2], [foo]].

splitl

?splitl_trace(foo, [1, 2, foo, 1, 2, foo]).
[[1, 2, foo], [1, 2, foo]].

6 Nemesis

6.1 Are your supervisors tested?

Taking fault-tolerance seriously is one of the selling points of Erlang. Organizing processes in supervisor trees is used widely, but not often tested
Tuning supervisor trees is an art
Snabbkaffe wants to turn it into a chore
It does so by injecting deliberate faults into the system

6.2 Fault injection

Any tracepoint can be used to inject errors into the system

?inject_crash( #{?snk_meta := #{domain := [ekka, rlog|_]}}
             , snabbkaffe_nemesis:random_crash(0.1)
             )

First argument: event matching expression
Second argument: fault scenario

6.3 Fault scenarios

Always crash:

?inject_crash(..., snabbkaffe_nemesis:always_crash())

Crash N times:

?inject_crash(..., snabbkaffe_nemesis:recover_after(10))

Crash randomly with probability P:

?inject_crash(..., snabbkaffe_nemesis:random_crash(0.1))

Crash periodically:

snabbkaffe_nemesis:periodic_crash(
  _Period = 10, _DutyCycle = 0.5, _Phase = math:pi())

6.4 Manipulating process schedulings

Sometimes it is necessary to test a certain process scheduling
Imagine working on a bugfix using TDD-style
For example, normally effect foo occurs after effect bar, and everything works. But in rare cases the opposite happens, and the bug occurs
Snabbkaffe can manipulate process schedulings to some extend:

%% run stage...
?force_ordering(#{?snk_kind := bar}, #{?snk_kind := foo})

(It also supports guards)

7 Using snabbkaffe as a benchmark library

7.1 Disclaimer

This usecase is a secondary, and quite rudimentary. Don't expect much
It lacks many features of proper benchmarking libraries, such as
- Warmup
- Advanced statistical analysis, e.g. outlier detection

7.2 Reporting metrics

%% Report a single scalar datapoint:
snabbkaffe:push_stat(scalar_metric_name, Value),
%% Report a single datapoint with X coord:
snabbkaffe:push_stat(metric_name, X, Value),
%% Report multiple scalar datapoints:
snabbkaffe:push_stats(metric_name, [Val1, Val2, ...]),
%% Report multiple datapoints with X coord:
snabbkaffe:push_stats(metric_name, X, [Val1, Val2, ...])

7.3 Using spans to analyze performance

push_stats function also works together with ?find_pairs macro:

Pairs = ?find_pairs( #{ ?snk_span := start
                      , ?snk_kind := foo
                      , pid       := _Pid
                      }
                   , #{ ?snk_span := {complete, _}
                      , ?snk_kind := foo
                      , pid       := _Pid
                      }
                   ),
snabbkaffe:push_stats(metric_name, [X,] Pairs)

7.4 Analyzing the results

Use the following function in the end of check stage:

snabbkaffe:analyze_statistics()

It will print the results in the console:

Mean scalar_metric_name: 10.0

Statisitics of metric_name
100.479087 ^                                       *
           |                         *
           |                   *
           |
           |            *
           |      *
         0 +---------------------------------------->
           0                                     1100

8 Conclusion

8.1 Useful links

https://github.com/kafka4beam/snabbkaffe

The library is being actively developed.

Suggestions, bug reports and patches are welcome!

Snabbkaffe in action
Feel free to use as an example (clickable links):
- brod
- kflow
- ekka
- EMQ X