UP | HOME

Introduction to Snabbkaffe

This is an HTML version of snabbkaffe presentation slides.

1 Background and motivation

1.1 What is snabbkaffe

  • Snabbkaffe is a library that allows to test concurrent and distributed systems
  • It does so by moving focus from states to effects
  • Developers find bugs by looking at the logs, snabbkaffe does the same and automates the process
  • Advanced modes of testing: fault and scheduling injection
  • Efficiency: run test scenario once, verify multiple properties

1.2 Name

  • Snabbkaffe means ``instant coffee'' in Swedish
  • There is no hidden meaning, the name was chosen randomly

1.3 Nondeterminism

  1. Definitions
    • Deterministic functions always produce the same output for the same input
    • Nondeterministic functions can produce different outputs for the same input
  2. Sources of nondeterminism
    • Process schedulings
    • Fault-tolerance (failovers)
    • Network (packet loss and reordering)
    • Load balancing, worker pools

    Nondeterminism is an inherent property of concurrent systems

1.4 Why change the paradigm of testing?

  1. Difficulty levels
    1. Easy: Pure functions
    2. Normal: Stateful systems
    3. Hard: Nondeterministic systems
    4. Nightmare: Nondeterministic fault-tolerant distributed systems
  2. Traditional approach
    • Traditional approach is sufficient for testing pure functions. There are a variety of excellent tools for this, like PropER
    • It can do pretty well with stateful systems (stateful PropER)
    • Where it starts to crack: typical Erlang application assembled together

1.5 Challenges of testing Erlang applications in situ

Erlang partitions the state of the system between processes

  • It eliminates some concurrency bugs and isolates failures
  • But it makes the system harder to analyze
  • More often than not it's impossible to inspect state of the process (sys:get_state)
  • Often it's useless
  • Tests that rely on the state usually need to wait for the system to stabilize, it can be slow, and they can't analyze the system in motion
  • Asynchronous data processing patterns (e.g. job queues) are incredibly annoying to test using traditional methods

1.6 Possible solution to this problem?

Move away from states and embrace events and their temporal relationships

2 Instrumenting the code

2.1 Tracing macros

  1. Replace your regular logs with structured logs (the industry is moving towards structured logs anyway)
  2. Include "snabbkaffe/include/trace.hrl" header to your module
  3. Use ?tp or ?tp_span macros for logging (tp stands for Trace Point)
    • In the release build these macros will become regular logger messages
    • In the test build these macros will emit trace events

2.2 Examples

%% Disappears in the release build:
?tp(rlog_replica_import_trans,
    #{ seqno       => SeqNo
     , txid        => TXID
     , transaction => Transaction
     })
%% Becomes a notice message
?tp(notice, "Remote process died",
    #{ reason   => Reason
     , my_state => State
     })
%% Emits two events:
?tp_span(trans_write, #{record => Record, txid => get_txid()},
         mnesia:write(Record)).

2.3 Distributed tracing

Forward traces from the remote node:

%% on the ct_master:
snabbkaffe:forward_trace(Node)

All features (incl. fault and scheduling injections) will work

2.4 Why not using erlang tracing?

An often asked question: why not using dbg:tracer or the like?

It was a conscious design choice:

  • Traces obtained this way are tightly bound to the structure of the code. Refactoring the code or adding a new function argument would break the tests
  • Debug traces are excessive, it would be hard to see the forest through the trees
  • Log messages are typically placed in ``interesting'' places
    • Interesting for humans = interesting for tests

3 Testing

3.1 Stages of the test

Every testcase is split in two stages:

  1. Run stage

    where the program runs and emits the event trace

  2. Check stage

    where the trace is collected as a list of Erlang maps and validated against the spec(s)

3.2 check_trace macro

my_test() ->
  ?check_trace(begin
                 %% Run stage:
                 launch_system(),
                 do_stuff(),
                 wait_for_completion(),
                 get_result()
               end,
               fun(Result, Trace) ->
                  %% Check stage:
                  ?assertMatch(ok, Result),
                  check_history(Trace),
                  check_history2(Trace),
                  ...
               end).

3.3 Waiting for events

It is possible to block run stage until a certain event occurs:

?block_until(#{ ?snk_kind  := message_acked
              , message_id := Id
              } when Id > 42,
             Timeout, BackInTime)

This macro will return immediately as soon as the event happens, so it's more efficient and less prone to flakiness than a sleep.

3.4 Starting an asynchronous action

?block_until is fine, but it requires timeout tuning. What if the same event happened in the past?

?wait_async_action( send_async_request(Req)
                  , #{ ?snk_kind := request_handled
                     , request := Req
                     }
                 [, Timeout]
                  )

This macro doesn't have to look into the past.

3.5 Integrating with PropER

It's not a problem to combine stateless PropER tests with snabbkaffe. ?check_trace macro can be used inside proper's ?FORALL macro.

Snabbkaffe provides some convenience macros that simplify trace-property-based testing:

trace_prop_test(Config) ->
  Prop = ?forall_trace(
            X, list(),
            begin
              %% Run stage:
              do_stuff(X)
            end,
            fun(Result, Trace) ->
                %% Check stage
                true
            end),
  ?run_prop(Config, Prop).

3.6 Misc. macros

There are a few macros that are not directly related to trace-based testing, but often needed:

  1. ?panic
    ?panic("Unexpected result", #{result => A})
    

    Throw an error

  2. ?retry
    ?retry(_RetryPeriod = 100, _MaxRetries = 10,
           some_action_that_can_fail())
    

    Run some code periodically until it succeeds (use as the last resort, when ?block_until won't work)

  3. ?give_or_take
    ?give_or_take(_Expected = 20, _Deviation = 2, Value)
    

    Check that an integer value is within range

4 Basic functions for working with traces

4.1 Structure of the trace event

[#{ ?snk_kind => foo
  , ?snk_meta => #{ domain => [foo, bar]
                  , node => 'foo@localhost'
                  , time => <monotonic time>
                  , pid => <0.343.0>
                  , group_leader => <0.67.0>
                  , custom_field1 => baz
                  , custom_field2 => foobar
                  }
  , field1 => 42
  , field2 => foo
  },
  ...
]

4.2 Structure of the span trace

[#{ ?snk_kind => foo
  , ?snk_meta => #{ ... }
  , ?snk_span => start
  , field1 => 42
  , field2 => foo
  },
 #{ ?snk_kind => foo
  , ?snk_meta => #{ ... }
  , ?snk_span => {complete, ReturnValue}
  , field1 => 42
  , field2 => foo
  }
]

4.3 Filtering the trace

  1. ?of_kind macro
    FooEvents = ?of_kind(foo, Trace)
    
    FooOrBarEvents = ?of_kind([foo, bar], Trace)
    
  2. ?of_domain macro
    EMQXEvents = ?of_domain([emqx|_], Trace)
    
  3. ?of_node macro
    NodeFooEvents = ?of_node('foo@localhost', Trace)
    
  4. More complex filtering

    Of course, lists:filter and list comprehensions work too

4.4 Extracting fields

[1, 2] = ?projection(id, ?of_kind(handle_message, Trace))
[{1, "hello"}, {2, "world"}] =
     ?projection( [id, message]
                , ?of_kind(handle_message, Trace)
                )

5 Event correlations

5.1 ?strict_causality macro

?strict_causality( #{?snk_kind := msg_received, id := _Id}
                 , #{?snk_kind := msg_processed, id := _Id}
                 , Trace
                 )

Return value:

  • true if some matching events were found
  • false if no events were found
  • exception if causality is violated

5.2 ?strict_causality with guard

Suppose we're testing a "base64 server":

?strict_causality( #{req := _Req}
                 , #{resp := _Resp}
                 , _Resp =:= base64:encode(_Req)
                 , Trace
                 )

5.3 ?causality macro

Overall, it's the same as ?strict_causality, except it doesn't require each ``cause'' to have an ``effect''

?causality( #{?snk_kind := msg_received, id := _Id}
          , #{?snk_kind := msg_processed, id := _Id}
         [, Guard]
          , Trace
          )

5.4 Pitfalls

There is a potential problem with ``causality'' macros:

  1. Format of the event is changed in the code
  2. Match expressions in causality stop matching the events
  3. Tests still pass, even though they didn't find any events

Solution:

Always wrap ?causality and ?strict_causality in ?assert, unless there are legit situations when no events can be produced in the test

5.5 ?find_pairs macro

?find_pairs(Strict, MatchCause, MatchEffect [, Guard] , Trace)
  1. It returns a list of {pair, Cause, Effect} or {singleton, Cause}
  2. When Strict is true this macro also checks that effects don't occur before causes (much like ?causality)

5.6 Splitting traces

Often it is useful to split traces to parts before and after some event (for example, restart)

There are a variety of macros for this:

  1. split_at
    ?split_trace_at(bar, [foo, bar, baz, bar]).
    {[foo], [bar, baz, bar]}.
    
  2. splitr
    ?splitr_trace(foo, [1, 2, foo, 1, 2, foo])
    [[1, 2], [foo, 1, 2], [foo]].
    
  3. splitl
    ?splitl_trace(foo, [1, 2, foo, 1, 2, foo]).
    [[1, 2, foo], [1, 2, foo]].
    

6 Nemesis

6.1 Are your supervisors tested?

  • Taking fault-tolerance seriously is one of the selling points of Erlang. Organizing processes in supervisor trees is used widely, but not often tested
  • Tuning supervisor trees is an art
  • Snabbkaffe wants to turn it into a chore
  • It does so by injecting deliberate faults into the system

6.2 Fault injection

Any tracepoint can be used to inject errors into the system

?inject_crash( #{?snk_meta := #{domain := [ekka, rlog|_]}}
             , snabbkaffe_nemesis:random_crash(0.1)
             )

  • First argument: event matching expression
  • Second argument: fault scenario

6.3 Fault scenarios

  1. Always crash:
    ?inject_crash(..., snabbkaffe_nemesis:always_crash())
    
  2. Crash N times:
    ?inject_crash(..., snabbkaffe_nemesis:recover_after(10))
    
  3. Crash randomly with probability P:
    ?inject_crash(..., snabbkaffe_nemesis:random_crash(0.1))
    
  4. Crash periodically:
    snabbkaffe_nemesis:periodic_crash(
      _Period = 10, _DutyCycle = 0.5, _Phase = math:pi())
    

6.4 Manipulating process schedulings

  • Sometimes it is necessary to test a certain process scheduling
  • Imagine working on a bugfix using TDD-style
  • For example, normally effect foo occurs after effect bar, and everything works. But in rare cases the opposite happens, and the bug occurs
  • Snabbkaffe can manipulate process schedulings to some extend:
%% run stage...
?force_ordering(#{?snk_kind := bar}, #{?snk_kind := foo})

(It also supports guards)

7 Using snabbkaffe as a benchmark library

7.1 Disclaimer

  • This usecase is a secondary, and quite rudimentary. Don't expect much
  • It lacks many features of proper benchmarking libraries, such as
    • Warmup
    • Advanced statistical analysis, e.g. outlier detection

7.2 Reporting metrics

%% Report a single scalar datapoint:
snabbkaffe:push_stat(scalar_metric_name, Value),
%% Report a single datapoint with X coord:
snabbkaffe:push_stat(metric_name, X, Value),
%% Report multiple scalar datapoints:
snabbkaffe:push_stats(metric_name, [Val1, Val2, ...]),
%% Report multiple datapoints with X coord:
snabbkaffe:push_stats(metric_name, X, [Val1, Val2, ...])

7.3 Using spans to analyze performance

push_stats function also works together with ?find_pairs macro:

Pairs = ?find_pairs( #{ ?snk_span := start
                      , ?snk_kind := foo
                      , pid       := _Pid
                      }
                   , #{ ?snk_span := {complete, _}
                      , ?snk_kind := foo
                      , pid       := _Pid
                      }
                   ),
snabbkaffe:push_stats(metric_name, [X,] Pairs)

7.4 Analyzing the results

Use the following function in the end of check stage:

snabbkaffe:analyze_statistics()

It will print the results in the console:

Mean scalar_metric_name: 10.0

Statisitics of metric_name
100.479087 ^                                       *
           |                         *
           |                   *
           |
           |            *
           |      *
         0 +---------------------------------------->
           0                                     1100

8 Conclusion

8.1 Useful links

https://github.com/kafka4beam/snabbkaffe

The library is being actively developed.

Suggestions, bug reports and patches are welcome!

  1. Snabbkaffe in action

    Feel free to use as an example (clickable links):

Date: 2020-12-01