Delimited Continuations vs Lwt for Threads

MirageOS is a fully event-driven system, with no support for conventional preemptive threads. Instead, programs are woken by events such as incoming network packets, and event callbacks execute until they themselves need to block (due to I/O or timers) or complete their task.

Event-driven systems are simple to implement, scalable to lots of network clients, and very hip due to frameworks like node.js. However, programming event callbacks directly leads to the control logic being scattered across many small functions, and so we need some abstractions to hide the interruptions of registering and waiting for an event to trigger.

OCaml has the excellent Lwt threading library that utilises a monadic approach to solving this. Consider this simplified signature:

  val return : 'a -> 'a Lwt.t 
  val bind : 'a Lwt.t -> ('a -> 'b Lwt.t) -> 'b Lwt.t
  val run : 'a Lwt.t -> 'a

Threads have the type 'a Lwt.t, which means that the thread will have a result of type 'a when it finishes. The return function is the simplest way to construct such a thread from an OCaml value.

If we then wish to use the value of thread, we must compose a function that will be called in the future when the thread completes. This is what the bind function above is for. For example, assume we have a function that will let us sleep for some time:

  val sleep: int -> unit Lwt.t

We can now use the bind function to do something after the sleep is complete:

  let x = sleep 5 in
  let y = bind x (fun () -> print_endline "awake!") in
  run y

x has the type unit Lwt.t, and the closure passed to bind will eventually be called with unit when the sleep finishes. Note that we also need a function to actually begin evaluating an Lwt thread, which is the run function.

Concerns

MirageOS currently uses Lwt extensively, and we have been very happy with using it to build a network stack. However, I was surprised to hear a lot of debate at the 2011 OCaml Users Group meeting that Lwt is not to everyone's tastes. There are a few issues:

The monadic style means that existing code will not just work. Any code that might block must be adapted to use return and bind, which makes integrating third-party code problematic.
More concerningly, any potential blocking points require the allocation of a closure. This allocation is very cheap in OCaml, but is still not free. Jun Furuse notes that combinator-based systems are slower during the development of his Planck parser.

Lwt addresses the first problem via a comprehensive syntax extension which provides Lwt equivalents for many common operations. For example, the above example with sleep can be written as:

  lwt x = sleep 5 in
  print_endline "awake"

The lwt keyword indicates the result of the expression should be passed through bind, and this makes it possible to write code that looks more OCaml-like. There are also other keywords like for_lwt and match_lwt that similarly help with common control flow constructs.

Fibers

After the meeting, I did get thinking about using alternatives to Lwt in MirageOS. One exciting option is the delimcc library which implements delimited continuations for OCaml. These can be used to implement restartable exceptions: a program can raise an exception which can be invoked to resume the execution as if the exception had never happened. Delimcc can be combined with Lwt very elegantly, and Jake Donham did just this with the Lwt_fiber library. His post also has a detailed explanation of how delimcc works.

The interface for fibers is also simple:

  val start: (unit -> 'a) -> 'a Lwt.t
  val await : 'a Lwt.t -> 'a

A fiber can be launched with start, and during its execution can block on another thread with await. When it does block, a restartable exception saves the program stack back until the point that start was called, and it will be resumed when the thread it blocked on completes.

Benchmarks

I put together a few microbenchmarks to try out the performance of Lwt threads versus fibers. The fiber test looks like this:

  module Fiber = struct
    let basic fn yields =
      for i = 1 to 15000 do
        for x = 1 to yields do
          Lwt_fiber.await (fn ())
        done
      done

    let run fn yields =
      Lwt_fiber.start (fun () -> basic fn yields)
  end

We invoke the run function with two arguments: a thread to use for blocking and the number of times we should yield serially (so we can confirm that an increasing number of yields scales linearly). The Lwt version is pretty similar:

  module LWT = struct
    let basic fn yields =
      for_lwt i = 1 to 15000 do
        for_lwt x = 1 to yields do
          fn ()
        done
      done
  
    let run = basic
  end

We do not need to do anything special to launch a thread since we are already in the Lwt main loop, and the syntax extension makes the for loops look like the Fiber example above.

The choice of blocking function is important. The first test runs using a fast Lwt.return () that returns immediately:

The x-axis on the above graph represents the number of yields in each loop. Both Lwt_fiber and pure Lwt optimise the case where a thread returns immediately, and so this graph simply tells us that the fast path is working (which is nice!). The next test replaces the blocking function with two alternatives that force the thread to yield:

There are two blocking functions used in the graph above:

the "slow" version is Lwt_unix.sleep 0.0 which forces the registration of a timeout.
the "medium" version is Lwt.pause () which causes the thread to pause and drop into the thread scheduler. In the case of Lwt_fiber, this causes an exception to be raised so we can benchmark the cost of using a delimited continuation.

Interestingly, using a fiber is slower than normal Lwt here, even though our callstack is not very deep. I would have hoped that fibers would be significantly cheaper with a small callstack, as the amount of backtracking should be quite low. Lets confirm that fibers do in fact slow down as the size of the callstack increases via this test:

  module Fiber = struct
    let recurse fn depth =
      let rec sum n = 
        Lwt_fiber.await (fn ());
        match n with
        |0 -> 0
        |n -> n + (sum (n-1)) 
      in
      for i = 1 to 15000 do
        ignore(sum depth)
      done

    let run fn depth = 
      Lwt_fiber.start (fun () -> recurse fn depth)
  end

The recurse function is deliberately not tail-recursive, so that the callstack increases as the depth parameter grows. The Lwt equivalent is slightly more clunky as we have to rewrite the loop to bind and return:

  module LWT = struct
    let recurse fn depth =
      let rec sum n =
        lwt () = fn () in
        match n with
        |0 -> return 0
        |n ->
          lwt n' = sum (n-1) in 
          return (n + n')
      in
      for_lwt i = 1 to 15000 do
        lwt res = sum depth in
        return ()
      done

   let run = recurse
  end

We then run the experiment using the slow Lwt_unix.sleep 0.0 function, and get this graph:

The above graph shows the recursive Lwt_fiber getting slower as the recursion depth increases, with normal Lwt staying linear. The graph also overlays the non-recursing versions as a guideline (*-basic-slow).

Thoughts

This first benchmark was a little surprising for me:

I would have thought that delimcc to be ahead of Lwt when dealing with functions with a small call-depth and a small amount of blocking (i.e. the traffic pattern that loaded network servers see). The cost of taking a restartable exception seems quite high however.
The fiber tests still use the Lwt machinery to manage the callback mechanism (i.e. a select loop and the timer priority queue). It may be possible to create a really light-weight version just for delimcc, but the Lwt UNIX backend is already pretty lean and mean and uses the libev to interface with the OS.
The problem of having to rewrite code to be Lwt-like still exists unfortunately, but it is getting better as the pa_lwt syntax extension matures and is integrated into my favourite editor (thanks Raphael!)
Finally, by far the biggest benefit of Lwt is that it can be compiled straight into Javascript using the js_of_ocaml compiler, opening up the possibility of cool browser visualisations and tickets to cool node.js parties that I don't normally get invited to.

I need to stress that these benchmarks are very micro, and do not take into account other things like memory allocation. The standalone code for the tests is online at Github, and I would be delighted to hear any feedback.

Retesting recursion [18th Jun 2011]

Jake Donham comments:

I speculated in my post that fibers might be faster if the copy/restore were amortized over a large stack. I wonder if you would repeat the experiment with versions where you call fn only in the base case of sum, instead of at every call. I think you're getting N^2 behavior here because you're copying and restoring the stack on each iteration.

When writing the test, I figured that calling the thread waiting function more often wouldn't alter the result (careless). So I modified the test suite to have a recurse test that only waits a single time at the end of a long call stack (see below) as well as the original N^2 version (now called recurse2).

  module Fiber = struct
    let recurse fn depth =
      let rec sum n = 
        match n with
        |0 -> Lwt_fiber.await (fn ()); 0
        |n -> n + (sum (n-1)) 
      in
      for i = 1 to 15000 do
        ignore(sum depth)
      done

    let run fn depth = 
      Lwt_fiber.start (fun () -> recurse fn depth)
  end

The N^2 version below of course looks the same as the previously run tests, with delimcc getting much worse as it yields more often:

However, when we run the recurse test with a single yield at the end of the long callstack, the situation reverses itself and now delimcc is faster. Note that this test ran with more iterations than the recurse2 test to make the results scale, and so the absolute time taken cannot be compared.

The reason for Lwt being slower in this becomes more clear when we examine what the code looks like after it has been passed through the pa_lwt syntax extension. The code before looks like:

  let recurse fn depth =
    let rec sum n =
      match n with
      | 0 -> 
          fn () >> return 0
      | n ->
          lwt n' = sum (n-1) in 
          return (n + n') in

and after pa_lwt macro-expands it:

  let recurse fn depth =
    let rec sum n =
      match n with
      | 0 ->
          Lwt.bind (fn ()) (fun _ -> return 0)
      | n ->
          let __pa_lwt_0 = sum (n - 1)
          in Lwt.bind __pa_lwt_0 (fun n' -> return (n + n')) in

Every iteration of the recursive loop requires the allocation of a closure (the Lwt.bind call). In the delimcc case, the function operates as a normal recursive function that uses the stack, until the very end when it needs to save the stack in one pass.

Overall, I'm convinced now that the performance difference is insignificant for the purposes of choosing one thread system over the other for MirageOS. Instead, the question of code interoperability is more important. Lwt-enabled protocol code will work unmodified in Javascript, and Delimcc code helps migrate existing code over.

Interestingly, Javascript 1.7 introduces a yield operator, which has been shown to have comparable expressive power to the shift-reset delimcc operators. Perhaps convergence isn't too far away after all...