Callbacks, synchronous and asynchronous

by havoc

Here are two guidelines for designing APIs that use callbacks, to add to my inadvertent collection of posts about minor API design points. I’ve run into the “sync vs. async” callback issue many times in different places; it’s a real issue that burns both API designers and API users.

Most recently, this came up for me while working on Hammersmith, a callback-based Scala API for MongoDB. I think it’s a somewhat new consideration for a lot of people writing JVM code, because traditionally the JVM uses blocking APIs and threads. For me, it’s a familiar consideration from writing client-side code based on an event loop.

Definitions

  • A synchronous callback is invoked before a function returns, that is, while the API receiving the callback remains on the stack. An example might be: list.foreach(callback); when foreach() returns, you would expect that the callback had been invoked on each element.
  • An asynchronous or deferred callback is invoked after a function returns, or at least on another thread’s stack. Mechanisms for deferral include threads and main loops (other names include event loops, dispatchers, executors). Asynchronous callbacks are popular with IO-related APIs, such as socket.connect(callback); you would expect that when connect() returns, the callback may not have been called, since it’s waiting for the connection to complete.

Guidelines

Two rules that I use, based on past experience:

  • A given callback should be either always sync or always async, as a documented part of the API contract.
  • An async callback should be invoked by a main loop or central dispatch mechanism directly, i.e. there should not be unnecessary frames on the callback-invoking thread’s stack, especially if those frames might hold locks.

How are sync and async callbacks different?

Sync and async callbacks raise different issues for both the app developer and the library implementation.

Synchronous callbacks:

  • Are invoked in the original thread, so do not create thread-safety concerns by themselves.
  • In languages like C/C++, may access data stored on the stack such as local variables.
  • In any language, they may access data tied to the current thread, such as thread-local variables. For example many Java web frameworks create thread-local variables for the current transaction or request.
  • May be able to assume that certain application state is unchanged, for example assume that objects exist, timers have not fired, IO has not occurred, or whatever state the structure of a program involves.

Asynchronous callbacks:

  • May be invoked on another thread (for thread-based deferral mechanisms), so apps must synchronize any resources the callback accesses.
  • Cannot touch anything tied to the original stack or thread, such as local variables or thread-local data.
  • If the original thread held locks, the callback will be invoked outside them.
  • Must assume that other threads or events could have modified the application’s state.

Neither type of callback is “better”; both have uses. Consider:

list.foreach(callback)

in most cases, you’d be pretty surprised if that callback were deferred and did nothing on the current thread!

But:

socket.connect(callback)

would be totally pointless if it never deferred the callback; why have a callback at all?

These two cases show why a given callback should be defined as either sync or async; they are not interchangeable, and don’t have the same purpose.

Choose sync or async, but not both

Not uncommonly, it may be possible to invoke a callback immediately in some situations (say, data is already available) while the callback needs to be deferred in others (the socket isn’t ready yet). The tempting thing is to invoke the callback synchronously when possible, and otherwise defer it. Not a good idea.

Because sync and async callbacks have different rules, they create different bugs. It’s very typical that the test suite only triggers the callback asynchronously, but then some less-common case in production runs it synchronously and breaks. (Or vice versa.)

Requiring application developers to plan for and test both sync and async cases is just too hard, and it’s simple to solve in the library: If the callback must be deferred in any situation, always defer it.

Example case: GIO

There’s a great concrete example of this issue in the documentation for GSimpleAsyncResult in the GIO library, scroll down to the Description section and look at the example about baking a cake asynchronously. (GSimpleAsyncResult is equivalent to what some frameworks call a future or promise.) There are two methods provided by this library, a complete_in_idle() which defers callback invocation to an “idle handler” (just an immediately-dispatched one-shot main loop event), and plain complete() which invokes the callback synchronously. The documentation suggests using complete_in_idle() unless you know you’re already in a deferred callback with no locks held (i.e. if you’re just chaining from one deferred callback to another, there’s no need to defer again).

GSimpleAsyncResult is used in turn to implement IO APIs such as g_file_read_async(), and developers can assume the callbacks used in those APIs are deferred.

GIO works this way and documents it at length because the developers building it had been burned before.

Synchronized resources should defer all callbacks they invoke

Really, the rule is that a library should drop all its locks before invoking an application callback. But the simplest way to drop all locks is to make the callback async, thereby deferring it until the stack unwinds back to the main loop, or running it on another thread’s stack.

This is important because applications can’t be expected to avoid touching your API inside the callback. If you hold locks and the app touches your API while you do, the app will deadlock. (Or if you use recursive locks, you’ll have a scary correctness problem instead.)

Rather than deferring the callback to a main loop or thread, the synchronized resource could try to drop all its locks; but that can be very painful because the lock might be well up in the stack, and you end up having to make each method on the stack return the callback, passing the callback all the way back up the stack to the outermost lock holder who then drops the lock and invokes the callback. Ugh.

Example case: Hammersmith without Akka

In Hammersmith as originally written, the following pseudocode would deadlock:

connection.query({ cursor => /* iterate cursor here, touching connection again */ })

Iterating the cursor will go back through the MongoDB connection. The query callback was invoked from code in the connection object… which held the connection lock. Not going to work, but this is natural and convenient code for an application developer to write. If the library doesn’t defer the callback, the app developer has to defer it themselves. Most app developers will get this wrong at first, and once they catch on and fix it, their code will be cluttered by some deferral mechanism.

Hammersmith inherited this problem from Netty, which it uses for its connections; Netty does not try to defer callbacks (I can understand the decision since there isn’t an obvious default/standard/normal/efficient way to defer callbacks in Java).

My first fix for this was to add a thread pool just to run app callbacks. Unfortunately, the recommended thread pool classes that come with Netty don’t solve the deadlock problem, so I had to fix that. (Any thread pool that solves deadlock problems has to have an unbounded size and no resource limits…)

In the end it works, but imagine what happens if callback-based APIs become popular and every jar you use with a callback in its API has to have its own thread pool. Kind of sucks. That’s probably why Netty punts on the issue. Too hard to make policy decisions about this in a low-level networking library.

Example case: Akka actors

Partly to find a better solution, next I ported Hammersmith to the Akka framework. Akka implements the Actor model. Actors are based on messages rather than callbacks, and in general messages must be deferred. In fact, Akka goes out of its way to force you to use an ActorRef to communicate with an actor, where all messages to the actor ref go through a dispatcher (event loop). Say you have two actors communicating, they will “call back” to each other using the ! or “send message” method:

actorOne ! Request("Hello")
// then in actorOne
sender ! Reply("World")

These messages are dispatched through the event loop. I was expecting my deadlock problems to be over in this model, but I found a little gotcha — the same issue all over again, invoking application callbacks with a lock held. This time it was the lock on an actor while the actor is processing a message.

Akka actors can receive messages from either another actor or from a Future, and Akka wraps the sender in an object called Channel. The ! method is in the interface to Channel. Sending to an actor with ! will always defer the message to the dispatcher, but sending to a future will not; as a result, the ! method on Channel does not define sync vs. async in its API contract.

This becomes an issue because part of the “point” of the actor model is that an actor runs in only one thread at a time; actors are locked while they’re handling a message and can’t be re-entered to handle a second message. Thus, making a synchronous call out from an actor is dangerous; there’s a lock held on the actor, and if the synchronous call tries to use the actor again inside the callback, it will deadlock.

I wrapped MongoDB connections in an actor, and immediately had exactly the same deadlock I’d had with Netty, where a callback from a query would try to touch the connection again to iterate a cursor. The query callback came from invoking the ! method on a future. The ! method on Channel breaks my first guideline (it doesn’t define sync vs. async in the API contract), but I was expecting it to be always async; as a result, I accidentally broke my second guideline and invoked a callback with a lock held.

If it were me, I would probably put deferral in the API contract for Channel.! to fix this; however, as Akka is currently written, if you’re implementing an actor that sends replies, and the application’s handler for your reply may want to call back and use the actor again, you must manually defer sending the reply. I stumbled on this approach, though there may be better ones:

private def asyncSend(channel: AkkaChannel[Any], message: Any) = {
    Future(channel ! message, self.timeout)(self.dispatcher)
}

An unfortunate aspect of this solution is that it double-defers replies to actors, in order to defer replies to futures once.

The good news about Akka is that at least it has this solution — there’s a dispatcher to use! While with plain Netty, I had to use a dedicated thread pool.

Akka gives an answer to “how do I defer callbacks,” but it does require special-casing futures in this way to be sure they’re really deferred.

(UPDATE: Akka team is already working on this, here’s the ticket.)

Conclusion

While I found one little gotcha in Akka, the situation is much worse on the JVM without Akka because there isn’t a dispatcher to use.

Callback-based APIs really work best if you have an event loop, because it’s so important to be able to defer callback invocation.

That’s why callbacks work pretty well in client-side JavaScript and in node.js, and in UI toolkits such as GTK+. But if you start coding a callback-based API on the JVM, there’s no default answer for this critical building block. You’ll have to go pick some sort of event loop library (Akka works great), or reinvent the equivalent, or use a bloated thread-pools-everywhere approach.

Since callback-based APIs are so trendy these days… if you’re going to write one, I’d think about this topic up front.