Task Dispatch and Nonblocking IO in Scala

by havoc

TL;DR

Modern application development platforms are addressing the related issues of globally-coordinated task dispatch and nonblocking IO.

Here’s my definition of the problem, an argument for why it matters, and some suggestions for specific standard library features to add to Scala in particular.

The same ideas apply to any application development platform, though. It’s rapidly becoming mandatory for a competitive platform to offer an answer here.

Definitions

Let’s define a blocking task to be anything that ties up a thread or process but does not use CPU cycles. The most common ways to block are on IO channels and locks.

A busy loop is not a blocking operation in this sense; it takes up a thread, but it’s using the CPU, not “wasting” the thread.

By “task” I mean any piece of executable code. A task is blocking if it spends part of its time waiting, or nonblocking if it needs the CPU the whole time.

Dispatch just means scheduling the task on a thread and executing it.

Dispatch for nonblocking tasks, in an ideal world

For nonblocking tasks (which are CPU-bound), the goal is to use 100% of all CPU cores. There are two ways to lose:

  • Fail to use all the cores (not enough threads or processes).
  • Too many threads for the number of cores (inefficient and wastes memory).

The ideal solution is a fixed thread or process pool with a number of threads related to the number of cores. This fixed pool must be global to the app and used for all nonblocking tasks. If you have five libraries in your app and they each create a thread per CPU core, you’re losing, even though each library’s approach makes sense in isolation.

When the fixed number of threads are all in use, tasks should be queued up for later dispatch.

Dispatch for blocking tasks, in an ideal world

Blocking tasks pose some competing concerns; the trouble with blocking tasks is that these concerns are hard to balance.

  • Memory: each blocking task ties up a thread, which adds overhead (the thread) to the task. A super-tiny http parser gets you nowhere if you accompany each one with a thread.
  • Deadlocks: blocking tasks are often waiting for another blocking task. Limiting the number of threads can easily create deadlocks.
  • Tasks outstanding: with IO, it is desirable to send lots of requests at once or have lots of sockets open at once. (With CPU-bound tasks, the opposite is true.)

The ideal solution (if you must block) is an “as huge as memory allows” thread/process pool.

If you run blocking tasks on a bounded pool, you could have deadlocks, and you would not maximize tasks outstanding. Still, as memory pressure arrives, it would be better to start making some tasks wait than it would be to exhaust memory. Apps inevitably become pathological when memory is exhausted (either you have swap and performance goes to hell, or you don’t have swap and an out-of-memory exception breaks the app). But as long as memory is available, it’s better to add threads to the pool than it is to queue tasks.

An automatic solution to this problem might require special help from the JVM and/or the OS. You’d want to have an idea about whether it’s reasonable to create another thread, in light of each thread’s memory usage, the amount of memory free, and whether you can GC to recover memory.

In practice, you have to do some amount of manual tuning and configuration to get a thread pool setup that works in practice for a particular deployment. Maybe setting a large, but fixed, thread pool size that happens to keep your app using about the right amount of memory.

Different tasks, different pools

It’s broken to dispatch nonblocking tasks to an unbounded (or large) pool, and broken to dispatch blocking tasks to a small bounded pool. I can’t think of a nice way to handle both kinds of task with the same pool.

Possible conclusion: a dispatch API should permit the application to treat the two differently.

Physical resource coordination requires global state

We’ve all been taught to avoid global variables, global state, and singletons. They cause a lot of trouble, for sure.

Assuming your app runs on a single machine, you have N CPUs on your computer — period. You can’t create a new “CPUs context” with your own CPUs every time you need more CPUs. You have N megabytes of RAM — period. You can’t create a new “RAM context” with its own RAM.

These resources are shared among all threads and processes. Thus you need global, coordinated task dispatch.

Nonblocking IO creates another global resource

With nonblocking IO APIs, such as poll(), you can have multiple IO operations outstanding, using only one thread or process.

However, to use poll() or equivalent you have a new problem: every IO operation (on the same thread) must be coordinated so that the file descriptors end up in a single call to poll(). The system for coordinating this is called an “event loop” or “main loop.”

In an API such as libev or GMainLoop, applications can say “I need to wake up in 3 seconds” or “I need to know about activity on this socket handle,” and the library aggregates all such requests into a single invocation of poll(). The single poll() puts the thread to sleep until one of the requests is ready.

Nonblocking IO requires a globally-coordinated “managed poll()” — also known as an event loop. Otherwise you’re back to needing threads.

How Java sucks at this

In brief:

  1. No global task dispatcher to coordinate CPU and memory usage.
  2. The APIs are mostly blocking.
  3. The nonblocking APIs in nio have limited utility because there’s no global event loop.

1. No global dispatcher

Java has all sorts of nice executors allowing you to dispatch tasks in many different ways.

But for average apps doing average things, we need two global singleton executors, not a zillion ways to create our own.

An average app needs the executor for nonblocking CPU-bound tasks, so that executor can coordinate CPU-limited tasks. And it needs the executor for blocking tasks, so that executor can coordinate memory-limited tasks.

In the JVM ecosystem, you start using a library for X and a library for Y, and each one starts up some tasks. Because there’s no global executor, each one creates its own. All those per-library executors are probably great by themselves, but running them together sucks. You may never create a thread by hand, but when you run your app there are 100 threads from a dozen different libraries.

2. Blocking APIs

With the standard Java APIs, many things are hard or impossible to do without tying up a thread waiting on IO or waiting on a lock. If you want to open a URL, there’s URL.openStream() right there in the standard library, but if you want to open a URL without blocking you’ll end up with a far more involved external dependency (such as AsyncHttpClient).

Just to kick you while you’re down, many of the external dependencies you might use for nonblocking IO will create at least one dedicated thread, if not a whole thread pool. You’ll need to figure out how to configure it.

3. No event loop

Low-level nonblocking APIs in the spirit of poll() are not enough. Even if every library or code module uses poll() to multiplex IO channels, each library or code module needs its own thread in which to invoke poll().

In Java, a facility as simple as Timer has to spawn its own threads. On platforms with an event loop, such as node.js, or browsers, or many UI toolkits, you tell the platform how long to wait, and it ensures that a single, central poll() has the right timeout to wake up and notify you. Timer needs a thread (or two) because there’s no centralized event loop.

The impact in practice

If you just use Java and Scala APIs naively in the most convenient way, you end up with a whole lot of threads. Then you have to start tracking down thread pools inside of all your libraries, sharing them when possible, and tuning their settings to match the size of your hardware and your actual production load. Or just buy a single piece of hardware more powerful than you’ll ever need and allow the code to be inefficient (not a rare strategy).

I recently wrote a demo app called Web Words, and even though it’s not complex, it shows off this problem well. Separately, the libraries it uses (such as Akka, AsyncHttpClient, Scala parallel collections, RabbitMQ) are well-behaved. Together, there are too many threads, resulting in far more memory usage than should be required, and inefficient parallelization of the CPU-bound work.

This is a whole category of developer tedium that should not exist. It’s an accident of broken, legacy platform design.

The node.js solution

node.js has a simple solution: don’t have any APIs that block. Implement all nonblocking APIs on top of a singleton, standard event loop. Run one process per CPU. Done.

Dispatch of blocking tasks is inherently hard, so node.js makes it impossible to implement a blocking task and avoids the problem.

This would fail without the global singleton event loop. If node.js provided poll() instead of an event loop, poll() would be a blocking API, and any task using it would take over the node.js process.

People often say that “no threads” is the secret to node.js; my take is that first the global singleton event loop enables all APIs to be nonblocking; second the lack of blocking APIs removes the need for threads. The global singleton event loop is the “root cause” which unlocks the big picture.

(The snarky among you may be thinking, “if you like node.js so much, why don’t you marry it?”; node.js is great, but I love a lot of things about Scala too. Type safety goes without saying, and I’ll also take actors over callbacks, and lots more. Comparing one aspect of the two platforms here.)

Event loop as a third-party library: not a solution

You can get event loop libraries for lots of languages. This is in no way equivalent to a standard, default global singleton event loop.

First: the whole point of an event loop is to share a single call to poll() among all parts of the application. If the event loop library doesn’t give you a singleton, then every code module has to create its own loop with its own poll().

Second: if the event loop singleton is not in the standard library, then the platform’s standard library can’t use it. Which means the standard library either has no IO facilities, or it has only broken, blocking IO facilities.

Solving it in Scala

This could be solved on the Java level also, and maybe people are planning to do so — I hope so.

In the meantime, if you’ve read this far, you can probably guess what I’d propose for Scala.

Blocking tasks are inherently somewhat intractable, but they are also a legacy fact of life on the JVM. My suggested philosophy: design APIs assuming nonblocking tasks, but give people tools to manage blocking tasks as best we can.

The critical missing pieces (the first three here) should not be a lot of work in a strictly technical sense; it’s more a matter of getting the right people interested in understanding the problem and powering through the logistics of landing a patch.

1. Two global singleton thread pools

The Scala standard library should manage one thread pool intended for CPU-bound nonblocking tasks, and one intended for blocking tasks.

  • The simplest API would let you submit a function to be executed in one of the pools.
  • Another useful API would create an ExecutorService proxy that used the pools to obtain threads. ExecutorService.shutdown() and ExecutorService.awaitTermination() would have to work correctly: wait for tasks submitted through the proxy to complete, for example. But shutting down the proxy should not interfere with the underlying global thread pool. The proxy would be provided to legacy Java APIs that allow you set a custom ExecutorService.

Built-in Scala features such as actors and parallel collections should make use of these thread pools, of course.

2. A global default event loop instance

The goal is to allow various code modules to add/remove channels to be watched for IO events, and to set the timeout on poll() (or equivalent) to the earliest timeout requested by any code module.

The event loop can be very simple; remember, it’s just a way to build up a poll() invocation that “takes requests” from multiple code modules. More complexity can be assembled on top.

3. A standard Future trait

The standard Future in Scala doesn’t quite have what people need, so there’s a proliferation of third-party solutions, most of them similar in spirit. There’s even a wrapper around all the flavors of Future, called sff4s.

(Note: all of these Future are nonblocking, while Java’s Future requires you to block on it eventually.)

A standard Future is essential because it’s the “interoperability point” for nonblocking APIs. Without a good Future in the standard library, there can’t be good nonblocking APIs in the standard library itself. And without a standard Future, third-party nonblocking APIs don’t play nicely together.

4. Bonus Points and Optional Elaborations

await

In my opinion, the C# await operator with async keyword is the Right Thing. Microsoft understands that great async support has become a required language feature. At my last company, C. Scott Ananian built a similar continuation-passing-style feature on JavaScript generators and it worked great.

Scala has the shift/reset primitives for continuation-passing style, but it isn’t clear to me whether they could be used to build something like await, and if they could, there’s nothing in the standard library yet that does so.

await depends on a standard Future because it’s an operator on Future. In Scala, await would probably not be a language feature, it would be a library function that operated on Future. (Assuming the shift/reset language feature is sufficient to implement that.)

Nonblocking streams

Many, many Java APIs let you pass in or obtain an InputStream or an OutputStream. Like all blocking APIs, these are problematic. But there isn’t a great alternative to use when designing an API; you’d have to invent something custom. The alternative should exist, and be standard. The standard library should have a nonblocking version of URL.openStream(), too.

Actors

If you code exclusively with actors and write only nonblocking code inside your actors, there’s never a need to mess with dispatchers, futures, or event loops. In that sense, actors solve all the problems I’m describing here. However: to code exclusively with actors, you need libraries that implement actor-based alternatives to all blocking APIs. And those libraries in turn need to be implemented somehow. It can’t be actors all the way down. There’s also the matter of using existing, non-actor APIs.

Ideally, Akka, and associated actor-based APIs, would build on a standard task dispatch and event loop facility. Akka already builds on regular Java executors; it doesn’t need anything super-fancy.

An issue with actors today is that they’re allowed to block. My idea: there should be a way to mark a blocking actor as such. It would then be dispatched to the unbounded executor intended for blocking tasks. By default, Akka could dispatch to the bounded executor intended for nonblocking tasks. People who want to block in their actors could either mark them blocking so Akka knows, or (brokenly) configure their nonblocking task executor to be unbounded. If you do everything right (no blocking), it would work by default. Remember the theory “design for nonblocking by default, give people tools to damage-control blocking.”

JDBC and similar

Peter Hausel pointed out to me recently that JDBC only comes in blocking form. Now there’s a fun problem… There are quite a few libraries with similar issues, along with people trying to solve them such as Brendan McAdams’s Hammersmith effort for MongoDB.

What’s the big deal?

Developers on platforms with no good solution here are used to the status quo. When I first came to server-side development from client-side UI toolkit development, the lack of an event loop blew my mind. I’m sure I thought something impolitic like How can anybody use anything so freaking broken??? In practice of course it turns out to be a manageable issue (obviously lots of stuff works great running on the JVM), but it could be better.

In my opinion, with node.js, C# adding async and await to the language, and so on, good answers in this area are now mandatory. People are going to know there’s a better way and complain when they get stuck using a platform that works the old way.

A high-quality application development platform has to address global task dispatch coordination, have an event loop, and offer nice nonblocking APIs.

Related posts…

Some old posts on related topics.