Task Dispatch and Nonblocking IO in Scala
by havoc
TL;DR
Modern application development platforms are addressing the related issues of globally-coordinated task dispatch and nonblocking IO.
Here’s my definition of the problem, an argument for why it matters, and some suggestions for specific standard library features to add to Scala in particular.
The same ideas apply to any application development platform, though. It’s rapidly becoming mandatory for a competitive platform to offer an answer here.
Definitions
Let’s define a blocking task to be anything that ties up a thread or process but does not use CPU cycles. The most common ways to block are on IO channels and locks.
A busy loop is not a blocking operation in this sense; it takes up a thread, but it’s using the CPU, not “wasting” the thread.
By “task” I mean any piece of executable code. A task is blocking if it spends part of its time waiting, or nonblocking if it needs the CPU the whole time.
Dispatch just means scheduling the task on a thread and executing it.
Dispatch for nonblocking tasks, in an ideal world
For nonblocking tasks (which are CPU-bound), the goal is to use 100% of all CPU cores. There are two ways to lose:
- Fail to use all the cores (not enough threads or processes).
- Too many threads for the number of cores (inefficient and wastes memory).
The ideal solution is a fixed thread or process pool with a number of threads related to the number of cores. This fixed pool must be global to the app and used for all nonblocking tasks. If you have five libraries in your app and they each create a thread per CPU core, you’re losing, even though each library’s approach makes sense in isolation.
When the fixed number of threads are all in use, tasks should be queued up for later dispatch.
Dispatch for blocking tasks, in an ideal world
Blocking tasks pose some competing concerns; the trouble with blocking tasks is that these concerns are hard to balance.
- Memory: each blocking task ties up a thread, which adds overhead (the thread) to the task. A super-tiny http parser gets you nowhere if you accompany each one with a thread.
- Deadlocks: blocking tasks are often waiting for another blocking task. Limiting the number of threads can easily create deadlocks.
- Tasks outstanding: with IO, it is desirable to send lots of requests at once or have lots of sockets open at once. (With CPU-bound tasks, the opposite is true.)
The ideal solution (if you must block) is an “as huge as memory allows” thread/process pool.
If you run blocking tasks on a bounded pool, you could have deadlocks, and you would not maximize tasks outstanding. Still, as memory pressure arrives, it would be better to start making some tasks wait than it would be to exhaust memory. Apps inevitably become pathological when memory is exhausted (either you have swap and performance goes to hell, or you don’t have swap and an out-of-memory exception breaks the app). But as long as memory is available, it’s better to add threads to the pool than it is to queue tasks.
An automatic solution to this problem might require special help from the JVM and/or the OS. You’d want to have an idea about whether it’s reasonable to create another thread, in light of each thread’s memory usage, the amount of memory free, and whether you can GC to recover memory.
In practice, you have to do some amount of manual tuning and configuration to get a thread pool setup that works in practice for a particular deployment. Maybe setting a large, but fixed, thread pool size that happens to keep your app using about the right amount of memory.
Different tasks, different pools
It’s broken to dispatch nonblocking tasks to an unbounded (or large) pool, and broken to dispatch blocking tasks to a small bounded pool. I can’t think of a nice way to handle both kinds of task with the same pool.
Possible conclusion: a dispatch API should permit the application to treat the two differently.
Physical resource coordination requires global state
We’ve all been taught to avoid global variables, global state, and singletons. They cause a lot of trouble, for sure.
Assuming your app runs on a single machine, you have N CPUs on your computer – period. You can’t create a new “CPUs context” with your own CPUs every time you need more CPUs. You have N megabytes of RAM – period. You can’t create a new “RAM context” with its own RAM.
These resources are shared among all threads and processes. Thus you need global, coordinated task dispatch.
Nonblocking IO creates another global resource
With nonblocking IO APIs, such as poll(), you can have multiple IO operations outstanding, using only one thread or process.
However, to use poll()
or equivalent you have a new problem: every IO operation (on the same thread) must be coordinated so that the file descriptors end up in a single call to poll(). The system for coordinating this is called an “event loop” or “main loop.”
In an API such as libev or GMainLoop, applications can say “I need to wake up in 3 seconds” or “I need to know about activity on this socket handle,” and the library aggregates all such requests into a single invocation of poll()
. The single poll()
puts the thread to sleep until one of the requests is ready.
Nonblocking IO requires a globally-coordinated “managed poll()” — also known as an event loop. Otherwise you’re back to needing threads.
How Java sucks at this
In brief:
- No global task dispatcher to coordinate CPU and memory usage.
- The APIs are mostly blocking.
- The nonblocking APIs in nio have limited utility because there’s no global event loop.
1. No global dispatcher
Java has all sorts of nice executors allowing you to dispatch tasks in many different ways.
But for average apps doing average things, we need two global singleton executors, not a zillion ways to create our own.
An average app needs the executor for nonblocking CPU-bound tasks, so that executor can coordinate CPU-limited tasks. And it needs the executor for blocking tasks, so that executor can coordinate memory-limited tasks.
In the JVM ecosystem, you start using a library for X and a library for Y, and each one starts up some tasks. Because there’s no global executor, each one creates its own. All those per-library executors are probably great by themselves, but running them together sucks. You may never create a thread by hand, but when you run your app there are 100 threads from a dozen different libraries.
2. Blocking APIs
With the standard Java APIs, many things are hard or impossible to do without tying up a thread waiting on IO or waiting on a lock. If you want to open a URL, there’s URL.openStream()
right there in the standard library, but if you want to open a URL without blocking you’ll end up with a far more involved external dependency (such as AsyncHttpClient).
Just to kick you while you’re down, many of the external dependencies you might use for nonblocking IO will create at least one dedicated thread, if not a whole thread pool. You’ll need to figure out how to configure it.
3. No event loop
Low-level nonblocking APIs in the spirit of poll()
are not enough. Even if every library or code module uses poll()
to multiplex IO channels, each library or code module needs its own thread in which to invoke poll()
.
In Java, a facility as simple as Timer has to spawn its own threads. On platforms with an event loop, such as node.js, or browsers, or many UI toolkits, you tell the platform how long to wait, and it ensures that a single, central poll()
has the right timeout to wake up and notify you. Timer needs a thread (or two) because there’s no centralized event loop.
The impact in practice
If you just use Java and Scala APIs naively in the most convenient way, you end up with a whole lot of threads. Then you have to start tracking down thread pools inside of all your libraries, sharing them when possible, and tuning their settings to match the size of your hardware and your actual production load. Or just buy a single piece of hardware more powerful than you’ll ever need and allow the code to be inefficient (not a rare strategy).
I recently wrote a demo app called Web Words, and even though it’s not complex, it shows off this problem well. Separately, the libraries it uses (such as Akka, AsyncHttpClient, Scala parallel collections, RabbitMQ) are well-behaved. Together, there are too many threads, resulting in far more memory usage than should be required, and inefficient parallelization of the CPU-bound work.
This is a whole category of developer tedium that should not exist. It’s an accident of broken, legacy platform design.
The node.js solution
node.js has a simple solution: don’t have any APIs that block. Implement all nonblocking APIs on top of a singleton, standard event loop. Run one process per CPU. Done.
Dispatch of blocking tasks is inherently hard, so node.js makes it impossible to implement a blocking task and avoids the problem.
This would fail without the global singleton event loop. If node.js provided poll()
instead of an event loop, poll()
would be a blocking API, and any task using it would take over the node.js process.
People often say that “no threads” is the secret to node.js; my take is that first the global singleton event loop enables all APIs to be nonblocking; second the lack of blocking APIs removes the need for threads. The global singleton event loop is the “root cause” which unlocks the big picture.
(The snarky among you may be thinking, “if you like node.js so much, why don’t you marry it?”; node.js is great, but I love a lot of things about Scala too. Type safety goes without saying, and I’ll also take actors over callbacks, and lots more. Comparing one aspect of the two platforms here.)
Event loop as a third-party library: not a solution
You can get event loop libraries for lots of languages. This is in no way equivalent to a standard, default global singleton event loop.
First: the whole point of an event loop is to share a single call to poll()
among all parts of the application. If the event loop library doesn’t give you a singleton, then every code module has to create its own loop with its own poll()
.
Second: if the event loop singleton is not in the standard library, then the platform’s standard library can’t use it. Which means the standard library either has no IO facilities, or it has only broken, blocking IO facilities.
Solving it in Scala
This could be solved on the Java level also, and maybe people are planning to do so — I hope so.
In the meantime, if you’ve read this far, you can probably guess what I’d propose for Scala.
Blocking tasks are inherently somewhat intractable, but they are also a legacy fact of life on the JVM. My suggested philosophy: design APIs assuming nonblocking tasks, but give people tools to manage blocking tasks as best we can.
The critical missing pieces (the first three here) should not be a lot of work in a strictly technical sense; it’s more a matter of getting the right people interested in understanding the problem and powering through the logistics of landing a patch.
1. Two global singleton thread pools
The Scala standard library should manage one thread pool intended for CPU-bound nonblocking tasks, and one intended for blocking tasks.
- The simplest API would let you submit a function to be executed in one of the pools.
- Another useful API would create an
ExecutorService
proxy that used the pools to obtain threads.ExecutorService.shutdown()
andExecutorService.awaitTermination()
would have to work correctly: wait for tasks submitted through the proxy to complete, for example. But shutting down the proxy should not interfere with the underlying global thread pool. The proxy would be provided to legacy Java APIs that allow you set a custom ExecutorService.
Built-in Scala features such as actors and parallel collections should make use of these thread pools, of course.
2. A global default event loop instance
The goal is to allow various code modules to add/remove channels to be watched for IO events, and to set the timeout on poll()
(or equivalent) to the earliest timeout requested by any code module.
The event loop can be very simple; remember, it’s just a way to build up a poll()
invocation that “takes requests” from multiple code modules. More complexity can be assembled on top.
3. A standard Future trait
The standard Future in Scala doesn’t quite have what people need, so there’s a proliferation of third-party solutions, most of them similar in spirit. There’s even a wrapper
around all the flavors of Future
, called sff4s.
(Note: all of these Future
are nonblocking, while Java’s Future requires you to block on it eventually.)
A standard Future
is essential because it’s the “interoperability point” for nonblocking APIs. Without a good Future
in the standard library, there can’t be good nonblocking APIs in the standard library itself. And without a standard Future
, third-party nonblocking APIs don’t play nicely together.
4. Bonus Points and Optional Elaborations
await
In my opinion, the C# await operator with async
keyword is the Right Thing. Microsoft understands that great async support has become a required language feature. At my last company, C. Scott Ananian built a similar continuation-passing-style feature on JavaScript generators and it worked great.
Scala has the shift/reset primitives for continuation-passing style, but it isn’t clear to me whether they could be used to build something like await
, and if they could, there’s nothing in the standard library yet that does so.
await
depends on a standard Future
because it’s an operator on Future
. In Scala, await
would probably not be a language feature, it would be a library function that operated on Future
. (Assuming the shift/reset language feature is sufficient to implement that.)
Nonblocking streams
Many, many Java APIs let you pass in or obtain an InputStream
or an OutputStream
. Like all blocking APIs, these are problematic. But there isn’t a great alternative to use when designing an API; you’d have to invent something custom. The alternative should exist, and be standard. The standard library should have a nonblocking version of URL.openStream()
, too.
Actors
If you code exclusively with actors and write only nonblocking code inside your actors, there’s never a need to mess with dispatchers, futures, or event loops. In that sense, actors solve all the problems I’m describing here. However: to code exclusively with actors, you need libraries that implement actor-based alternatives to all blocking APIs. And those libraries in turn need to be implemented somehow. It can’t be actors all the way down. There’s also the matter of using existing, non-actor APIs.
Ideally, Akka, and associated actor-based APIs, would build on a standard task dispatch and event loop facility. Akka already builds on regular Java executors; it doesn’t need anything super-fancy.
An issue with actors today is that they’re allowed to block. My idea: there should be a way to mark a blocking actor as such. It would then be dispatched to the unbounded executor intended for blocking tasks. By default, Akka could dispatch to the bounded executor intended for nonblocking tasks. People who want to block in their actors could either mark them blocking so Akka knows, or (brokenly) configure their nonblocking task executor to be unbounded. If you do everything right (no blocking), it would work by default. Remember the theory “design for nonblocking by default, give people tools to damage-control blocking.”
JDBC and similar
Peter Hausel pointed out to me recently that JDBC only comes in blocking form. Now there’s a fun problem… There are quite a few libraries with similar issues, along with people trying to solve them such as Brendan McAdams‘s Hammersmith effort for MongoDB.
What’s the big deal?
Developers on platforms with no good solution here are used to the status quo. When I first came to server-side development from client-side UI toolkit development, the lack of an event loop blew my mind. I’m sure I thought something impolitic like How can anybody use anything so freaking broken??? In practice of course it turns out to be a manageable issue (obviously lots of stuff works great running on the JVM), but it could be better.
In my opinion, with node.js, C# adding async
and await
to the language, and so on, good answers in this area are now mandatory. People are going to know there’s a better way and complain when they get stuck using a platform that works the old way.
A high-quality application development platform has to address global task dispatch coordination, have an event loop, and offer nice nonblocking APIs.
Related posts…
Some old posts on related topics.
- Guidelines for sync and async callbacks in nonblocking APIs.
- I wrote a node.js-like code experiment that shared one event loop among many threads and JavaScript global objects, there’s some discussion of using JS yield like C# await in that post.
- I claimed offhand above that out-of-memory exceptions always break apps, here’s why I think that.
Have you looked into vert.x? https://github.com/purplefox/vert.x
I realize you have a substantial investment in Scala, so this sort of response isn’t likely to be persuasive; nevertheless…
Every time I read a blog post like this, I find myself puzzled that people still agonize over this stuff. Just use Go[1]. Problem solved.
With Go you’ve got “dispatch of blocking tasks” built in (maybe it’s “inherently hard”, but it’s also a solved problem), which means you can write straightforward blocking I/O and lock-obtaining code, with no messy callbacks *or* actors. (In practice channels are usually better than locks, but for our purposes here they’re the same — they both block.) Looking at node.js or Scala (or even Erlang) now gives me those same impolitic thoughts.
[1]: http://golang.org/
It does look very interesting, I hadn’t read about goroutines. I’ll have to dig in more. Curious what it does behind the scenes.
“Just use Go” is maybe oversimplified – there are plenty of factors when choosing a language other than this one topic!
hmm, I realized maybe the “inherently hard” wasn’t clear. What I’m talking about there is that if you block on the OS level (your thread is stopped by the kernel until a system call returns) then you inherently have a thread per IO system call. Dealing with that is the inherently hard because N outstanding IO requests = N threads.
I’m guessing Go makes it _look_ like you’re doing blocking IO, but really it’s doing nonblocking IO for you and making it nice and blocking-looking… or if that’s not it, I’m very interested in what’s going on!
Yes, Go, just like Erlang “makes it look like it’s blocking” but it’s not …
You might want to read up on how the GHC Haskell compiler/runtime tackles this, especially since the 7.x series (in the new so-called ‘I/O Manager’).
Some links:
http://blog.johantibell.com/2010/09/final-version-of-our-ghc-io-manager.html
http://blog.johantibell.com/2010/01/scalable-timeout-support-for-ghcs-io.html
GHC has support for ‘forkIO’ for quite a long time, which does *exactly* what you want in your ideal world =).
Nice article. I don’t work in the JVM world, but the points apply fairly generally.
Great article. Here’s my take on implementing await in Scala using shift/reset: https://gist.github.com/1392983
Interesting article, but you’ve missed out an important NIO-based design option that we went out of our way to support during JSR-51.
Rather than having a single global selector driving a single global event queue, you can have multiple, per-CPU selectors driving corresponding per-CPU event queues, each in a separate thread.
The point that is often missed is that each thread can select on the same listening socket without any additional synchronization. Mutual exclusion is only required when the listening socket is ready and accept is actually called on it.
This allows a server to operate per-CPU event queues over per-CPU pools of active sockets with only minimal interaction between the queues: around calls to accept on the shared listening socket where, crucially, the listening socket is known to be ready for a call on accept.
This kind of design is inherently load-balanced between CPUs at the per-connection level, because the more activity any given per-CPU event-loop has, the less frequently it will tend to select, making it more likely that less busy per-CPU event-loops will pick up any connection backlog.
Obviously a fully fleshed out design needs to dot a lot of i’s a cross a lot of t’s, and it’s almost inevitable that more interaction would be required than the minimum I’ve described. Nevertheless, a design along these lines is the way to go for best performance on commodity hardware. It has all of the structural benefits of a Node.js type solution, but within a single process. The general template for this style comes from Richard Stevens, “Prethreaded Sever, per-Thread accept” model in his Unix Network Programming, Vol. 1.
The reason that this style isn’t adopted more often is that to derive it’s full benefits you have to code in a uniformly event-driven style. In the absence of tools to support this the programmer has to manually transform their entire application into continuation passing style which is, frankly, a horrible thing to be forced to do (given that I’m partly responsible for NIO, people who’ve suffered this pain will doubtless be glad to hear that I was the first to suffer it, while working with early NIO prototypes).
And this is where Scala has something important to offer: compiler-level support for continuations. They’re not perfect, but they can make it very much easier to write event driven applications. I really wish I had the time to explore this space properly … it’s this kind of problem that drew me to Scala in the first place.
There was a talk at ScalaDays about async IO:
http://days2011.scala-lang.org/node/138/288
[…] check out this cool post about Task Dispatch and Nonblocking IO in Scala @havocp Feel free to drop me a mail or message @markglh on twitter with any Scala news! Share […]
I have been experimenting with non-blocking IO using Akka for many of the same reasons you stated, and support for it will probably be available in Akka 2.0 (with a disclaimer about the API being experimental and subject to change). At the moment the API is based on a simple iteratee implementation.
Some examples of what I have built with it:
Redis client https://github.com/derekjw/fyrie-redis
HTTP server https://github.com/derekjw/akka-http
With Haskell’s type system it is not too late to add something like this. You could build a new monad, say NIO, which only has non-blocking IO. You could have this single event loop inside this monad, and the compiler could verify you never did anything blocking. Seems like a nice project, actually.
So basically… Grand Central Dispatch? http://arstechnica.com/apple/reviews/2009/08/mac-os-x-10-6.ars/12
I haven’t done any OS X programming but it sounds like they’re solving this problem not just global to a process but global to the OS? nice!
Fortunately for server side Java I guess people tend to run one JVM per server so coordinating multiple processes might not be needed. Still, given a thing like GCD a global dispatcher could theoretically use it …
[…] it. Recently, I’ve noticed increased discussion around implementation choices as best fit for blocking vs non blocking tasks which is both interesting AND useful. Personally, I’m intrgiued to see how efforts like ql.io […]