
Libtorrent experience - the poor state of async disk IO - willvarfar
http://www.rasterbar.com/libtorrent_blog/2012/10/asynchronous-disk-io/
======
arielweisberg
My experience with Linux and buffered IO (ext4) from multiple threads has been
very positive. The only beef I have is that you can't prevent the data you
write/read from polluting the cache without resorting to madvise which isn't
available from Java. I don't usually care about the contents of the page cache
so it isn't a showstopper.

You can do hundreds of thousands of random reads a second from a single thread
submitting tasks to a thread pool on an in memory data set. You can do tens of
thousands of reads for a > memory data set with an SSD and I was able to get
the advertised number of 4k IOPs out of the SSD (Crucial m4) and an Intel i5
desktop CPU.

I frequently have to multiplex data as it becomes available into a single file
(to keep the IO sequential for the disk and filesystem) and I always use a
thread per file and I got up to 250 megabytes/sec on a 4 disk RAID-0. I don't
currently have a use case for needing more sequential write throughput than
that so I haven't tried to attaching more disk and SSDs weren't as fast or
common at the time.

My reading of buffered IO in Linux is that it translates to a combination of
page cache interactions and async IO under the hood so we are technically
always using async IO.

~~~
wmf
Yeah, I would be more interested in hearing why libtorrent feels the need to
implement their own disk cache. I'm sure they can do better, but by how much?

~~~
jeffffff
i actually very much doubt that they can do better. writing any sort of large
LRU cache in a machine with swap turned on is a bad idea because a lot of your
cache will get swapped out and then swapped back in unnecessarily when you try
to use that memory for something else. mlock can be used to mitigate this
effect but the default mlock limit is so low that it's useless in practice.
another thing to consider is that using mmap gives you a big advantage over
writing your own disk cache because the OS can take advantage of the paging
hardware in the processor, which you can't do from userspace.

~~~
arvidn
one reason why the bittorrent layer could make better decisions on what to
flush is because it always need to hash pieces that are downloaded. When
blocks are downloaded out of order, it doesn't make sense to flush unhashed
blocks before ones that the the hash cursor has already passed. If an unhashed
block is flushed, it will have to be read back from disk again, when the piece
complete, which is very expensive.

------
evmar
(Copy'n'paste of reddit comment:)

While it's true that the Windows API seems to be the best thought through, I
was surprised to learn that the implementation may randomly fall back to
synchronous IO in unpredictable ways, which (depending on the app, but likely
for something that's attempting to juggle a lot of work like a bittorrent
implementation) means you need a thread pool anyway.

[http://neugierig.org/software/blog/2011/12/nonblocking-
disk-...](http://neugierig.org/software/blog/2011/12/nonblocking-disk-io.html)

~~~
mey
Interesting, I've personally experienced this bug multiple times and always
been surprised when it happens since Chrome is normally extremely responsive
UI wise even when under load.

~~~
derleth
My experience is the opposite: When Chrome starts to hit the disk, it freezes,
especially on startup.

------
dclusin
Another good read about increasing disk IO and reducing latency can be found
at the Mechanical Symphony blog about the single writer principle:
[http://mechanical-sympathy.blogspot.com/2011/09/single-
write...](http://mechanical-sympathy.blogspot.com/2011/09/single-writer-
principle.html)

We ran into issues where lots of threads attempting a disk write were causing
latency problems. We were able to get around this by having a single dedicated
thread to disk IO.

------
wizard_2
I believe even NodeJS's libuv came to the same unfortunate conclusion for non
windows hosts.

<https://github.com/joyent/libuv>
<http://nikhilm.github.com/uvbook/filesystem.html>

~~~
Tobu
When node switched to libuv it didn't degrade io performance on unix hosts; io
in threads has been competitive with evented io. I don't think async io was
even considered, but let me know if you dig up anything.

~~~
tedsuo
yep, node has always used a threadpool for filesystem calls, even before
libuv.

------
tytso
The problem is that we have a chicken and egg problem. Very few programs
(including most enterprise databases) use AIO. Why? For portability reasons;
there are other ways of doing things (i.e., using thread pools) that will
allow a database developer to get all or most of the benefits of AIO, at least
before the days of super fast storage. (Now that we have really fast PCIe-
attached flash which is fast enough that scheduler overhead starts becoming a
real problem, this may no longer be true.

As a result, there is little incentive to improve AIO (on all systems, but
especially on Linux --- a lot of the Direct I/O work was done to make the
enterprise database vendors happy). And since AIO isn't good enough, very few
people want to use it, and since making it better is difficult, few people are
interested in working to solve the problem, and the cycle repeats again.

------
mattgreenrocks
Nice write-up. I suspect the poor implementation of async I/O suggests how
often it is actually used in practice. Signal handling definitely feels like
the wrong design here, especially for a library author.

I'm also not surprised that Windows fared better here; with IOCP they had a
chance to redo async I/O completely.

~~~
wmf
Or the disuse of async disk I/O is due to the difficulty of its proper
implementation.

~~~
shrughes
It's also the extremely limited use case. There's not that many I/O requests
to a disk that you can do at a time while getting a performance speed-up.
Having to use a thread pool ends up not being such a problem -- you don't
really hurt your performance benchmarks. On the other hand, a system that
talks on a network to thousands or millions of clients will benefit greatly
from avoiding 4-8KB of stack overhead per connection.

~~~
throwaway54-762
4-8kB? Maybe physical memory overhead, if your code isn't too deep. But
userspace thread stacks are anywhere from 128kB (FreeBSD) to 8MB (Linux) of
virtual memory overhead.

~~~
shrughes
Stacks can be made to be 4KB or 8KB if you want them to be.

~~~
throwaway54-762
Depends on what libc (or other) routines you call... going over the end of the
stack is no fun. Lots of code seems to be written to rely on deep stacks in
userspace.

------
j_s
Alan McGovern chose a compromise for MonoTorrent, using async io but
processing all the results in a single thread.

The Evolution of MonoTorrent - FOSDEM 2010

[http://www.youtube.com/watch?v=TbhKpeqIy8o&t=10m10s](http://www.youtube.com/watch?v=TbhKpeqIy8o&t=10m10s)

Simplified Threading API

[http://monotorrent.blogspot.com/2008/10/monotorrent-050-good...](http://monotorrent.blogspot.com/2008/10/monotorrent-050-good-
bad-and-seriously.html)

------
VMG
the line breaks make it very difficult to read for me - here's a copy of the
text: <https://gist.github.com/3960408>

------
gwern
> The aio branch has several performance improvements apart from allowing
> multiple disk operations outstanding at any given time. For instance:

This sounds like a bad idea. If the improvements aren't tied to asynch, why
weigh them down with the async albatross instead of merging them to the
mainline?

~~~
klodolph
No problem, that's why we have cherry-pick.

------
freyrs3
libeio makes some strides in this direction.

<http://software.schmorp.de/pkg/libeio.html>

~~~
willvarfar
by using a thread pool, right?

------
cmccabe
This is a good writeup in general, but it fails to mention the fact that under
glibc, POSIX AIO is implemented with a thread pool anyway. Only native (non-
portable) Linux AIO is implemented by the kernel.

In general, unless you're doing something super high-performance, you should
not bother with AIO. It's kind of one of those "if you have to ask, you don't
need to know," situations.

