
Designing the Scylla Userspace Disk I/O Scheduler (Part 1) - mattiemass
http://www.scylladb.com/2016/04/14/io-scheduler-1/
======
mtanski
First. Seastar & Scylla are really impressive work. Props Avi & team.

Doing disk IO well from userspace is hard. There's obvious topics about
durability that have been covered on HN for years. Getting good performance
out of modern drives is one of those things that doesn't get covered enough.

Take a prosumer drive like the Samsung 950 Pro (M2 form factor). It can 1GB to
2GB of streaming transfer and anywhere from 100k to 300k iops. All for about
$180.

The system (kernel) interfaces and filesystems haven't really kept up. The
only async interface is via libaio and the io_submit syscalls. If you ever
worked with you know the limitations, it only works with O_DIRECT, has all
sorts of requirements on your ops and very few guarantees. Random class /
filesystems will just block on submit. XFS probably does the best here (if you
have a recent kernel).

Once you went down this rabbit hole you're implementing your own page caching
and replacement algorithms. And finally you get to the point where you need to
worry about scheduling your IO because if you push down too many ops down to
the kernel your response times become unpredictable (see:
[https://lwn.net/Articles/682582/](https://lwn.net/Articles/682582/) [paid
till next week]).

Anyways, fascinating work & fascinating write up. Much nicer then another
rehash about another async framework that only handles small async network
requests.

~~~
hendzen
Avi has an earlier post [0] where he shows that (recent) XFS is the _only_
filesystem that actually executes io_submit asynchronously.

[http://www.scylladb.com/2016/02/09/qualifying-
filesystems/](http://www.scylladb.com/2016/02/09/qualifying-filesystems/)

~~~
glommer
Actually, what Avi has demonstrated in this article is that XFS is the only
filesystem that executes it _mostly_ asynchronously.

Before we got started with the implementation of the I/O Scheduler (which we
eventually wanted anyway for prioritization), I saw await time as reported by
iostat as bad as 7s (truth be told, those weren't the best disks in the
planet).

That was basically XFS sleeping during io_submit due to the problem I have
briefly mentioned in this article, with the allocation groups.

If you limit the amount of requests the filesystem is consuming, then it is
gone to the point that we started focused our attention in other areas. But it
still has a couple of places in which it will resort to synchronous behavior.

No Linux filesystem can execute io_submit completely asynchronous.

~~~
mtanski
I'm going to guess that your next step to chase further speeds is going direct
to nvme just levering the block layer (per core multiq) skipping the
filesystem entirely.

~~~
seastarer
You're going to be right on that guess. It will take some time though.

------
henrikschroder
"However, since finding the right point through this method is both error-
prone and time-consuming (diskplorer can take ages to collect all points).
Scylla (and Seastar) now ships with scylla_io_setup (a wrapper around
Seastar’s iotune) tool, that helps users find out what the recommended
threshold is and configure the I/O scheduler properly."

It's a sidenote in the article, but a _fantastic_ idea. I wish every major
infrastructure component came with something like this, because the sad state
today is that given a piece of tech, everyone has to tune each installation
themselves, and there's a bazillion blog posts about each, and all of them
containing conflicting information. And everytime you move to a new setup
somewhere you have to remember all of that crap. Again.

~~~
glommer
There are other things that the I/O setup script will do as well, like making
sure your filesystem can handle async I/O properly and is fully patched, etc.
It is indeed designed to be run everytime anything major changes in your
deployment (like if you deploy to other machines, add more disks, etc)

------
dmarti
Part 2 of the article is at: [http://www.scylladb.com/2016/04/29/io-
scheduler-2/](http://www.scylladb.com/2016/04/29/io-scheduler-2/)

------
lpgauth
Curious, is anyone using ScyllaDB in production? Would love to get rid of
Cassandra in our stack.

~~~
ShrpErinaceidae
Can I ask why you'd love to get rid of it? (Not intended as a challenge to the
idea of getting rid of it, just curious why)

~~~
lpgauth
Mostly performance issues. Tuning the JVM is painful, there's no QOS,
compactions affects everything, no request timeouts... the list is long. So
far what I've seen from ScyllaDB is very promising.

