

Why Virtual Machines suck when you run them from BTRFS files system - sagarun
http://lists.fedoraproject.org/pipermail/devel/2011-July/154251.html

======
aliguori
This rambles a bit. Here's the summary:

btrfs is currently optimized for normal applications that do open("foo",
O_RDWR). With this mode, integrity semantics are quite loose in POSIX.

Because VMs emulate physical hardware with strong integrity semantics, they
usually do either open("foo", O_DIRECT) or open("foo", O_SYNC).

btrfs sucks for O_SYNC. It's not just VMs, databases also tend to make heavy
use of O_SYNC.

~~~
gaius
Ironic that BTRFS is sponsored by Oracle then!

~~~
aliguori
A lot of filesystems don't optimize O_SYNC heavily until it becomes necessary.
ext4 has really bad O_SYNC performance until pretty recently FWIW.

Given where BTRFS is right now development wise, it's not at all surprising
O_SYNC hasn't been optimized yet.

------
masklinn
So BTRFS is very efficient for big sequential read (which you generally don't
care for much, because they're pretty fast in any case) and dies when
subjected to small random read (which are the bane of platters in the first
place)... isn't that dumb for a general-purpose FS?

~~~
rbanffy
What I got from that is that BtrFS sucks for doing lots of small synchronous
writes, something that's relatively unique to VMs and is a major improvement
over ext4 in just about everything else (feature set and performance). In
fact, it's so unique it never popped up on the tests they do regularly on
every patch.

------
ScottBurson
Huh, that's funny. I've been running a VM out of a btrfs partition for months
and haven't seen these problems. It's not blindingly fast, but (a) the
partition is encrypted, and (b) the VM is running Windows with antivirus
software, so there are a couple of things other than btrfs slowing down the
write path. But I certainly haven't seen freezes such as those described in
this post.

------
rbanffy
I'm glad there are about a dozen different file systems that don't suck with
VM work and quite relieved BtrFS developers are actively working on improving
the case that hurts VM performance.

Having said that, I'd love to know if there are automated tests within the
kernel that could verify integrity/correctness/performance of things like
filesystem drivers in a simple and automated form. Something like it could
prevent surprising developments performance regressions like this and provide
better mapping between what you want to do and how you should do it.

~~~
simcop2387
The closest thing I'm aware of for doing that currently is the Phoronix
Testing Suite; or whatever they're calling it now. It's most certainly not
complete but it's the only thing I'm aware of that can do any kind of
regression testing like that. In fact it's been worked on recently to make it
really easy to do testing with git bisect on the kernel.

~~~
rbanffy
Going further in the thread they mention xfstests and that they run it against
every patch.

[http://xfs.org/index.php/Getting_the_latest_source_code#XFS_...](http://xfs.org/index.php/Getting_the_latest_source_code#XFS_tests)

and

[http://oss.sgi.com/cgi-
bin/gitweb.cgi?p=xfs/cmds/xfstests.gi...](http://oss.sgi.com/cgi-
bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=summary)

------
vivekl
I think the problem also depends on what kind of virtual disk you end up
using. Let me ellaborate:

I dont think the problem is purely buffered vs. unbuffered IO. The guest
operating system will have performed some block coalescing anyway so the block
requests will often NOT be 4K chunks but should have slightly larger
granularity. However, if you use a COW based virtual disk layout like QCOW2
which I guess is standard in KVM, you may see additional scattered IO.

I think it is weird to be using COW virtual disk layouts on a file system that
natively supports COW as is the case with BTRFS. I would be curious to see
what the performance of raw sparse files on BTRFS is vs. qcow2 etc.

------
otterley
It sounds as though the same issues that make it perform suboptimally on VM
hypervisors would make it also perform suboptimally for OLTP databases -- in
both, the I/O patterns generally involve high numbers of small writes.

------
jhefter
This problem is similar to (and exacerbated by) the IO bottlenecks VMs
experience when using traditional hard drive disks, due to high levels of
random IO operations. For this reason, many new virtual setups are using solid
state drives, which have no seek time. This keeps the high level of random IO
operations from significantly impacting performance.

~~~
masklinn
> For this reason, many new virtual setups are using solid state drives, which
> have no seek time. This keeps the high level of random IO operations from
> significantly impacting performance.

Except for btrfs, where it would make the whole thing even less efficient
(because now the only cost is the waiting around for threads, not even the
random seek on your platters).

And as a result, I disagree with your "and exacerbated by". BTRFS's problem
becomes worse on SSDs (qualitatively) because the random read itself is almost
free, and _all_ of the cost is in the context switching done by the FS,
instead of only 80~90% of that cost.

------
ch0wn
Thanks for posting this. This is a really important information when setting
up a new host to run VMs.

------
funkah
The quoted text is painful to read. I don't know why so many mailing list
pages have to look like this. At the very least could the linebreaks be taken
out?

~~~
sagarun
You could have clicked the "Previous message" link and viewed the original
message without quotes:
[http://lists.fedoraproject.org/pipermail/devel/2011-July/154...](http://lists.fedoraproject.org/pipermail/devel/2011-July/154250.html)

~~~
vdm
I'm with funkah. Mailing list archive pages haven't evolved in a decade.

------
rbanffy
Observing the dynamics of the list, I have to ask: who is JB and why is he/she
so worried about VM performance under BtrFS?

Fedora is not a Linux you recommend for someone who doesn't know what they are
doing and, if you know VM performance sucks with BtrFS, please, by all means,
add another partition and use ext4 (or 3 or 2 or XFS or anything you think may
offer you better performance)

