Linux w/ reiserfs 3.x vs a SUN SAN 7410 running solaris and ZFS

ajross · on Oct 20, 2009

This is the meat:

Reiserfs does an "ls" in a directory with 6000 files in it in about 3-5 seconds. The SUN SAN does it in about 1-2 minutes. Serious problems here.

That has to be a bug, not merely slow filesystem behavior. Even doing a seek for every block (4k on ZFS?) at 10ms for a minute comes to 4kb of data per directory entry. That's ridiculous. Something is broken; probably an interaction between subsystems (hardware cache, software cache, ZFS filesystem, network filesystem, SAN configuration, etc...).

scumola · on Oct 21, 2009

ZFS uses an Extensible hash table dor its directory while most of the other filesystems use a b-tree. I'm no ZFS expert, but perhaps this method degrades for larger directories?

I'm not trying to be a troll, I'm trying to fix the performance problem.

moe · on Oct 21, 2009

    mkdir /tmp/foo
    cd /tmp/foo
    for ((i=0;$i<6000;i++)); do touch $i; done
    time ls >/dev/null

    real    0m0.012s
    user    0m0.012s
    sys     0m0.000s

Perhaps try ext3?

scumola · on Oct 21, 2009

Those are obviously in cache.

tensor · on Oct 21, 2009

I have run into this problem myself. It is actually a problem with the Solaris ls and related utilities, and not with ZFS. I solved it by compiling some relevant GNU versions of utilities like find, which have some important performance improvements on this front.

As for working with many small files, in my experience ZFS has been far better than xfs (I do not use reiserFS due to previous stability issues I experienced with it). One particular example was a user who had over a million files in one directory. This caused all our backup software to fail. After moving to ZFS, I could send and receive these files between servers without problem. I could list the files easily enough as well, after installing the aforementioned GNU versions with Blastwave.

I believe there is a GNU version of ls in Blastwave as well that Steve could try.

tensor · on Oct 21, 2009

Oh wait, this is a SAN. I assume then that the ls is over the network and not done on a Solaris box? If that is the case, then it is very much some sort of non-typical problem. I guess there are also other factors here. Were the SATA drives connected via a network as well? Does the SAN run NFS or something else?

At any rate, it certainly doesn't take minutes to list 6000 files. My example above was actually on 4 million files. Extrapolating from 6000 to 4 million, my listing should have taken 11 hours. It may have with the default utilities, but it took no where near that long when I replaced the utilities with GNU versions.

For reference, this was directly on the Solaris machine which had the ZFS pool attached via SATA.

nailer · on Oct 21, 2009

If a GNU version of ls is required to take advantage of Sun's years-of-effort filesystem in a fairly reasonable situation for a high performance filesystem, Sun has a serious problem.

termie · on Oct 20, 2009

You'll just adding more long stats and pushing more metadata into cache if you add another directory level. You should look probably into a Haystack-like approach where you have a single file blob (or several large blobs) and then maintain an index into the offsets.

ajross · on Oct 20, 2009

Isn't that feature (managing storage so you don't have to) precisely the reason for using a "filesystem" in the first place? Your suggestion is isomorphic to telling the poster to chuck the filesystem and just write to the block device directly. It's almost certainly not a sane real-world option.

termie · on Oct 21, 2009

This guy is already @366M objects and growing... halving or quartering the number for file i/os in a large system like this has real, observable and proven benefit. It's definitely not for the meek, but adding another directory (and adding an additional metadata lookup) is not the way to go. When you get to more than a billion objects all requiring 5+ metadata entries that need to get walked on every request, you might see it differently.

ajross · on Oct 21, 2009

No, that's silly. You're not changing the number of I/O operations at all. You're simply moving the location in code where they are done from the kernel's filesystem to the applications's userspace. There's no reason to expect either to be faster by anything other than a constant factor.

apgwoz · on Oct 21, 2009

Isn't what both of you are describing a filesystem in itself?

mey · on Oct 20, 2009

Is a "file system" the best design for this?

Not sure of a better general solution, but approaches like BigTable/GFS and SimpleDB come to mind.

Conversely if ZFS is supposed to scale to petabyte loads, what configuration do they expect that data to be in?

liuliu · on Oct 20, 2009

BigTable is for structure data. I think the best open-source software for this problem is Haystack from Facebook. It solves the problem by creating really big file (1G size), in that case, you don't have to bother with filesystem.

pvg · on Oct 20, 2009

Haystack is not open-source software.

Andys · on Oct 20, 2009

Doing 'ls' in a big directory with a cold cache is a pathalogical case for ZFS. Finding and opening a single file should be alot faster than 'ls', and frequently used metadata will be cached by ZFS's ARC over time.

I guess instant, unlimited snapshots don't come free. But you also have the option of storing metadata cache on seperate storage (such as SSDs), a feature which many other filesystems don't offer.

Andys · on Oct 20, 2009

OK, I just created a directory with 6000 * 8kb files, and did 'ls' from another computer over NFS with a cold cache, and it completed in:

real 0m 1.96s user 0m 1.12s sys 0m 0.00s

Sounds like Steve was having some other problem unrelated to ZFS.

EDIT: also just found a directory I had with 60,000 randomly created files over time (ie. fragmented), and ls took 3.5 seconds locally (didn't try it over NFS). This is looking more and more like a troll post :-)

Periodic · on Oct 21, 2009

When you created the files they would still be cached by ZFS, so it's going to skip reading them from the disks.

Took me 9.9 seconds to get a directory listing for 65336 files over NFS after creating them over NFS on another system.

That's still no where near as bad as the author states, but I had those files in my cache on the file server, I bet.

slice · on Oct 21, 2009

I managed to get to about 30 seconds with 65,000 files on a local ext3 file system. The file names where all of the same length and with a ~100 character long identical prefix. I re-mounted the file system before the ls to eliminate caching.

Andys · on Oct 21, 2009

Perhaps he did "ls -l".. which requires a stat() for each file. but even in the worst case scenario, that's 10ms for each roundtrip.. way more than it should be: something sounds broken with his setup.

scumola · on Oct 21, 2009

Your newly-created files are still in cache.

tmountain · on Oct 20, 2009

From a performance perspective, it does not matter if every file on your filesystem sits in a single directory. The only thing directories provide are namespaces for files to reside in and convenience.

ajross · on Oct 20, 2009

That's true only for modern filesystems. For decades, the standard directory storage in unix filesystems was a plain unordered list. Finding a single file in a large directory required reading half the directory on average. Double the size of the directory and you double the time it takes to seek through it. It's only recently that most systems are shipping with filesystems that are O(1) or O(logN) in the directory size.

mey · on Oct 20, 2009

This depends on the file system in question, and how it organizes it's data on disk. Some file systems have a max limit on how many files can be in a directory, others have a max limit on files in an entire volume etc. Granted todays system are a bit more robust, but can be still targeted towards a variety of usage scenarios.