

Linux w/ reiserfs 3.x vs a SUN SAN 7410 running solaris and ZFS - scumola
http://badcheese.com/?q=node/114

======
ajross
This is the meat:

 _Reiserfs does an "ls" in a directory with 6000 files in it in about 3-5
seconds. The SUN SAN does it in about 1-2 minutes. Serious problems here._

That has to be a bug, not merely slow filesystem behavior. Even doing a seek
for every block (4k on ZFS?) at 10ms for a minute comes to 4kb of data per
directory entry. That's ridiculous. Something is broken; probably an
interaction between subsystems (hardware cache, software cache, ZFS
filesystem, network filesystem, SAN configuration, etc...).

~~~
scumola
ZFS uses an Extensible hash table dor its directory while most of the other
filesystems use a b-tree. I'm no ZFS expert, but perhaps this method degrades
for larger directories?

I'm not trying to be a troll, I'm trying to fix the performance problem.

~~~
moe

        mkdir /tmp/foo
        cd /tmp/foo
        for ((i=0;$i<6000;i++)); do touch $i; done
        time ls >/dev/null
    
        real    0m0.012s
        user    0m0.012s
        sys     0m0.000s
    

Perhaps try ext3?

~~~
scumola
Those are obviously in cache.

------
tensor
I have run into this problem myself. It is actually a problem with the Solaris
ls and related utilities, and not with ZFS. I solved it by compiling some
relevant GNU versions of utilities like find, which have some important
performance improvements on this front.

As for working with many small files, in my experience ZFS has been far better
than xfs (I do not use reiserFS due to previous stability issues I experienced
with it). One particular example was a user who had over a million files in
one directory. This caused all our backup software to fail. After moving to
ZFS, I could send and receive these files between servers without problem. I
could list the files easily enough as well, after installing the
aforementioned GNU versions with Blastwave.

I believe there is a GNU version of ls in Blastwave as well that Steve could
try.

~~~
tensor
Oh wait, this is a SAN. I assume then that the ls is over the network and not
done on a Solaris box? If that is the case, then it is very much some sort of
non-typical problem. I guess there are also other factors here. Were the SATA
drives connected via a network as well? Does the SAN run NFS or something
else?

At any rate, it certainly doesn't take minutes to list 6000 files. My example
above was actually on 4 million files. Extrapolating from 6000 to 4 million,
my listing should have taken 11 hours. It may have with the default utilities,
but it took no where near that long when I replaced the utilities with GNU
versions.

For reference, this was directly on the Solaris machine which had the ZFS pool
attached via SATA.

------
termie
You'll just adding more long stats and pushing more metadata into cache if you
add another directory level. You should look probably into a Haystack-like
approach where you have a single file blob (or several large blobs) and then
maintain an index into the offsets.

~~~
ajross
Isn't that feature (managing storage so you don't have to) precisely the
reason for using a "filesystem" in the first place? Your suggestion is
isomorphic to telling the poster to chuck the filesystem and just write to the
block device directly. It's almost certainly not a sane real-world option.

~~~
termie
This guy is already @366M objects and growing... halving or quartering the
number for file i/os in a large system like this has real, observable and
proven benefit. It's definitely not for the meek, but adding another directory
(and adding an additional metadata lookup) is not the way to go. When you get
to more than a billion objects all requiring 5+ metadata entries that need to
get walked on every request, you might see it differently.

~~~
ajross
No, that's silly. You're not changing the number of I/O operations at all.
You're simply moving the location in code where they are done from the
kernel's filesystem to the applications's userspace. There's no reason to
expect either to be faster by anything other than a constant factor.

------
mey
Is a "file system" the best design for this?

Not sure of a better general solution, but approaches like BigTable/GFS and
SimpleDB come to mind.

Conversely if ZFS is supposed to scale to petabyte loads, what configuration
do they expect that data to be in?

~~~
liuliu
BigTable is for structure data. I think the best open-source software for this
problem is Haystack from Facebook. It solves the problem by creating really
big file (1G size), in that case, you don't have to bother with filesystem.

~~~
pvg
Haystack is not open-source software.

------
Andys
Doing 'ls' in a big directory with a cold cache is a pathalogical case for
ZFS. Finding and opening a single file should be alot faster than 'ls', and
frequently used metadata will be cached by ZFS's ARC over time.

I guess instant, unlimited snapshots don't come free. But you also have the
option of storing metadata cache on seperate storage (such as SSDs), a feature
which many other filesystems don't offer.

~~~
Andys
OK, I just created a directory with 6000 * 8kb files, and did 'ls' from
another computer over NFS with a cold cache, and it completed in:

real 0m 1.96s user 0m 1.12s sys 0m 0.00s

Sounds like Steve was having some other problem unrelated to ZFS.

EDIT: also just found a directory I had with 60,000 randomly created files
over time (ie. fragmented), and ls took 3.5 seconds locally (didn't try it
over NFS). This is looking more and more like a troll post :-)

~~~
Periodic
When you created the files they would still be cached by ZFS, so it's going to
skip reading them from the disks.

Took me 9.9 seconds to get a directory listing for 65336 files over NFS after
creating them over NFS on another system.

That's still no where near as bad as the author states, but I had those files
in my cache on the file server, I bet.

~~~
slice
I managed to get to about 30 seconds with 65,000 files on a local ext3 file
system. The file names where all of the same length and with a ~100 character
long identical prefix. I re-mounted the file system before the ls to eliminate
caching.

------
tmountain
From a performance perspective, it does not matter if every file on your
filesystem sits in a single directory. The only thing directories provide are
namespaces for files to reside in and convenience.

~~~
ajross
That's true only for modern filesystems. For decades, the standard directory
storage in unix filesystems was a plain unordered list. Finding a single file
in a large directory required reading half the directory on average. Double
the size of the directory and you double the time it takes to seek through it.
It's only recently that most systems are shipping with filesystems that are
O(1) or O(logN) in the directory size.

