Reiserfs does an "ls" in a directory with 6000 files in it in about 3-5 seconds. The SUN SAN does it in about 1-2 minutes. Serious problems here.
That has to be a bug, not merely slow filesystem behavior. Even doing a seek for every block (4k on ZFS?) at 10ms for a minute comes to 4kb of data per directory entry. That's ridiculous. Something is broken; probably an interaction between subsystems (hardware cache, software cache, ZFS filesystem, network filesystem, SAN configuration, etc...).
ZFS uses an Extensible hash table dor its directory while most of the other filesystems use a b-tree. I'm no ZFS expert, but perhaps this method degrades for larger directories?
I'm not trying to be a troll, I'm trying to fix the performance problem.
I have run into this problem myself. It is actually a problem with the Solaris ls and related utilities, and not with ZFS. I solved it by compiling some relevant GNU versions of utilities like find, which have some important performance improvements on this front.
As for working with many small files, in my experience ZFS has been far better than xfs (I do not use reiserFS due to previous stability issues I experienced with it). One particular example was a user who had over a million files in one directory. This caused all our backup software to fail. After moving to ZFS, I could send and receive these files between servers without problem. I could list the files easily enough as well, after installing the aforementioned GNU versions with Blastwave.
I believe there is a GNU version of ls in Blastwave as well that Steve could try.
Oh wait, this is a SAN. I assume then that the ls is over the network and not done on a Solaris box? If that is the case, then it is very much some sort of non-typical problem. I guess there are also other factors here. Were the SATA drives connected via a network as well? Does the SAN run NFS or something else?
At any rate, it certainly doesn't take minutes to list 6000 files. My example above was actually on 4 million files. Extrapolating from 6000 to 4 million, my listing should have taken 11 hours. It may have with the default utilities, but it took no where near that long when I replaced the utilities with GNU versions.
For reference, this was directly on the Solaris machine which had the ZFS pool attached via SATA.
If a GNU version of ls is required to take advantage of Sun's years-of-effort filesystem in a fairly reasonable situation for a high performance filesystem, Sun has a serious problem.
You'll just adding more long stats and pushing more metadata into cache if you add another directory level. You should look probably into a Haystack-like approach where you have a single file blob (or several large blobs) and then maintain an index into the offsets.
Isn't that feature (managing storage so you don't have to) precisely the reason for using a "filesystem" in the first place? Your suggestion is isomorphic to telling the poster to chuck the filesystem and just write to the block device directly. It's almost certainly not a sane real-world option.
This guy is already @366M objects and growing... halving or quartering the number for file i/os in a large system like this has real, observable and proven benefit. It's definitely not for the meek, but adding another directory (and adding an additional metadata lookup) is not the way to go. When you get to more than a billion objects all requiring 5+ metadata entries that need to get walked on every request, you might see it differently.
No, that's silly. You're not changing the number of I/O operations at all. You're simply moving the location in code where they are done from the kernel's filesystem to the applications's userspace. There's no reason to expect either to be faster by anything other than a constant factor.
BigTable is for structure data. I think the best open-source software for this problem is Haystack from Facebook. It solves the problem by creating really big file (1G size), in that case, you don't have to bother with filesystem.
Doing 'ls' in a big directory with a cold cache is a pathalogical case for ZFS. Finding and opening a single file should be alot faster than 'ls', and frequently used metadata will be cached by ZFS's ARC over time.
I guess instant, unlimited snapshots don't come free. But you also have the option of storing metadata cache on seperate storage (such as SSDs), a feature which many other filesystems don't offer.
OK, I just created a directory with 6000 * 8kb files, and did 'ls' from another computer over NFS with a cold cache, and it completed in:
real 0m 1.96s
user 0m 1.12s
sys 0m 0.00s
Sounds like Steve was having some other problem unrelated to ZFS.
EDIT: also just found a directory I had with 60,000 randomly created files over time (ie. fragmented), and ls took 3.5 seconds locally (didn't try it over NFS). This is looking more and more like a troll post :-)
I managed to get to about 30 seconds with 65,000 files on a local ext3 file system. The file names where all of the same length and with a ~100 character long identical prefix. I re-mounted the file system before the ls to eliminate caching.
Perhaps he did "ls -l".. which requires a stat() for each file. but even in the worst case scenario, that's 10ms for each roundtrip.. way more than it should be: something sounds broken with his setup.
From a performance perspective, it does not matter if every file on your filesystem sits in a single directory. The only thing directories provide are namespaces for files to reside in and convenience.
That's true only for modern filesystems. For decades, the standard directory storage in unix filesystems was a plain unordered list. Finding a single file in a large directory required reading half the directory on average. Double the size of the directory and you double the time it takes to seek through it. It's only recently that most systems are shipping with filesystems that are O(1) or O(logN) in the directory size.
This depends on the file system in question, and how it organizes it's data on disk. Some file systems have a max limit on how many files can be in a directory, others have a max limit on files in an entire volume etc. Granted todays system are a bit more robust, but can be still targeted towards a variety of usage scenarios.
Reiserfs does an "ls" in a directory with 6000 files in it in about 3-5 seconds. The SUN SAN does it in about 1-2 minutes. Serious problems here.
That has to be a bug, not merely slow filesystem behavior. Even doing a seek for every block (4k on ZFS?) at 10ms for a minute comes to 4kb of data per directory entry. That's ridiculous. Something is broken; probably an interaction between subsystems (hardware cache, software cache, ZFS filesystem, network filesystem, SAN configuration, etc...).