
You can list a directory containing 8 million files But not with ls..  - bcx
http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/
======
scott_s
Excellent writeup. Computer systems are discoverable. That attitude, along
with some of the basic tools (such as strace, ltrace, man pages, debuggers and
a compiler) and a willingness to dive into system source code go a long way.
If your tracing leads you to a library call and you don't know what's going on
inside, find the source code. If it's the kernel, load up lxr
(<http://lxr.linux.no/>).

~~~
unwind
Very interesting. I spotted two minor problems with the posted code.

Doing this: #define BUF_SIZE 1024 * 1024 * 5

to define a numerical constant is a bit scary, since depending on how the
symbol is used it can break due to dependencies. It's better to enclose the
value in parenthesis. Personally I would probably write the right hand side as
(5 << 20).

The first time the modification to skip entries with inode 0 is mentioned, the
code is wrong:

 _I did this by adding if (dp- >d_ino == 0) printf(...);_

This should use !=, not == (it's correct the second time, but this adds
confusion).

~~~
unwind
On second thought (too late to edit), it's quite likely that I would not even
bother with a define for this, but instead just set the size directly in the
code, e.g. char buffer[5 << 20], then rely on sizeof buffer in the call where
the buffer's size is needed. I prefer sizeof whenever possible.

~~~
scott_s
Using (5 << 20) instead of (5 * 1024 * 1024) is premature optimization. Modern
C compilers will take the expression (5 * 1024 * 1024) and turn it into the
combination of shifts and additions that are appropriate for your
architecture.

And I definitely prefer defined constants for two reasons. One, it's likely
you'll have to declare multiple such buffers in different places. Two, if I
want to tune the parameter, I'd rather do it at the top of a source file with
other such defined constants than hunting for the declaration in the code. I
do agree that sizeof() is preferable when it's an option.

~~~
__rkaup__
I don't think he meant it as an optimization. It's easy to understand as 5 *
2^20, and quicker to type.

~~~
scott_s
I much prefer 5 * 1024 * 1024. I have to stop and do some reasoning with 5 <<
20.

------
tzs
I suspect the author is incorrect in his claim that reading in 32k chunks is
responsible for the slowness. Due to read ahead and buffering, Unix-like
systems tend to do reasonably well on small reads. Yes, big reads are better,
but small reads are not unreasonably slow.

To test this, he should try "ls | cat". On large directories that often runs
many orders of magnitude faster than "ls". This is because, I believe, ls by
default on most people's systems, want to display information such as file
modes or type via coloring or otherwise decorating the file names, and getting
that information requires looking at the inode for each file.

It's all those inode lookups that slow things down, as the inodes are likely
to be scattered all over the place. When you do "ls | cat", ls noticed output
is not a terminal, and so turns off the fancy stuff, and just lists the file
names. The names can be determined entirely from the directory, and so
performance is much better.

~~~
bcx
The original onus for the post was python's os.listdir() which as far as I
know doesn't stat(). ls, just made the blog post more interesting :-).

I was surprised that the 32K reads were taking so long. It's possible since it
was on a virtualized disk ("in the cloud") that something else was slowing
down disk IO (like Xen).

But I can assure you that a larger read buffer performed much better in this
given scenario.

I'd welcome more tests though.

~~~
nitrogen
This is just a hypothesis based on very little actual knowledge, but perhaps a
very long scheduling interval is responsible for the slowness with smaller
reads? Consider this scenario: the virtualization hypervisor is on a
reasonably loaded system, and decides to block the virtual machine for every
single read. Since the physical system has several other VMs on it, whenever
the VM in question loses its time slice it has to wait a long time to get
another one. Thus, even if the 32K read itself happens quickly, the act of
reading alone causes a delay of _n_ milliseconds. If you increase the read
size, your VM still gets scheduled 1000/ _n_ times per second, but each time
it gets scheduled it reads 5MB instead of 32K.

------
brndnhy
Really surprised by all the high-fiving and positive excitement going on about
this article.

Putting eight million files in one directory level aside, the whole basis for
this event - using the filesystem as a storage layer for a k/v 'database' - is
just twisted.

Happy not to be working with devs like this.

~~~
scott_s
Systems people tend to do the simplest, reasonable thing that works, then
forget about it until it doesn't work anymore. This means that sometimes
you'll make design choices that look ugly, but really, they don't matter.

See this talk by Jonathan Blow, the designer and programmer behind Braid:
<http://news.ycombinator.com/item?id=2689999>

------
drhodes
getdents stands for "get directory entries", in case anyone else was
wondering.

~~~
shabble
Hark back to the days when 8 character function names were a luxury!

And remember to put all your variable declarations at the top of each block,
so the compiler can handle it all in a single pass :)

------
Luyt
The author has a great tip for kernel/filesystem developers:

 _"Perhaps the buffer should be dynamically set based on the size of the
directory entry file"_

This would eliminate the readdir() bottleneck.

~~~
jharsman
readdir() is implemented in libc, not the kernel. The kernel interface is
getdents() and that has a configurable buffer size.

------
carbonica
>>> "Don’t be afraid to compile code and modify it"

I was a bit thrown by this advice. Are there folks out there that are afraid
to compile code and modify it?

~~~
uxp
Throw a novice Ubuntu user into a FreeBSD system, and tell him to install a
port, and 9 times out of 10 they'll freak out once they see GCC output on the
screen.

Nothing against Ubuntu (RedHat, SuSE, Debian, Arch, et. al.), but source
compilation is something they all have been letting their users avoid for a
long time. The target audience is different.

~~~
qntm
Speaking as a novice Ubuntu user, I don't even know what "install a port"
means. Do you mean "open a port"?

~~~
bmurphy
Nope, he does not mean open a port. Ports is the name of the packaging system
FreeBSD uses, and a port is the equivalent (more or less) of an RPM or deb
source package.

<http://www.freebsd.org/ports/>

~~~
qntm
Doesn't that become awfully confusing?

~~~
uxp
Only out of context.

If your server had a network connection problem and you needed to open up a
port, we're already talking about network ports, not software, so you would
try to diagnose your network.

If you asked me how you could get the old game `rogue` on your system, I would
tell you to go install the freebsd-games port, which has nothing to do with
networking, and so it clearly means that you need to look in your ports tree.

------
rarrrrrr
The easy way to do this is:

    
    
      find . -maxdepth 1 -mindepth 1
    

Those arguments will remove the need for find to stat each directory entry.
Regardless, this is a nice walk through of low level details often overlooked.

~~~
moe
Actually 'find' will also stat each entry no matter what.

Many of the standard-tools that most people would intuitively expect to be
rather optimized (find, rsync, gzip) are embarrassingly inefficient under the
hood and turn belly up when confronted with data of any significant size.

That probably stems from the fact that most of the development on these tools
took place during a time when 1GB harddrives were "huge" and SMP was "high
end".

~~~
tedunangst
The only issue I'm aware of with gzip is actually in zlib, where it stored
32-bit byte counters, but those were strictly optional and it works fine with
data that overflowed them. The zlib window size may be only 32k, but bzip2
doesn't do _that_ much better with 900k and a better algorithm, so I wouldn't
consider it embarrassingly inefficient.

~~~
moe
I was referring to the lack of SMP support in gzip (see
<http://www.zlib.net/pigz/>).

------
justincormack
The man pages really do try to put you off using the syscalls. I only used
getdents for the first time when I was writing a syscall interface for Lua
<https://github.com/justincormack/ljsyscall> \- not tried it on 8m files but
you can set the buffer size so it should work.

The native aio interface has advantages over the posix wrapper too, and some
of the other interfaces. On the other hand the futex interface is not nice to
use, as it requires architecture specific assembly which seems undocumented...

------
malkia
In a similar fashion, something that is reading directories as files on
Windows: <https://gist.github.com/1148070> \- it allows me to get faster some
things (it's very crude, rough code, written for exploration, rather than real
usage).

Also to get the NTFS streams (metadata).

[http://msdn.microsoft.com/en-
us/library/aa364226(v=vs.85).as...](http://msdn.microsoft.com/en-
us/library/aa364226\(v=vs.85\).aspx)

------
ms4720
Just curious did you try 'ls -f' to get unsorted output. I think a lot of your
time may have been in getting stuff sorted for output.

~~~
bcx
I actually didn't try ls -f, but I did try find . and os.listdir and both hung
with very similar straces

~~~
mitchty
I hit this with a log trimming script I wrote in perl a few months back. We
had one directory with about 4 million files.

Just in case you're curious I left it run and it took about 25 hours to get
all of the directory entries. One with about 3 million files took 12 hours, so
i'm not sure how long it would have taken if you'd have let ls run on its own
accord.

Only 200k files afterwards though. :D

I think i'll poke around with this after I get a coffee, I remember stracing
the interpreter and getting annoyed at its 32k or bust behavior. But since I
didn't have a time limit I didn't much care about runtime. Thanks for the
write up!

------
ck2
Why aren't files stored in subdirectories based on the first character of the
filename to breakup the volume?

~~~
archangel_one
They were never meant to have anywhere near that number of files in the first
place. The article mentions that the cleaning up of old files failed -
presumably they didn't bother with subdirectories because under normal
circumstances the single directory approach was working fine.

------
cookiecaper
My impulse would be to modify libc's readdir() to use a larger buffer size
instead of using my own C program. Would that much stupider for some reason
(besides packaging/dependencies)? Do libc clients expect 32k buffers and die
if they receive something else?

------
kylek
I ran into this problem before. One of our engineers had a script that kept
some temporary data in his home directory, which went crazy at one point and
generated millions of files (no clue the reasoning for this). Anyways, this
was HELL on our backup solution, which ended up failing several nights in a
row on that area. Luckily, since the files were largely useless, this left our
options open. I think the kicker was the 'rm' command not working (same issue
as listing? This was a Solaris 8 system I think). I believe we ended up moving
all of the other data off of that filesystem and newfs'ing it.

------
davvid
Having a fan-out directory instead of putting all files at the root would have
helped to avoid this problem altogether.

Here's an example. The .git/objects/ directory can grow to have up to 256 sub-
directories. If an object hashes to 0a1b2c3d then it gets written to
objects/0a/b2c3d. Lookups are still fast and navigating can be done without
resorting to writing an 'ls' replacement.

------
jasonmoo
I feel compelled to point out that perl can do this rather easily. But I
really did like a little more low level on how a directory is listed.

[http://blogs.perl.org/users/randal_l_schwartz/2011/03/perl-t...](http://blogs.perl.org/users/randal_l_schwartz/2011/03/perl-
to-the-rescue-case-study-of-deleting-a-large-directory.html)

------
bcx
Should be fixed completely and fast now, SPW was a new blog so the cache
config was support wrong :-)

------
pornel
Wouldn't the reason be that `ls` tries to sort and stat the results?

Perhaps

    
    
        ls -f1  # (one) disables sorting and just prints filenames
    

would be fast enough?

------
mml
What about 'echo *' ? That's usually my last resort when dealing with
malfunctioning ls.

~~~
joblessjunkie
The shell will attempt to expand '*' into arguments before it spawns the
'echo' process.

The shell cannot handle long argument lists, so this will fail rather quickly.

~~~
smackfu
Yep, this is usually the first point of failure for commands acting on
somewhat large directories, that drives people to use find and exec.

------
nbashaw
Sorry our blog is down now - we're trying to get it back up ASAP!

~~~
mjpizz
looks like it's fine now, but worst case you can access the cached version:

[http://webcache.googleusercontent.com/search?q=cache:KBsyzf3...](http://webcache.googleusercontent.com/search?q=cache:KBsyzf3o_gYJ:www.olark.com/spw/+list+a+directory+containing+8million&hl=en&client=safari&gl=us&strip=1)

~~~
bpierre
Better cached version:
[http://webcache.googleusercontent.com/search?q=cache:KBsyzf3...](http://webcache.googleusercontent.com/search?q=cache:KBsyzf3o_gYJ:www.olark.com/spw/2011/08/&hl=en&client=safari&gl=us&strip=0)

~~~
bcx
no need for cache anymore. Word to the wise, make sure supercache is actually
caching when copying wordpress installs :-)

~~~
sofal
The cache is the only way it will display properly on my Android phone.

------
JimmyMCN
Wouldn't ReiserFS work better for this?

~~~
mmccaskill
It murders your disk though...

~~~
mmccaskill
Too soon?

------
brndnhy
Excellent case for not giving root.

~~~
ars
What does root have to do with this?

~~~
brndnhy
Article illustrates numerous manner of ways of doing things incorrectly.

The premise of the article is a bad precedent for stable environments: let's
bend the OS so that it plays nicely with what's clearly misuse and
misunderstanding of filesystems.

The only way eight million files should ever end up in a single directory
level is by accident, and that's not the case in the scenario outlined here.

~~~
bcx
Typically there are on the order of 2000 files in a directory. It was a
mistake.

