
Fast Directory Listing on Linux - romka2
https://github.com/romkatv/gitstatus/blob/master/docs/listdir.md
======
ridiculous_fish
A lovely way to sort strings, especially strings that may have long shared
prefixes, is a 3-way partition quicksort. This allows you to avoid walking the
same prefix over and over the way that memcmp does.

Pick your pivot element and partition the strings into <, =, and > based on
the first character only. Note this differs from classic quicksort in that
we're maintaining three regions, not two. Now recurse for all three regions,
except for the = region you look at the second character only. etc.

It's probably a loss for filenames which tend to be short, but for long
strings this is a very easy to implement speedup. There's lots more
optimizations possible with this technique, e.g. counting-sort style
bucketing.

~~~
agumonkey
what about symbol trees ?

/a/b/c/d/e => {a,b,c,d,e} symbol table + [a,b,c,d,e]

and you accumulate into a tree/trie

~~~
saagarjha
Tries/trees have pointers which can be slow, so I’d measure before using
something like this.

------
CoolGuySteve
You can get faster than memcmp by rolling your own SSE/AVX compare.

memcmp typically has a few branches and instruction cachelines of just trying
to verify that it's running on aligned memory and checking how aligned (ie
8byte stride vs 64 byte stride). All that can be skipped with a for-loop of
intrinsics if you the programmer know alignment characteristics the compiler
cannot infer.

~~~
mistrial9
dont you want the compiler to do this for you, with ARCH flags? a bit of
foresight and a little checking seems like enough.. excessive ASM is a mistake
these days

    
    
      CXXFLAGS -march=sandybridge -mtune=sandybridge

~~~
saagarjha
The point is that the compiler does not have enough information to do this
automatically.

~~~
mistrial9
.. meant to say, _after_ rewriting code to avoid (this,that) extra actions,
_then_ let the compiler make the optimized code.. _not_ saying "just re-run
the old code with CFLAGS thus" ..

------
kazinator
A really important optimization when scanning directories is to avoid doing
the _stat_ on each object just to get its type. On Linux, the _struct dirent_
has a _d_type_ (not specified by POSIX and not supported by all filesystems)
which forward the type of the object from the inode to the directory entry.
When portable programs that use _stat_ to get basic type information (like "is
this a directory or a regular file") are converted to use _d_type_ , the
speedup is dramatic.

~~~
romka2
Yep, this is mentioned in the doc.

> [...] every element in entries has d_type at offset -1. This can be useful
> to the callers that need to distinguish between regular files and
> directories (gitstatusd, in fact, needs this). Note how ListDir() implements
> this feature at zero cost, as a lucky accident of dirent64_t memory layout.

------
tele_ski
I think this is an awesome article and a really cool idea on how the author
incremetally improved their code for performance. But is the API as easy to
use for a programmer? I find a complex API into a function makes it much
harder to understand even if its your own code months later. I also feel like
the parent_fd parameter is sort of glossed over and is an exercise on how to
pass that into the final versions of the function since it's just expected to
be there in those versions.

~~~
romka2
The real implementation of ListDir accepts the descriptor of the directory it
needs to list (not parent_fd plus dirname like it's done in the article) and
doesn't close it. This is fairly straightforward. You still have to pass Arena
as an extra parameter though, which adds inconvenience, and d_type is still at
-1 offset -- a rather unusual thing for an API.

The biggest downside from the API perspective is that directory listing and
sorting are bundled in a single function. The insight of v5 in the article is
that this bundling allows us to achieve higher performance than what we can
get if we have a separate API for listing which we can compose with sorting.

So it's a far cry from the cleanest API you can imagine. Levels of
abstractions often have to give way when maximum performance is the goal.

~~~
tele_ski
Since this is C++, what about using a thread_local Arena for the storage? Then
you could simplify the API a bit and not re-allocate too!

I also agree, the parent_fd is trivial but it isn't as easy to use as no
parameter.

------
mankeysee
>As an added bonus, those casts will fend off the occasional frontend
developer who accidentally wanders into the codebase.

Beautiful lol

------
rasz

      Every directory has entries "." and "..", which we aren't interested in. We filter them out with a helper function Dots().
    
      bool Dots(const char* s) { return s[0] == '.' && (!s[1] || (s[1] == '.' && !s[2])); }
    
    

why would you iterate over every single file entry? both . .. entries will
land at one end of sorted list anyway.

~~~
romka2
"." and ".." are filtered out before sorting. Sorting is O(N^2) in the common
case (when there are fewer than 65 files in a directory), so it pays off to
reduce the number of entries by 2 before we get there.

~~~
rasz
n log n? and arent . .. entries always at the beginning of the list? could
just skip first 2 elements (or stop comparing after filtering . ..). What
about scandir?

~~~
jwilk
> n log n?

From the article: _Digging into the source code of std::sort() we can see that
it uses Insertion Sort for short collections. Our 32-element vector falls
under the threshold. Insertion Sort makes O(N^2) comparisons_

> arent . .. entries always at the beginning of the list?

No.

------
the8472
If you're not hitting the caches then there's a lot more to gain by optimizing
IO patterns, either by traversing multiple directories in parallel (to fill
SSD command queues) or by performing readaheads on the directories (to be
friendly to HDD elevators). Sadly the latter is somewhere between difficult
and impossible .

~~~
romka2
gitstatusd calls ListDir in parallel from multiple threads. At least with a
fast SSD it's CPU bound. I don't have an HDD to test on.

~~~
the8472
have you dropped disk caches before each bench iteration?

~~~
romka2
No, I did the opposite. I made sure the disk caches are warm before each
benchmark. Since all versions of ListDir are identical in terms of IO demands,
warming up caches is an effective way to reduce benchmark variability and to
make performance differences of different code versions easier to detect
without changing their order on the performance ladder.

~~~
the8472
That's reasonable when optimizing the average case, but not for the worst
case.

~~~
romka2
I would agree with this sentence if "optimizing" were replaced with "making
marketing claims". Optimization is the process of finding the fastest
implementation. In the case of ListDir each successive version is faster than
the last on a benchmark with warm IO caches, therefore it'll be faster on the
same benchmark with cold IO caches (this is not a general claim; I'm talking
specifically about ListDir). Benchmarking with warm IO caches is easier
(invalidating caches is generally very difficult) and yields more reliable and
more actionable results, hence it's better to benchmark with hot caches. It
has nothing to do with the average vs worst case.

------
aasasd
In my experience, another important component of fast directory listings is a
cron job that does “ls -R” every half hour. Because you can't control the
eagerness of fs metadata caches even if your fileserver has plenty of free
mem.

~~~
lathiat
You can actually tune some VFS parameters to make it less likely to evict the
VFS caches (vfs_cache_pressure), however, it's not well documented which
settings work the best other than to set it to '1' for 'mostly don't' which
mostly works well only if you have way more RAM than you need and will tend to
go quite badly if that is not the case. I think the only way to determine that
well would be to benchmark it for your specific use case. But it may help you
over a recurring "ls -R"

------
ezoe
> returning vector<string> is an unaffordable convenience.

Not anymore since C++11.

Reusing vector storage is good, but you can still use move on parameter and
not the reference.

Although, in this case, reference is probably easier to write.

~~~
milemi
I think he was concerned about the cost of allocating separate strings that go
into the vector.

~~~
romka2
Yep, every string is potentially an allocation (unless it's short and
std::string implements Small String Optimization) plus O(log N) allocations by
the vector itself.

C++11 didn't make returning the vector in this function faster because it's
written in a way to take advantage of RVO. It did make growing the vector
faster though -- individual strings now get moved instead of copied.

------
zeroimpl
I wonder how this compares to filesystem traversal APIs like fts/ftw?

~~~
romka2
fts and ftw are glibc wrappers over glibc wrappers that I had to bypass for
better performance. They are made for convenience, not for speed.

~~~
kazinator
Also: last time I looked at _fts_ in the context of glibc, it did not support
64 bit file offset builds (-D_FILE_OFFSET_BITS=64).

I was looking at _fts_ because there exists a BSD licensed implementation of
_nftw_ in terms of _fts_ ; I was researching the possibility of creating a
semantically extended/enriched version of _nftw_ , without coding it entirely
from scratch. So I plonked that implementation into my program and, lo and
behold, error message from glibc's _fts_ header file about not supporting 64
bit file offsets.

------
ausjke
What about using nftw directly instead of opendir/readdir

~~~
romka2
nftw is a wrapper over opendir/readdir, not the other way around.

------
thwd
But why? Why create a tool that does `git status` 10x faster?

~~~
romka2
The job of gitstatusd is to put Git status in your shell prompt. See
screenshot at the top of
[https://github.com/romkatv/gitstatus](https://github.com/romkatv/gitstatus).
This is an amazingly convenient tool.

When you are working on chromium, on every command you type gitstatusd needs
to list the contents of 25,000 directories. Low level optimizations like the
ones described here are what makes gitstatusd 10 times faster than `git
status`, which in turn makes prompt responsive when otherwise it would be
sluggish.

~~~
Lorkki
Isn't inotify meant for exactly this kind of thing, though? You are already
introducing a hard dependency on Linux syscalls in "v4" of the optimisations,
so it would seem advantageous to make use of that to avoid the full directory
traversal for most of the time.

~~~
majewsky
inotify requires you to hold open fds for all the dirs and files you're
watching iirc, so in repos that large, you'll be crushed by the file-
descriptor-per-process limit.

~~~
avar
It doesn't, you might need to tweak its limits via /proc/sys/fs/inotify/max_*
on such repositories, but those limits are not the same as the much lower open
FD limits (as in ulimit -n ...).

~~~
romka2
I've posted some details explaining why gitstatusd doesn't use inotify in
[https://github.com/romkatv/gitstatus/commit/050aaaa04b652e15...](https://github.com/romkatv/gitstatus/commit/050aaaa04b652e15ab4c28b669f1ed753f9d0519).
The short version is that the default max_user_watches limit is much too low
to be useful for gitstatusd and I cannot ask or expect users to change their
system settings.

------
JoshTriplett
> As an added bonus, those casts will fend off the occasional frontend
> developer who accidentally wanders into the codebase.

[https://blog.aurynn.com/2015/12/16-contempt-
culture](https://blog.aurynn.com/2015/12/16-contempt-culture)

Code that intentionally celebrates its unapproachability is not a badge of
honor or pride, even as a joke. It might occasionally be an unavoidable
necessity, in which case it needs enough documentation that the next person
who has to deal with it has the best chance possible.

~~~
oralty
I mean, I think that was a bit of a joke. It's ironic that you rail against
this dude for "code that's hard to understand" in reply to a long-form prose
article about understanding the code and how it got there..

We'll probably have to agree to disagree.. but that blog post doesn't really
resonate for me... especially the end where it talks about keep it to your
"own language." I've never had patience for identity politics especially in
the case of a god-damn programming language.

~~~
JoshTriplett
> It's ironic that you rail against this dude for "code that's hard to
> understand"

I'm sure that it _was_ a joke, and that doesn't make it better. I'm not
suggesting that this code _was_ hard to understand, and I enjoyed the article.
I'm calling out this _particular_ point, and suggesting that even as a joke,
mocking "frontend developers" is very much from the same territory that
promotes "Real Programming" and mocks people for their choice of programming
language or technology.

~~~
oralty
I mean it is pretty true that most frontend developers do not have the
inclination to go trodding into a C++ code base making syscalls without a
wrapper. Even taken _uncharitably_ seriously, that is _all_ the comment was
insinuating. Everything else is added by you the reader.

And anyway I'm fine with exclusively frontend devs staying out of systems
stuff. Anyone that wants to read OS kernel and compiler source code has tons
of choices. Most I've talked to find it dry and boring. They don't belong
there... that's fine.

