
When setting an environment variable gives you a 40x speedup - CaliforniaKarl
https://news.sherlock.stanford.edu/posts/when-setting-an-environment-variable-gives-you-a-40-x-speedup
======
mrb
Story time.

This took place around 2009. Back then I was working for Rapid7, on their
network vulnerability scanner Nexpose. The codebase was mostly Java and was
relatively large at 1M+ lines of code. We had many unit tests. Running the
entire test suite took up to 40 minutes on Windows, and 20 minutes on Linux
(Windows was always about twice slower on everything: building, product's
startup time, etc.) The company had grown quickly to at least 30-50 software
engineers. The problem was that every time one of them ran a build on his or
her local machine (which happened multiple times a day) it would have to run
the test suite and waste up to 40 minutes of this person's time. 40 minutes ×
dozens of engineers = lots of inefficiencies in the company.

I loved solving performance issues so one day I remember arriving at the
office and making it my mission to investigate if there was an easy way to
speed up the test suite. Our build system was based on Ant, and our tests used
the JUnit framework. After a little time profiling Ant with basic tools (top,
strace, ltrace), and taking a few Java stack traces, I realized that most of
the wasted time was not actually running the individual tests, but many
instances of the JVM kept being started and destroyed between each test. Our
Ant build file was running the JUnit test with fork=yes, which was required
for a reason I don't recall at the moment. This forks the JVM for running the
tests. Then a little googling lead me to this:

[https://ant.apache.org/manual/Tasks/junit.html](https://ant.apache.org/manual/Tasks/junit.html)

While reading this documentation, I stumbled upon an unknown parameter to me:
forkmode. What does it do?

" _forkmode: Controls how many JVMs get created if you want to fork some
tests. Possible values are perTest (the default), perBatch and once. once
creates only a single JVM for all tests while perTest creates a new JVM for
each TestCase class. perBatch creates a JVM for each nested <batchtest> and
one collecting all nested <test>s._"

Our Ant build file did not set forkmode, so it meant we were forking a new JVM
for every test!

I immediately tried forkmode=perBatch and... the test suite ran 10× faster! 40
minutes down to 4 minutes on Windows. And Linux ran it in 2 minutes instead of
20 minutes. I told my boss right away but he was unbelieving. He asked that I
check with our most-experienced Java developer. I showed him my 1-line patch
speeding the test suite 10× and he said "I guess you are right, we can commit
that." By lunch time the fix was committed and everyone loved me :)

~~~
quickthrower2
That’s the time to ask for a raise!

~~~
user5994461
You're gonna be very disappointed in the workplace if you expect a raise every
time you're just doing your job.

~~~
quickthrower2
If he was just doing his job the implication is that no one else was doing
theirs at the same company.

~~~
user5994461
Don't you think it's far fetched to say that every single developer at the
company is useless because they didn't look into ant build settings?

~~~
quickthrower2
I’m playing devils advocate, but the point is he used initiative. He wasn’t
asked to improve the build, he was curious and found a solution.

Is that worth a raise? Probably not on it’s own but I was saying it’s a good
time to ask for one if you are secretly thinking you should get one as you
have a good good forward.

On a meta level a company that doesn’t taskmaster or track everyone’s time
usage will get this kind of result from time to time. Curious professional
just making things better. Continuous improvement.

------
CaliforniaKarl
This was written by my coworker, Kilian Cavalotti. We have two main file
stores: $SCRATCH is a Lustre cluster, which is extremely performant but
doesn't do well with lots of inodes. And $HOME is a multi-node Isilon, which
can handle all the inodes, but is not as performant. And we have users who
sometimes like to put many files in single directories.

You can find more information about us at
[https://srcc.stanford.edu](https://srcc.stanford.edu)

And you can find more information about Sherlock at
[https://www.sherlock.stanford.edu](https://www.sherlock.stanford.edu)

~~~
netsharc
Tell your colleague I think emojis make the article look like it was written
by a 12 year old.

IMO. I guess I'm just an old fart.

~~~
yani
It is how most millenials write nowadays

~~~
falsedan
It’s how people trying to write like millennials write

Postfix emoji, eurgh

------
vagab0nd
> Having thousands of files in a single directory is usually not very file
> system-friendly, and definitely not recommended.

I wonder, is this specific to the situation when I use "ls", or is it in
general?

I recently worked on a project where I need to store many small files on ext4,
although these files are not read/written by human. I came across suggestions
to group files into subdirectories and not put them all in one directory. Is
there evidence that it's actually worth it on a modern filesystem?

EDIT: by "many" I mean 10+ million files.

~~~
mcbits
For future reference, next time you need to store millions of related small
files, it might be worth checking if SQLite is an option instead of using the
filesystem:
[https://www.sqlite.org/fasterthanfs.html](https://www.sqlite.org/fasterthanfs.html)

Performance aside, I wish NodeJS had a way to pack all its little turds into
one SQLite file, out of sight. (Admittedly I hadn't thought to look for such a
thing until now... Preliminary results are negatory.)

~~~
civility
You should benchmark SQLite before recommending it like this (for future
reference). I think it's a very elegant library, and I admire the development
philosophy that went into it. However, every time I've tried to use it for
something that needs to be fast it ended up being a painful mistake.

~~~
striking
Could you show us your methodology for your benchmark? We don't know under
which circumstances you experienced degraded performance.

I've seen a lot of people just missing an index or something, and their DB
then ran fine, so that's why I ask.

~~~
civility
I don't work for that company any more, and I couldn't have legally showed the
code even if I did.

It's not an index thing though. Try inserting 10 million rows into a simple
table. No foreign constraints or anything. Some of our "documents" had 300
million rows. Handle the primary key however you want (let the DB do it, or
generate your own). Use whatever SQLite API calls you want (precompiled
statements, whatever). In a case like this, adding other indexes can only slow
it down.

There are a few options to disable integrity and other safety constraints.
They help speed things up a little, but it's all painfully slow in comparison
to a writing simple CSV file.

The same is true on reading. You can parse an entire CSV file more quickly
than "SELECT *" from the table.

I've tried several time to use SQLite as a document file format. It performs
worse than CSV and JSON. It performs much worse than just an array of binary
records. The devil is in the details, and assuming each file is not a single
row/record, I wouldn't be surprised if 10+ million files in a structured
directory tree performs better too.

~~~
mcbits
Well, I certainly agree about benchmarking versus other options, especially if
I/O proves to be a bottleneck. The other option above was not a single CSV or
JSON-structured file, but millions of small files stored on the filesystem
(with its requisite indexes, permissions, access logs, etc). And the
comparison is not with the file contents being splayed out into a relational
structure in SQLite, but just one flat key-blob table for storage and
retrieval. It's possible that a multi-gigabyte CSV file would be faster still,
depending on actual access patterns and how much effort you want to devote to
this bespoke CSV database.

~~~
civility
> The other option above was [...] millions of small files stored on the
> filesystem

I don't have any direct comparisons of SQLite to this approach, but other
projects I've worked on did have a simple directory tree organizing medium
size chunks of data by date over 20+ years. We had one "DB" where everything
was stored by the degree of latitude and longitude it fell into, and another
where things were stored by date. Both were gigabytes in size at a time when
that was considered a large data set, and it was very fast to open only what
you needed.

Depending on the problem, this can be a very good solution. It was trivially
accessible from any programming language (including ones that didn't exist
when the DB was started), and it was trivial to migrate to new operating
systems (some of this started out on VMS).

I like SQLite quite a bit, but it's not always the best solution to storing
tabular/relational data.

~~~
mcbits
How many files are you talking about? That's the relevant variable, not the
amount of data or whether it's text, tabular or relational data, images,
audio, etc. E.g. you can copy a single 10 GB file to another drive faster than
1 million files totaling 1 GB, all due to filesystem overhead. If there's a
filesystem where that's not true, I'm interested. :)

~~~
civility
> How many files are you talking about?

The numbers won't seem significant by today's standards. In the one case
(stored by date), maybe 150 thousand files, each a few megabytes.

> you can copy a single 10 GB file to another drive faster than 1 million
> files totaling 1 GB

True, but I'll bet you can create (or read) a thousand files with a thousand
records each faster than you can insert (or select) a million records into (or
from) a SQLite table.

~~~
mcbits
150,000 is solidly in the realm where SQLite tested faster than the
filesystems in the link above, although their files were only a few kilobytes.
It's almost certainly different (worse) for multi-megabyte files. But what
I've been trying to convey is that the number of records in the SQLite table
will be identical to the number of files. If you need to parse the file
contents, you'd parse the BLOB just the same. The difference is in how you
interact with the disk.

Is reading X-thousand files containing a thousand records each (or one
thumbnail, or one HTML dump, or one JavaScript function, or whatever) faster
than SELECTing exactly the same number of BLOBs containing exactly the same
data? It's worth considering and testing once the number of files starts
affecting performance or even just becomes a pain to deal with. If it turns
out that storing many files is still a better fit for a particular
application, that's cool too. Nothing is a panacea.

------
codezero
I started reading this thinking it would talk about LC_ALL/locale related, but
not too surprised to see another env variable throwing a wrench in the works.

~~~
nisa
for context you refer to the fact that

    
    
        LC_ALL=C grep string hugefile 
    

is often magnitudes faster (also works for other shell utilities) - this is
due to unicode handling that is either complex due to the nature of unicode or
just performances badly on glibc - probably it's the complexity.

~~~
jschwartzi
Seems analogous to xml parsing in Perl, where I realized a 10x speedup in
parsing by switching from a general-purpose XML parser to a set of regular
expressions.

~~~
jasonjayr
[https://stackoverflow.com/questions/1732348/regex-match-
open...](https://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags/)

If you read nothing else, make sure you read the first answer there.

~~~
adrianN
If you want to extract just the values of a single tag or similarly trivial
things or the XML is a very restricted set, regexes are just fine for parsing.

~~~
brokensegue
it's really not

~~~
hopscotch
If you want to extract strings from XML that arrives in a well-known format
that meets certain expectations that may not be rigorous, then regexes can be
fine.

------
HocusLocus
Fave speedup story, in the 80s a good friend had written a custom accounting
system in CBASIC and he came to me in regards to one client, a boat charter
service management business that maintained a set of 'virtual double-entry
books' for every one of its 200+ member yachts. The monthly process was taking
4 hours to complete. Could I improve on that?

Well in a couple hours I discovered that his routines assembling the main
batch file did repeated lookups in other files without caching any operation
results, and because it was MS-DOS the OS didn't cache many sectors either. So
the hundreds-of-thousands of preparatory operations were waiting for the hard
disk platter to come around again, for each. Yes, even hard disk platter speed
was a significant factor in those days.

So I added a 15 element array that cached lookup operations in memory. From 4
hours to 15 minutes, Thank You Very Much.

~~~
agumonkey
May I contribute with an hybrid optimization. Some workers were tasked to
remove duplicate in excel files by hand. They'd have to create filtered view
for each entry, Excel being slow it took hours. Some woman had a 3000 lines
spreadsheet and was about to lost it. I wrote a 8 lines VBA to count
duplicates, just enough so she could filter anything that had count > 1 and
then delete things on her own. Turning multi-day job into a two clicks
operation. Probably my most happy lines of code in my entire life.

------
_bxg1
"he mentioned his laptop was 1,000x faster than Sherlock to list this
directory’s contents"

"that’s a 40x speedup right there"

So... what about... the rest?

~~~
kilianc
Good point! But sometimes a large-scale parallel filesystem can't beat a local
SSD on certain patterns.

Also, we didn't see any timings for that 1,000x claim, so let's say it's an
estimate. :)

~~~
emmelaich
It's a great article, thanks.

But .. do people actually use `ls` to list so many files? I mean, it'd scroll
off the terminal anyway.

~~~
chris_wot
I admit to being guilty of the odd

    
    
      for i in $(ls)
      do
        Something dumb
      done

~~~
zaphirplane
Out it interest why not xarg or find’s exec

~~~
freedomben
Not GP, but I often find myself writing making statements (which is awkward
with find), or using syntax that find struggles with.

------
whalesalad
The irony here is that instead of just unsetting the alias that is proxying
`ls` to `ls --color=auto` the solution was to leave the coloring in place but
try and hack the override env var for defeating that behavior.

ie, `var foo = 1 + 1 - 1` instead of `var foo = 1`

Wouldn't it have been more straightforward to open one's .bashrc/.bash_profile
and remove the alias?

~~~
marvy
Sure, but they WANTED colors. The just wanted them to be fast.

~~~
kilianc
Precisely. We want colors. We're refined people.

------
trollied
I was once flown out to look at a significant performance problem. The
customer could not do a days worth of work in a day. We’d told them to buy a
fully loaded Sun E15k, and it couldn’t keep up.

I instrumented everything (recompiled everything so that the binaries spat out
gprof data, iostat output, vmstat output, sybase monitoring) and it quickly
became obvious that the DB driven config was configured in a way we’d not
anticipated. The fix was easy (just use a btree in a couple of places instead
of a linear array walk), but then we had a huge problem.

With the fix, the CPU bottleneck became a 15% CPU utilisation. The system had
been sized based upon our advice. Whoops. What happened next was political and
I won’t go into it. Deliberately vague on the industry and customer because of
the 7 figure cost involved!

~~~
serpi
So before diagnosing the problem you made the customer spend huge amounts of
money based on a guess. Not the least bit surprised.

------
rawrmaan
I really enjoyed the writing style. This was fun to read. Thanks!

~~~
chapium
I feel the opposite, I'd rather they get to the point in the first paragraph
then follow it up with casual examples of how they arrived at the solution.

~~~
kilianc
True, that could have used a TL;DR, agreed.

~~~
CaliforniaKarl
But then people would miss the rainbows!

------
jf
The tl;dr here is that "ls" can be much faster if you disable colorizing files
based on the their file capabilities, setuid/setgid bits, or executable flag.

    
    
        LS_COLORS='ex=00:su=00:sg=00:ca=00:'

~~~
neilv
And if you don't want angry fruit salad colors in your cool hipster retro-
Matrix semitransparent tiling-window desktop theme:

    
    
      unset LS_COLORS
      alias l='/bin/ls -a'
      alias ll='/bin/ls -alF'

~~~
addicted
I’d argue that for the vast majority of manual usages of ls, the maybe second
or so saved listing directory contents by removing colors will be dwarfed by
the additional time it takes the human to parse its contents, without the
additional context provided by those colors.

~~~
neilv
I'd agree with you: judicious and consistent use of color should give better
HCI performance than the colorless method, for the majority of people.

But it's very visual and aesthetic, and maybe it's best considered a user
preference, which is why the tongue-in-cheek rationale, when I mentioned the
option.

~~~
stevenhuang
Good points. And for those wondering, "HCI" means Human Computer Interaction
(had to check).

~~~
neilv
Thanks. HCI (and human factors engineering) were the original areas of study
concerned with these questions. UX, which is more popular at the moment, seems
to have more emphasis on visual appeal and marketing psychology, rather than
on effectiveness for the user's goals. You could see "dark patterns" as an
extreme of this shift in intent.

------
franciscop
I thought this was about setting NODE_ENV=production, and reading other
comments others know different ways. Seems there are many ways to get a system
an order of magnitude faster with little, but deep work/knowledge.

------
gnufx
It may be worth mentioning that this sort of thing isn't just a problem for
whoever is running it, because it may have a bad effect on other users if you
hammer the metadata server(s). It's necessary to have monitoring which can
identify the jobs responsible and take action. The Lustre system here was
clobbered for several days because the admins had no means of doing that,
despite all the facilities provided for and by Lustre.

------
tyingq
I suspect something like "ls | cat" would speed it up as well. Probably tests
if stdout is a tty before bothering with stat() and color logic.

~~~
kilianc
True, but then you would loose all the coloring. By targeting only the
attributes that generate additional syscalls, you can keep the vast majority
of the colors and get a nice speed bump. Win win.

------
fsniper
I was expecting some new flag about getdents bigger buffer size.

[http://be-n.com/spw/you-can-list-a-million-files-in-a-
direct...](http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-
but-not-with-ls.html)

------
natmaka
An often-neglected related trick is the "dir_index" feature of ext3/ext4

------
yann63
The author removed the coloring, because the default setup is for ls to list
files with colors. He uses an environment variable to do that.

Wouldn't it be simpler to use \ls instead of ls? This way you get the "basic
version of the executable", not some alias. I always use \command instead of
command in my shell scripts, because you never know which aliases a user set
up.

~~~
Pawamoy
But aliases are not supposed to be available in subprocesses/scripts, unless
you use shopt -s expand_aliases. Even if they were, you cannot ask all your
users to rewrite their scripts to use \ls instead of ls.

------
cespare
The conclusion seems unsatisfactory. Why is lstat so much slower on their
system than on the laptop?

~~~
mbuna33
Compute clusters generally use distributed file systems, over the network...

~~~
CaliforniaKarl
Exactly that. We have 2000+ users, and 1000+ compute nodes. Each compute node
does have a certain amount of local SSD (`$L_SCRATCH`) for job-local storage,
but for everything else we have to use some sort of network-accessed file
system. For us, that is NFSv4 for longer-lasting data (`$HOME` and
`$GROUP_HOME`) to the Isilon over Ethernet, and short-term data (`$SCRATCH`
and `$GROUP_SCRATCH`) to Lustre over Infiniband.

(Yes, Infiniband! Hello, Mellanox!)

------
chapium
tl;dr version

LS_COLORS='ex=00:su=00:sg=00:ca=00:'

------
ngcc_hk
Wow. Miss this for decades. Life waste so much on colour.

------
drb91
Why are people so lazy at naming?
[https://en.m.wikipedia.org/wiki/Sherlock_(software)](https://en.m.wikipedia.org/wiki/Sherlock_\(software\))

What ever happened to meaningful names that described the particular, not the
meme?

~~~
brokensegue
like GNU?

~~~
drb91
I refuse to defend any gnu or unix naming conventions.

