This took place around 2009. Back then I was working for Rapid7, on their network vulnerability scanner Nexpose. The codebase was mostly Java and was relatively large at 1M+ lines of code. We had many unit tests. Running the entire test suite took up to 40 minutes on Windows, and 20 minutes on Linux (Windows was always about twice slower on everything: building, product's startup time, etc.) The company had grown quickly to at least 30-50 software engineers. The problem was that every time one of them ran a build on his or her local machine (which happened multiple times a day) it would have to run the test suite and waste up to 40 minutes of this person's time. 40 minutes × dozens of engineers = lots of inefficiencies in the company.
I loved solving performance issues so one day I remember arriving at the office and making it my mission to investigate if there was an easy way to speed up the test suite. Our build system was based on Ant, and our tests used the JUnit framework. After a little time profiling Ant with basic tools (top, strace, ltrace), and taking a few Java stack traces, I realized that most of the wasted time was not actually running the individual tests, but many instances of the JVM kept being started and destroyed between each test. Our Ant build file was running the JUnit test with fork=yes, which was required for a reason I don't recall at the moment. This forks the JVM for running the tests. Then a little googling lead me to this:
While reading this documentation, I stumbled upon an unknown parameter to me: forkmode. What does it do?
"forkmode: Controls how many JVMs get created if you want to fork some tests. Possible values are perTest (the default), perBatch and once. once creates only a single JVM for all tests while perTest creates a new JVM for each TestCase class. perBatch creates a JVM for each nested <batchtest> and one collecting all nested <test>s."
Our Ant build file did not set forkmode, so it meant we were forking a new JVM for every test!
I immediately tried forkmode=perBatch and... the test suite ran 10× faster! 40 minutes down to 4 minutes on Windows. And Linux ran it in 2 minutes instead of 20 minutes. I told my boss right away but he was unbelieving. He asked that I check with our most-experienced Java developer. I showed him my 1-line patch speeding the test suite 10× and he said "I guess you are right, we can commit that." By lunch time the fix was committed and everyone loved me :)
Certainly untrue for those who enjoyed their 40 minute breaks
My first task was to automate the build process for all their iOS and macOS front-end software, as well as the Linux-based back-end support systems. They didn’t want their Engineers to have their laptops taken over for thirty minutes to an hour or more, every time they did a build.
So, I delivered a Jenkins CI/CD system for them. And although I’m not a developer, I was able to find and fix the problems in their code base that kept it from building in a server environment instead of on their laptops. I also added a lot of static code analysis tools, and they developed a lot more regression suite tests, once the builds were automated on the server.
And then I found out I had automated myself out of a job, and my contract ended. I was a DevOps Engineer, with a heavy emphasis on the Ops side of the house, and because I wasn’t a Developer, they didn’t have any more work for me.
After finding out that my contract would be ending, the first thing I did was to call my wife and tell her the good news — I would be moving back to Austin, and I would be there in time for Christmas.
As a contractor, I had a high hourly pay rate, which meant I was just barely able to afford to get a 1BR apartment in the Cupertino/San Jose area. But if they had offered me a full-time position, they would have had to offer to pay me at least $250k-$350k per year, for me to be able to afford to keep that same apartment. And that doesn’t include what they would have had to pay me to be able to afford to have anything remotely like our house here in Austin.
I really enjoyed most aspects of working there, and it was a really good experience, but I do feel like I really dodged a bullet there. Or maybe it was a grenade.
Huh? Rent is expensive in Cupertino, but not THAT expensive. You can live quite comfortably with half that.
Is that worth a raise? Probably not on it’s own but I was saying it’s a good time to ask for one if you are secretly thinking you should get one as you have a good good forward.
On a meta level a company that doesn’t taskmaster or track everyone’s time usage will get this kind of result from time to time. Curious professional just making things better. Continuous improvement.
You can find more information about us at https://srcc.stanford.edu
And you can find more information about Sherlock at https://www.sherlock.stanford.edu
On a NFS mounted on my PC your LS_COLORS tweak actually degrades the performance. Without modification, it takes 0.5 seconds to list a directory (on a slow consumer grade HDD) with 14448 files and after setting
However, listing a local (SSD) directory with 15k files takes just about 0.12 seconds and gets faster after settings the variable to 0.06 seconds.
For all tests I drop the caches before testing, e.g.:
echo 3 > /proc/sys/vm/drop_caches; time ls --color=always /mnt/nfs/many | wc -l
Run "strace -o logfile ls --color=always" and diff the two logfiles.
# du -sh logfile_nfs*
Without modifying the variable: https://bin.disroot.org/?149fa91c08b27312#0w9O6BAWNEEC4SUXEb...
With setting the variable: https://bin.disroot.org/?149fa91c08b27312#0w9O6BAWNEEC4SUXEb...
But for completeness, I wrote a few loops which should answer your question:
With exported Variable: https://bin.disroot.org/?c405d7aa74b50ee0#KverJYJEzNhct7CmdW...
With exported Variable: https://bin.disroot.org/?d3ac83f1dff9e767#WVODjGnvL1QbOJdWPp...
I also tried dropping the cache on the NFS server but that didn't seem to have a major effect on the performance (probably because reading a 15k file index from a local disk doesn't take that long after all).
I helped a researched debug a Lustre performance issue a while ago. Each job was nothing special, read a few files (maybe a few GB total), do some (serial, no MPI or such) calculations taking maybe 10 min or so, produce output files, again a few GB. No problem, except when the person ran a several hundred of them in parallel as an array job the throughput per job dropped to a small fraction of normal. Turned out that all the jobs were using the same working directory. Slightly tweaking the workflow to have per-job directories fixed it.
Disclaimer: I work for Dell.
Disclaimer: I work on Isilon
Edit: By "file storage", I'm talking about storage mounted using protocols like SMB and NFS.
IMO. I guess I'm just an old fart.
It's not because it has colours that it's not a serious article.
IMO. I guess I'm not an ageist.
Postfix emoji, eurgh
I wonder, is this specific to the situation when I use "ls", or is it in general?
I recently worked on a project where I need to store many small files on ext4, although these files are not read/written by human. I came across suggestions to group files into subdirectories and not put them all in one directory. Is there evidence that it's actually worth it on a modern filesystem?
EDIT: by "many" I mean 10+ million files.
Performance aside, I wish NodeJS had a way to pack all its little turds into one SQLite file, out of sight. (Admittedly I hadn't thought to look for such a thing until now... Preliminary results are negatory.)
I've seen a lot of people just missing an index or something, and their DB then ran fine, so that's why I ask.
It's not an index thing though. Try inserting 10 million rows into a simple table. No foreign constraints or anything. Some of our "documents" had 300 million rows. Handle the primary key however you want (let the DB do it, or generate your own). Use whatever SQLite API calls you want (precompiled statements, whatever). In a case like this, adding other indexes can only slow it down.
There are a few options to disable integrity and other safety constraints. They help speed things up a little, but it's all painfully slow in comparison to a writing simple CSV file.
The same is true on reading. You can parse an entire CSV file more quickly than "SELECT *" from the table.
I've tried several time to use SQLite as a document file format. It performs worse than CSV and JSON. It performs much worse than just an array of binary records. The devil is in the details, and assuming each file is not a single row/record, I wouldn't be surprised if 10+ million files in a structured directory tree performs better too.
I don't have any direct comparisons of SQLite to this approach, but other projects I've worked on did have a simple directory tree organizing medium size chunks of data by date over 20+ years. We had one "DB" where everything was stored by the degree of latitude and longitude it fell into, and another where things were stored by date. Both were gigabytes in size at a time when that was considered a large data set, and it was very fast to open only what you needed.
Depending on the problem, this can be a very good solution. It was trivially accessible from any programming language (including ones that didn't exist when the DB was started), and it was trivial to migrate to new operating systems (some of this started out on VMS).
I like SQLite quite a bit, but it's not always the best solution to storing tabular/relational data.
The numbers won't seem significant by today's standards. In the one case (stored by date), maybe 150 thousand files, each a few megabytes.
> you can copy a single 10 GB file to another drive faster than 1 million files totaling 1 GB
True, but I'll bet you can create (or read) a thousand files with a thousand records each faster than you can insert (or select) a million records into (or from) a SQLite table.
File systems aren't the end of the story however. There's always the need to do backups or copy files for some other reason. Some tool somewhere will find the need to readdir and will not cope well with it.
So it is still be a good idea to partition files in directories when you have many of them, and millions of files are still many.
Many of our users use MATLAB. Yes, they should probably use Python, or Octave, but they're using MATLAB.
We use SLURM as our job scheduler. It supports "job arrays", which allows a user to submit a large number of jobs (up to 1,000) with a single "batch script", which is a shell script that gets run on a compute node, in a cgroup with the requested amount of memory and CPU. A job array simply runs this on many (for us, up to 1,000) cgroups, spread out across nodes; the batch script is able to check environment variables to see its "array index", which the user uses to split up work.
Once people found this, they took advantage of our site-wide MATLAB license, launching hundreds or thousands of MATLAB instances at around the same time.
MATLAB likes to create a 'prefdir' inside `~/.matlab/RXXXXy` ("R2017b", for example). Thousands of MATLABs, all making directories, inside the same common directory, on network-mounted storage.
Our Isilon cluster was floored.
We found the environment variable which controlled where the prefdir is placed. We changed it to a path in $SCRATCH (the user's scratch space), which is on Lustre.
Our Luster environment held up for a while, but was similarly floored.
In all of these cases, the issue appears to be many clients trying to take a short lock on the `~/.matlab/RXXXXy` directory, to make their subdirectory.
"It worked fine for one, or for a hundred, so it must be OK?"
I suspect it's also still a performance issue with many popular filesystems. See https://lwn.net/Articles/606995/ for more info.
If there is even a remote chance you will use standard Unix tools on your files as well, I'd go for the subdirectories.
LC_ALL=C grep string hugefile
If you read nothing else, make sure you read the first answer there.
We're literally getting the exact same XML message back, but with different text in each field. The framework we were using insisted on parsing each response separately even though the responses were basically the same, but with different values in the exact same XML tags.
In my opinion it's the same kind of optimization as massaging your data inputs to span cache lines in your MCU to improve performance. By requiring the data to be well-specified in its format you can ignore all of the clues that XML provides about the format of the data, because you can assume the format in advance. So all the stuff like schemas and validation become completely unnecessary because you already know what valid data looks like.
Well in a couple hours I discovered that his routines assembling the main batch file did repeated lookups in other files without caching any operation results, and because it was MS-DOS the OS didn't cache many sectors either. So the hundreds-of-thousands of preparatory operations were waiting for the hard disk platter to come around again, for each. Yes, even hard disk platter speed was a significant factor in those days.
So I added a 15 element array that cached lookup operations in memory. From 4 hours to 15 minutes, Thank You Very Much.
Had a vendor we got partnered with for a government contract. The company I worked for usually did all its own development of java based web applications for the state, but on this one it was decided to partner with someone else as they already had a solution in the space. We'd just run it for them.
It was terrible. The UI/UX was fine. It did actually solve the problem for the state, and was pretty intuitive. On the software side, though it just flat out did not scale for a number of reasons (including the same one OP had, they loved to dump everything in a single directory).
My particular favourite reason was the way they chose to produce reports. They'd take the business requirement, sit there with a visual database tool, and piece together a SQL statement to produce the report, no matter how complicated that query got. Then they'd drop that in their code and voila, effortless reporting! There were queries in their code base that would end up nesting some 40+ selects deep, e.g. SELECT foo FROM bar WHERE id = (SELECT id FROM monkey WHERE id = (SELECT ... and so on down the line.
The MySQL query optimiser, at the time, didn't handle those queries very well, and it could end up taking 45 minutes or so to produce the final report, and it was showing signs of potentially exponential execution time growth. This was around the time MariaDB was starting to take off, and it turned out this issue was one of the first they'd fixed in the optimiser (MySQL followed suit with a completely overhauled query optimiser they'd been working on for a while), and so I switched over to MariaDB instead of MySQL. Free speed boost! Now it only took 10 minutes to produce the report! Everyone loved me... but I still wasn't happy. The final reports didn't look that complicated to me.
I picked up the most egregious example, and rewrote it in perl (the only language I wrote in at the time). Along the way I identified that if they wanted to keep it in a single query they could easily cut it down to something significantly simpler, but also if I just made it four or five queries and wrote some _very_ simple logic around it in perl, I could get the report out within seconds, while drastically reducing load on the database server. I gave the developers the proof of concept and simplified SQL query, but neither got used.
In the end the company that produced the software was facing bankruptcy, they couldn't support what they'd sold the state, at the price they'd sold it for. They weren't ready for that scale, didn't know how to code to that scale, and didn't really have a grasp of the costs of supporting at that scale. For various reasons, my employer ended up taking over the code base from them, with legal agreements that we'd only use it for this purpose, never sell it anywhere else etc. etc. The code base was _awful_. I thought the things I could see (queries) were bad. The code base was worse. Thankfully that was someone else's problem to work on :D
"that’s a 40x speedup right there"
So... what about... the rest?
Also, we didn't see any timings for that 1,000x claim, so let's say it's an estimate. :)
But .. do people actually use `ls` to list so many files? I mean, it'd scroll off the terminal anyway.
for i in $(ls)
for i in *txt
Use find | while read instead, or xargs. That's going to be easier on the eyes and actually work.
for i in *; do
EDIT: Actually, one issue with the above is that if there aren't any matching files then you get a literal "*". But I'd rather deal with that then break when filenames have embedded whitespace.
ie, `var foo = 1 + 1 - 1` instead of `var foo = 1`
Wouldn't it have been more straightforward to open one's .bashrc/.bash_profile and remove the alias?
I instrumented everything (recompiled everything so that the binaries spat out gprof data, iostat output, vmstat output, sybase monitoring) and it quickly became obvious that the DB driven config was configured in a way we’d not anticipated. The fix was easy (just use a btree in a couple of places instead of a linear array walk), but then we had a huge problem.
With the fix, the CPU bottleneck became a 15% CPU utilisation. The system had been sized based upon our advice. Whoops. What happened next was political and I won’t go into it. Deliberately vague on the industry and customer because of the 7 figure cost involved!
alias l='/bin/ls -a'
alias ll='/bin/ls -alF'
But it's very visual and aesthetic, and maybe it's best considered a user preference, which is why the tongue-in-cheek rationale, when I mentioned the option.
Quoting from the article, "It turns out that when the LS_COLORS environment variable is not defined, or when just one of its <type>=color: elements is not there, it defaults to its embedded database and uses colors anyway."
In addition to that code, the shell init scripts (both system-wide, and user-specific default ones that are copied in) might do things like checking whether a variable is set, and setting it, if not (or simply overriding, regardless).
Wouldn't it be simpler to use \ls instead of ls? This way you get the "basic version of the executable", not some alias. I always use \command instead of command in my shell scripts, because you never know which aliases a user set up.
(Yes, Infiniband! Hello, Mellanox!)
What ever happened to meaningful names that described the particular, not the meme?