Hacker News new | past | comments | ask | show | jobs | submit login
When setting an environment variable gives you a 40x speedup (stanford.edu)
505 points by CaliforniaKarl 61 days ago | hide | past | web | favorite | 131 comments



Story time.

This took place around 2009. Back then I was working for Rapid7, on their network vulnerability scanner Nexpose. The codebase was mostly Java and was relatively large at 1M+ lines of code. We had many unit tests. Running the entire test suite took up to 40 minutes on Windows, and 20 minutes on Linux (Windows was always about twice slower on everything: building, product's startup time, etc.) The company had grown quickly to at least 30-50 software engineers. The problem was that every time one of them ran a build on his or her local machine (which happened multiple times a day) it would have to run the test suite and waste up to 40 minutes of this person's time. 40 minutes × dozens of engineers = lots of inefficiencies in the company.

I loved solving performance issues so one day I remember arriving at the office and making it my mission to investigate if there was an easy way to speed up the test suite. Our build system was based on Ant, and our tests used the JUnit framework. After a little time profiling Ant with basic tools (top, strace, ltrace), and taking a few Java stack traces, I realized that most of the wasted time was not actually running the individual tests, but many instances of the JVM kept being started and destroyed between each test. Our Ant build file was running the JUnit test with fork=yes, which was required for a reason I don't recall at the moment. This forks the JVM for running the tests. Then a little googling lead me to this:

https://ant.apache.org/manual/Tasks/junit.html

While reading this documentation, I stumbled upon an unknown parameter to me: forkmode. What does it do?

"forkmode: Controls how many JVMs get created if you want to fork some tests. Possible values are perTest (the default), perBatch and once. once creates only a single JVM for all tests while perTest creates a new JVM for each TestCase class. perBatch creates a JVM for each nested <batchtest> and one collecting all nested <test>s."

Our Ant build file did not set forkmode, so it meant we were forking a new JVM for every test!

I immediately tried forkmode=perBatch and... the test suite ran 10× faster! 40 minutes down to 4 minutes on Windows. And Linux ran it in 2 minutes instead of 20 minutes. I told my boss right away but he was unbelieving. He asked that I check with our most-experienced Java developer. I showed him my 1-line patch speeding the test suite 10× and he said "I guess you are right, we can commit that." By lunch time the fix was committed and everyone loved me :)


> and everyone loved me :)

Certainly untrue for those who enjoyed their 40 minute breaks


Should be motivation enough to implement fuzzy testing.


Several years ago, I did a short contract for Apple Retail Software Engineering. They build all the software and systems that are used to communicate between Retail HQ and the stores, handle all training, etc.... Apple employees know that they’re not allowed to even acknowledge the existence of this software, unless you’re in a secure location, like the store back-of-house. More than a few times, I would go into a store for personal reasons, and as part of the regular chit-chat they would ask where I worked, and I would tell them. They would smile at me and maybe a little wink, but that was always as far as it went.

My first task was to automate the build process for all their iOS and macOS front-end software, as well as the Linux-based back-end support systems. They didn’t want their Engineers to have their laptops taken over for thirty minutes to an hour or more, every time they did a build.

So, I delivered a Jenkins CI/CD system for them. And although I’m not a developer, I was able to find and fix the problems in their code base that kept it from building in a server environment instead of on their laptops. I also added a lot of static code analysis tools, and they developed a lot more regression suite tests, once the builds were automated on the server.

And then I found out I had automated myself out of a job, and my contract ended. I was a DevOps Engineer, with a heavy emphasis on the Ops side of the house, and because I wasn’t a Developer, they didn’t have any more work for me.

After finding out that my contract would be ending, the first thing I did was to call my wife and tell her the good news — I would be moving back to Austin, and I would be there in time for Christmas.

As a contractor, I had a high hourly pay rate, which meant I was just barely able to afford to get a 1BR apartment in the Cupertino/San Jose area. But if they had offered me a full-time position, they would have had to offer to pay me at least $250k-$350k per year, for me to be able to afford to keep that same apartment. And that doesn’t include what they would have had to pay me to be able to afford to have anything remotely like our house here in Austin.

I really enjoyed most aspects of working there, and it was a really good experience, but I do feel like I really dodged a bullet there. Or maybe it was a grenade.


> But if they had offered me a full-time position, they would have had to offer to pay me at least $250k-$350k per year, for me to be able to afford to keep that same apartment.

Huh? Rent is expensive in Cupertino, but not THAT expensive. You can live quite comfortably with half that.


That’s the time to ask for a raise!


Ha! I did something like that for a company, in this case. Their core app was slow and not scaling and they were failing SLAs, I was doing sysadmin work then and not on the dev team then but I dove into the code base, found the bug fixed it, increased speed by 3000%. They had tons of DB clients inserting one row at a time to the DB. Batched them. Company went nuts everyone was happy since this has been hunting the business for months, VP came into the office and gave me a high five. CEO sent me a thanks email. Then economy shrank and pay raise was frozen. I got $0 after performance review which was excellent a few months later. My boss at that time dipped into his pocket and gave me a $100 gift card to Best Buy. I bought a VCR, VCR then was $120. :-D


I’ve always wondered, does this happen in US companies? Or is it only in very small shops? I’ve worked in a variety of big European firms, and salaries and promotions are discussed once a year, at review time. Even when changing jobs within the same firm, the salary will only be aligned at the next cycle


Except if you have some leverage like an offer from another company, then sometimes you can make your current employer to almost match the offer. Once I improved some workflow greatly because I took extra time for it (it was about some Lua debugging, I don't really remember the circumstances), I was told that this is part of my job, so no raise. Now I don't expect raises for things like that but I only invest extra time if an issue annoys me and even then I open a ticket first. Long build times are annoying.


You're gonna be very disappointed in the workplace if you expect a raise every time you're just doing your job.


If he was just doing his job the implication is that no one else was doing theirs at the same company.


And what about all the times someone looks into something like this but finds no way to improve things -- is that doing their job or wasting time?


Don't you think it's far fetched to say that every single developer at the company is useless because they didn't look into ant build settings?


I’m playing devils advocate, but the point is he used initiative. He wasn’t asked to improve the build, he was curious and found a solution.

Is that worth a raise? Probably not on it’s own but I was saying it’s a good time to ask for one if you are secretly thinking you should get one as you have a good good forward.

On a meta level a company that doesn’t taskmaster or track everyone’s time usage will get this kind of result from time to time. Curious professional just making things better. Continuous improvement.


Unless you're the CEO.


Legit reason. He’s saving hours of engineers time, every day. Hope they gave him something in some other form


Didn't ask for a raise right away. In general the company liked me a lot; so a few months earlier I had asked for and already obtained a +19% raise, and during perf review a few months later I got another +7% raise.


Nice employer.


“But that is your job”


This was written by my coworker, Kilian Cavalotti. We have two main file stores: $SCRATCH is a Lustre cluster, which is extremely performant but doesn't do well with lots of inodes. And $HOME is a multi-node Isilon, which can handle all the inodes, but is not as performant. And we have users who sometimes like to put many files in single directories.

You can find more information about us at https://srcc.stanford.edu

And you can find more information about Sherlock at https://www.sherlock.stanford.edu


Thanks for the additional information but sometimes I wonder that 15k+ files count as 'many' nowadays. I mean, an off-the-shelf Canon DSLR produces 20k files per directory (10k JPG + 10k raw). Nevertheless, I did some test your approach on my PC:

On a NFS mounted on my PC your LS_COLORS tweak actually degrades the performance. Without modification, it takes 0.5 seconds to list a directory (on a slow consumer grade HDD) with 14448 files and after setting

  export LS_COLORS='ex=00:su=00:sg=00:ca=00:'
it takes 4.4 seconds.

However, listing a local (SSD) directory with 15k files takes just about 0.12 seconds and gets faster after settings the variable to 0.06 seconds.

For all tests I drop the caches before testing, e.g.:

  echo 3 > /proc/sys/vm/drop_caches; time ls --color=always /mnt/nfs/many | wc -l


"On a NFS mounted on my PC your LS_COLORS tweak actually degrades the performance"

Run "strace -o logfile ls --color=always" and diff the two logfiles.


Well, this looks rather strange. As expected, settings the variable causes large changes as it removes a lot of lstat calls and also removes the color codes from the writes. So there seems to be no obvious explanation, why it becomes slower when doing less...

  # du -sh logfile_nfs*
  1.7M    logfile_nfs
  668K    logfile_nfs_exported
However, the strace -c output looks kinda different:

Without modifying the variable: https://bin.disroot.org/?149fa91c08b27312#0w9O6BAWNEEC4SUXEb...

With setting the variable: https://bin.disroot.org/?149fa91c08b27312#0w9O6BAWNEEC4SUXEb...


You pasted the same link for both cases. Can't tell what's going on.


Oh sorry, my fault. The second link should be

https://bin.disroot.org/?56f3dcb618240df0#x3StRGZZgSzM0UJ0BK...


Plus/minus what? Note that caching on the server can be relevant, and you can't control that from the client. That's one thing that requires care for performance measurements on networked filesystems.


Sounds as if you don't trust my statement. Sure there is some variance, but when one is measured in seconds and the other in the tenth of a second it should be clear that it is not a coincidence...

But for completeness, I wrote a few loops which should answer your question:

=NFS=

Normal: https://bin.disroot.org/?8967a0f34b26e512#aTYmbyESfeuqAXS802...

With exported Variable: https://bin.disroot.org/?c405d7aa74b50ee0#KverJYJEzNhct7CmdW...

-----

=Local=

Normal: https://bin.disroot.org/?b94e1d1f58e3edb7#tKszD/tjvwBwepJLun...

With exported Variable: https://bin.disroot.org/?d3ac83f1dff9e767#WVODjGnvL1QbOJdWPp...

I also tried dropping the cache on the NFS server but that didn't seem to have a major effect on the performance (probably because reading a 15k file index from a local disk doesn't take that long after all).


I don't understand the "lots of inodes" comment, but any filesystem limit on inodes isn't relevant to this, which is just due to RPCs to the metadata server(s). (ls -l is actually worse than coloured plain ls because the size data are on the OSS, not the MDS.) The canonical advice for general performance is to keep directories reasonably small on Lustre due to possible lock contention, but I don't know the circumstances for which that's actually relevant. [Metadata operations, such as large builds and tars, are typically slower on our Isilon than on the Lustre filesystem, which has no serious tuning as far as I know.]


> The canonical advice for general performance is to keep directories reasonably small on Lustre due to possible lock contention, but I don't know the circumstances for which that's actually relevant.

I helped a researched debug a Lustre performance issue a while ago. Each job was nothing special, read a few files (maybe a few GB total), do some (serial, no MPI or such) calculations taking maybe 10 min or so, produce output files, again a few GB. No problem, except when the person ran a several hundred of them in parallel as an array job the throughput per job dropped to a small fraction of normal. Turned out that all the jobs were using the same working directory. Slightly tweaking the workflow to have per-job directories fixed it.


what is an Isilon? thanks



And more on the product in particular: https://en.wikipedia.org/wiki/OneFS_distributed_file_system

Disclaimer: I work for Dell.


Hyperscale Network Attached Storage!

Disclaimer: I work on Isilon


That's actually a disclosure, not a disclaimer.


Why was iSCSI support deprecated? I recently developed an interest an iSCSI, which makes me curious; not an Isilon user, though.


My guess would be that Isilon/OneFS is designed for file storage, instead of block storage, so having a block storage layer on top (like iSCSI) seems a little ungainly. Instead I expect Dell/EMC would prefer you investigate the VNX platform!

Edit: By "file storage", I'm talking about storage mounted using protocols like SMB and NFS.


Clustered storage.


Tell your colleague I think emojis make the article look like it was written by a 12 year old.

IMO. I guess I'm just an old fart.


I think they add a bit of emotion to the text written.

It's not because it has colours that it's not a serious article.


Even if it were, maybe you should have more respect for 12 year olds.

IMO. I guess I'm not an ageist.


It is how most millenials write nowadays


It’s how people trying to write like millennials write

Postfix emoji, eurgh


> Having thousands of files in a single directory is usually not very file system-friendly, and definitely not recommended.

I wonder, is this specific to the situation when I use "ls", or is it in general?

I recently worked on a project where I need to store many small files on ext4, although these files are not read/written by human. I came across suggestions to group files into subdirectories and not put them all in one directory. Is there evidence that it's actually worth it on a modern filesystem?

EDIT: by "many" I mean 10+ million files.


For future reference, next time you need to store millions of related small files, it might be worth checking if SQLite is an option instead of using the filesystem: https://www.sqlite.org/fasterthanfs.html

Performance aside, I wish NodeJS had a way to pack all its little turds into one SQLite file, out of sight. (Admittedly I hadn't thought to look for such a thing until now... Preliminary results are negatory.)


You should benchmark SQLite before recommending it like this (for future reference). I think it's a very elegant library, and I admire the development philosophy that went into it. However, every time I've tried to use it for something that needs to be fast it ended up being a painful mistake.


Could you show us your methodology for your benchmark? We don't know under which circumstances you experienced degraded performance.

I've seen a lot of people just missing an index or something, and their DB then ran fine, so that's why I ask.


I don't work for that company any more, and I couldn't have legally showed the code even if I did.

It's not an index thing though. Try inserting 10 million rows into a simple table. No foreign constraints or anything. Some of our "documents" had 300 million rows. Handle the primary key however you want (let the DB do it, or generate your own). Use whatever SQLite API calls you want (precompiled statements, whatever). In a case like this, adding other indexes can only slow it down.

There are a few options to disable integrity and other safety constraints. They help speed things up a little, but it's all painfully slow in comparison to a writing simple CSV file.

The same is true on reading. You can parse an entire CSV file more quickly than "SELECT *" from the table.

I've tried several time to use SQLite as a document file format. It performs worse than CSV and JSON. It performs much worse than just an array of binary records. The devil is in the details, and assuming each file is not a single row/record, I wouldn't be surprised if 10+ million files in a structured directory tree performs better too.


Well, I certainly agree about benchmarking versus other options, especially if I/O proves to be a bottleneck. The other option above was not a single CSV or JSON-structured file, but millions of small files stored on the filesystem (with its requisite indexes, permissions, access logs, etc). And the comparison is not with the file contents being splayed out into a relational structure in SQLite, but just one flat key-blob table for storage and retrieval. It's possible that a multi-gigabyte CSV file would be faster still, depending on actual access patterns and how much effort you want to devote to this bespoke CSV database.


> The other option above was [...] millions of small files stored on the filesystem

I don't have any direct comparisons of SQLite to this approach, but other projects I've worked on did have a simple directory tree organizing medium size chunks of data by date over 20+ years. We had one "DB" where everything was stored by the degree of latitude and longitude it fell into, and another where things were stored by date. Both were gigabytes in size at a time when that was considered a large data set, and it was very fast to open only what you needed.

Depending on the problem, this can be a very good solution. It was trivially accessible from any programming language (including ones that didn't exist when the DB was started), and it was trivial to migrate to new operating systems (some of this started out on VMS).

I like SQLite quite a bit, but it's not always the best solution to storing tabular/relational data.


How many files are you talking about? That's the relevant variable, not the amount of data or whether it's text, tabular or relational data, images, audio, etc. E.g. you can copy a single 10 GB file to another drive faster than 1 million files totaling 1 GB, all due to filesystem overhead. If there's a filesystem where that's not true, I'm interested. :)


> How many files are you talking about?

The numbers won't seem significant by today's standards. In the one case (stored by date), maybe 150 thousand files, each a few megabytes.

> you can copy a single 10 GB file to another drive faster than 1 million files totaling 1 GB

True, but I'll bet you can create (or read) a thousand files with a thousand records each faster than you can insert (or select) a million records into (or from) a SQLite table.


150,000 is solidly in the realm where SQLite tested faster than the filesystems in the link above, although their files were only a few kilobytes. It's almost certainly different (worse) for multi-megabyte files. But what I've been trying to convey is that the number of records in the SQLite table will be identical to the number of files. If you need to parse the file contents, you'd parse the BLOB just the same. The difference is in how you interact with the disk.

Is reading X-thousand files containing a thousand records each (or one thumbnail, or one HTML dump, or one JavaScript function, or whatever) faster than SELECTing exactly the same number of BLOBs containing exactly the same data? It's worth considering and testing once the number of files starts affecting performance or even just becomes a pain to deal with. If it turns out that storing many files is still a better fit for a particular application, that's cool too. Nothing is a panacea.


What is a modern filesystem and what are its performance characteristics? If you don't want to evaluate these things ahead of time, erring on the side of using subdirectories is safer — it's unlikely to be much worse even on filesystems with fast huge directory traversal and lookup, and it's a lot better on conventional filesystems.


Modern file systems with indexed metadata has improved that situation considerably. There is always a limit however. If your application stores a million files today, maybe it will grow to a billion?

File systems aren't the end of the story however. There's always the need to do backups or copy files for some other reason. Some tool somewhere will find the need to readdir and will not cope well with it.

So it is still be a good idea to partition files in directories when you have many of them, and millions of files are still many.


Here's another little "story from the trenches".

Many of our users use MATLAB. Yes, they should probably use Python, or Octave, but they're using MATLAB.

We use SLURM as our job scheduler. It supports "job arrays", which allows a user to submit a large number of jobs (up to 1,000) with a single "batch script", which is a shell script that gets run on a compute node, in a cgroup with the requested amount of memory and CPU. A job array simply runs this on many (for us, up to 1,000) cgroups, spread out across nodes; the batch script is able to check environment variables to see its "array index", which the user uses to split up work.

Once people found this, they took advantage of our site-wide MATLAB license, launching hundreds or thousands of MATLAB instances at around the same time.

MATLAB likes to create a 'prefdir' inside `~/.matlab/RXXXXy` ("R2017b", for example). Thousands of MATLABs, all making directories, inside the same common directory, on network-mounted storage.

Our Isilon cluster was floored.

We found the environment variable which controlled where the prefdir is placed. We changed it to a path in $SCRATCH (the user's scratch space), which is on Lustre.

Our Luster environment held up for a while, but was similarly floored.

In all of these cases, the issue appears to be many clients trying to take a short lock on the `~/.matlab/RXXXXy` directory, to make their subdirectory.

"It worked fine for one, or for a hundred, so it must be OK?"


I'm surprised that the number of concurrent array job elements isn't more limited, partly for that reason. I'd also expect the resource manager to make a per-job (element) TMPDIR on local disk and to get jobs to use that. Where array elements are accessing the same data, it's typically good to unroll the array loop and stage data to local storage for multiple operations on them. I wrote support for that with BLAST jobs in mind, whose mmap was particularly problematic.


At the very least, it would be more polite to anyone that might need to support, troubleshoot, remove the files with shell tools after removing your software, etc.

I suspect it's also still a performance issue with many popular filesystems. See https://lwn.net/Articles/606995/ for more info.


That depends on how you access them. If you only ever iterate over all files readdir(3) in a tight loop is as fast as it can get; but if you need lots of access to specific files, you will waste a lot of time reading directory entries just to discard them...

If there is even a remote chance you will use standard Unix tools on your files as well, I'd go for the subdirectories.


I started reading this thinking it would talk about LC_ALL/locale related, but not too surprised to see another env variable throwing a wrench in the works.


I used export LC_COLLATE=C export LANG=C to gain several orders of magnitude sorting through 6TB of data by 2 GUID-sized fields. Default collation is terribly slow compared to ascii only.


for context you refer to the fact that

    LC_ALL=C grep string hugefile 
is often magnitudes faster (also works for other shell utilities) - this is due to unicode handling that is either complex due to the nature of unicode or just performances badly on glibc - probably it's the complexity.


Seems analogous to xml parsing in Perl, where I realized a 10x speedup in parsing by switching from a general-purpose XML parser to a set of regular expressions.


https://stackoverflow.com/questions/1732348/regex-match-open...

If you read nothing else, make sure you read the first answer there.


If you want to extract just the values of a single tag or similarly trivial things or the XML is a very restricted set, regexes are just fine for parsing.


You’re not parsing XML at that point. You’re parsing a text based data format that looks superficially like XML. This can work if you strictly control the encoder but it’s not generally interoperable.


Why does general interoperability matter in the context of two devices that you control both interoperating? I mean, if I had written the software on the other end I would've used a different interchange format altogether. But when you constrain the problem down from your software connecting to any and all systems on the internet to your software connecting to a specific system for the purpose of exchanging well-defined messages, you don't need a general-purpose parser with validation. And you end up spending a lot of time validating and analyzing the same message over and over again even though you can guarantee that it's the same message.


Not if your XML contains CDATA sections.


it's really not


If you want to extract strings from XML that arrives in a well-known format that meets certain expectations that may not be rigorous, then regexes can be fine.


Agreed. It's almost as odorous as xmldoc.toString.getBytes--which is definitely a thing some folks do--okay, welcome to encoding problems and busted documents.


Using regex on serialized XML has the potential to go off the rails thanks to the wiiiide variety of ways to express character entities, import docs, use namespaces, escape or not, etc.; consider an event-y/SAX parser for performance if you might run afoul of such issues.


Unless you know exactly the format of the data you're getting back. So in my use case it's not arbitrary XML but rather it's constrained to a particular message with specific fields that are named in a specific way. So it's really a text response that looks like XML rather than being some arbitrary XML message that differs after each response.

We're literally getting the exact same XML message back, but with different text in each field. The framework we were using insisted on parsing each response separately even though the responses were basically the same, but with different values in the exact same XML tags.


If you control precisely the inputs, then it could certainly be fine... until it's not because other thing changed. Please consider it a brittle hack.


Calling it a "brittle hack" implies that I don't have any control over the other endpoint. In reality my employer controls the endpoint I'm hitting. So it's not a brittle hack in our organization even though it would be a brittle hack if we were, say, hitting a Twitter API.

In my opinion it's the same kind of optimization as massaging your data inputs to span cache lines in your MCU to improve performance. By requiring the data to be well-specified in its format you can ignore all of the clues that XML provides about the format of the data, because you can assume the format in advance. So all the stuff like schemas and validation become completely unnecessary because you already know what valid data looks like.


Yup, been bitten by this a number of times. You can't keep it set to C for daily use these days however. Modern things like Python have a bird.


in my experience `LC_CTYPE=en_US.UTF-8` is sufficient for python. not sure if that slows down grep or not.


Fave speedup story, in the 80s a good friend had written a custom accounting system in CBASIC and he came to me in regards to one client, a boat charter service management business that maintained a set of 'virtual double-entry books' for every one of its 200+ member yachts. The monthly process was taking 4 hours to complete. Could I improve on that?

Well in a couple hours I discovered that his routines assembling the main batch file did repeated lookups in other files without caching any operation results, and because it was MS-DOS the OS didn't cache many sectors either. So the hundreds-of-thousands of preparatory operations were waiting for the hard disk platter to come around again, for each. Yes, even hard disk platter speed was a significant factor in those days.

So I added a 15 element array that cached lookup operations in memory. From 4 hours to 15 minutes, Thank You Very Much.


May I contribute with an hybrid optimization. Some workers were tasked to remove duplicate in excel files by hand. They'd have to create filtered view for each entry, Excel being slow it took hours. Some woman had a 3000 lines spreadsheet and was about to lost it. I wrote a 8 lines VBA to count duplicates, just enough so she could filter anything that had count > 1 and then delete things on her own. Turning multi-day job into a two clicks operation. Probably my most happy lines of code in my entire life.


My favourite speed up never got used :(

Had a vendor we got partnered with for a government contract. The company I worked for usually did all its own development of java based web applications for the state, but on this one it was decided to partner with someone else as they already had a solution in the space. We'd just run it for them.

It was terrible. The UI/UX was fine. It did actually solve the problem for the state, and was pretty intuitive. On the software side, though it just flat out did not scale for a number of reasons (including the same one OP had, they loved to dump everything in a single directory).

My particular favourite reason was the way they chose to produce reports. They'd take the business requirement, sit there with a visual database tool, and piece together a SQL statement to produce the report, no matter how complicated that query got. Then they'd drop that in their code and voila, effortless reporting! There were queries in their code base that would end up nesting some 40+ selects deep, e.g. SELECT foo FROM bar WHERE id = (SELECT id FROM monkey WHERE id = (SELECT ... and so on down the line.

The MySQL query optimiser, at the time, didn't handle those queries very well, and it could end up taking 45 minutes or so to produce the final report, and it was showing signs of potentially exponential execution time growth. This was around the time MariaDB was starting to take off, and it turned out this issue was one of the first they'd fixed in the optimiser (MySQL followed suit with a completely overhauled query optimiser they'd been working on for a while), and so I switched over to MariaDB instead of MySQL. Free speed boost! Now it only took 10 minutes to produce the report! Everyone loved me... but I still wasn't happy. The final reports didn't look that complicated to me.

I picked up the most egregious example, and rewrote it in perl (the only language I wrote in at the time). Along the way I identified that if they wanted to keep it in a single query they could easily cut it down to something significantly simpler, but also if I just made it four or five queries and wrote some _very_ simple logic around it in perl, I could get the report out within seconds, while drastically reducing load on the database server. I gave the developers the proof of concept and simplified SQL query, but neither got used.

In the end the company that produced the software was facing bankruptcy, they couldn't support what they'd sold the state, at the price they'd sold it for. They weren't ready for that scale, didn't know how to code to that scale, and didn't really have a grasp of the costs of supporting at that scale. For various reasons, my employer ended up taking over the code base from them, with legal agreements that we'd only use it for this purpose, never sell it anywhere else etc. etc. The code base was _awful_. I thought the things I could see (queries) were bad. The code base was worse. Thankfully that was someone else's problem to work on :D


"he mentioned his laptop was 1,000x faster than Sherlock to list this directory’s contents"

"that’s a 40x speedup right there"

So... what about... the rest?


Good point! But sometimes a large-scale parallel filesystem can't beat a local SSD on certain patterns.

Also, we didn't see any timings for that 1,000x claim, so let's say it's an estimate. :)


It's a great article, thanks.

But .. do people actually use `ls` to list so many files? I mean, it'd scroll off the terminal anyway.


So, I need to call out one thing. For the people mentioning `ls | grep` and the like, in auto mode `ls` disables colors when it sees it is not running in a terminal. So using it in a shell script (or similar construction) doesn’t have this issue. Otherwise your loop variable would be full of weird ANSI escape color sequences!


Yup, some people do. Maybe they don’t expect so many files. Maybe they are gonna scroll around. But they do.


I admit to being guilty of the odd

  for i in $(ls)
  do
    Something dumb
  done


It clearly depends on what good that subshell running ls does for you, but the trivial form would be:

  for i in *txt
  do
    ...
  done
The reason to stuff every file name through ls (keep in mind that it's always your shell doing glob expansions, ls is not involved) would be to sort them by creation date or some other processing that ls can do. But as soon as there are many files involved, or those sorting arguments gets non-trivial, it will fall apart completely.

Use find | while read instead, or xargs. That's going to be easier on the eyes and actually work.


  for i in *; do
    ...;
  done
works even better as it can handle files with whitespace. I usually disable file globbing but enable it specifically for code like this.

EDIT: Actually, one issue with the above is that if there aren't any matching files then you get a literal "*". But I'd rather deal with that then break when filenames have embedded whitespace.


Out it interest why not xarg or find’s exec


Not GP, but I often find myself writing making statements (which is awkward with find), or using syntax that find struggles with.


It is like you read my script.sh over my shoulder as I was writing it. :)


I sometimes ls | grep.


I assume they were speaking colloquially.


The irony here is that instead of just unsetting the alias that is proxying `ls` to `ls --color=auto` the solution was to leave the coloring in place but try and hack the override env var for defeating that behavior.

ie, `var foo = 1 + 1 - 1` instead of `var foo = 1`

Wouldn't it have been more straightforward to open one's .bashrc/.bash_profile and remove the alias?


Sure, but they WANTED colors. The just wanted them to be fast.


Precisely. We want colors. We're refined people.


The article mentioned that they knew this, but wanted to continue supporting as much color as possible.


The ideal solution would be to have something that disables coloring if there's more than X files. Someone with more linux-fu can probably come up with such a macro.


How is that more ideal than having coloring even on more than X files but still being fast?


I was once flown out to look at a significant performance problem. The customer could not do a days worth of work in a day. We’d told them to buy a fully loaded Sun E15k, and it couldn’t keep up.

I instrumented everything (recompiled everything so that the binaries spat out gprof data, iostat output, vmstat output, sybase monitoring) and it quickly became obvious that the DB driven config was configured in a way we’d not anticipated. The fix was easy (just use a btree in a couple of places instead of a linear array walk), but then we had a huge problem.

With the fix, the CPU bottleneck became a 15% CPU utilisation. The system had been sized based upon our advice. Whoops. What happened next was political and I won’t go into it. Deliberately vague on the industry and customer because of the 7 figure cost involved!


So before diagnosing the problem you made the customer spend huge amounts of money based on a guess. Not the least bit surprised.


I really enjoyed the writing style. This was fun to read. Thanks!


I feel the opposite, I'd rather they get to the point in the first paragraph then follow it up with casual examples of how they arrived at the solution.


True, that could have used a TL;DR, agreed.


But then people would miss the rainbows!


The tl;dr here is that "ls" can be much faster if you disable colorizing files based on the their file capabilities, setuid/setgid bits, or executable flag.

    LS_COLORS='ex=00:su=00:sg=00:ca=00:'


Does that leave any colorizing? I guess dirent still has d_type information, which should allow coloring based on type (directory, block device, symlink, socket, etc). (This information is also contained in stat's st_mode.)


Yes, it leaves all of the default coloring, except for the bits that require additional syscalls, that is: executable permission, setuid/setgid bits, and file capabilities.


And if you don't want angry fruit salad colors in your cool hipster retro-Matrix semitransparent tiling-window desktop theme:

  unset LS_COLORS
  alias l='/bin/ls -a'
  alias ll='/bin/ls -alF'


I’d argue that for the vast majority of manual usages of ls, the maybe second or so saved listing directory contents by removing colors will be dwarfed by the additional time it takes the human to parse its contents, without the additional context provided by those colors.


I'd agree with you: judicious and consistent use of color should give better HCI performance than the colorless method, for the majority of people.

But it's very visual and aesthetic, and maybe it's best considered a user preference, which is why the tongue-in-cheek rationale, when I mentioned the option.


Good points. And for those wondering, "HCI" means Human Computer Interaction (had to check).


Thanks. HCI (and human factors engineering) were the original areas of study concerned with these questions. UX, which is more popular at the moment, seems to have more emphasis on visual appeal and marketing psychology, rather than on effectiveness for the user's goals. You could see "dark patterns" as an extreme of this shift in intent.


> unset LS_COLORS

Quoting from the article, "It turns out that when the LS_COLORS environment variable is not defined, or when just one of its <type>=color: elements is not there, it defaults to its embedded database and uses colors anyway."


From looking at a version of the likely C source (`coreutils` source, as packaged by Debian), it looks like they're aggressively trying to use color. In addition to `LS_COLORS`, they also look at `COLORTERM`. There's also C variables to chase through the source; `print_with_color` seems most likely one to focus on.

In addition to that code, the shell init scripts (both system-wide, and user-specific default ones that are copied in) might do things like checking whether a variable is set, and setting it, if not (or simply overriding, regardless).


Literally undoing the improvement to make it non-colorful?


I thought this was about setting NODE_ENV=production, and reading other comments others know different ways. Seems there are many ways to get a system an order of magnitude faster with little, but deep work/knowledge.


It may be worth mentioning that this sort of thing isn't just a problem for whoever is running it, because it may have a bad effect on other users if you hammer the metadata server(s). It's necessary to have monitoring which can identify the jobs responsible and take action. The Lustre system here was clobbered for several days because the admins had no means of doing that, despite all the facilities provided for and by Lustre.


I suspect something like "ls | cat" would speed it up as well. Probably tests if stdout is a tty before bothering with stat() and color logic.


True, but then you would loose all the coloring. By targeting only the attributes that generate additional syscalls, you can keep the vast majority of the colors and get a nice speed bump. Win win.


I was expecting some new flag about getdents bigger buffer size.

http://be-n.com/spw/you-can-list-a-million-files-in-a-direct...


An often-neglected related trick is the "dir_index" feature of ext3/ext4


The author removed the coloring, because the default setup is for ls to list files with colors. He uses an environment variable to do that.

Wouldn't it be simpler to use \ls instead of ls? This way you get the "basic version of the executable", not some alias. I always use \command instead of command in my shell scripts, because you never know which aliases a user set up.


But aliases are not supposed to be available in subprocesses/scripts, unless you use shopt -s expand_aliases. Even if they were, you cannot ask all your users to rewrite their scripts to use \ls instead of ls.


The conclusion seems unsatisfactory. Why is lstat so much slower on their system than on the laptop?


Compute clusters generally use distributed file systems, over the network...


Exactly that. We have 2000+ users, and 1000+ compute nodes. Each compute node does have a certain amount of local SSD (`$L_SCRATCH`) for job-local storage, but for everything else we have to use some sort of network-accessed file system. For us, that is NFSv4 for longer-lasting data (`$HOME` and `$GROUP_HOME`) to the Isilon over Ethernet, and short-term data (`$SCRATCH` and `$GROUP_SCRATCH`) to Lustre over Infiniband.

(Yes, Infiniband! Hello, Mellanox!)


tl;dr version

LS_COLORS='ex=00:su=00:sg=00:ca=00:'


Wow. Miss this for decades. Life waste so much on colour.


Why are people so lazy at naming? https://en.m.wikipedia.org/wiki/Sherlock_(software)

What ever happened to meaningful names that described the particular, not the meme?


like GNU?


I refuse to defend any gnu or unix naming conventions.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: