Hacker News new | past | comments | ask | show | jobs | submit login
Surprisingly Slow (gregoryszorc.com)
559 points by dochtman on April 8, 2021 | hide | past | favorite | 148 comments

> Historically, the Windows Command Prompt and the built-in Terminal.app on macOS were very slow at handling tons of output.

A very old trick I remember on Windows, is to minimize command prompts if a lot of output was expected and would otherwise slow down the process. I don't know if it turned write operations into a no-op, or bypassed some slow GDI functions, but it had an extremely noticeable difference in performance.

IIRC the font rendering in Windows was surprisingly slow and even somewhat unsafe. In one case lots of different webfonts displayed in the MSIE rendered the whole text rendering stack broken, with all text across the entire system disappeared. I wouldn't be surprised if this is a root cause of slow command prompts.

This sounds a lot like GDI resource exhaustion. It looks like the limits on handle IDs and the GDI heap size are still in place even in Windows 10.

Well that explains the crazy slow performance on an old windows forms app I had written back when I was a junior developer. I'll try to see if I can reach anyone from my first employer and make them disable the debug output. Would be an interesting contribution, considering that I left more than 10 years ago :)

Old old Windows Forms had its own text rendering (and still has), but most controls now have a second code path that uses the system's text renderer, which got updated with better shaping, more scripts, etc., while GDI+ basically never got any updates. You can see that when there's a call to SetCompatibleTextRendering(false) in the code somewhere; then it's using GDI instead of GDI+.

I strongly don’t think that throughput is what terminal emulators should optimise for: basically no one cares about how quickly you can get a load of text you will ignore. Instead, kill the command and modify it to reduce output.

I think the right thing to optimise for is input latency.

I strongly disagree! It shouldn’t be that hard to make this fast, and it’s a very common source of slowdown in builds, so why not just make it fast?

As to why you shouldn’t just silence all that noisy spam output, you never know when you might need it to diagnose a weird build error.

Sure, it would be great if the build system always produced nice concise output with exactly the error messages you need, but that’s not always realistic. A big bunch of spam that you can dig into if you need to is almost as good -- as long as it doesn’t slow you down.

Edit to add: I guess another approach is to redirect spammy output to a file instead of the terminal. But then I have to know in advance which commands are going to be noisy, and it introduces another workflow (maybe the error message you need is right at the end of the output, but now it’s tucked away in a file instead of right there in your terminal).

Just make the terminal fast at handling large output. No reason that should make it harder to have good input latency too.

One thing to consider is that ssh is much more likely to be throttling than the terminal emulator, so having a really fast terminal emulator won’t really fix problems with programs that output too much. I’m also not saying that throughput should be totally ignored, just that it shouldn’t be the metric used to benchmark and optimise ones terminal emulator.

Input latency is the problem for many terminals when a command accidentally generates megabytes of output text - the console locks up and won't respond to input until it has rendered and scrolled through every single line of output.

Surprisingly few terminals, when suddenly faced with a million new lines of text, decide "hey, let's just draw the last screen of output, I don't need to draw each and every line and then move the existing text upwards by a row"

There's a good reason terminals can't just skip ahead, even if we ignore ANSI/vt100/etc escape sequences, even "plain text" is not so plain.

Some applications output a few lines, then many MBs of "\rcurrent progress: xx.xxx%" that constantly overwrites itself.

Rendering is just not linear.

tmux will skip drawing text if it gets enough of it, it still maintains state but it waits to render it.

>I strongly don’t think that throughput is what terminal emulators should optimise for: basically no one cares about how quickly you can get a load of text you will ignore. Instead

They would care if they knew it would also block their program doing the output...

Ssh also has this problem to a greater extent and is very commonly used

One can enable compression by default for all ssh connections by adding

  Compression yes
to ~/.ssh/config

It can pipe many megabytes of text output per second on relatively slow channels (`ssh -vv` prints the total data transfer when you close the connection), as long as it's not encrypted or unable to be compressed for some other reason (which is pretty rare for text output).

Scrolling very large files in vim becomes much faster, for example.

This might be useful for some: https://mosh.org/

It is great for unstable / slow links and pressing ctrl+c works immediately which is nice.

Maybe not something to optimize for, but still something to optimize when you can do it without sacrificing speed or functionality in other areas.

> bypassed some slow GDI functions

I think it is mainly just the scroll operation. You can visually see the process speeding up as you reduce the height of the console window.

Typically I reduce it to a couple of lines, then it goes several times faster yet I can keep an eye on it.

As I recall, command line scrolling output was also faster if you minimized the window. You just needed an alert of some sort when the thing was done, like a bell or audio file.

There was a period of time where I earned brownie points by flushing all of the defunct debugging output from old bugs that nobody removed, typically for a 2x improvement in app performance. All because of screen scroll bottlenecks.

> CPUs have somewhat plateaued in their single core performance in the past decade

In fact for many cases single core performance has dropped at a given relative price-point. Look at renting inexpensive (not bleeding edge) bare-metal servers: the brand new boxes often have a little less single-core performance than units a few years old, but have two, three, or four times the number of cores at a similar inflation/other adjusted cost.

For most server workloads, at least where there is more than near-zero concurrency, adding more cores is far more effective than trying to make a single core go faster (up to a point - there are diminishing returns when shoving more and more cores into one machine, even for embarrassingly parallel workloads, due to other bottlenecks, unless using specialist kit for your task).

It can be more power efficient too despite all the extra silicon - one of the reasons for the slight drop (rather than a plateau) in single core oomph is that a small drop in core speed (or a reduction in the complexity via pipeline depth and other features) can create a significant reduction in power consumption. Once you take into account modern CPUs being able to properly idle unused cores (so they aren't consuming more than a trickle of energy unless actively doing something) it becomes a bit of a no-brainer in many data-centre environments. There are exceptions to every rule of course - the power dynamic flips if you are running every core at full capacity most or all of the time (i.e. crypto mining).

Yes, yes, this is why I haven't bothered to upgrade my 2500K, although it is actually time now, since games apparently learnt how to use more than 1 core. I always went to some benchmark every year and saw single-core performance barely moving upwards.

I replaced my 2500K a couple years ago using the cheapest AMD components I could find. The main improvements were mostly in the chipset/motherboard:

- The PCIe 2.0 lanes on the old CPU were throttling my NVMe drive to 1GB/sec transfer rates.

- USB3 compatibility and USB power delivery were vastly more reliable. My old 2500K ASUS motherboard couldn't power a Lenovo VR headset for example, and plugging too many things into my USB hub would cause device dropouts.

- Some improvement in either DDR4 memory bandwidth or latency fixed occasional loading stalls I'd see in games when transitioning to new areas. Even with the same GPU, before the upgrade games would lock up for about half a second sometimes and then go back to running in 60fps.

2500k is borderline of the sweet spot, but something like a 4790k, 5775c or 6700k can hold up 7 years later.

That said, the very latest processors (AMD 5000 series, M1 apple silicon) are starting to make real gains in single threaded speed

Similar here. My main home machine's CPU held out for years more than previous ones had. I didn't do much by way of heavy dev/test/DB work in that period[†] so games were the only big processing it did[‡], and they only used a couple of cores properly or were bottlenecked at the GPU.

I upgraded early last year because I was doing a bunch of video transcoding, something where going from 4 to 16 cores really helps, and had finally started to notice it bogging down more than a little elsewhere. There was a per-core performance bump too in this case, that R7 2700/x was excellent value for money at the time. Also there was a goodly increase in memory bandwidth with the new kit, to keep those cores' caches full of things to be getting on with, but again that wasn't a massive bottleneck for my other uses up to that point.

[†] which I had previously, but personal dev has dropped off significantly since developing out-door habits (when you properly get into running it can be very time-consuming!) and day-job work is usually done via VPN+RDC when I do it at home.

[‡] and even then I wasn't spending time & money on the bleeding edge, though I did upgrade to a 1060/6GB for those that were demanding more than my old GPU could give)

how do you people do to survive. I have so much stuff running on my computer (intel 6900k) right now than I can see some amount of lag when selecting text in HN ?

>"Closing File Handles on Windows

Many years ago I was profiling Mercurial to help improve the working directory checkout speed on Windows, as users were observing that checkout times on Windows were much slower than on Linux, even on the same machine.

I thought I could chalk this up to NTFS versus Linux filesystems or general kernel/OS level efficiency differences. What I actually learned was much more surprising.

When I started profiling Mercurial on Windows, I observed that most I/O APIs were completing in a few dozen microseconds, maybe a single millisecond or two ever now and then. Windows/NTFS performance seemed great!

Except for CloseHandle(). These calls were often taking 1-10+ milliseconds to complete. It seemed odd to me that file writes - even sustained file writes that were sufficient to blow past any write buffering capacity - were fast but closes slow. It was even more perplexing that CloseHandle() was slow even if you were using completion ports (i.e. async I/O). This behavior for completion ports was counter to what the MSDN documentation said should happen (the function should return immediately and its status can be retrieved later).

While I didn't realize it at the time, the cause for this was/is Windows Defender. Windows Defender (and other anti-virus / scanning software) typically work on Windows by installing what's called a filesystem filter driver. This is a kernel driver that essentially hooks itself into the kernel and receives callbacks on I/O and filesystem events. It turns out the close file callback triggers scanning of written data. And this scanning appears to occur synchronously, blocking CloseHandle() from returning. This adds milliseconds of overhead."

PDS: Observation: In an OS, if I/O (or more generally, API calls) are initially written to run and return quickly -- this doesn't mean that they won't degrade (for whatever reason), as the OS expands and/or underlying hardware changes, over time...

For any OS writer, present or future, a key aspect of OS development is writing I/O (and API) performance tests, running them regularly, and immediately halting development to understand/fix the root cause -- if and when performance anomalies are detected... in large software systems, in large codebases, it's usually much harder to gain back performance several versions after performance has been lost (i.e., Browsers), than to be disciplined, constantly test performance, and halt development (and understand/fix the root cause) the instant any performance anomaly is detected...

Related, If you copy a file via the OS's copy function the system knows the file was scanned and you get fast copies. If you copy the file by opening a new destination file for write, opening the source file for read, and copying bytes, then of course you trigger the virus scanner.

So for example I was using a build system and part of my build needed to copy ~5000 files of assets to the "out" folder. It was taking 5 seconds on other OSes and 2 minutes on Windows. Turned out the build system was copying using the "make a new file and copy bytes" approach instead of calling the their language's library copy function, which, at least on Windows, calls the OS copyfile function. I filed a bug and submitted a PR. Unfortunately while they acknowledged the issue they did not take the PR nor fix it on their side. My guess is they don't really care about devs that use Windows.

Note that python's copyfile does this wrong on MacOS. It also uses the open, read bytes, write bytes to new file method instead of calling into the OS. While it doesn't have the virus scanning issue (yet) it does mean files aren't actually "copied" so metadata is lost.

> Note that python's copyfile does this wrong on MacOS. It also uses the open, read bytes, write bytes to new file method instead of calling into the OS.

It doesn't, since 3.8. It tries fcopyfile() and only if it fails, does the read/write dance.

See: https://github.com/python/cpython/blob/master/Lib/shutil.py#...

I tested in 3.8, didn't seem to work


That's different thing; the copied data includes only file data itself, not metadata. From the documentation

- on shutil.copyfile:

> Copy the contents (no metadata) of the file named src to a file named dst and return dst in the most efficient way possible. src and dst are path-like objects or path names given as strings.

- on "Platform-dependent efficient copy operations":

> On macOS fcopyfile is used to copy the file content (not metadata).

On top of the shutil module:

> Warning

> Even the higher-level file copying functions (shutil.copy(), shutil.copy2()) cannot copy all file metadata.

> On POSIX platforms, this means that file owner and group are lost as well as ACLs. On Mac OS, the resource fork and other metadata are not used. This means that resources will be lost and file type and creator codes will not be correct. On Windows, file owners, ACLs and alternate data streams are not copied.

> "For any OS writer, present or future, a key aspect of OS development is writing I/O (and API) performance tests, running them regularly, and immediately halting development to understand/fix the root cause -- if and when performance anomalies are detected... in large software systems, in large codebases, it's usually much harder to gain back performance several versions after performance has been lost (i.e., Browsers), than to be disciplined, constantly test performance, and halt development (and understand/fix the root cause) the instant any performance anomaly is detected..."

Yes, this! And not just OS writers, but authors of any kind of software. Performance is like a living thing; vigilance is required.

I've had the displeasure of using machines with Mcaffe software that installed a filesystem driver. It made the machine completely unusable for development and I'm shocked Microsoft thought making that the default configuration was reasonable.

Copying or moving a folder that contained a .git folder resulted in a very large number of small files being created. To this day, I'm not sure if it was the antivirus, the backup software, or Windows' built-in indexing, but the computer would become unusable for about 5 minutes whenever I would move folders around. It was always Windows Explorer and System Interrupts taking up a huge amount of CPU, and holy cow was it annoying.

Even worse than that, moving a lot of small files in WSL reliably BSODs my work machine due to some sort of interaction with the mandated antivirus on-access scanning, making WSL totally unusable for development on that machine.

Good talk about debugging i/o in rustup:


Perhaps disable Windows Defender for the database (or whatever) folder

I'll throw in "hidden network dependencies / name resolution"; it's amazing how things break nowadays when there's no net.

For years I thought sudo just had to take seconds to startup. Then one day I stumbled across the fact that this is caused by a missing entry in /etc/hosts. I still don't understand why this is necessary.


Number one rule of distributed systems, "the network is not reliable"

Also the number one rule of Comcast.

I have actually used this to test remote pair-programming tools, as remote pairing really hinges on latency.

If we have a pesky test suite with failing latency tests https://github.com/auchenberg/volkswagen

Second rule, trust no one.

I'd add SaaS dependencies as well, whether it be slowness or downtime

This is a solved problem tho, timeout / retry / circuit breakers / fallback etc.

See - https://github.com/resilience4j/resilience4j

I wouldn't call it solved if a downstream SaaS is down and my build still times out despite the aforementioned resiliency.

The python overhead is something I've noticed as well in a system that runs a lot of python scripts. Especially with a few more modules imported, the interpreter and module loading overhead can be quite significant for short running scripts.

Numpy was particularly slow during imports, but I didn't see an easy way to fix this apart from removing it entirely. My impression was that it does a significant amount of work on module loading, without a way around it.

I think the other side of "surprisingly slow" is that computers are generally very fast, and the things we tend to think of as the "real" work can often be faster than this kind of stuff that we don't think about that much.

I see this alot with Ansible. Its not particularly slow but running it places a bigger burden on laptop cpu and fans than I'd imagined.

Noticed the same too. It is likely that we are both impacted by the very aggressive default of 1ms for `internal_poll_interval`: https://github.com/ceph/ceph-cm-ansible/pull/308

Huh, polling with a sleep() rather than a proper event wait seems like a bad code smell there...

How so? This is almost certainly what the internals of a packaged event wait would look like.

It could use a Queue of some kind rather than just pushing onto a deque between threads [1]? Then it would idly wait for the results_thread_main to push results.

I guess that might be what Mitogen does (along with other improvements), a faster Ansible strategy [2].

[1] https://github.com/ansible/ansible/blob/becf9416736dc911d341...

[2] https://github.com/mitogen-hq/mitogen/blob/master/docs/ansib...

Thanks for the heads up, i'll give that a go :)

Window's slow thread spawn time is incredibly noticeable when you use Magit in Emacs.

It runs a bunch of separate git commands to populate a detailed buffer. It's instantaneous on MacOS, but I have to sit and stare on Windows

Do you mean *process* spawn time?

From the article:

> On Windows, assume a new process will take 10-30ms to spawn. On Linux, new processes (often via fork() + exec() will take single digit milliseconds to spawn, if that).

> However, thread creation on Windows is very fast (~dozens of microseconds).

Yes, they clearly mean process spawn time:

> It runs a bunch of separate git commands

One of many reasons why I prefer to run Emacs under WSL1 when on Windows. WSL1 has faster process start times.

But then with git, there are other challenges. It took me a while to make Magit usable on our codebase (that for various reasons needs to be on the Windows side of the filesystem) - the main culprit were submodules, and someone's bright recommendation to configure git to query submodules when running git status.

Here's the things I did to get Magit status on our large codebase to show in a reasonable time (around 1-2 seconds):

- git config --global core.preloadindex true # This should be defaulted to true, but sometimes might not be; it ensures git operations parallelize looking at index.

- git config --global gc.auto 256 # Reduce GC threshold; didn't do much in my case, but everyone recommends it in case of performance problems on Windows...

- git config status.submoduleSummary false # This did the trick! It significantly cut down time to show status output.

Unfortunately, it turned out that even with submoduleSummary=false, git status still checks if submodules are there, which impacts performance. On the command line, you can use --ignore-submodules argument to solve this, but for Magit, I didn't find an easy way to configure it (and didn't want to defadvice the function that builds the status buffer), so I ended up editing .git/config and adding "ignore = all" to every single submodule entry in that config.

With this, finally, I get around ~1s for Magit status (and about 0.5s for raw git status). It only gets longer if I issue a git command against the same repo from Windows side - git detects the index isn't correct for the platform, and rebuilds it, which takes several seconds.

Final note: if you want to check why Git is running slow on your end, set GIT_TRACE_PERFORMANCE to true before running your command[0], and you'll learn a lot. That's how I discovered submoduleSummary = false doesn't prevent git status from poking submodules.


[0] - https://git-scm.com/docs/git, ctrl+f GIT_TRACE_PERFORMANCE. Other values are 1, 2 (equivalent to true), or n, where n > 2, to output to a file descriptor instead of stderr.

Wow that's very helpful. I'll give it a shot next time I'm at work

To precise, you say WSL1 is faster compared to Windows, or compared to WSL2? With WSL2 (and native-comp emacs branch) I've never noticed any unusual slowdowns with magit or other.

I haven't tried WSL1.

WSL1 process creation is faster compared to Windows, because part of the black magic it does to run Linux processes on NT kernel is using minimal processes - so called "pico processes"[0]. These are much leaner than standard Windows processes, and more suited for UNIX-style workflow.

I can't say if it's faster relative to WSL2, but I'd guess so. WSL2 is a full VM, after all.


[0] - https://docs.microsoft.com/en-us/archive/blogs/wsl/pico-proc...

It shouldn't actually be a noticeable difference. HW virtualization means that unless the guest is doing I/O or needs to be interrupted to yield to the host, the guest is kind of just doing its thing. Spawning a new user space process in a VM should, in theory, be basically the same speed as spawning a new user space process on the bare metal. How that compares to the WSL1 approach of pico processes I don't know, but Linux generally has a very optimized path for spawning a process that I would imagine is competitive.

Yeah, I hope this is one of the issues Microsoft address some time because although CreateProcess is a slightly nicer API in some regards the cost is very high. It may not be possible to fix it without removing backwards-compatibility, but maybe we could have a new "lite" API.

The bit about Windows Defender being hooked into every process is also infuriating. We pay a high price for malware existing even if we're never hit by it.

Yes. This makes me wonder if I could speed up our builds by 2x by whitelisting the source repository folder. If it's at all possible (and company policy allows for it)...

One thing that deeply frustrates me is that I simply don't know which things are slowed down by Defender. I can add my source repos to some "exclude folder" list deep in the Defender settings, but I've yet to figure out whether that actually does something, whether I'm doing it right, whether I should whitelist processes instead of folders or both, I have no idea.

If anyone here knows how to actually see which files Defender scans / slows down, then that would be awesome. Right now it's a black box and it feels like I'm doing it wrong, and it's easily the thing I dislike the most about developing on Windows.

Writing things that do a lot of forking, like using the multiprocess or subprocess modules in python, is basically unusable to my coworkers who use windows.

Startup time for those processes goes from basically instant to 30+ seconds.

I researched this a little bit and it seems that it may be related to DEP.

It's basically just Windows: Back when the current Windows architecture was designed (OS/2 and Windows NT going forward--not Win9x) the primary purpose of any given PC was to run one application at a time. Sure, you could switch applications and that was well accounted for but the entire concept was that one application would always be in focus and pretty much everything related to process/memory/file system standpoint is based around this assumption.

Even for servers the concept was and is still just one (Windows) server per function. If you were running MSSQL on a Domain Controller this was considered bad form/you're doing something wrong.

The "big change" with the switch to the NT kernel in Windows 2000 was "proper" multi-user permissions/access controls but again, the assumption was that only one user would be using the PC at a time. Even if it was a server! Windows Terminal Server was special in a number of ways that I won't get into here but know that a lot of problems folks had with that product (and one of many reasons why it was never widely adopted) were due to the fact that it was basically just a hack on top of an architecture that wasn't made for that sort of thing.

Also, back then PC applications didn't have too many files and they tended to be much bigger than their Unix counterparts. Based on this assumption they built in hooks into the kernel that allow 3rd party applications to scan every file on use/close. This in itself was a hack of sorts to work around the problem of viruses which really only exist because Windows makes all files executable by default. Unfortunately by the time Microsoft realized their mistake it was too late to change it and would break (fundamental) backwards compatibility.

All this and more is the primary reason why file system and forking/new process performance is so bad on Windows. Everything that supposedly mitigates these problems (keeping one process open/using threads instead of forking, using OS copy utilities instead of copying files via your code, etc) are really just hacks to work around what is fundamentally a legacy/out-of-date OS architecture.

Don't get me wrong: Microsoft has kept the OS basically the same for nearly 30 years because it's super convenient for end users. It probably was a good business decision but I think we can all agree at this point that it has long since fallen behind the times when it comes to technical capabilities. Everything we do to make our apps work better on Windows these days are basically just workarounds and hacks and there doesn't appear to be anything coming down the pipe to change this.

My guess is that Microsoft has a secret new OS (written from scratch) that's super modern and efficient and they're just waiting for the market opportunity to finally ditch Windows and bring out that new thing. I doubt it'll ever happen though because for "new" stuff (where you have to write all your stuff from scratch all over again) everyone expects the OS to be free.

> Also, back then PC applications didn't have too many files and they tended to be much bigger than their Unix counterparts.

Okay, let me interrupt you right here. To this very day Linux has a default maximum number of file descriptors per process as 1024. And select(3), in fact, can't be persuaded to use FDs larger than 1023 without recompiling libc.

Now let's look at Windows XP Home Edition -- you can write a loop of "for (int i = 0; i < 1000000; i++) { char tmp[100]; sprintf(tmp, "%d", i); CreateFile(tmp, GENERIC_ALL, FILE_SHARE_READ, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL); }" and it will dutifully open a million of file handles in a single process (although it'll take quite some time) with no complaints at all. Also, on Windows, select(3) takes an arbitrary number of socket handles.

I dunno, but it looks to me like Windows was actually designed to handle applications that would work with lots of files simultaneously.

> fundamentally a legacy/out-of-date OS architecture

You probably wanted to write "badly designed OS architecture", because Linux (if you count it as continuation of UNIX) is actually an older OS architecture than Windows.

1024 is a soft limit you can change through ulimit.

The actual limit can be seen via 'sysctl fs.file-max'. On my stock install it's 13160005.

> I doubt it'll ever happen though because for "new" stuff (where you have to write all your stuff from scratch all over again) everyone expects the OS to be free.

I think one way they could pull it off is to do a WSL2 with Windows - run the NT kernel as a VM on the new OS.

As for the price, I think they're already heading there. They already officially consider Windows to be a service - I'm guessing they're just not finished getting everyone properly addicted to the cloud. If they turn Windows into SaaS execution platform, they may just as well start giving it away for free.

>My guess is that Microsoft has a secret new OS (written from scratch) that's super modern and efficient and they're just waiting for the market opportunity to finally ditch Windows and bring out that new thing. I doubt it'll ever happen though because for "new" stuff (where you have to write all your stuff from scratch all over again) everyone expects the OS to be free.


>My guess is that Microsoft has a secret new OS (written from scratch) that's super modern and efficient and they're just waiting for the market opportunity to finally ditch Windows and bring out that new thing. I doubt it'll ever happen though because for "new" stuff (where you have to write all your stuff from scratch all over again) everyone expects the OS to be free.

More and more stuff gets offloaded onto the WSL for stuff which doesn't need interactive graphics or interoperability through the traditional windows IPC mechanisms.

In my experience, Magit is slow even on Linux. On my small repos at home, subjectively magit-status seems to take around 0.2-0.3 seconds. And that's just status, the most basic information you ask of git. Committing is several times slower. On a large codebase at work, magit-status usually takes around 10 seconds, sometimes longer. Again, I'm usually running it to just check some basic metadata (what branch I'm on, do I have a dirty tree, if yes, then what files are changed), so it's frustrating to wait. Honestly, I'd expect stuff like that to update effortlessly in real time without me issuing any commands. This is what happens in some other editors. However, currently I'm glued to Emacs because of Tramp for working remotely in a nice GUI and org-mode for time-tracking (TaskWarrior/TimeWarrior isn't for me).

I prefer Fork on Windows and Mac (prefer the Windows version for aesthetic reasons). Unfortunately, it's not available for Linux.

> Currently, many Linux distributions (including RHEL and Debian) have binary compatibility with the first x86_64 processor, the AMD K8, launched in 2003. [..] What this means is that by default, binaries provided by many Linux distributions won't contain instructions from modern Instruction Set Architectures (ISAs). No SSE4. No AVX. No AVX2. And more. (Well, technically binaries can contain newer instructions. But they likely won't be in default code paths and there will likely be run-time dispatching code to opt into using them.)

I've used Gentoo (everything compiled for my exact processor) and Kubuntu (default binaries) on the same laptop a few years ago and the differences in perceived software speed was negligible.

It depends on the software. I've recompiled the R core with -march=native and -ftree-vectorize and gotten 20-30% performance improvements on large dataframe operations.

If it were up to me, the R process would be a small shim that detects your CPU and then loads a .so that's compiled specifically for your architecture.

The same improvements would probably be seen in video/image codecs, especially on Linux where browsers seem incredibly eager to disable hardware acceleration.

My understanding is that the stdlib of the machine already figures out the faster code for the machine at run time. Such that, for most of the heavy stuff in many programs, it isn't that different.

Granted, I actually do think I can notice the difference on some programs.

I'd like to see some numbers comparing "backwards compatible" x86_64 performance with "bleeding edge" x86_64. That was something I had never considered, but it seems obvious in hindsight that you cannot use any modern instruction sets if you want to retain binary compatibility with all x86_64 systems.

.NET Core did quite a bit of work on introducing "hardware intrinsics". They updated their standard libraries to use those intrinsics and let the JIT optimize userland code. .NET Core performance is pretty good, if not the best [1].

I have myself seen 2-3x improvements in older .NET applications migrating to the .NET Core (called just .NET from version 5 onwards). I think JITed code have a unique advantage here that any hardware-specific optimizations are automatically applied by the JIT, with newer versions of JITs bringing even more performance improvements when the ISAs change.

[1]. https://devblogs.microsoft.com/dotnet/hardware-intrinsics-in...

Edit: Added reference link.

I suspect the difference will not be very large, as compilers are fairly bad at automatic vectorization. Most of the places that could benefit reside in your libc anyways and those are certainly tuned to your processor model.

I dunno, the "Trickle-Down Performance" section on Cosmopolitan describing the hand-tuned memcpy optimization gives me the impression that there might still be quite a lot of missed optimization opportunities there due to pessimistic assumptions about register use:


I probably agree with Justine about instruction cache bloat for these functions, but I remain unconvinced that diverging from System-V is something worth its tradeoffs. The discussion for that would likely be lengthy and unrelated to this topic, as compiling with newer CPU features would likely make performance comparisons worse under her scheme as the vector registers are temporaries.

The Cosmopolitan Libc memcpy() implementation is still binary compatible with the System V Application Binary Interface (x86_64 Processor Supplement). It gains the performance by restricting itself to a subset of the ABI, which the optional memcpy() macro is able to take advantage of. For example, the macro trick isn't even defined in strict ANSI mode.

> I remain unconvinced that diverging from System-V

I'm afraid I'm not well-versed enough in the subject to follow the leap here. I though I was just linking to an example of what (to me) appears to be manually doing register coloring optimizations (or something close to it). How does that lead to "diverging from System-V"?

(I'm not asking you to dive into the discussion you alluded to, I'll take your word for it that it's a complicated one, just asking to clarify what the discussion even is)

I assume parent meant System V Application Binary Interface, Large AMD64 Architecture Processor Supplement, which defines the calling convention and as a part of it defines how registers are used.

So unless it is OK to break the ABI (i.e. you are going to call it only yourself, go wild), there won't be register coloring optimization over exported symbols.

Thank you for clarifying

This discussion flared up recently when mesa and Arch Linux were considering changing the defaults or providing new packages. https://www.phoronix.com/scan.php?page=news_item&px=Mesa-202...

From experience using `-march=native` is not a magical release brake and speed up switch. It works great for some projects and does almost nothing for others.

Including multiple code paths usually has very little overhead unless used in a tight loop or the branch predictors become confused by it.

Here is a Thread from the Arch mailing list: https://lists.archlinux.org/pipermail/arch-general/2021-Marc...

That's mostly true. For something like TensorFlow -march=native can be like night and day. What I try to do with Cosmopolitan Libc, is CPUID dispatching, so it doesn't matter what the march flag is set to, for crucial code paths like CRC32, which go 51x faster if Westmere is available.

Your libc being tuned to your processor model seems extremely unlikely unless you’re compiling from source? I dunno, I only have one amd64 DVD of ubuntu

It has multiple implementations of functions and dynamically selects the right one based on CPU features.

There is a Linux distro that builds packages with newer instruction sets and compiler flags for performance reasons. It's Clear Linux: https://clearlinux.org/

The last section is really interesting. The author presents the following algorithm as the "obvious" fast way of doing diffing:

1. Split the input into lines.

2. Hash each line to facilitate fast line equivalence testing (comparing a u32 or u64 checksum is a ton faster than memcmp() or strcmp()).

3. Identity and exclude common prefix and suffix lines.

4. Feed remaining lines into diffing algorithm.

This seems like a terrible way of finding the common prefix/suffix! Hashing each line isn't magically fast, you have to scan through each line to compute the hash. And unless you have a cryptographic hash (which would be slow as anything), you can get false positives, so you still have to compare the lines anyway. Like, a hash will tell you for sure that two lines are different, but not necessarily that they are the same: different strings can have the same hash. In a diff situation, the assumption here is that 99% of the times, the lines will be the same, only small parts of the file will change.

So, in reality, the hashing solution does this:

1. Split the files into lines

2. Scan through each line of both files, generating the hashes

3. For each pair of lines, compare the hashes. For 99% of pairs of lines (where the hash matches), scan through them again to make sure that the lines actually match

You're essentially replacing a strcmp() with a hash() + strcmp(). Compared to the naive way of just doing this:

1. Split the files into lines

2. For each pair line, strcmp() the lines once. Start from the beginning for the prefix, start from the end for the suffix, in each case, stop when you get to a mismatch

That's so much faster! Generating hashes is not free!

The hashes might be useful for the actual diffing algorithm (between the prefix/suffix) because it presumably has to do a lot more line comparing. But for finding common prefix/suffix, it seems like an awful way of doing it.

Isn't this exactly what the author says? From the posted article:

> Another common inefficiency is computing the lines and hashes of content in the common prefix and suffix. Use of memcmp() (or even better: hand-rolled assembly to give you the offset of the first divergence) is more efficient, as again, your C runtime library probably has assembly implementations of memcmp() which can compare input at near native memory speed.

So I think you're agreeing with him: it's a useful optimization to first remove the matching prefix and suffix using memcmp() to avoid having to do line splitting and hashing across the entire file. Especially since it's not uncommon for the files being compared to be mostly identical, with only a few changes in the middle, or some content added at the end.

Yeah, I misread that section: he talked about how the program spent longer time in the prefix/suffix section, and how picking a better hashing algorithm would improve things, and I was going "no! don't hash at all for that part! just compare the lines!". I missed the paragraph you quoted there.

Still, though, this is wrong: "Another common inefficiency is computing the lines and hashes of content in the common prefix and suffix." You have to compute the lines for the prefix at least, diffs use line numbers to indicate where the change is.

> diffs use line numbers to indicate where the change is.

The optimal solution would be a minor variation on the AVX-optimised memcmp that also counts the number of times it sees the newline character as it is comparing. You still do a single pass through the data, but you get the prefix comparison and the line number in the same go.

For modern CPUs, this is likely optimal.

You assumed the diff algorithm only compares each line against one other line. That's not true.

You look at each line many times in these algorithms. Running time is O(n log n) or O(n^2) not O(n).

So you generate N hashes and compare each hash against log N or N other hashes.

So, for big enough data it should be faster.

No, you misunderstand: he mentions a common optimization where before you run your diffing algorithm, you find the common line suffix/prefix for the file, and how that will improve performance (if you have a compact 5-line diff in a 10,000 line file, it's unnecessary to run the diffing algorithm over the whole thing). His point was that this suffix/prefix finding thing was surprisingly slow totally apart from the actual diffing.

I was talking about that part, how hashing there is unnecessary. As I mentioned at the end of my comment, for the actual diffing algorithm, it's fine to hash away.

"That's so much faster!"

Do you have benchmarks for that?

would love to see an experiment on this

> Laptops are highly susceptible to thermal throttling and aggressive power throttling to conserve battery. I hold the general opinion that laptops are just too variable to have reliable performance. Given the choice, I want CPU heavy workloads running in controlled and observed desktops or server environments.

Hallelujah. Running microbenchmarks on laptops is generally pointless

Measuring instruction counts instead of cpu cycles or wall time can help. Or you can temporarily disable CPU boost clocks to stay within the thermal envelope. You should do that anyway since boosting is another source of noise. On linux it's as easy as

    echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

    echo 0 > /sys/devices/system/cpu/cpufreq/boost

Except then he goes on and says you can't expect consistent performance across servers either :)

Probably not as bad as laptop troubles, of course.

Would slow build configuration be a problem though? It isn't even slow compiling, on one machine you configure once and then you can compile n times (e.g. if you're developing)

He's definitely right about writing to Terminals though, or in my experience logging.

Often I rerun the entire build from scratch when I'm uploading some nontrivial change and I want to really make sure what I'm uploading builds. Clean builds using containers or chroots will often need to rerun a configure step to make sure the build is really clean.

You could argue for a "sufficiently smart cache", but if you manage to optimize the configure step to run in a small time without a cache, it's one moving part less to keep in one's mind.

If you need to run the configuration only once, it's probably not a problem. It's surprisingly often though that we get into a situation where we unexpectedly need to run a task many times. Trivially, this happens when you need to make a change in the configuration and require feedback.

Unfortunately autotools is dumb. If you add a new source file to Makefile.am, you will have to run autoreconf to generate your Makefile anew. That’s fine, but it also regenerates configure, even if you didn’t touch configure.ac. AND THEN the Makefile sees that configure is new, and it reruns it.

Autotools does not hear your screams.

Even on subsequent updates to the build configuration, I believe it checks if there environment has changed before running a build configuration. The assumption here is that the dependencies are setup correctly.

This is a fascinating set of shop-knowledge from someone who's clearly spent many years in a set of trenches that I hope I never have to. Great stuff.

Does anyone have an recommendations for interesting “lessons learned” type information like this blog article has?

Yeah, autoconf/autotools are a mishmash of old tools and scripts put together.

I still can't get my head around what it actually does when you do ./configure (probably conjure some 70's Unix daemon to make sure your machine is not some crazy variant with 25-bit addresses) and I tend to avoid it whenever possible

It was born in an era when there were many different *nix versions, all slightly different, with different names for include files, different C compilers with different flags, etc. Different OSes took differing numbers of arguments to certain library functions, etc. It was a necessary evil in the 90s.

I still remember the perl config script that would say "Congratulations. You aren't running Eunice!"

(Eunice was a bsd unix compat layer for VMS)

1. It is relatively easy to see what exactly configure is doing - it is logged to config.log and you can edit configure and add an echo or change something to troubleshoot failure. Troubleshooting cmake failures is much harder in my experience. 2. In many cases it is possible to reduce number of checks configure is doing: configure.ac files often contain unnecessary stuff, probably copied from some other project and kept just in case.

Oh I know what it is testing for, I just don't think a modern project needs to "checking for special C compiler options needed for large files" or check for the fsync command

Most of such check are performed because a software author put some macro into autoconf.ac, sometimes without good reason or without any thought at all.

I see this such attitude in many different areas e. g. people copy-paste some config options into software configs from some outdated how-to or StackOverflow and OK with now knowing what a given option does and if it is relevant in given case.

May be with autotools it happens more often than with cmake (though I've seen enough bad CMakeLists.txt too) because there is a myth that autotools is very hard to learn so developer don't even try to read documentation and just put some random stuff into their .ac/.am files until it works for them albeit slow.

If the configure tests were part of the dependency graph for parallel exection then any unneeded checks would simply not be run because nothing depends on them.

So if I'm compiling PostgreSQL from source, should I be doing:

    export CFLAGS='-O3 -march=native'
Before ./configure? Because if I don't, it's using -O2 without specifying an architecture.

Only use -march=native if you're never-ever going to run the binary on another machine. This includes upgrading the processor or data rescue.

What does data at rest has to do with the cpu architecture ?

march=native turns on all the features for that are available for your processor. So if, for example, you have a processor that supports AVX512 and another that doesn't, you'll get illegal instruction errors as soon as your other machine hits a region code it decided to optimize using those instructions.

You won't get any real useful error messages when this happens since it's so low level. Unless you know to look for this you'll just be scratching your head going, "but it works on my machine?!?"

Here's a dumb question: doesn't slow software affect the environment significantly?

To give an oversimplified answer, that depends on how things are connected and synchronized.

You know how bottlenecks are often explained using funnels in real life? Imagine two funnels, one directly connecting to another one. In that case, the effective rate of flow is determined by the funnel which has the smallest tip - the fact that the larger one lets through fluid quicker is irrelevant, as it is slowed down by the smaller one (we're obviously ignoring things like funnels potentially overflowing here, this is a simplified scenario). So then the bottleneck is easy to pin-point, but also it immediately becomes obvious that widening one funnel beyond the size of the other has no effect.

Now let's change the scenario: I have a bucket of water that I tip into the first funnel, going into a second bucket. Once the second bucket is full I tip the second bucket into the second funnel, going into a third bucket. In this case both funnels have an impact on the time it takes to fill the second bucket, but the smaller one sill has the bigger impact.

And then of course there is the parallel scenario, you can probably see how that plays out.

In our software environment is a complicated mix of all of these, and it can be really hard to pin-point what is the most significant effect. The "buckets" and "funnels" translate to all kinds of things - CPU an I/O are the most obvious ones, but there's more to it than that.

Also there's tons of side-effects that make this entire picture a lot more complicated than the simplified model I just described. The article gives quite a few examples, like thermal throttling, or branch prediction, but reality is even more depressing. Here's a great talk on why benchmarking is even harder than you think it is by Emery Berger:


This, by the way, is also why micro-benchmarks can be meaningless in the larger context. For those kind of situations Coz is supposedly a better option (I've never had to optimize complicated situations like that). The talk I just linked goes into detail as to how it works and why it is better

depends on why it is slow, is it CPU bound or waiting for I/O?

He singles out Windows for configure slowness, but MacOS is shamefully slow as well. I've seen configure run at least 2x as fast on the same machine booted into Linux or FreeBSD as compared to the MacOS that came on it.

For those with issues reading the site


Autoconf can use a cache file to speed up tests: https://www.gnu.org/software/autoconf/manual/autoconf-2.60/h...


Wow, a ton of nitty gritty details I was not aware of!

Speaking of thermal throttling on Macbooks, it's also worth pointing out that after 2 years the thermal paste on the CPU should be replaced, which is only a few dollars. I wish Apple made this a free maintenance along with removing internal dust.

Great content but please improve the contrast of your website <3

Huh? The foreground is #404040 and background is #F9F9F9. That’s 9.84:1 which is high enough for WCAG AAA. The background image is speckled, but even the lowest-brightness pixel is #E8F7FA which is 4.72:1, which is not AAA, but is AA.

I didn't notice the contrast problem until seeing the GP comment, but I suspect that speckled backgrounds are worse than a solid #E8F7FA because they mess with people's ability to do edge detection when trying to see the shapes of the letters in the text.

What are these ratios and the letter sequences?

Ratios are contrast ratios [1] while letter sequences are WCAG 2.0 conformance levels (AAA is the highest).

[1] https://www.w3.org/TR/WCAG20/#contrast-ratiodef

I removed the background image and made the text blacker. Might take a force refresh to pick up the CSS file change.

Is it good enough or are further tweaks needed? If more, my web design skills are mediocre, so actionable feedback would be appreciated.

Looks perfect to me after ctrl+shift+r. I'm glad you removed that background image, it was rather pointless.

In my experience, third party antivirus software does a better job than Windows Defender when it comes to file open/close performance. I always disable Defender or replace it with something else specifically because of the performance impact when working with many tiny files.

Maybe the third parties have handlers that don't block? There isn't any need to, after all - simply record the file details and return immediately. The virus scanner can always check that file later.

In fact, it makes more sense to do it that way because if the same file changes multiple times, the scanner will only check it once. Just have to make a trade-off on the duration - wait too long before you check the written file and it may have already gone on to infect something else.

I think their assumption is that this would race with another program opening or executing the file that was just downloaded if there is any possible delay at all, and thus by the time the scanner reaches it it's already too late.

I mean, Microsoft aren't dumb. Windows Defender is a competent AV product. If they're blocking on close there's probably a reason for it and they probably hate it. The trick with thread pooling file closes is one I'll stash in my brain for later: performance on Windows matters, especially as Win10 is getting more and more competitive vs macOS all the time.

> If you are running thousands of servers and your CPU load isn't coming from a JIT'ed language like Java (JITs can emit instructions for the machine they are running on... because they compile just in time), it might very well be worth compiling CPU heavy packages (and their dependencies of course) from source targeting a modern microarchitecture level so you don't leave the benefits of modern ISAs on the table.

Interesting, I wonder how this has affected language benchmarks and/or overall perception between JITed languages and native languages

> “ Programmers need to think long and hard about your process invocation model. Consider the use of fewer processes and/or consider alternative programming languages that don't have significant startup overhead if this could become a problem (anything that compiles down to assembly is usually fine).”

This is backwards. It costs extra developer overhead and code overhead to write those invocations in an AOT compiled language. The trade off is usually that occasional minor slowness from the interpreted language pales in comparison to the develop-time slowness, fights with the compiler, and long term maintenance of more total code, so even though every run is a few milliseconds slower, adding up to hours of slowness over hundreds of thousands of runs, that speed savings would never realistically amortize the 20-40 hours of extra lost developer labor time up front, plus additional larger lost time to maintenance.

People who say otherwise usually have a personal, parochial attachment to some specific “systems” language and always feel they personally could code it up just as fast (or, more laughably, even faster thanks to the compiler’s help) and they naively see it as frustration that other programmers don’t have the same level of command to render the develop-time trade off moot. Except that’s just hubris and ignores tons of factors that take “skill with particular systems language” out of the equation, ranging from “well good luck hiring only people who want to work like that” to “yeah, zero of the required domain specific libraries for this use case exist in anything besides Python.”

This is a case where this speed optimization actually wastes time overall.

> This is a case where this speed optimization actually wastes time overall.

That's too much of an absolute to be a good rule. If your heavyweight runtime is being launched 1000s of times to get a job done but you only do this once every few months, sure, don't do many optimizations, certainly don't worry about a rewrite in another language. The savings probably aren't worth it. If your heavyweight runtime is being launched 1000s of times to get a job done every day or multiple times a day, consider optimizing. Which may include changing the language. That's hardly controversial, this is the same thing we consider with every other programming task.

Is X expensive in your language and do you have to do this frequently? Then minimize X or rewrite in a language that handles it better.

> “ If your heavyweight runtime is being launched 1000s of times to get a job done every day or multiple times a day, consider optimizing. Which may include changing the language. That's hardly controversial”

No, that is controversial because the time saved per run (even 1000s of times per run with multiple daily runs) is never going to come close to amortizing the upfront sunk cost of that migration and future maintenance.

I’m specifically saying in the exact case you highlighted, people will short-sightedly think it’s a clear case to migrate out of the easy-but-slow interpreted language or never start with it to begin with, and they would be quantitatively wrong, missing the forest for the trees.

Think from the point of view of a tools/productivity engineer at a large company.

Yes, you invest some of your time to create the faster tool. Then hundreds to tens of thousands of people all use that faster tool to save time, day in and day out.

Just to put some concrete numbers to this, if you have a 100-person engineering team and you ask one of them to spend all their work time ensuring that the others are 1% more efficient than they would be otherwise (so saving each of them 5 minutes per typical 8 hour workday), you about break even. If you have a 500-person team, you come out ahead.

Now it's possible that the switch to a compiled language we are discussing would not save people 5 minutes per day, or that you don't have 100+ engineers. Obviously for a tool that only the developer of the tool will use the calculus is very different!

Great, insightful post

I can't but think some of these fall into premature territory. Configuring a build for the machine is relatively rarely on the critical path. And it is mostly tests before the build. As such, it needs to compare to the build with tests, which typically takes longer than just the build.

Similarly, the concern on interpreter startup feels like being about one of the least noticed times on the system. :(

I agree that usually the first bottleneck is the edit-rebuild cycle. But I think bottlenecks (plural) are a better way to view things than a single critical path. If my edit-rebuild cycle is fast enough that it doesn't bother me, then depending on what I am working on, I may quickly start noticing configure times. An accurate build system will reconfigure when any of a lot of different files are touched. (And inaccurate build systems waste a lot more developer time, just less evenly distributed and with more frustration involved.) So I reconfigure if I'm working on something relevant to the build system. I reconfigure when I pull down changes and rebase. I reconfigure when I'm switching between tasks (perhaps because my tests take a long time in CI), and in fact I won't switch tasks if reconfiguring takes too long. (And yes, I have multiple work trees, but my object directories can hit 20GB and it wastes even more time shuffling stuff around when I start running out of disk space for 4 work trees * 3-5 different configurations.)

And interpreter startup bites you all over the place! If it adds half a second latency to my shell prompt, I won't use it. And Greg gave the math for things where you're restarting the interpreter a million times, which isn't uncommon in my experience.

You should always profile. These things are heavily dependent on the type of stuff you work on. Et cetera. But my personal experience at several very different jobs says that these things do matter, a lot. Also, if you work at a moderately large place, the productivity loss from small inefficiencies is staggering. You have to look at the full picture to really see it properly; when things are slower, people don't just wait, they context switch and may never come back that day. Good for engagement numbers on HN, bad for productivity and flow.

The author appears to be mostly speaking from their experience of working at Mozilla, so I hope most of their claims are at least somewhat backed up by empirical (although possibly anecdotal) evidence

Fair. And I should have stated my main objection is that I don't find some of these surprising. Not that I don't agree they are slow.

Many of these will remain slow because they are far from the critical path of most end user systems.

Well, ok, maybe, maybe not - but why NOT make everything as efficient as you can? He's gone to all the trouble to show you what to do, it's no additional effort on your part to apply it.

Isn't this essentially Knuth's take?

I don't disagree, but I suspect the critical path had moved dramatically. In large, the time to decide to use python takes far longer than using python. (And I dislike python...)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact