
Ask HN: Why do maintainers optimize for small binaries? - nikisweeting
Just to be clear, I&#x27;m not talking about GUI apps or websites, those are separate conversations (cough cough 350mb Electron apps and 10mb Webpack bundles).<p>I&#x27;m talking specifically about CLI tools, webservers, and other tools distributed as static binaries or via package managers.<p>What&#x27;s the reasoning for so many package maintainers optimizing for &lt;5mb binaries at the expense of usability?<p>It seems like when &gt;90% of hosts are running on 2008+ hardware with SSDs or even moderately fast HDDs,
loading time and storage space are not major issues below the 30~50mb mark.<p>A recent example from HN: https:&#x2F;&#x2F;raphlinus.github.io&#x2F;rust&#x2F;2019&#x2F;08&#x2F;21&#x2F;rust-bloat.html
======
caymanjim
The simple answer is that they don't. It's a false premise.

Developers who are specifically targeting systems with limited memory will try
to produce small binaries. If you're talking about a distribution like Ubuntu,
though, it's simply not a concern. At all. Applications are built to do
whatever it is that they need to do, and whatever size they end up being is
how big they are, almost without exception.

The reason CLI binaries are small is that ALL binaries are small. They are
compiled, and any resources they need are stored externally in other files.
They use shared libraries, making the code even smaller through re-use.

I have 1785 programs in /usr/bin on my Ubuntu server, and all but 10 of them
are under 5M. The ones that are larger are only big for unusual reasons (e.g.
mysql_embedded).

I'm not sure what you're referring to when you talk about usability. Are you
saying that my 1.1MB nginx lacks some utility? And that it lacks it because
someone was worried about the size of the binary? That's simply false to the
point of being nonsensical.

One of the biggest binaries I use is the Postgres server, at a hefty 6.5MB. Is
Postgres missing features that affect its usability?

~~~
nikisweeting
Nginx lacks nginx_http_perl_module, nginx_http_lua_module, and many other
modules that I wish it had out of the box. This is why I end up installing
openresty instead on most boxes.

Postfix lacks OpenDMARC and OpenDKIM out of the box, which are recommended for
almost all non-satellite mail server installs these days. Granted they're not
native Postfix modules, but they're so important that I wish it had them out-
of-the-box.

Same is true for caddy, certbot, and several other tools I use often.

~~~
peterwwillis
All Linux distros handle these things differently. Some build in every feature
under the sun, some try to modularize and build 30 related packages, some
strip it down to minimal dependencies. It depends on the package, the
maintainer, and the distro.

It is often necessary to build custom packages if you want functionality which
isn't core to the original software (perl and lua aren't necessary to run
nginx core). If you work with the package maintainer, you may find a way to
have the distro build the extra functionality as optional packages.

~~~
nikisweeting
> perl and lua aren't necessary to run nginx core

But that's the crux of my argument, most of the time I don't just want the
core, I want to be able to do almost everything and I'm willing to pay the
binary price to have an "apt install nginx-everything" available.

Users with limited system resources could always "apt install nginx-minimal".

(If you can show me a distro that includes all the modules for nginx, caddy,
postfix, certbot, etc. by default, I'll switch in a heartbeat!)

------
mirashii
A few thoughts:

(1) I'm not sure that it is at the expense of usability, so I want to question
your premise.

(2) Size on disk is a proxy for many other kinds of bloat. That article
specifically mentions one in the opening, compile time. Surface area of the
codebase is another good example, where people are concerned that the amount
of libraries and code that they're pulling in means higher probability of
library bugs and maintenance burden.

(3) Bandwidth still has a cost.

~~~
nikisweeting
You don't wish nginx supported using environment variables in config files out
of the box?

I'd argue that lacking a way to do that is a significant hit to usability, and
it requires recompiling with the nginx_http_perl_module or lua module to get
that feature.

------
singron
On my machine I have ~1500 packages. I don't really care if a few are very
large, but on the whole, they have to be pretty small. My SSD has limited
space and I'd like to use it for things where IO performance is important for
the tasks I'm doing. That random package I don't use that's pulled in as a
dependency for a feature I don't even have enabled takes up space.

E.g. if you use protobufs for C++, you probably don't want to pull in all of
ghc so protoc can support haskell. You would rather use that space for ccache
or something.

------
enkiv2
A lot of places still have pretty slow/unreliable internet. Developing
countries often end up being basically mobile-only because it requires less
infrastructure, but mobile-only doesn't mean 4G in such places. Other times,
relatively rich countries will have spotty internet access because of
geography: New Zealand famously had very poor internet service for a while
because of the complexity around getting cables there, and Australia has such
a big difference in population density between urban and rural areas that
wiring infrastructure in the sticks is tough. Even in the united states, a
surprising number of people are still using dial-up internet because broadband
is not available.

Another concern is embedded platforms. If rust can't produce meaningful code
that's under a megabyte, then it can't be used to write elevator controllers
or microwave oven displays or to program pacemakers -- and as regular
computers get more RAM to try to make up for the end of Moore's law, a larger
and larger percentage of cases where it makes sense to use a compiled systems
language over a JITed scripting language will be embedded ones. To some
people, if a compiler can't produce 1k binaries, it's worthless because it
forces you to substitute in a beefier machine.

There's also general performance and security concerns: any unnecessary code
adds to both vulnerability surface & footprint. If you work your system really
hard & you're always a meg or two short of running out of RAM, then saving a
meg or two of code size matters a whole lot. (This applies less to desktops &
more to applications where you might process lots of data on off-the-shelf
hardware: if you're clever & take advantage of parallelism, you can process
many TB of data on a machine with less than a gig of ram in a relatively short
time, and you can cut it down further if you can run more copies of your
application, which you can do better if your memory footprint is smaller.)

Finally, I don't think it's safe to overestimate what percentage of machines
are less than eleven years old. Ever since Moore's Law ended, the case for
upgrading hardware has been a lot weaker, and even when it was running strong
folks often went a decade without doing so.

Bloat doesn't necessarily translate into usability, and usability doesn't
necessarily transfer between users. There are a lot of folks for whom a 5mb
app is, necessarily, unusable.

~~~
nikisweeting
> "unnecessary code adds to both vulnerability surface & footprint"

I wholeheartedly agree with this. If there is unnecessary code, there's no
reason to keep it.

I'm more directing my gripes towards packages that have highly desirable,
well-audited add-ons / modules that package maintainers choose not to include
in the default distributions (under the reasoning that "users want small
binaries").

Take for example nginx, caddy, certbot, or postfix (there are many others
too).

All of these require recompiling from source with build flags to enable the
inclusion of even their most common add-on modules, e.g.
nginx_http_lua_module, nginx_http_perl_module, caddy:http.cache, cerbot dns
plugins, etc.

Recompiling from source breaks the ability to use a package manager for
install and automated updates, which drastically reduces usability for the
majority of users. There are ways around this of course, but for the average
user, having to compile a package from source is a major hurdle.

Instead, why not distribute the binary with no add-ons as "apt install
packagename-minimal" for the users with bad internet / low-resource
requirements, and make the default "apt install packagename" distribution the
"batteries included" version?

(If you're interested, here's an old blog post of mine that goes into detail
on why I think package manager distributions are worth the effort to maintain
in general, even for static binaries or packages with dylibs:
[https://docs.sweeting.me/s/against-curl-
sh](https://docs.sweeting.me/s/against-curl-sh))

~~~
coryrc
Install nginx-extra then?

~~~
nikisweeting
But why not have that default and the non-extras version 'apt install nginx-
minimal'?

What about caddy which doesn't offer module support _at all_ in the package
manager verserion?

Certbot and postfix also don't include their most commonly used modules out of
the box.

~~~
mholt
If apt and others could simply allow us (package authors) to construct a URL
from which the binary can be downloaded, and then plop it into the user's PATH
for them, this would not be a problem. (What I'm getting at is that all it
takes to install a custom build of Caddy -- with any number of plugins you
want -- is a GET request.)

------
nyc_pizzadev
Instruction cache. If your mega binary doesn't fit into cache, instructions
jump around to different and unpredictable address spaces. You are fetching
and executing instructions directly from memory. This is why huge applications
are so slow, they are both instruction memory bound and data memory bound.
Imagine how slow a CPU is with no cache!

~~~
papermachete
A ryzen 2300X has 256KB of instruction cache, that's 4x64 KiB 4-way set
associative. How can you hope to fill that?

I've personally ran gentoo on a 4.2GHz i3-530 as well as a 4.5GHz FX-4350.
From kernel to firefox, I only gained performance going from -O2 to -Ofast. I
have no hopes of considerable cache misses on a modern CPU.

~~~
AstralStorm
Remember that your process is not the only one running on the CPU or core.
Most of the time, at least, and even if it is via some artifice you get to
call kernel code for syscalls and C library too.

------
vortico
I'm still not convinced of your claim that "package maintainers optimize for
<5mb binaries at the expense of usability". I've never heard anyone talk about
that. Do you have more examples?

~~~
chucksmash
Possibly precipitated by this post which was on the front page yesterday with
400 upvotes:

[https://news.ycombinator.com/item?id=20761449](https://news.ycombinator.com/item?id=20761449)

~~~
vortico
That example was given in the original post. That opinion is Raph Levien's
("One of my hopes for xi-editor is that the core would be lightweight"), so if
that's the only example, the OP should have asked him directly.

------
nightfly
Business logic isn't going to be a large part of a programs size. Most space
is used by static resources likes images, lookup-tables, and libraries for
handling the former. People aren't cutting out features from their CLI
programs to save a couple kilobytes.

~~~
marcosdumay
Hum... They are (specifically, cutting speed), as evidenced by the OP's link.

Not that it matters with the low number involved.

~~~
wademealing
I might be missing the point here, but larger binaries are not always slower.

------
gwern
The cynical answer is: binary size is, like the price of gasoline, highly
legible all out of proportion to its importance, and for that reason,
influential. It's easy to see instantly how big a package is because
everything reports it as an easy standard metric. You can see binary sizes go
up as you add functionality or dependencies. But it's not easy to see actual
end-user performance, bug-freeness, or security (binary size being only weakly
correlated, at best, to things like those, which the user actually cares
about). Setting up a meaningful benchmark or testsuite is much harder than
noticing that your binary is no longer <5mb. So, the latter happens more than
the former.

~~~
nikisweeting
You can still claim that you have a small core binary of 200kb, while offering
the 20 modules in the default package manager distribution that bring it up to
8mb.

Then refer people to "apt install packagename-minimal" if they only want the
core.

~~~
gwern
You can but that requires more of an attitude of gaming metrics than most
people will have. Lots of metrics are easy to game if you explicitly set out
to, but most people aren't dishonest; it's the ones which are easily gameable
without explicitly trying to which are the dangerous ones, because most people
are honest.

------
koala_man
I'm the maintainer of ShellCheck. I minimize the size of the binaries because
I pay per gigabyte transferred for the hosting, and to avoid playing into
Haskell's rep of being bloaty.

~~~
mehrdadn
Shellcheck is huge because of the language runtime, right? It's a couple tens
of megabytes IIRC, whereas, had it been a normal C++ program, I would've
expected it to be maybe a few hundred KB, maybe a MB or two if pushing it.

------
saint_abroad
> Once you accept bloat, it’s very hard to claw it back. If your project has
> multi-minute compiles, people won’t even notice a 10s regression in compile
> time. Then these pile up, and it gets harder and harder to motivate the work
> to reduce bloat, because each second gained in compile time becomes such a
> small fraction of the total.
> [https://raphlinus.github.io/rust/2019/08/21/rust-
> bloat.html](https://raphlinus.github.io/rust/2019/08/21/rust-bloat.html)

If accepting a feature that causes a 3x compile-time regression were easy,
then that 29s build is just 5 easy decisions away from a 2-hour build. Now,
nobody cares about adding more time to the builds, since nightly, and
developers commit hoping it doesn't fail. This is sadly all too familiar.

> Good programmers write good code. Great programmers write no code. Zen
> programmers delete code. [https://www.quora.com/What-are-some-things-that-
> only-someone...](https://www.quora.com/What-are-some-things-that-only-
> someone-who-has-been-programming-20-50-years-would-know/answer/John-Byrd-2)

------
peterwwillis
Well, ask yourself the reverse: what am I gaining by having unnecessarily
large binaries?

If you ever have an automated process built around your tool, eventually
something will run it lots of times. The more it's downloaded, copied, and
run, the more the "bigness" affects performance. It's best to choose smaller
whenever it doesn't take away features that you need.

~~~
nikisweeting
I'm gaining all the functionality of nginx_http_perl_module,
nginx_http_lua_module, caddy:http-cache, and all the other modules that I
wanted to have in the first place.

------
mntmoss
The most accurate answer w/r to the thing you linked is: All binaries used to
be much smaller, by a factor of 100x or more. And so on this aspect of
performance we have had a kind of regression where most things have a bigger
footprint than necessary. Some of it is to be expected since 64-bit code is
larger, and Unicode handling is larger, and we support more features even in
the smallest CLI tools, but there's also an element of the tooling making an
optimization that is locally sensible - expand code size to get some runtime
performance benefits - and globally useless, since as the featuresets increase
an increasing percentage of all code is "run once/run never" and not a hot
loop, and so it's just wasting disk, memory and cache to have it in that
configuration.

------
thesuperbigfrog
Efficiency.

Here's an analogy using your question:

Why do aircraft builders optimize for weight?

Just to be clear, I'm not talking about cargo planes or helicopters, those are
separate conversations.

I'm talking specifically about passenger airliners, fighter jets, and other
aircraft that are built in large quantities.

What's the reasoning for so many aircraft builders optimizing for weights
<50000 kgs at the expense of usability?

~~~
nikisweeting
Given that most plugin-style functionality can be toggled in runtime config so
that it's never loaded into memory when unused. And given that maintainers can
provide a separate "apt install package-minimal" distribution with none of the
plugins included for people with low, bandwidth.

How does a smaller binary make things more efficient?

------
biggestdecision
>loading time and storage space are not major issues below the 30~50mb mark.

Even if you have ample disk space and memory, smaller binaries can be a
performance advantage. A smaller binary with less instructions will more
easily fit into CPU caches. Binary size can be an indicator of performance.

~~~
nikisweeting
Not if you design it so that optional functionality can be toggled in runtime
config. Nginx's optional modules never get loaded into memory if you don't
include them in your config.

------
franciscop
Hi Nick! Happy to answer from my point of view. I just reduced one of my
packages to 1/1000 its original size:
[https://bundlephobia.com/result?p=drive-
db@5.0.0](https://bundlephobia.com/result?p=drive-db@5.0.0)

There are several reasons that I like small packages, both when making them
and when consuming them:

\- Easier to maintain. One packages does a single thing, it's easy to reason
and compose with other packages. This is specially true for utility packages.
It's easier to document a single thing, to debug a single functionality, etc.

\- Faster installs. It does take several minutes under certain circumstances
to install larger packages/projects for me. Not everyone using tech tools live
in a world with fast internet. Sure this amounts to ~30 min/week max, but
would prefer to use my time differently.

\- You say 30-50mb, but my typical React project is 80-100 MB in node_modules.
As an example, my current laptop is ~4 months old, and the "projects" folder
has 670k+ files and weights 6+ GB.

\- Copying projects. While you talk about binaries, it's usual that it's
either a single minified file with a decent size, or hundreds/thousands of
dependencies. While the size itself doesn't matter so much, the amount of
files matters for things like backups, searching in files, etc.

\- Signaling. People who care about this, normally won't throw a lot of
dependencies on top of it if it can be easily avoided. So you know there
aren't many surprises normally or Guy Fieri images:
[https://medium.com/s/silicon-satire/i-peeked-into-my-node-
mo...](https://medium.com/s/silicon-satire/i-peeked-into-my-node-modules-
directory-and-you-wont-believe-what-happened-next-b89f63d21558)

\- Marketing. Many people care about it for these or other reasons. Everything
else being the same, a smaller package is better, so there's no disadvantage
if you can easily shed some of the library size. I don't care that much,
especially because of what you say. But some people seem to do.

\- Tooling is easy! Rollup, webpack, uglify, etc. There are many easy (okay,
not Webpack) projects to bundle and minimize a project, which are needed from
front-end JS.

Now, I wouldn't optimize at the expense of usability or dev time. For
instance, the two features I removed from `drive-db` were removed because they
were half-baked. File cache ~> replaced by in-memory cache, which also allows
for browser and Cloudflare Worker usage (the main point for refactoring).
MongoDB-like queries? ~> JS today is good enough not to need those for a small
spreadsheet. They didn't work the same as MongoDB, and not even I, the library
author, used them ever.

~~~
nikisweeting
Thanks for the detailed response!

I think I should've added more detail in my original post explaining that I'm
very much pro-removing code, my gripes were directed at packages that have
optional modules that aren't included by default because maintainers claim
"users want small binaries".

I'm of the opinion that most packages should include their optional add-ons in
their default distributions, and offer a separate "minimal" version without
any add-ons for the users with low-resource requirements.

(See my response to the other comment above too)

~~~
franciscop
I see, I totally agree! I want to rewrite
[https://serverjs.io/](https://serverjs.io/) at some point, and I'm thinking
about what functionality to add. e.g., CORS will probably be a config toggle
instead of having to add custom middleware.

But what about things where it is purely an option? There's a slippery slope
there IMHO. Database connection? For a server library, no way. Key Value
store? Also no way... no, wait, those are used for sessions, now you have to
either do no sessions by default, have in-memory sessions which are super
tricky (because they "work", until there's tricky production issues) or add
Redis/similar connection. Same for rendering templates, I added the 3-4 most
common but not others and feels really "meh". It's also A LOT of work to add
the top 3-4 options, even without documentation.

~~~
nikisweeting
Those decisions seem pretty reasonable from the perspective of a mostly-Django
dev! Do you think >30% of users would want to install those things every time
they use serverjs.io? If not it seems fine to leave them out.

I think Django supports db-backed/cache-backed/file-backed sessions out of the
box, but I think Flask doesn't have native sessions at all, it's only addable
with additional libraries, but I could be wrong.

------
devinjflick
For Linux cli tools the obvious (to me) answer is IoT devices. These devices
are typically pretty lacking in terms of hardware capabilities. So for common
tools having them as small as possible allows them to be distributed on a
wider range of devices.

~~~
nikisweeting
So provide "apt install packagename-minimal" for those with low-resource
requirements. Otherwise the tyranny of the minority forces the majority to
accept usability reductions.

------
mister_hn
Performance is a major concern. Static binaries help also being much more
redistributable and hassle-free compared to dynamic linked ones (imagine all
those DLLs, so or jar files around - search for DLL Hell or JAR Hell for
instance)

~~~
nikisweeting
Most optional plugin-style functionality can be toggled at runtime, and never
has to be loaded into memory, why would including more optional modules in the
default distribution affect performance?

------
fulafel
The example you referenced, Druid, is infact a GUI toolkit.

