
The Law of Leaky Abstractions - dwmkerr
https://github.com/dwmkerr/hacker-laws#the-law-of-leaky-abstractions
======
ryanbrunner
I see this fact used pretty often to dismiss the idea that we should use
abstractions at all, and I think that's pretty wrongheaded. As an example,
it's pretty common to hear that people should prefer writing raw SQL to using
an ORM or query builder because those are both "leaky abstractions"

Abstractions aren't necessarily only there so that you don't need to
understand anything about what's being abstracted. They are there so that your
code can get away from nitty-gritty implementation details, and be more
focused on the problem domain.

What would you rather see when you're coming into a new codebase:

    
    
        const formData = new FormData();
        formData.append('name', 'Widget');
        fetch(`/api/widgets/${id}`, {
           method: 'POST',
           body: formData
        })
    

or:

    
    
        Widget.save({ name: 'Widget' });

~~~
icxa
The argument against ORMs aren't that they are a leaky abstraction (although
they most definitely are).

The argument is although they appear to offer you value up front, they cost
you much more down the road. The moment you have a hot path query that needs
optimizing, you are dropping down to your ORM's "raw sql" mode. Then you do it
again. Then again. Then you are ripping out the ORM and spending cycles
refactoring it out of your code and replacing it with simpler abstractions.

I always find the people who don't believe writing raw SQL is preferable to
using an ORM is usually down to a lack of experience or because they have been
coerced to use ORMs by their "enterprise grade" language (usually your C#
developers of the world, no offense to you all but if every time you look up
examples of data operations it's dealing with EntityFramework you are probably
going to wind up with a lot of devs using EntityFramework). The ORMers, as I
call them, don't have confidence in their own SQL ability, and they haven't
experienced the aforementioned situation enough times to realize you are
better off starting with SQL to begin with.

The realization is, if I could summarize it: yes all abstractions are at least
somewhat leaky, so you are better to use simpler ones than complex ones (and
if you need something more complex, compose it from simpler ones)

~~~
asdkhadsj
I'd not mind writing raw SQL if it was anything other than a big, non-typed
blob of crap as far as my language is concerned. I want compile time safety in
my SQL. Tbh, I'm surprised that raw SQL folks don't promote some type of non-
ORM but also compile-time type safe SQL implementation. The runtime-ness of
SQL strings have always blown me away.

I imagine it would be pretty easy too. Move SQL out of the program language
_(ie, into files)_. Verify the syntax. Compare the SQL to the db to ensure
validity. Bam, compile time verified SQL. Though, I've never used a setup like
that, as that is what my ORM does, just in-language.

~~~
jmull
Your compiler doesn’t — and _can’t_ — know what’s in your database.

To the extent you create a system that requires this constraint, you’ve
created a brittle (and soon to be broken) system.

(I suppose there may be some case where all the data is all known up-front and
will always be updated in lock-step with the consuming code/services but
that’s rare in my experience.)

~~~
dragonwriter
> Your compiler doesn’t — and can’t — know what’s in your database.

There's no reason you couldn't compile code against a schema, at least the
portions that would be exposed to the application anyway (independent of the
data content and non-exposed backing parts of the schema) just as you do
against header files (independent of the implementation). It is a form of
coupling, but its coupling that exists _anyway_ in DB-consuming code, its just
not typically statically verified and so is prone to unnecessary run-time
breakage.

Unfortunately, you need tooling to statically analyze SQL schemas in whatever
flavor of SQL you are using (and vendor differences will matter here),
including inferring types through view definitions, etc., and then you need to
tooling to map that to the type system of your implementation language, and
potentially you need extension points so that you can do custom mapping for
custom types from the database side.

Unless your DB and application platform share tight common control or are near
ubiquitous, getting someone to make the investment to do this and keep it
current is hard. Its not too hard to imagine Microsoft doing if for the SQL
Server / .NET combo or Oracle doing it for Oracle DB / Java, but a general and
maintained enough to be usable solution is harder to see getting the kind of
support it would need.

~~~
jmull
Sure, you can validate your databases schema against types in your code.
Nothing wrong with that either, as far as it goes.

But that isn't a guarantee of anything at runtime.

You could make a runtime requirement that the schema and code match, but
you're going to pay a price for that tight coupling. E.g. you'll need to put
in place a mechanism to be able to update all your database clients and all
your database schemas atomically. Once you get a decent amount of data in your
database or scale horizontally, this could become infeasible (e.g., due to the
downtime to update the schema) or impossible (e.g., because you don't have
complete control of all databases, data, and database clients).

If you are going to go in this direction, you're usually going to be better
off targeting a mapping layer/API, not the base tables. The coupling is not so
rigid (e.g. there's room for simultaneous support for multiple versions of the
data access API/schema. This all exists, of course. Of course, ORMs and other
higher level data access libraries generally do this kind of thing to a
greater or lessor degree. Importantly, the mapping layer/data access API
should _live with the database_. That is, be developed and deployed with the
database (directly or in controlled parallel).

------
jasonkester
A fun example from SQL Server that bit me the other day. Observe this simple
proc:

    
    
      create procedure SearchJackets
        @buttonCount int
      as
      select *
      from   jackets
      where  @buttonCount is null or buttonCount = @buttonCount
    
    

It's a common pattern you use for multi-param search queries that lets you
avoid building your query as a string. Normally there are lots of little
filter params in there, but we'll just show one for now. "Computer: List me
the jackets, and maybe just show the ones with a certain number of buttons."

That version takes 60 seconds to run, whereas this version takes less than a
second:

    
    
      create procedure SearchJackets
        @buttonCount int
      as
      declare @buttonCountCopy int
      set @buttonCountCopy = @buttonCount 
      select *
      from   jackets
      where  @buttonCountCopy is null or buttonCount = @buttonCountCopy 
    
    

... because it kicks the optimizer upside the head and convinces it to
reconfigure the query plan in an efficient way. There are hints you can use at
CREATE time that are supposed to do the same thing, but they don't actually
work in this particular case. So now one of my codebases has this hack
sprinkled about in a couple "Advanced Search" pages.

Fun stuff.

~~~
hackinthebochs
>There are hints you can use at CREATE time that are supposed to do the same
thing, but they don't actually work in this particular case.

Are you saying that "option(recompile)" doesn't work for you in this case? It
works just fine for me when using that 'X is null or' pattern for optional
filters.

~~~
jasonkester
Indeed, that's the option I was testing that had no effect. From what I've
read, it does seem to work in some cases. But not mine, sadly.

~~~
hackinthebochs
Well that's interesting. I'll have to keep your little hack in the back of my
mind. Thanks!

------
TickleSteve
"for a magnetic drive, reading data sequentially will be significantly faster
than random access (due to increased overhead of page faults),"

...err... no.

Magnetic drives have slow random access due to seek-time, i.e the time taken
for the head and disk to physically change position.

In comparison to that, SSDs are _effectively_ zero latency but they still have
read-ahead/buffering/caching latencys to deal with.

~~~
mannykannot
The quoted text appears to be comparing the sequential and random-access
speeds of magnetic disks, so the differences with respect to SSDs does not
come into it. On the other hand, I do not understand what the author means in
the following clause, where the slower random access is attributed to the
overhead of page faults, unless the author has in mind a specific (and
unmentioned) scenario involving memory-mapped access (and if page faults are
the issue in that scenario when using magnetic disks, why would one not have
the same issue when using SSDs? I would have thought the causality goes the
other way: the overhead of page faults is higher when using magnetic disks
because of their relatively slow random access.)

~~~
TickleSteve
Possibly... All I can surmise is that the OP may think disks are addressed in
a memory-mapped fashion and hence may be subject to page-faults for some
reason.

(Obviously, they're not).

~~~
tonyarkles
Reading that generously and since we’re talking about abstractions... reading
from disk via mmap does work via page faults! Except... it’s a layer up. Doing
random reads on an mmap’d file will likely have terrible performance until
those pages have been cached, but one layer down there’s no guarantee that
sequential reads from an mmap’d file are going to be sequential reads from
disk! (Because the file isn’t guaranteed to be laid out sequentially on disk)

Others in this discussion have talked about some abstractions being perfect
and a consumer not needing to understand the layer beneath; I strongly
disagree. Ultimately, the physical reality of the machine will come into play
(disk, RAM, caches, network, CPU etc), and I am generally uncomfortable if I
don’t have a solid feel for how the high-level operations in an abstraction
are going to use those resources.

------
hcarvalhoalves
I feel "abstraction" (as a term) is thrown around a lot, but should only apply
to the _conceptual_ system (aka, while in the computer science realm).

Once you're dealing w/ the implementation and it turns into something
_concrete_ (aka the engineering realm), it's just a layer of indirection (and
it serves its purpose by reducing coupling), but you still have to consider
that underneath all you're talking to some machine, over the network, there's
latency, it can timeout, there could be a load balance in between, the cache
may not be coherent for consecutive requests, you assume some operation is
both atomic and produces instantaneous side-effects, and so on... if you
ignore these details you end up w/ a system that looks great at a conceptual
level but is full of problems and race conditions because you ignored physics.

TL;DR "Leaky abstraction" is pretty much a tautology if you look up the
meaning of "abstraction"?

------
tabtab
Re: _Modern practices like 'Microservice Architecture' can be thought of as an
application of this law (The Unix Philosophy), where services are small,
focused and do one specific thing, allowing complex behaviour to be composed
from simple building blocks._

An oop class or API can do the same thing. Microservices are overkill unless
multiple applications will be sharing the service, and even then may have
unnecessary overhead compared to say stored procedures.

------
indogooner

       However, for a magnetic drive, reading data sequentially will be significantly faster than random access (due to increased overhead of page faults), but for an SSD drive, this overhead will not be present. 
    

Even SSDs have overhead for random access although not as much as spinning
disks (~3 times IIRC)

~~~
mehrdadn
I never understood how the myth that SSDs have no random access overhead
became so prevalent and oft-repeated. Did nobody ever measure?

~~~
dspillett
SSDs have effectively zero random read access overhead when compared to
traditional drives, because the overhead is a couple of orders of magnitude
smaller. Also for SATA connected SSDs the effect of this read latency is
reduced by bottlenecks elsewhere.

For the common home/office/other user the difference between zero and
effectively zero is, well, effectively zero, so the two easily conflate. It
isn't so much a myth as a convenient simplification.

(The fact that there is still latency is very easy to show though - just throw
something like crystaldiskmark at an SSD and show the measured throughput
difference between the sequential and random tests.)

For NVMe drives where the bottlenecks of SATA are removed the difference
starts to become more noticeable, and on any SSD random write latency is more
significant than random read latency, but NVMe has only recently become common
for the general user and the tests that people usually look at are random read
not random write as for those common users this is the most significant
measure in terms of how it will affect their day-to-day use patterns.

------
_bxg1
I think the key is to make sure all leaks happen in the sphere of performance,
not correctness. So you can use the abstraction on its own without fragility,
and then you can learn more about it if you want to tighten up performance.

------
CSMR
A better formulation: Abstractions do not adequately describe the full working
of a system. So higher-level approaches will always require a knowledge of
lower-level operations.

However, the opposing and terribly named "Dependency Inversion Principle" is
true much more often: "High level [approaches] should not be dependent on low-
level implementations."

------
CapsAdmin
Maybe
[https://github.com/denysdovhan/wtfjs](https://github.com/denysdovhan/wtfjs)
counts as leaky abstractions.

It's definitely something that's difficult with building a language.
Abstracting and generalizing your language parser can lead to amigious
behavior.

------
amelius
Article could use a few more examples.

~~~
yasth
Joel's article ( [https://www.joelonsoftware.com/2002/11/11/the-law-of-
leaky-a...](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-
abstractions/) ) has tons.

A classic trap for new players is that in most modern compiled languages you
can add strings and get a string, but strings are in fact immutable and can't
be added without making an entirely new string and disposing of the original
two. This means "a" \+ "b" is actually a horrible way to build up a string if
you have to do lots of little additions, so most languages have some other
method of making a string of strings/chars (StringBuilder in Java and C#,
strings.Builder in Go, etc).

~~~
mankeysee
Why are strings made immutable by default in some langs (I have seen this
mainly in java and python)? Nothing fundamentally requires strings to be so.
Has some analysis been done indicating most string operations in software
would benefit by immutable form rather than non-immutable form?

~~~
ChrisSD
At least in Python, string objects are widely used as keys to dictionaries or
as options in functions. For speed and efficiency these small strings are
"interned" so that there is only ever one instance of the same string.

Also for mutable strings you either have to allocate enough memory to fit the
final result or have some kind of rope data structure. Or else you end up
copying it anyway.

------
peter_d_sherman
Great list of programmer laws, ideas, and understandings, not just "The Law Of
Leaky Abstractions" (which of course is a classic in its own right...)

------
crimsonalucard
It's like the 2nd law of thermodynamics. A surjective epimorphism. Information
is lost as you go higher and higher into the abstractions.

