
How not to structure database-backed web apps: performance bugs in the wild - godelmachine
https://blog.acolyer.org/2018/06/28/how-_not_-to-structure-your-database-backed-web-applications-a-study-of-performance-bugs-in-the-wild/
======
latch
I've worked on moderately busy backend platforms (~10K-20k rps handled on a ~4
e5-2650 and aiming for 5ms 95p response times).

It greatly depends on what you're doing, but for the majority of systems which
are read heavy (and that most certainly includes "dynamic" sites like Amazon
or Wikipedia), I hold to two major beliefs:

1 - Have very long TTLs on your internal cache servers with a way to
proactively purge (message queues) and refresh in the background. Caching
shouldn't be a compromise between freshness and performance. Have both!

2 - Generate message/payloads/views asynchronously in background workers and
have your synchronous path as streamlined as possible (select 1 column from 1
table with indexes filters). Avoid serialization. Precaculate and denormalize.
Any personalization or truly dynamic content can be done: 1 - By having the
client make separate requests for that data 2 - Merging the data into the
payload with some composition. 3 - Glueing bytes together (easier/safer with
protocol buffers than json)

Do things asynchronously. Use message queues / streams.

Beyond that, GC becomes noticeable. For example, Go's net/http used to (might
still) allocate much more than other 3rd party options.

~~~
blattimwind
0 - Caching antipattern 101:

    
    
        key = calculate_cache_key()
        if not cache.has(key):
            data = expensive_calculation()
            cache.store(key, data)
        else:
            data = cache.get(key)

~~~
memset
This is interesting because this is generally how I implement caching! What
would pseudocode look like for a non-antipattern?

~~~
brightball
Depends on what you are doing.

To frame it another way, what happens if 100 requests come in at the same time
for that expensive value when it isn’t in the cache yet? The expensive
calculation will be run 100 times at the same time.

Ideally, you’d rather refresh the cache value in the background once and never
allow duplicate requests for it from the web.

If you’re running a language that makes it easier to deduplicate requests for
certain data, the original approach will last longer. The CacheEx library in
Elixir, for example, will only run the expensive calculation once, set the
cache and then send the value back to everything that requested it while it
was loading.

~~~
rapind
CacheEx sounds interesting. Basically a debounce. Pretty sure you could solve
this outside of the application layer with Varnish but that depends on how the
view is composed. I prefer using a grace / stale period but that only works if
it's acceptable to return stale data during computation instead of queuing it
up.

~~~
brightball
Well CacheEx is just using functionality that comes naturally on the BEAM
here. This is one of the reasons that a lot of CDN's like Cloudfront are
written in Erlang.

CacheEx gets the ability to check for the presence of the cache key in Erlang
Term Storage (ETS) which is basically an in-memory cache. If the key is
present, it just returns the value.

If it's not, it sends checks to see if a process exists with the cache key
name. If there isn't one, it creates one to request the resource.

For any other requests that come in until the value has been created, they
will be directed to the process that is getting the value.

When the process that was calculating things comes back, it will save the
value to ETS and then also send it back to all of the queued processes that
have been waiting for it.

In the case of Varnish, you'd be expecting it to send back the entire
completed view...HTML and all. This isn't something that you need to worry
about with Elixir because the view is never actually rendered in the
application. It's broken down into pieces that are never duplicated in memory
and then replayed directly to the socket...meaning you really only ever need
to cache expensive data and not what it's transformed into.

Here's a good read on why this view layer is so fast, if you're curious. Most
people report shock that their uncached performance with Elixir and Phoenix is
on par with statically cached HTML. I didn't believe it until I saw it myself.

[https://www.bignerdranch.com/blog/elixir-and-io-lists-
part-2...](https://www.bignerdranch.com/blog/elixir-and-io-lists-part-2-io-
lists-in-phoenix/)

~~~
rapind
That's pretty awesome. Varnish is my go to for Rails apps (and almost always
necessary).

------
knesek
When I was inexperienced I feared ORMs because of the negative performance
impacts I've read they could have. I constantly worried about what would
happen if the amount of data increased and I hit ORM induced problem that I
could not resolve without major rewrite of data access layer. However,
whenever I've actually hit those problems in production, I found the similar
thing the authors of the article did - ORM induced performance problems can be
fixed by 1-5 lines of code changes, if you knew where the actual problem was.
In fact, I learnt that there's no "ORM induced performance problems", there
are only problems induced by lack of understanding of how ORM works.

As for the knowing where the performance problem is, I find that skill (that I
think should be basic) is in fact very elusive in the engineers I encounter.
When I interview people, even in what would be upper intermediate to senior
level in terms of years of experience, alarmingly small number have ever done
or could do (or even would do) performance profiling like described in the
article. Yet they do describe how they changed this ORM for that ORM, this DB
for that DB, this language for that language, in the name of higher
performance.

I've seen or heard of:

\- Teams spending weeks exchanging SQL DB for No SQL DB because of unsolvable
performance problem. When hitting the same problem with NoSQL DB, they find
that addition of a simple index is solution in both cases

\- Teams spending weeks exchanging Hibernate for OpenJPA in a complex
application, because of performance, without doing any performance analysis,
because they've read article that says Hibernate is slow

\- Teams choosing complex architectures they don't really understand, for
performance reasons, without being able to articulate performance requirements
of the system they're building

These days, whenever someone mentions performance as a reason for anything, I
judge their competence based on their response to the question "And how are
you measuring and monitoring it?"

~~~
raarts
‘Unsolvable performance problems’ that could have been fixed by adding an
index??

How did these team members pass their job interviews?

By practicing algorithm puzzles?

~~~
blihp
Sadly, this occurs far more often than you might think. Large companies
especially do it to themselves given how they structure their teams with
developers and DBAs being on different teams and reporting to different
managers. (i.e. you can trace their respective management chain separately
until just below the C level) These reporting structures are problematic since
large companies tend to hire people with very narrow skillsets which makes it
vital that these groups work together. The end result is that you have teams
throwing things over the wall to each other not knowing or caring what is
happening on the other side of the wall... dysfunction by design.

------
projectramo
The real joy of this article was the following:

Some academics (Yang, Subramanian, Lu, Yan, Cheung) were able to produce
massive improvements in about a dozen large, mature, battle tested open source
projects using just a few lines of code.

This should give hope to all those tepidly trying to get into open source.
Just go and take a look at the dozens of open source projects in Django or
whatever and you could improve the performance by keeping an eye on the ORM.

Better yet you might find another thing that makes them even better with ease.

Of course, I should add, I think the really clever thing the academics did is
to come up with this random link clicking program to time the worst load times
of the projects. That whole setup was gold.

~~~
Cthulhu_
What's intimidating to me is that to make a change like that, you first need
to get the code, get it and the tests running, and actually understand what
the code does - and whether your fix has the equivalent result or is subtly
broken.

They didn't just change a single line in an application, they wrote a huge
benchmark suite and dug through miles of code to find issues like this. I've
no clue how much time they spent on it, it must've been months.

~~~
mklepaczewski
This is not necessarily true. You would be surprised how many absolutely
trivial performance issues can be found in almost every project. Sometimes
it's really just about moving computation of a constant value out of a 'for'
loop.

~~~
soberhoff
I once sped up a program by 90% by turning `new String("foo")` into just
`"foo"`. The project was still rotten though, so it ultimately didn't matter.

~~~
sut-123
Your example sounds interesting, could you explain more

~~~
soberhoff
Are you familiar with Java? If you use just `"foo"`, then the string will be
interned and reused. If you use `new String("foo")`, each call creates a new
copy on the heap. This call was inside a very hot loop, thus eating almost all
the application's runtime.

~~~
sut-123
Thanks a lot for your explanation.

------
mping
My 2 cents are: if you don't know what queries your orm API is generating, you
shouldn't use one. I mean, sometimes there are bugs here and there, but people
should know the methods they call.

It's the equivalent of making a rest call and complaining about latency. Yeah,
it's a method, why is it taking so long?

~~~
arwhatever
That's an excellent point. I'm very comfortable with my ORM of choice as well
as its performance, because I've spent time running SQL Profiler while
stepping through my own code to learn what my ORM does, and when it does it.

------
krainboltgreene
There are some really glaring issues in this article. For example, the author
suggests that each -> update is worse than update_all, and while the
performance may be better it's comparing apples to oranges! update_all, in
ActiveRecord, bypasses business logic, bypasses updating the updated_at,
bypasses validation and more. It's not the same thing at all. There is no
difference between where.first and find_by, that is just a bikeshed. Not every
map(column) can be turned into pluck(column) if there are business logic
wrappers around that column.

------
siscia
I never understood why I don't enjoy working with ORM but reading this article
make it clearer for me.

They blend the distinction between working in memory vs accessing the db.

From one side it is very convenient, however I really feel that such
performance sensitive operation should be carefully considered and that most
SQL should be written by hand.

~~~
olavk
Writing SQL by hand does not solve typical performance problems like n+1
queries. If on the other hand you know enough to avoid n+1 queries, then you
can also avoid them when using an ORM, and save a lot of work.

If you like writing SQL by hand, by all means do so, but you will not
automatically get better performance by handwritten SQL as compared to ORM
generated SQL.

~~~
marcus_holmes
No, it's not automatically more performant; it's more that you can optimise
hand-written SQL for your specific schema and use case. You can't do that with
ORM-written SQL.

~~~
olavk
Depends on what kind of optimizations, and on the ORM I guess. The major
problems described in the article, like n+1 queries and filtering on the
client rather than in the database seem to be _easier_ to fix in an ORM than
in hand-written SQL with boilerplate wrappers. (Of course they are also easier
to _introduce_ in an ORM if you don't know what you are doing, but such is the
curse of powerful abstractions.)

Even things like query hints can be easily applied in the ORM's I know. It is
kind of hacky since it breaks abstraction layers - but so is query hints in
SQL.

But of course there can be some special cases where you just have to drop down
to raw SQL for some reason. All ORM's I know allow this.

------
superasn
This may be a little OT (OT because the points raised in the study are totally
valid and mine is just a comment) but for small companies and solo developers
ORM or whatever that gets the job done quickly is the way to go.

Most sites and web apps never even break a 100k/day hit mark for which I
believe inefficiency may not be the biggest issue. But wasting a month tryig
to write native Sql queries can hurt your project a lot more.

~~~
mnm1
Even if you get 1 hit / day and that takes a long time to return, the ORM has
failed your one customer. Inefficiency exists even at small scales with ORMs.
If your developers don't know how to write SQL, let them learn. Or fire them
if they won't.

~~~
blattimwind
> Inefficiency exists even at small scales with ORMs.

In other words, you have no idea what you're talking about.

~~~
mnm1
So the ORM taking many seconds for query overhead isn't a problem? This makes
the web page served by this api many seconds slower. A second of slowness
increases bounce rate by quite a bit and by two to three seconds, most
visitors abandon the website. I guess that doesn't matter to you as you post a
comment without substance and only an insult.

~~~
falsedan
I think you have bigger fish to fry than site performance if you're getting 1
hit per day.

~~~
mnm1
I think you completely missed the point here. If this happens for one
customer, it happens for all of them and no amount of hardware thrown at the
problem is going to solve this because it's already running on the best
hardware available.

------
Rafert
Interestingly their code checker is just a bunch of regular expressions:
[https://github.com/hyperloop-rails/static-
checker](https://github.com/hyperloop-rails/static-checker) \- I would've
expected custom Rubocop rules.

Rubocop already knows about `where.first? => find_by` for exmple:
[https://github.com/rubocop-
hq/rubocop/blob/master/lib/ruboco...](https://github.com/rubocop-
hq/rubocop/blob/master/lib/rubocop/cop/rails/find_by.rb)

~~~
Balf
They suggested naïve sequential string replacing for html templating, so I
would absolutely expect a bunch of regular expressions.

------
crocal
... Which begs the question: what good is an ORM if it does not prevent by
design such issues? Here, we are essentially saying users of ORM must also
have in their mind the SQL version. Or call an expert after the mess is done
:/...

~~~
FooBarWidget
I find this question to be very weird. I have never seen ORMs as tools for
completely abstracting away the database. I have always seen them as
convenience APIs for using the database. They exist to make your code shorter,
less repetitive, more readable, easier to reason about, and more maintainable;
not to make you forget about the database altogether.

I also don't see ORMs as exclusive. It's fine to use ORMs for 95% of your use
cases, but drop down to raw SQL for the remaining 5% where it's not a good
match. That's still a win for the goals I mentioned above.

~~~
doesnt_know
> I have never seen ORMs as tools for completely abstracting away the
> database.

Some are advertised that way. Entity Framework Code First, Migrations etc.

At our place, our DB dev team is larger than our api team, which in turn
collectively dwarfs our front end teams. ORMs in this sort of environment have
never really been given a chance, but I feel like it would be easy to justify
them in many situations...

~~~
adrianmsmith
> Some are advertised that way

As an example of this, the influential DHH of Basecamp blogged saying just
that: [https://m.signalvnoise.com/conceptual-compression-means-
begi...](https://m.signalvnoise.com/conceptual-compression-means-beginners-
dont-need-to-know-sql-hallelujah-661c1eaed983) "Basecamp 3 has about 42,000
lines of code, and not a single fully formed SQL statement as part of
application logic!"

~~~
sudhirj
Given that these are the people who wrote the Rails ORM, you'd expect that
they know how to use the ORM to generate high quality SQL. Which is actually
quite doable - in Rails / ActiveRecord you're much better served by knowing
what happens for every ORM call, and the default development log prints every
generated SQL query as well. Think it now also provides alerts when the
queries are slow.

~~~
adrianmsmith
True, but my point was that that blog post supported the point that ORMs are
sometimes promoted as a way to avoid needing to know SQL. The intended
audience of that blog post was not people who write ORMs, it was to persuade
people who are writing applications that they don’t need to learn how to use a
database, that they only need to learn how to use an ORM.

~~~
sudhirj
Yeah, that's isn't going to work. Would want to use an ORM the way you'd use a
bicycle - it won't necessarily let you do something you couldn't before, but
it makes it easier. Using an ORM without knowing the SQL it generates is like
learning to bicycle without learning to walk. You'll fall down at some point
and then you're well and truly screwed.

------
mendelk
Having worked many years both with ORMs and with "old-school" devs slinging
raw SQL strings and everything in between, I've come to the conclusion that
"ORMs don't kill [performance], people do". :)

------
brightball
I experienced fixing most of thes first hand. I think 90% of the issues come
from developers who haven’t worked directly with SQL enough to understand what
the ORM is doing.

The experience of building apps with hand coded SQL, while not good for a
project, is an excellent teacher.

------
thermodynthrway
It's a bit high level to mention this, but it doesn't have the one query
problem I see constantly. Paging using offset and limit. I swear every app
I've worked on uses it somewhere. And it's horrifically inefficient 90% of the
time.

~~~
vorpalhex
Can you elaborate on the issues with that? I've used that pattern and haven't
found any major woes (yet), but also don't usually paginate complicated
queries.

~~~
bpicolo
In postgres and mysql, at least, the database has to re-scan data every time
you run a limit/offset query query. It gets progressively slower as your
offset increases.

The efficient way to handle it is to set a lower limit on the pkid (or other
atomically increasing row) of the last record fetched, and fetch in ascending
order e.g.

    
    
        SELECT * from my_table where id > (last_row_id_seen) ORDER BY id asc limit 20;
    

Then you have an indexed jump to the correct rows to return. It's pretty
straightforward to abstract a batched database iterator with this pattern for
use application-wide (usually can be done in ~20-50 loc), but no ORM that I'm
aware of supports this pattern natively.

Do note that, by extension, it's impossible to efficiently batch iterate over
tables that don't have a unique, orderable key in postgres and mysql.

~~~
kraftman
What if you need to order by something other than id?

~~~
thermodynthrway
Then, generally, give the user a bunch of filtering options and limit page
depth to less than ~1000 entries

------
guiriduro
Wouldn't it just be better to invert the relation (sic) between App-ORM and
DB, and have everyone better understand what data modelling and a DBMS is,
then write SQL and some reverse-ORM to expose App services in SQL?

E.g. EXPOSE SERVICE(REST, GraphQL) BillOfMaterials (VARCHAR arg) AS (SELECT
... WHERE ... = arg etc.)

~~~
girvo
Yep! See here[0] and here[1] for that idea done with Postgres

[0] [https://postgrest.org/en/v5.0/](https://postgrest.org/en/v5.0/)

[1]
[https://www.graphile.org/postgraphile/](https://www.graphile.org/postgraphile/)

------
ssijak
ORM for saving objects and very simple queries. Writing SQL or using something
like JOOQ in Java to write type-safe SQL for everything else.

~~~
collyw
I disagree. I have used Django extensively. The ORM handles raw SQL queries,
but which is great when you need something beyond the capabilities of ORM, but
you loose a lot as well when you go that route.

Pagination and sorting are pretty easy additions when you retrieve data using
the ORM in the standard way, and now you need to add extra code to handle
those specifically. I don't think you can use the Django admin with raw SQL (I
never tried, but it doens't make too much sense, as it basically generates a
set of views per table). Model methods don't make sense using SQL queries.

You should be able to write SQL if you want to be able to use the ORM
effectively and I am certainly glad I knew it well before started using
Django's ORM.

~~~
ssijak
What makes you think Django ORM is good ORM? What is bad about adding "extra
code" if that code is doing a useful and correct thing? And I wonder why
people find Django Admin useful for anything than simple CRUD apps. Writing
code so it can go along with django admin is a strange way of thinking, there
should be other priorities first.

------
rmc
One of the example Rails applications they use is the code which powers the
OpenStreetMap website.

They populated their install by randomly filling in fields on the website.
Which doesn't include any map editing! For OSM they suggest changing how the
_diary feature_ operates, which is a tiny, almost irrelevant part of the OSM
website software stack. The OSM database has millions of geographic objects,
and they talk about the diary system on the website.

> For example, when we profile the latest version of Openstreetmap, a
> collaborative editable map system, we find that a lot of time is spent on
> generating a location_name string for every diary based on the diary’s
> longitude, latitude, and language properties stored in the diary_entry table

The paper claims to have filed bug reports, and has URLs. But those links
don't exist.

Paper: [https://hyperloop-
rails.github.io/220-HowNotStructure.pdf](https://hyperloop-
rails.github.io/220-HowNotStructure.pdf) openstreetmap-website:
[https://github.com/openstreetmap/openstreetmap-
website/](https://github.com/openstreetmap/openstreetmap-website/) Claimed
Issues submitted: [https://github.com/hyperloop-rails/issues-
summary](https://github.com/hyperloop-rails/issues-summary)

~~~
lelelelemon
Thanks for looking into this - I am one of the authors of the study. We have
updated the submitted issues link to [https://github.com/hyperloop-
rails/study-replication/tree/ma...](https://github.com/hyperloop-rails/study-
replication/tree/master/issues-summary). Let us know if you have any further
questions.

------
pjungwir
> Inefficient Rendering (IR). This one was a surprise to me

He's talking about the common Rails pattern of rendering a small Haml/ERB
partial over & over in a loop. I've noticed big perf hits from this before,
but never completely understood why. Rendering the exact same result directly
in the loop body, without calling a second partial, gives a big speedup.

There is an extensive discussion here:

[https://softwareengineering.stackexchange.com/questions/1571...](https://softwareengineering.stackexchange.com/questions/157108/rails-
does-using-partials-slow-views-rendering)

I would love to have some better understanding around this, although it sounds
like no one has a clear idea of the cause.

In development, I've noticed that Rails re-reads the partial file off disk
_every iteration of the loop_. I don't know if that happens in production too,
but if so it would explain a lot.

------
syastrov
I’ve experienced a lot of n+1 queries problems as causes for bad performance.
Often times this was a result of wrapping the ORM in abstraction layers (for
business logic and fears of being “locked in” to the ORM). We rewrote that
part of the application using a different ORM (in Python). Making use of a
tool that could help find these problems automatically helped greatly and we
didn’t have a single performance problem when we went into production the
rewritten service.

This module can detect the n+1 queries problem automatically in Python ORMs:
[https://github.com/jmcarp/nplusone](https://github.com/jmcarp/nplusone)

Looking forward to seeing more automated tools like this in the Python/Django
world.

~~~
itsderek23
Scout also detects these for Django, ordering by the most performing N+1s:
[http://blog.scoutapp.com/articles/2018/04/30/finding-and-
fix...](http://blog.scoutapp.com/articles/2018/04/30/finding-and-
fixing-n-1s-in-django-apps)

~~~
sut-123
And bullet is used for Rails to detect N + 1 queries.
[https://github.com/flyerhzm/bullet](https://github.com/flyerhzm/bullet)

------
yread
This is cool. Are there more papers as practically useful for a (web)
developer as this one?

------
vladmihalcea
Most ORM-related problems are related to lack of knowledge of how the tool
works.

However, if you understand how databases work, how to tune the driver and how
to get the ORM tool to generate the same queries you'd otherwise write
yourself, then you are fine.

For more details, check out these [14 High-Performance Persistence
Tips]([https://vladmihalcea.com/14-high-performance-java-
persistenc...](https://vladmihalcea.com/14-high-performance-java-persistence-
tips/)).

------
unilynx
Pretty cool to read. We built a (currently proprietary) CMS with our own
scripting language, and instead of going the ORM way, we merged basic SQL into
the language itself. We did that mostly to eliminate sending raw strings to
databases (and all the injection risks and complexity that comes with it) but
it does allow a few extra optimisations because the compiler can look at both
the query and the language using it.

So if I do our equivalent of the ORM API Misuse case:

    
    
      IF(RecordExists(SELECT * FROM schema.variants WHERE track_inventory = 0))
      {
        ... 
      }
    

(RecordExists is a function that only checks whether the query returned
something, and has been marked that way) the compiler will already reduce this
to:

    
    
      IF(RecordExists(SELECT FROM schema.variants WHERE track_inventory = 0 LIMIT 1))
      {
        ... 
      }
    

And likewise a function that selects all columns from a database and returns
only one field, has the select reduced to only selecting that one column.

The drawback, of course, is that any SQL features of the underlying database
not exposed by the scripting language, are unreachable unless you fallback to
sending raw query strings again.

~~~
psadauskas
Of course it depends on the exact circumstances, but your query would probably
be faster as `SELECT COUNT(*)` than the `LIMIT 1` you optimize to now. In
fact, that’s one of the specific optimizations mentioned in the article.

~~~
adamzochowski
count(*) ends up counting all records, reading one first row is much much
faster than finding and counting all.

[https://wiki.postgresql.org/wiki/Slow_Counting](https://wiki.postgresql.org/wiki/Slow_Counting)

------
acutesoftware
I need a very good reason before using any external library in an attempt to
keep the total code base as clean as possible.

It is just too easy to be rushed and bring in a heap of code, so I prefer to
use SQL instead of ORM's.

All access to the database is done in a single module and they are wrapped in
functions like below

    
    
        def get_table_as_list(user_id, cols, tbl, where_clause, params_as_list, conn_str, order_by="1", maxrows='2000'):
            """
            This should be the ONLY place that selects from the database
            """
            db = get_db_conn(conn_str) 
            cur = db.cursor()
            where_clause += ' AND user_id = %s'
            sql = "SELECT " + cols + " FROM " + tbl + " WHERE " + where_clause + " ORDER BY " + order_by + " LIMIT " + maxrows
            params_as_list.append(str(user_id))
            cur.execute(sql, params_as_list)
            res = list(cur.fetchall())
            cur.close()
            db.close()
            return res
            

The database is designed and built first, then in the application the
definitions are done like below

    
    
        all_tables = [
            {'tbl':'as_note',
             'cols':['id','title','pinned', 'important','content','folder'],
             'col_types':['id','Text','Checkbox','Checkbox', 'Note','Text'],
            },
            {'tbl':'as_task',
            'cols':['id','Title','Pinned', 'Important','Notes','folder','Done'],
            'col_types':['id','Text','Checkbox','Checkbox','Note','Text','Checkbox'],
            }]
    
    

So far it is working well, and it is very simple to add new tables to the
schema and have them working in the application.

~~~
diek
Unless you're performing some magic elsewhere in the codebase, this will leak
connections if an exception is thrown since you're not closing the connection
in a 'finally' block. Alternatively, depending on your version of Python, you
could use a 'with' context to ensure the connection is closed.

~~~
acutesoftware
Good point, thanks for that - there is a lot error handling I haven't shown
but hadn't taken into account memory leaks.

------
mnm1
ORMs are really only useful for throwaway projects and beginners. I have yet
to see one without serious downsides in both performance and speed of
development, something that they are touted to improve but actually make
worse. The ORM I'm stuck with now (Doctrine2) adds a 10x overhead to queries.
For the most part, we don't even bother optimizing queries in such situations
because why waste time on something that could at most improve performance by
1/10th? The real fix is to get rid of the ORM. That requires a rewrite of two
apps that have been in development for four years each. I can't think of a
worse architectural decision for business (CRUD) software than using an ORM.
Irreversible and forever fucked.

~~~
thraxil
Don't know much about Doctrine2, but that sounds pretty terrible. I'm sorry
you are forced to work with something so inefficient.

I've been building apps with Django and Django's ORM for the last 10 years and
found essentially zero overhead in most cases. Every once in a while there's a
slow page, I open up the debug toolbar which shows me every SQL query that was
used to generate the page in a nice waterfall diagram, I see something a
little odd and change the order of some filters or add a `select_related`, or
`prefetch_related` or discover that some third party library is making a dumb
call (unavoidable problem of using third party libraries on any platform) and
find a workaround for that. Every once in a blue moon, it appears easier to
write a raw SQL statement than figure out what needs to be done to get the ORM
to generate it, so I do that. Of course, Django's ORM makes it stupid easy to
use a raw SQL statement:
[https://docs.djangoproject.com/en/2.0/topics/db/sql/](https://docs.djangoproject.com/en/2.0/topics/db/sql/)

I've worked with a few other ORMs in Python and other languages over the years
as well (in Go, Erlang, Elixir, Clojure, nodejs) and never really encountered
any where the ORM had a noticeable performance overhead (dominated by the
network latency back and forth from the database) and I've yet to work with
one that I couldn't just do a raw query when needed. The closest I've seen is
when I started using GORM in Go, I found it running slowly and discovered that
it automatically adds a "soft delete" functionality so every query gets an
additional "and not is_deleted" clause added. Disabled that feature and it was
fine. That was my own fault though for starting to build before I finished
reading the documentation, as it was pretty clearly explained in a later
section. (OK, also the ORMs I was using in Perl and Java back in the late
90's/early 00's were also pretty terrible, but those were prehistoric times.)

You always have to be careful of N+1 problems, but that's not just an ORM
thing. You run into that as soon as you have any abstraction in your code.
Once you have refactored to `get_list_of_items(some criteria)` and
`get_item_details(item)` functions/methods/whatever, whether it is using an
ORM underneath or raw SQL, developers working on the app have to know that
they can't loop over the results of the first and call the second on each of
them. Tradeoffs between reusability and performance are nothing new though and
not at all specific to web applications or ORMs.

~~~
pbowyer
I use Doctrine2 in a number of applications. I have a love/hate relationship
with it (mostly through fitting it to legacy database schemas) but it's not
slow to put/retrieve data from the database. What's slower is the object
mapping (hydration); and if you map database rows and relations into objects
yourself then your overhead is going to be similar, just labelled as
'application' overhead rather than 'ORM' overhead.

The 10x slowdown I noticed was moving from native database functions (PHP's
PDO or MySQLi extensions) to Doctrine2's underlying database abstraction
layer, DBAL. Doing a prepared query (with the same SQL) in the native
extension was 10x faster than DBAL, which uses the native extension under the
wrapper. I never got to why - DBAL is doing more (fair enough) but not enough
for this amount of slowdown.

~~~
mnm1
The only way manual processing of the data from the db could take as long as
doctrine hydration is if it's written poorly. An app developer who knows how
to write apps properly would never do hydration or any of the stupidity
doctrine does. Why would one map database rows into objects in the first
place? I've rewritten huge swaths of one of our doctrine apps getting a
roughly 5x to 10x improvement in speed and I left a lot of optimizations on
the table while still providing a general purpose serializer that uses
doctrine's meta data behind the scenes.

As for dbal, I think the overhead is in statement preparation and parameter
expansion (for array params) and it can be avoided by using the native
functions, though that overhead was never 10x for me.

------
ahallock
The problem with ORMs is that they make DB calls seem cheap, and when your
application is young and you don't have a lot of rows, everything works
wonderfully. Then all the N+1 problems start adding up and the SELECT COUNT(*)
performance issues. Since the ORM code is baked into the models, your code is
littered with these calls -- in the service layer, views, background workers.

That's why I really like Promises/Futures in Javascript. You know exactly when
you're executing a DB operation and you have to think about the implications
more. It's not just a simple object accessor.

------
ianamartin
The only big problem with ORMs is when developers are allowed to use them to
construct database schemata.

As long as you have a not-crazy person in charge of the initial design, you
can allow—and even encourage—developers to make versioned changes to the
database.

If you let us start from scratch with ORMs, it will suck for everyone.

There’s a fundamental mismatch between what developers think of as objects and
how relational designers think of tables.

ORMs can smooth that over. But nothinking can undo the wrong-headed-ness of a
developer thinking of a relational dB as an object store.

------
TekMol
Give me an O! Give me an R! Give me an M!

What does that spell? SLOW PERFORMANCE!

Todays programmers dont understand data. They understand frameworks. To find
the nr of all cars that are out of insurance they write:

    
    
        10 Nr=0
        20 Hey framework, give me all cars!
        Framework: Ok, here are 8001093 business objects representing all the cars in our DB. Each has all the attributes the car has. Color, mileage etc.
        30 Thanks!
        40 Foreach Cars as Car
        50 If Car->insurance_end_date < FancyDateLib.now() Nr++
    

I see variants of this everywhere. And the performance impact is just
untoppable. It's often several million times slower then a simple sql query.

But beefy hardware with lots of ram and memory takes care of it. 'Our software
is enterprise grade, so of course it cannot run on commodity hardware'.

~~~
patkai
I love to write SQL queries and even use them extensively in my code. However,
almost everyone I talk to resists this, and warns me that "one day you will
regret..." It gives me an uneasy feeling that I actually might, though it
hasn't happened yet.

~~~
weavie
I think one of the problems with SQL queries is that as far as your
application is concerned they are just strings. There's no typing information
or even syntax checking. Also if you do something like 'select * from' then
there the results returned aren't deterministic.

If you scatter these throughout your code and then the database schema
changes, you have a hell of a refactoring job to make sure everything still
works. One advantage of an Orm is that if you keep your objects in line with
your database the generated sql will stay correct.

Personally, I am more than happy to take that hit. I think it is a price well
worth paying in order to have optimised queries that do exactly what I what
them to do. To make it easier for myself though I will always try to keep my
queries in one place in the code. Then all my code needs to know is it is
getting clients_older_than(32) or whatever..

~~~
adrianmsmith
You can get the best of both worlds by using e.g. jOOQ in Java (allows you to
write e.g. db.select(MY_TABLE.MY_COL).from(MY_TABLE) where those values are
generated from the database therefore they exist and are of the right type.

It maps 1:1 to the SQL statement that gets executed so there's no magic e.g.
extra n+1 queries being introduced without you noticing.

But if you change your schema, re-generate, and immediate compile errors
showing you where you're referencing something that's now been deleted, so
it's not fragile like putting SQL in a String in your code (where if the
schema changes, the compiler can't help you)

~~~
weavie
I wasn't familiar with jOOQ before. It looks pretty comprehensive.

~~~
matwood
Honestly, jOOQ is one of the killer libraries for Java that keep me using it
for backend code.

------
lukaseder
> “compared to other popular ORM frameworks such as Django and Hibernate,
> Rails has 2x more applications on GitHub with 400 or more stars than Django
> and Hibernate combined.”

As if the whole world was hosted on GitHub...

------
NightlyDev
How the hell can these pages have common load times of ~500ms? This must be
the worst case pages, right? Hundreds of milliseconds is very slow if network
latency is out of the picture.

------
slifin
This should be translated for other back end languages, maybe followed by how
to use a relational database as a relational database

------
qaq
The problem with ORMs is they have to cater to the lowest common feature-set
of the supported RDBMS.

~~~
guipsp
This is not the case for pretty much any modern ORM

~~~
qaq
Sure it is show me the ORM that will support 2p commit, window functions,
CTEs, JSON operators in PG (just a random set of features)

~~~
bpicolo
SQLAlchemy supports all of the above - in general it's really terrific about
being extensible enough to support the corner cases.

~~~
qaq
It does look very nice and does support all of the above. I stand corrected
(not a Python dev. :))

------
mozumder
Its always amazing to see web pages take 2-4 seconds of back-end processing.
On a modern CPU, that's about 10 billion instructions.

10 billion instructions to send a few kilobytes of data.

There's such a waste in back-end server design. If you're measuring response
times are in seconds, and not in microseconds, you're doing something
seriously wrong.

~~~
czechdeveloper
From my experience, most of my waiting time is IO, not CPU load. Even simple
stuff, such as SELECT COUNT(*) to display total count in grid on big table can
take seconds. It does not have to be complicated.

~~~
ianamartin
You have a problem. And it’s probably related to your db architecture, which
was probably designed by devs with ORMs.

~~~
olavk
What kind of database architecture could cause SELECT * to be slow?

~~~
kyberias
Any. "Architecture" has nothing to do with that. SELECT COUNT(*) is usually
"slow".

------
jlebrech
what people need isn't an ORM but a DSL, if only there was a Domain Specific
Language for Querying the database in a Structured manner.

