Hacker News new | past | comments | ask | show | jobs | submit login
How not to structure database-backed web apps: performance bugs in the wild (acolyer.org)
486 points by godelmachine on June 28, 2018 | hide | past | favorite | 308 comments

I've worked on moderately busy backend platforms (~10K-20k rps handled on a ~4 e5-2650 and aiming for 5ms 95p response times).

It greatly depends on what you're doing, but for the majority of systems which are read heavy (and that most certainly includes "dynamic" sites like Amazon or Wikipedia), I hold to two major beliefs:

1 - Have very long TTLs on your internal cache servers with a way to proactively purge (message queues) and refresh in the background. Caching shouldn't be a compromise between freshness and performance. Have both!

2 - Generate message/payloads/views asynchronously in background workers and have your synchronous path as streamlined as possible (select 1 column from 1 table with indexes filters). Avoid serialization. Precaculate and denormalize. Any personalization or truly dynamic content can be done: 1 - By having the client make separate requests for that data 2 - Merging the data into the payload with some composition. 3 - Glueing bytes together (easier/safer with protocol buffers than json)

Do things asynchronously. Use message queues / streams.

Beyond that, GC becomes noticeable. For example, Go's net/http used to (might still) allocate much more than other 3rd party options.

0 - Caching antipattern 101:

    key = calculate_cache_key()
    if not cache.has(key):
        data = expensive_calculation()
        cache.store(key, data)
        data = cache.get(key)

Why it's an antipattern ?

It’s susceptible to Thundering Herd whereby more requests come in for the same cache key before the initial computation is finished, and so you end up with lots of cache misses. The fix is usually to lock the cache key and have subsequent requests wait on the original computation but it’s a bit more complex to code.

Just be careful to lock the cache key, and not the cache. I've seen the latter done often, which of course kills performance if there are cache misses for multiple keys at the same time.

I've heard it called Cache Stampede. Any decent framework for memoizing method calls would cover this case though.

It depends on the backing, if you're using just memcached there's no way to completely lock the key. None of the popular rails caching frameworks I've seen handle this either.

Do you have an example of one? I'm curious what features they provide over ad hoc memoization mechanics.

If you're in the Java ecosystem, the CacheBuilder in Guava is pretty good: https://google.github.io/guava/releases/19.0/api/docs/com/go...

By default it handles the case of concurrent retrieval on the same key (the second one will just wait for the first one to finish and use that value rather than starting a duplicate computation). It also lets you configure more interesting things like eviction strategies, removal notifications, and statistics.

Last year was a lesson for me in why caches are a hard problem as I had to debug many cache issues from other people not thinking things through... (At least one of the issues was my own fault. :)) Since then whenever someone suggests we use a cache I instinctively pull out a set of questions[0] to ask. The three questions Guava has you consider can also lead you to using memcached or the like instead but my set tries to answer the question "Do you even need a cache?" and if so, generating helpful design documentation.

    Is the code path as fast as it can possibly be without the cache? Do you have numbers?
    Will the cache have a maximum size, or could it grow without bound?
    If it grows without bound, either because of unbounded entries or because of unbounded memory for any particular entry, under what conditions will it consume all system memory?
    If it has a maximum size, how does it evict things when it reaches that size?
    Are you trying to cache something that could change?
    If so, how is the cache invalidated?
    How can it be invalidated / evicted manually by another thread or signal? (Debuggability, testability, inspectability/monitorability, hit rate and other statistics?)
    Is there a race condition possibility for concurrent stores, retrieves, and evicts?
    How constrained are your cache keys, that is, what does it need to know about to create one?
    Do they need to take into account global info?
    Do they need to take into account contextual information (like organization ID if your application server runs in a multi-tenant system, or user ID, or browser type, or requested-language)?
    Or do they only depend on direct inputs to that code path?
[0] https://www.thejach.com/view/2017/6/caches_are_evil -- need to update it a bit but not much...

A coworker introduced me to Caffeine which I think shares authors with Guava's cache. Kind of a 2.0 /lessons learned iteration.


No it wouldn't, memoization can't provide responses to calls it hasn't seen yet, that's the whole point.

This is interesting because this is generally how I implement caching! What would pseudocode look like for a non-antipattern?

Depends on what you are doing.

To frame it another way, what happens if 100 requests come in at the same time for that expensive value when it isn’t in the cache yet? The expensive calculation will be run 100 times at the same time.

Ideally, you’d rather refresh the cache value in the background once and never allow duplicate requests for it from the web.

If you’re running a language that makes it easier to deduplicate requests for certain data, the original approach will last longer. The CacheEx library in Elixir, for example, will only run the expensive calculation once, set the cache and then send the value back to everything that requested it while it was loading.

CacheEx sounds interesting. Basically a debounce. Pretty sure you could solve this outside of the application layer with Varnish but that depends on how the view is composed. I prefer using a grace / stale period but that only works if it's acceptable to return stale data during computation instead of queuing it up.

Well CacheEx is just using functionality that comes naturally on the BEAM here. This is one of the reasons that a lot of CDN's like Cloudfront are written in Erlang.

CacheEx gets the ability to check for the presence of the cache key in Erlang Term Storage (ETS) which is basically an in-memory cache. If the key is present, it just returns the value.

If it's not, it sends checks to see if a process exists with the cache key name. If there isn't one, it creates one to request the resource.

For any other requests that come in until the value has been created, they will be directed to the process that is getting the value.

When the process that was calculating things comes back, it will save the value to ETS and then also send it back to all of the queued processes that have been waiting for it.

In the case of Varnish, you'd be expecting it to send back the entire completed view...HTML and all. This isn't something that you need to worry about with Elixir because the view is never actually rendered in the application. It's broken down into pieces that are never duplicated in memory and then replayed directly to the socket...meaning you really only ever need to cache expensive data and not what it's transformed into.

Here's a good read on why this view layer is so fast, if you're curious. Most people report shock that their uncached performance with Elixir and Phoenix is on par with statically cached HTML. I didn't believe it until I saw it myself.


That's pretty awesome. Varnish is my go to for Rails apps (and almost always necessary).

The easiest workaround is to acquire a lock (e.g. redis redlock) before starting to refresh the cache. That way only one process will perform expensive_calculation(). A good caching library will generally do some variation on this for you.

For larger, more complex scenarios the techniques mentioned by GGP above work well, i.e. not doing expensive_calculation() in your app process at all.

The issue has a bunch of names, I know at least three: cache stampede, thundering herd and dogpile effect.

there's a race condition between checking if something is in the cache and actually putting it in the cache. A more correct solution would have all threads wait while one does the actual work.

This gets more complicated if you have a distributed cache and/or distributed application servers. The typical solution there is to allow at most one computation per process/device.

If your language supports promises, cache the promise and not the result. If it doesn’t support promises, find or write one.

CloudFront has this anti-pattern it drives me nuts!

Varnish + Grace period == solved

The amazing thing about 1 is that modern immutable database patterns get it for free. When things like https://www.datomic.com/ get mainstream the world is gonna be in a very different place. Not as in today's apps get faster, but as in they are so much faster that new kinds of apps become possible that couldn't have been dreamed of before. My startup http://www.hyperfiddle.net is an experiment in this space

I hope you can hire a designer and frontend engineer soon! I understand it's an experiment but the site looks 15 years outdated. In any case it seems interesting, good luck!

Do this (point 1) properly and ~80% of your "speed" problems are solved, for a simple web-app-db tiered app.

1 - TTL should be infinite. If one needs finite TTL it means that cache invalidation logic is bogus.

Or it could mean that the cost of cache invalidation is greater than the cost of allowing stale data to be read for the remainder of the TTL, or any other number of justifications for finite TTL. The point is, it depends on your application - while it's true that shorter TTLs can mask faulty invalidation logic, I don't think it's correct to say that "TTL should be infinite" with no qualifiers whatsoever.

If that's the desired behavior then cache layer should allow one to use stale (invalidated ) content. Cache layer (or application layers above it) should be aware of the status of cached data.

It's fine if stale data is used deliberately, problems arise when stale data is assumed to be 'fresh'.

True but cache invalidation is famously hard to get perfect.

A nicer way to put it might be that finite TTLs should be viewed as an opportunity to optimize, not ideal or standard.

True, caching even simple data dependant on 2-3 variables may require writing dozens of test cases - in practice, it often turns out that the data doesn't rely on 2-3 but 5-10+ variables.

However, just because it's hard it doesn't mean we shouldn't do it the right way.

Of course it does - negotiating trade-offs and accepting the set that best achieves your objectives is a large part of software design (and also undermines the notion of doing things a "right way").

Wouldn't deploying a less-than-precise TTLs be an appropriate trade-off for non-transactional caches, caches subject to network partitions, caches that can't enroll in invalidation messages, caches that can't poll for change sets, caches that can't implement eviction policy, etc?

Certainly the cache should be transparent about the trade-offs it has made (such as not promising authoritative data if it's accepting eventual consistency via TTL).


It's impossible to create perfect systems, so it always makes sense to give yourself an extra out to protect you from unanticipated defects (otherwise known as a "belt and suspenders" approach).

By TTL, are you referring to Transistor Transistor Logic?

TTL in this context is "time-to-live" - "a mechanism that limits the lifespan or lifetime of data in a computer or network" (https://en.wikipedia.org/wiki/Time_to_live). E.g., the duration of a cache entry.

When I was inexperienced I feared ORMs because of the negative performance impacts I've read they could have. I constantly worried about what would happen if the amount of data increased and I hit ORM induced problem that I could not resolve without major rewrite of data access layer. However, whenever I've actually hit those problems in production, I found the similar thing the authors of the article did - ORM induced performance problems can be fixed by 1-5 lines of code changes, if you knew where the actual problem was. In fact, I learnt that there's no "ORM induced performance problems", there are only problems induced by lack of understanding of how ORM works.

As for the knowing where the performance problem is, I find that skill (that I think should be basic) is in fact very elusive in the engineers I encounter. When I interview people, even in what would be upper intermediate to senior level in terms of years of experience, alarmingly small number have ever done or could do (or even would do) performance profiling like described in the article. Yet they do describe how they changed this ORM for that ORM, this DB for that DB, this language for that language, in the name of higher performance.

I've seen or heard of:

- Teams spending weeks exchanging SQL DB for No SQL DB because of unsolvable performance problem. When hitting the same problem with NoSQL DB, they find that addition of a simple index is solution in both cases

- Teams spending weeks exchanging Hibernate for OpenJPA in a complex application, because of performance, without doing any performance analysis, because they've read article that says Hibernate is slow

- Teams choosing complex architectures they don't really understand, for performance reasons, without being able to articulate performance requirements of the system they're building

These days, whenever someone mentions performance as a reason for anything, I judge their competence based on their response to the question "And how are you measuring and monitoring it?"

‘Unsolvable performance problems’ that could have been fixed by adding an index??

How did these team members pass their job interviews?

By practicing algorithm puzzles?

Sadly, this occurs far more often than you might think. Large companies especially do it to themselves given how they structure their teams with developers and DBAs being on different teams and reporting to different managers. (i.e. you can trace their respective management chain separately until just below the C level) These reporting structures are problematic since large companies tend to hire people with very narrow skillsets which makes it vital that these groups work together. The end result is that you have teams throwing things over the wall to each other not knowing or caring what is happening on the other side of the wall... dysfunction by design.

Funny story: at a, let's say, Fortune 50 tech company there was an investigation into why performance for a certain database query had become atrocious. The problem was that the query was against a table that started out quite small and then grew very large. And the database had been configured to use "query plan stability" to improve predictability of performance. However, the query plan that the db originally came up with for that nearly empty table was a full table scan, which was actually the fastest method under those conditions. Yet it continued to use a full table scan even as the table grew to many tens of millions of rows. It was a simple matter to switch the query to actually make use of the indexes that already existed.

Well, knowing that you need to profile is one step. But then you have to know HOW to profile.

Many years ago, a team member got asked to figure out what the performance implications would be if a specific application were to be changed from PostgreSQL to MongoDB (Mongo was very early at the time). That's a very difficult question to answer in general, but the way he decided to approach the problem was: create two programs. One would ask to connect to PG, grab the current time, run a query (once!), grab the time again, disconnect. And that's it. The other program would do the same for Mongo.

His 'findings' pointed out what some people were already expecting, that MongoDB was magically faster than PostgreSQL. Which was odd, as they both were running a single instance, with a simple data type, which PG should have zero problems handling.

I pointed out that he was measuring connection time too, which is slow on PG. He said that wasn't the case, because he started the timer after connecting to PG. I replied with "No, you are starting the timer after asking to connect to PG, you don't know if the connection is made at that point. Run a simple query first to be sure." He wouldn't accept that.

Turns out that not only that was correct (the mongo library would connect immediately, the PG one would defer until needed), but none of the DBs had indexes defined. So no numbers made any sense, the tested query was not based on any observed production queries, it would only be run one time, etc.

I have not seen a more meaningless 'performance' testing since. But the Powerpoint graphs looked pretty, were only management in the room, they would have likely be convinced to migrate.

All that, for a badly written application, that a single box with SQLite would have zero problems handling.

One the other side, I had a coworker implementing some simple, but effective, profiling of a Rails application. Slow path was traced to the caching code, specifically the code that generated the hash key. That was replaced with a better version and we got massive speedups.

Alright, I think I'll add some questions on profiling in my current team's interview process.

How would you test a CRUD app with a DB?

The way that I do it now it's just use some simple functional test to GET / POST and measure the time in a pytest script, I'm sure there are better ways to do it.

The fact that you feared ORMs probably pushed you to use raw database, and such experience really helps you understand how ORMs work.

Same is true with any abstraction – if you want to learn to use it right, first learn to make do without it.

Like so many things, it's useful to roll your own, not so you can use it, but so you can understand the complexities and trade-offs inherent in the problem space.

I've written a few ORMs before, and I can do it with fairly succinct and concise codenquickly if needed. I still reach for the full ORM from the beginning, because swapping out layer is painful, and you always want to do it sooner than later.

> Teams spending weeks exchanging SQL DB for No SQL DB because of unsolvable performance problem. When hitting the same problem with NoSQL DB, they find that addition of a simple index is solution in both cases

I don't get the idea that one has to pick either SQL or NoSQL, but a lot of people seem to think this way. Why not use both? The SQL portion can more or less be treated as a rich index of relationships, and the NoSQL portion can handle performance when it makes more sense for data to be embedded in parent documents.

Indeed, there is no reason to pick a "side". All big systems I have seen either have both or would benefit from both.

In fact, our current system has PG and several NoSQL databases, because each product has its own strengths.

The real joy of this article was the following:

Some academics (Yang, Subramanian, Lu, Yan, Cheung) were able to produce massive improvements in about a dozen large, mature, battle tested open source projects using just a few lines of code.

This should give hope to all those tepidly trying to get into open source. Just go and take a look at the dozens of open source projects in Django or whatever and you could improve the performance by keeping an eye on the ORM.

Better yet you might find another thing that makes them even better with ease.

Of course, I should add, I think the really clever thing the academics did is to come up with this random link clicking program to time the worst load times of the projects. That whole setup was gold.

What's intimidating to me is that to make a change like that, you first need to get the code, get it and the tests running, and actually understand what the code does - and whether your fix has the equivalent result or is subtly broken.

They didn't just change a single line in an application, they wrote a huge benchmark suite and dug through miles of code to find issues like this. I've no clue how much time they spent on it, it must've been months.

This is not necessarily true. You would be surprised how many absolutely trivial performance issues can be found in almost every project. Sometimes it's really just about moving computation of a constant value out of a 'for' loop.

I once sped up a program by 90% by turning `new String("foo")` into just `"foo"`. The project was still rotten though, so it ultimately didn't matter.

Your example sounds interesting, could you explain more

Are you familiar with Java? If you use just `"foo"`, then the string will be interned and reused. If you use `new String("foo")`, each call creates a new copy on the heap. This call was inside a very hot loop, thus eating almost all the application's runtime.

Thanks a lot for your explanation.

You think so?

Maybe but the ORM lines should be pretty insulated. If you return the same results and only speed them up you should be okay.

To take an example, replacing any? with exists? doesn't seem to require testing everywhere in the project. I imagine that the average fix had lines like that.

I also suspect that the fixes were pareto distributed -- that a few fixes brought most of the benefit.

My 2 cents are: if you don't know what queries your orm API is generating, you shouldn't use one. I mean, sometimes there are bugs here and there, but people should know the methods they call.

It's the equivalent of making a rest call and complaining about latency. Yeah, it's a method, why is it taking so long?

That's an excellent point. I'm very comfortable with my ORM of choice as well as its performance, because I've spent time running SQL Profiler while stepping through my own code to learn what my ORM does, and when it does it.

There are some really glaring issues in this article. For example, the author suggests that each -> update is worse than update_all, and while the performance may be better it's comparing apples to oranges! update_all, in ActiveRecord, bypasses business logic, bypasses updating the updated_at, bypasses validation and more. It's not the same thing at all. There is no difference between where.first and find_by, that is just a bikeshed. Not every map(column) can be turned into pluck(column) if there are business logic wrappers around that column.

I never understood why I don't enjoy working with ORM but reading this article make it clearer for me.

They blend the distinction between working in memory vs accessing the db.

From one side it is very convenient, however I really feel that such performance sensitive operation should be carefully considered and that most SQL should be written by hand.

Writing SQL by hand does not solve typical performance problems like n+1 queries. If on the other hand you know enough to avoid n+1 queries, then you can also avoid them when using an ORM, and save a lot of work.

If you like writing SQL by hand, by all means do so, but you will not automatically get better performance by handwritten SQL as compared to ORM generated SQL.

Sorry, I believe I wasn't clear.

Definitely writing plain SQL does not solve performance issues. However, it does force the developer to draw a clear line distinguishing where he is accessing the database and where he is working with in memory data structures.

No, it's not automatically more performant; it's more that you can optimise hand-written SQL for your specific schema and use case. You can't do that with ORM-written SQL.

Depends on what kind of optimizations, and on the ORM I guess. The major problems described in the article, like n+1 queries and filtering on the client rather than in the database seem to be easier to fix in an ORM than in hand-written SQL with boilerplate wrappers. (Of course they are also easier to introduce in an ORM if you don't know what you are doing, but such is the curse of powerful abstractions.)

Even things like query hints can be easily applied in the ORM's I know. It is kind of hacky since it breaks abstraction layers - but so is query hints in SQL.

But of course there can be some special cases where you just have to drop down to raw SQL for some reason. All ORM's I know allow this.

Depends on the ORM. SQLAchemy will let you write any query via the ORM

Sure, writing SQL won't solve design problems, but some ORMs need multiple queries to deliver data with relationships, joins, etc, that could be done in a single SQL query.

What ORM's does not support joins?

I imagine all of the non-toy ones would support joins for getting raw tabular data or for restricting what single kind of item (e.g. "Customer") you're pulling out.

Where it might get trickier is when you have something like a Customer-Region-Vendor join, and you actually want it to go ahead and create objects for the all of the Customers and Regions and Vendors whose data was pulled out of that one select.

In NHibernate and Entity Framework you just say .Include("Region").Include("Vendor") to eagerly load the associated entities. I imagine other ORM's have similar facilities.

It is actually not that bad if instead of a single query, you do one query per table.

It even offsets some of the burden to the app, which is generally easier to scale than the database.

The idea is that you query the "root" table, loop over the results and build an array of IDs, then do additional queries to the other tables with a "where whatever_id IN (...)".

My understanding is that a lot of SQL drivers support compressing the data anyway, so that the server (and client) avoid duplicating the data over and over just to fill the rectangular result.

Waterline (for Sails.js) is horrible. Last I looked there's no way to do a kind of `deepPopulate` (requiring an inner join)

Sorry you feel that way! The rest of the team and I work hard on it as often as we can. But deep populate is not a feature we’ll be adding— Sails and Waterline are opinionated, and I don’t feel that the performance penalty and potential issues are worth it— instead, we recommend using a native SQL query when you need to do a custom join.

Built in support for WHERE subqueries, on the other hand, is on our roadmap. Currently working to finish core MSSQL support first though. Hope that helps!

SQL is way more expressive than most ORMs, so solving problems is easier in general.

I believe there could be ORMs (or DB libraries) that makes this separation clear. Ecto gets this right although it's not an ORM: an SQL query is a composable data structure that should be explicitly passed to a separate function, belonging to the database repository, in order to be executed.

Or you could just spend 15 minutes learning how to avoid N+1 queries by using the includes() function, and some of the other super obvious performance optimization patterns that every junior Rails dev should know...

This may be a little OT (OT because the points raised in the study are totally valid and mine is just a comment) but for small companies and solo developers ORM or whatever that gets the job done quickly is the way to go.

Most sites and web apps never even break a 100k/day hit mark for which I believe inefficiency may not be the biggest issue. But wasting a month tryig to write native Sql queries can hurt your project a lot more.

I actually make this calculation for most projects that I work on -- what level of traffic do I expect this to see? If it's a back-office function or is limited to a few hundred clients at a time, I don't even worry about performance.

But that said... writing native SQL queries wastes a month? Um, no. ORMs are convenient, but writing select, insert, update, and delete queries by hand isn't that hard. It's mildly verbose and you end up repeating yourself lots so it's annoying, particularly if you (like me) have the kind of programmer's mind that constantly wants to factor that out. But it's doable without adding a high factor to development time.

Writing anything is fast. It's maintaining it that becomes a problem.

I build a web app for that is only used by maybe 3 persons.

Still, inefficiency is a huge issue. I need to be close to desktop-like performance for data entry, and that requiere both insert/read performance (that are executed in the same block and compromise several queries) below 1 seconds.

Request/seconds is not the metric that matter.

Is the seconds/for user(s) main activity.

Depends on the application. In e-commerce and I believe many other industries as well will depend on fast responses for conversions. For example, a page that loads ~2 seconds will start to lose bounce rates/ conversions.

Of course full page cache is the first stopgap for this problem but you'll always have dynamic content.

> But wasting a month tryig to write native Sql queries can hurt your project a lot more.

Unless it would be the first contact a developer has with SQL, I really don't see how difference in time taken to query the database with an ORM vs SQL would be anything more than a couple of minutes, maybe an hour if you really have a hard time with it or your data model is really convoluted.

This really depends entirely on what your application is doing and how it generates and processes queries. Some things are just annoying to solve without an ORM / "query builder" / SQL-AST (unless you write a small ORM yourself).

i always take the approach of get it functionally complete first then optimize for performance. It's easier to track optimization bugs (which can be nasty) and you always have something to revert to that at least works.

I agree that 99% of all projects never push beyond what relational databases are capable of. I've seen developers who were tasked with writing basically a todo list say that they're going with a nosql database because RDBMs don't scale.

Even if you get 1 hit / day and that takes a long time to return, the ORM has failed your one customer. Inefficiency exists even at small scales with ORMs. If your developers don't know how to write SQL, let them learn. Or fire them if they won't.

> Inefficiency exists even at small scales with ORMs

ORMs are not inherently inefficient. And in my time I have seen plenty of n+1 style queries written with raw SQL.

Lazy loading often fetches too little and eager loading too much. Most orms select all columns in all tables regardless of need. The query translation and hydration overhead of transforming the data into objects even in a fast language are always going to be problems, it's just a matter of how big, over a system without the orm. Since orms are superfluous, every line of code in them adds unnecessary overhead. They are indeed inherently inefficient by definition.

> The query translation and hydration overhead of transforming the data into objects even in a fast language are always going to be problems

No they are not. Have you measured this in a real-world application? This overhead is negligible compared to the cost of the query itself. And without an ORM you still have to load the data into some kind of objects or data structures before you pass it to presentation, you will just have to write the code yourself.

But yeah, lazy loading in a loop will kill performance. So don't do that.

It's definetly not always going to be a problem, but sometimes it does. ORM is always slower than writing custom code, but it might be more than fast enough.

Surely it depends on the custom code in question if it is faster or slower than the similar ORM logic?

You may be able to write faster code, but unless you are Donald Knuth, you can't guarantee that any custom code you write will always be faster than some library.

> Inefficiency exists even at small scales with ORMs.

In other words, you have no idea what you're talking about.

Personal attacks will get you banned here, regardless of how wrong someone else is. You've unfortunately been uncivil in at least one other comment in this thread too.

Could you please (re-)read https://news.ycombinator.com/newsguidelines.html and not use this site that way? The idea here is to post civilly and substantively, or not at all.


So the ORM taking many seconds for query overhead isn't a problem? This makes the web page served by this api many seconds slower. A second of slowness increases bounce rate by quite a bit and by two to three seconds, most visitors abandon the website. I guess that doesn't matter to you as you post a comment without substance and only an insult.

I think you have bigger fish to fry than site performance if you're getting 1 hit per day.

I think you completely missed the point here. If this happens for one customer, it happens for all of them and no amount of hardware thrown at the problem is going to solve this because it's already running on the best hardware available.

What ORM setup gives you "many seconds" overhead on a small website? Are you sure the ORM is to blame? I suspect you are doing something like lazy loading in a loop or joining/filtering on the client side.

Interestingly their code checker is just a bunch of regular expressions: https://github.com/hyperloop-rails/static-checker - I would've expected custom Rubocop rules.

Rubocop already knows about `where.first? => find_by` for exmple: https://github.com/rubocop-hq/rubocop/blob/master/lib/ruboco...

They suggested naïve sequential string replacing for html templating, so I would absolutely expect a bunch of regular expressions.

... Which begs the question: what good is an ORM if it does not prevent by design such issues? Here, we are essentially saying users of ORM must also have in their mind the SQL version. Or call an expert after the mess is done :/...

I find this question to be very weird. I have never seen ORMs as tools for completely abstracting away the database. I have always seen them as convenience APIs for using the database. They exist to make your code shorter, less repetitive, more readable, easier to reason about, and more maintainable; not to make you forget about the database altogether.

I also don't see ORMs as exclusive. It's fine to use ORMs for 95% of your use cases, but drop down to raw SQL for the remaining 5% where it's not a good match. That's still a win for the goals I mentioned above.

> I have never seen ORMs as tools for completely abstracting away the database.

Some are advertised that way. Entity Framework Code First, Migrations etc.

At our place, our DB dev team is larger than our api team, which in turn collectively dwarfs our front end teams. ORMs in this sort of environment have never really been given a chance, but I feel like it would be easy to justify them in many situations...

> Some are advertised that way

As an example of this, the influential DHH of Basecamp blogged saying just that: https://m.signalvnoise.com/conceptual-compression-means-begi... "Basecamp 3 has about 42,000 lines of code, and not a single fully formed SQL statement as part of application logic!"

Given that these are the people who wrote the Rails ORM, you'd expect that they know how to use the ORM to generate high quality SQL. Which is actually quite doable - in Rails / ActiveRecord you're much better served by knowing what happens for every ORM call, and the default development log prints every generated SQL query as well. Think it now also provides alerts when the queries are slow.

True, but my point was that that blog post supported the point that ORMs are sometimes promoted as a way to avoid needing to know SQL. The intended audience of that blog post was not people who write ORMs, it was to persuade people who are writing applications that they don’t need to learn how to use a database, that they only need to learn how to use an ORM.

Yeah, that's isn't going to work. Would want to use an ORM the way you'd use a bicycle - it won't necessarily let you do something you couldn't before, but it makes it easier. Using an ORM without knowing the SQL it generates is like learning to bicycle without learning to walk. You'll fall down at some point and then you're well and truly screwed.

They also let you compose SQL fragments in ways that would be a lot more difficult or less clear otherwise. In the case of Diesel, they let you typecheck the composition of those fragments. ORM hate is based on misunderstanding of the benefits those who do use them get from them.

I'm not very sure that Jooq and SQLAlchemy and other 'not quite ORMs' really help that much either.

I think, as the article suggests, that 'opaque' is as much the problem as the classic Object-Relational impedance mismatch, but I would posit that the opaqueness isn't just that the programmer can't _see_ what happens, but that they also don't _care_.

So I'd put my faith in tools instead. If programmers don't want to think about this, let the tools try instead.

The static analyser was interesting. Although static analysers have shallow comprehension, they were able to identify anti-patterns that were common enough to make it a useful linter. And of course their classic-mistakes approach can be applied to other languages too, even if each language and ORM needs its own corpus.

I sincerely hope that JetBrains and other IDEs bring this kind of analysis into their IDEs. I write a lot of non-Ruby code, and the JetBrain IDEs do keep suggesting nice 'code simplification' and 'generate boilerplate' checkers as I type.

But while they check my regex for well-formed-ness and colour my SQL but don't have any kind of meta analysis and comprehension. Missing opportunity.

You could take this further for normal non-DB code too - they could warn me when I have inefficient O(n) search in a loop and so on.

> You could take this further for normal non-DB code too - they could warn me when I have inefficient O(n) search in a loop and so on.

For phpstorm you have the EA extended plugins which gives lot of hints like that. I'm sure you could find the same kind of plugins or write one backed by a static analysis tool for other languages.

ORMs are good for 90% of your use cases, the rest I'm happy to use SQL for.

I could attempt to deconstruct the ORM problem, but Martin Fowler had a nice article [1] on replying to "ORM hate" a few years back which is worth the read for folks that question the usefulness of ORMs (and I believe questioning things is a healthy thing to do!)

[1] https://martinfowler.com/bliki/OrmHate.html

As much as I like Fowler, he's missing an important option in his description: don't try to do object relational mapping generically. Consider that you have business model objects. In the UI, you create a "view" of that data. Of course the UI view is not related in structure to the business model object. That would be absurd. We need to use collections of model objects and present the data in a flexible. Just as absurd would be to build a framework for automatically mapping the data in the model layer to the UI view.

Now consider that your DB is just like your UI. It's a view of your model objects. Just in the same way, there is no reason why your model objects should be related structurally to your DB representation of the data. And in exactly the same way there is no particular reason why you should have a way to automatically map to and from your DB layer to your model objects.

In many cases, it will be less complex to build a bespoke OR mapping rather than try to find a system that will do it generically.

Computers are fast. Correctness is more important than performance. Most businesses fail.

With an ORM you write the first working version of your product quickly, validate product-market fit, and maybe spend a small amount of time profiling and optimising eventually when you need to scale. That's a good tradeoff.

> Which begs the question: what good is an ORM if it does not prevent by design such issues?

This isn't as terrible as you make it out to be. For example, general purpose programing languages don't prevent the programmer writing a O(2^n) algorithm; yet they are useful.

The value proposition of ORM's is they simplify data access code, making the application simpler to write and more maintainable.

I have never seen anybody claim ORMs would eliminate performance issues.

Right up until the point where you have to hack around the ORM to get performance where it needs to be. Then you suddenly have more complex and harder to maintain "magic" code than if you'd just written some simple-but-boring boilerplate and SQL from the start.

I'm wading through the tedious and boring process of writing a data access layer for an application at the moment. It's repetitive, there's lots of error-checking, and testing it is a pain in the arse. My big consolation, though, is that in a year's time when I need to optimise the queries because we're getting load on it, it'll be easy to understand and simple to change.

ORMs don't make code easier to understand, they make it less boring to write. Those are not the same things.

Hacking around the ORM should be a question of:

    query.SQL("my complex query here")
If it's not, your ORM is broken and you should write a new one (they're really not that complex).

I guess this depends on the framework in use, but for example, in Hibernate you add a single "include" clause to avoid n+1 queries. It is only "magic" if you don't understand how it works, but then your SQL would be just as inefficient.

Any ORM I know allows you to easily drop down to raw SQL when you need it - but in the majority of cases you never need to. Writing everything manually in SQL with tedious boilerplate wrappers because you might need to manually optimize some queries at some point in the future seems like a massive violation of the YAGNI principle.

> Then you suddenly have more complex and harder to maintain "magic" code than if you'd just written some simple-but-boring boilerplate and SQL from the start.

Not my experience at all. Rather you have your straightforward ORM code with maybe a couple of "magic" hint annotations here and there where you need them. Your non-performance-critical queries are plain and simple, your performance-critical queries are less plain but remain readable. Whereas if you write boilerplate SQL the whole time all your queries are "readable" in theory, but there's just so much of them that you can never actually understand more than a fraction and have no way of knowing which differences are important and which are accidents.

(Unless, of course, you throw away the whole ORM infrastructure at the first sign of a performance problem and insist that you absolutely have to run custom SQL directly, disable entity caches, and so on. But don't do that.)

Having worked many years both with ORMs and with "old-school" devs slinging raw SQL strings and everything in between, I've come to the conclusion that "ORMs don't kill [performance], people do". :)

I experienced fixing most of thes first hand. I think 90% of the issues come from developers who haven’t worked directly with SQL enough to understand what the ORM is doing.

The experience of building apps with hand coded SQL, while not good for a project, is an excellent teacher.

It's a bit high level to mention this, but it doesn't have the one query problem I see constantly. Paging using offset and limit. I swear every app I've worked on uses it somewhere. And it's horrifically inefficient 90% of the time.

Simply depends on pagination depth.

Unfortunately there aren't any solutions that generalize as well as LIMIT + OFFSET which is why we use it. In fact, other solutions usually take quit a bit of custom tailoring if you want Prev/Next and deep page jumping.

Can you elaborate on the issues with that? I've used that pattern and haven't found any major woes (yet), but also don't usually paginate complicated queries.

In postgres and mysql, at least, the database has to re-scan data every time you run a limit/offset query query. It gets progressively slower as your offset increases.

The efficient way to handle it is to set a lower limit on the pkid (or other atomically increasing row) of the last record fetched, and fetch in ascending order e.g.

    SELECT * from my_table where id > (last_row_id_seen) ORDER BY id asc limit 20;
Then you have an indexed jump to the correct rows to return. It's pretty straightforward to abstract a batched database iterator with this pattern for use application-wide (usually can be done in ~20-50 loc), but no ORM that I'm aware of supports this pattern natively.

Do note that, by extension, it's impossible to efficiently batch iterate over tables that don't have a unique, orderable key in postgres and mysql.

What if you need to order by something other than id?

Then, generally, give the user a bunch of filtering options and limit page depth to less than ~1000 entries

Guys behinds PostgreSQL / CitusDB wrote awesome article on which methods of pagination work how well and where they fail.


Highly recommended read.

DB has to calculate every record from 0-> offset . A user can frequently crash a server by just going to page 100+ in a big list since the database has to load all that data.

Easiest way to prevent this is add enough filtering options that it doesn't inconvenience users to limit depth to page 10 or something

Probably still better than putting all the rows in a temp table with row numbers and then selecting where row number is between x and y, which I have also seen.

Wouldn't it just be better to invert the relation (sic) between App-ORM and DB, and have everyone better understand what data modelling and a DBMS is, then write SQL and some reverse-ORM to expose App services in SQL?

E.g. EXPOSE SERVICE(REST, GraphQL) BillOfMaterials (VARCHAR arg) AS (SELECT ... WHERE ... = arg etc.)

Yep! See here[0] and here[1] for that idea done with Postgres

[0] https://postgrest.org/en/v5.0/

[1] https://www.graphile.org/postgraphile/

ORM for saving objects and very simple queries. Writing SQL or using something like JOOQ in Java to write type-safe SQL for everything else.

I disagree. I have used Django extensively. The ORM handles raw SQL queries, but which is great when you need something beyond the capabilities of ORM, but you loose a lot as well when you go that route.

Pagination and sorting are pretty easy additions when you retrieve data using the ORM in the standard way, and now you need to add extra code to handle those specifically. I don't think you can use the Django admin with raw SQL (I never tried, but it doens't make too much sense, as it basically generates a set of views per table). Model methods don't make sense using SQL queries.

You should be able to write SQL if you want to be able to use the ORM effectively and I am certainly glad I knew it well before started using Django's ORM.

What makes you think Django ORM is good ORM? What is bad about adding "extra code" if that code is doing a useful and correct thing? And I wonder why people find Django Admin useful for anything than simple CRUD apps. Writing code so it can go along with django admin is a strange way of thinking, there should be other priorities first.

One of the example Rails applications they use is the code which powers the OpenStreetMap website.

They populated their install by randomly filling in fields on the website. Which doesn't include any map editing! For OSM they suggest changing how the diary feature operates, which is a tiny, almost irrelevant part of the OSM website software stack. The OSM database has millions of geographic objects, and they talk about the diary system on the website.

> For example, when we profile the latest version of Openstreetmap, a collaborative editable map system, we find that a lot of time is spent on generating a location_name string for every diary based on the diary’s longitude, latitude, and language properties stored in the diary_entry table

The paper claims to have filed bug reports, and has URLs. But those links don't exist.

Paper: https://hyperloop-rails.github.io/220-HowNotStructure.pdf openstreetmap-website: https://github.com/openstreetmap/openstreetmap-website/ Claimed Issues submitted: https://github.com/hyperloop-rails/issues-summary

Thanks for looking into this - I am one of the authors of the study. We have updated the submitted issues link to https://github.com/hyperloop-rails/study-replication/tree/ma.... Let us know if you have any further questions.

> Inefficient Rendering (IR). This one was a surprise to me

He's talking about the common Rails pattern of rendering a small Haml/ERB partial over & over in a loop. I've noticed big perf hits from this before, but never completely understood why. Rendering the exact same result directly in the loop body, without calling a second partial, gives a big speedup.

There is an extensive discussion here:


I would love to have some better understanding around this, although it sounds like no one has a clear idea of the cause.

In development, I've noticed that Rails re-reads the partial file off disk every iteration of the loop. I don't know if that happens in production too, but if so it would explain a lot.

I’ve experienced a lot of n+1 queries problems as causes for bad performance. Often times this was a result of wrapping the ORM in abstraction layers (for business logic and fears of being “locked in” to the ORM). We rewrote that part of the application using a different ORM (in Python). Making use of a tool that could help find these problems automatically helped greatly and we didn’t have a single performance problem when we went into production the rewritten service.

This module can detect the n+1 queries problem automatically in Python ORMs: https://github.com/jmcarp/nplusone

Looking forward to seeing more automated tools like this in the Python/Django world.

Scout also detects these for Django, ordering by the most performing N+1s: http://blog.scoutapp.com/articles/2018/04/30/finding-and-fix...

And bullet is used for Rails to detect N + 1 queries. https://github.com/flyerhzm/bullet

This is cool. Are there more papers as practically useful for a (web) developer as this one?

Most ORM-related problems are related to lack of knowledge of how the tool works.

However, if you understand how databases work, how to tune the driver and how to get the ORM tool to generate the same queries you'd otherwise write yourself, then you are fine.

For more details, check out these [14 High-Performance Persistence Tips](https://vladmihalcea.com/14-high-performance-java-persistenc...).

Pretty cool to read. We built a (currently proprietary) CMS with our own scripting language, and instead of going the ORM way, we merged basic SQL into the language itself. We did that mostly to eliminate sending raw strings to databases (and all the injection risks and complexity that comes with it) but it does allow a few extra optimisations because the compiler can look at both the query and the language using it.

So if I do our equivalent of the ORM API Misuse case:

  IF(RecordExists(SELECT * FROM schema.variants WHERE track_inventory = 0))
(RecordExists is a function that only checks whether the query returned something, and has been marked that way) the compiler will already reduce this to:

  IF(RecordExists(SELECT FROM schema.variants WHERE track_inventory = 0 LIMIT 1))
And likewise a function that selects all columns from a database and returns only one field, has the select reduced to only selecting that one column.

The drawback, of course, is that any SQL features of the underlying database not exposed by the scripting language, are unreachable unless you fallback to sending raw query strings again.

Of course it depends on the exact circumstances, but your query would probably be faster as `SELECT COUNT(*)` than the `LIMIT 1` you optimize to now. In fact, that’s one of the specific optimizations mentioned in the article.

count(*) ends up counting all records, reading one first row is much much faster than finding and counting all.


That depends a lot on the implementation of the database and the communication protocol. In our case, that made no difference

I need a very good reason before using any external library in an attempt to keep the total code base as clean as possible.

It is just too easy to be rushed and bring in a heap of code, so I prefer to use SQL instead of ORM's.

All access to the database is done in a single module and they are wrapped in functions like below

    def get_table_as_list(user_id, cols, tbl, where_clause, params_as_list, conn_str, order_by="1", maxrows='2000'):
        This should be the ONLY place that selects from the database
        db = get_db_conn(conn_str) 
        cur = db.cursor()
        where_clause += ' AND user_id = %s'
        sql = "SELECT " + cols + " FROM " + tbl + " WHERE " + where_clause + " ORDER BY " + order_by + " LIMIT " + maxrows
        cur.execute(sql, params_as_list)
        res = list(cur.fetchall())
        return res
The database is designed and built first, then in the application the definitions are done like below

    all_tables = [
         'cols':['id','title','pinned', 'important','content','folder'],
         'col_types':['id','Text','Checkbox','Checkbox', 'Note','Text'],
        'cols':['id','Title','Pinned', 'Important','Notes','folder','Done'],

So far it is working well, and it is very simple to add new tables to the schema and have them working in the application.

Unless you're performing some magic elsewhere in the codebase, this will leak connections if an exception is thrown since you're not closing the connection in a 'finally' block. Alternatively, depending on your version of Python, you could use a 'with' context to ensure the connection is closed.

Good point, thanks for that - there is a lot error handling I haven't shown but hadn't taken into account memory leaks.

What exactly is the problem you're attempting to solve with this?

Bonus questions:

What about maintenance or admin queries which aren't tied to a specific user_id?

What about sql injection?

> What exactly is the problem you're attempting to solve with this?

Keeping all database access in one place to avoid having selects around the codebase.

> What about maintenance or admin queries which aren't tied to a specific user_id?

This is the web interface for users, all admin stuff is done elsewhere

> What about sql injection?

The selects are passed as parametised queries, so the where clause would be 'title = %s AND folder = %s'

It's nice to see that people are still able to write 2002-era PHP in Python today.

ORMs are really only useful for throwaway projects and beginners. I have yet to see one without serious downsides in both performance and speed of development, something that they are touted to improve but actually make worse. The ORM I'm stuck with now (Doctrine2) adds a 10x overhead to queries. For the most part, we don't even bother optimizing queries in such situations because why waste time on something that could at most improve performance by 1/10th? The real fix is to get rid of the ORM. That requires a rewrite of two apps that have been in development for four years each. I can't think of a worse architectural decision for business (CRUD) software than using an ORM. Irreversible and forever fucked.

Don't know much about Doctrine2, but that sounds pretty terrible. I'm sorry you are forced to work with something so inefficient.

I've been building apps with Django and Django's ORM for the last 10 years and found essentially zero overhead in most cases. Every once in a while there's a slow page, I open up the debug toolbar which shows me every SQL query that was used to generate the page in a nice waterfall diagram, I see something a little odd and change the order of some filters or add a `select_related`, or `prefetch_related` or discover that some third party library is making a dumb call (unavoidable problem of using third party libraries on any platform) and find a workaround for that. Every once in a blue moon, it appears easier to write a raw SQL statement than figure out what needs to be done to get the ORM to generate it, so I do that. Of course, Django's ORM makes it stupid easy to use a raw SQL statement: https://docs.djangoproject.com/en/2.0/topics/db/sql/

I've worked with a few other ORMs in Python and other languages over the years as well (in Go, Erlang, Elixir, Clojure, nodejs) and never really encountered any where the ORM had a noticeable performance overhead (dominated by the network latency back and forth from the database) and I've yet to work with one that I couldn't just do a raw query when needed. The closest I've seen is when I started using GORM in Go, I found it running slowly and discovered that it automatically adds a "soft delete" functionality so every query gets an additional "and not is_deleted" clause added. Disabled that feature and it was fine. That was my own fault though for starting to build before I finished reading the documentation, as it was pretty clearly explained in a later section. (OK, also the ORMs I was using in Perl and Java back in the late 90's/early 00's were also pretty terrible, but those were prehistoric times.)

You always have to be careful of N+1 problems, but that's not just an ORM thing. You run into that as soon as you have any abstraction in your code. Once you have refactored to `get_list_of_items(some criteria)` and `get_item_details(item)` functions/methods/whatever, whether it is using an ORM underneath or raw SQL, developers working on the app have to know that they can't loop over the results of the first and call the second on each of them. Tradeoffs between reusability and performance are nothing new though and not at all specific to web applications or ORMs.

PHP's concurrency model poses unique challenges to orms but in this case, it's just slow hydration that is the main culprit and that can happen in any language, especially interpreted ones. It's no longer just fetching data and displaying it but fetching, processing (in a slow interpreted language) and then displaying it.

That is one issue. I find the interface the orm provides to also be inferior in every way compared to sql so I just don't get why people would add something that provides even the tiny overhead you describe to get an inferior interface for dealing with the database. Some orms I've used create their own sql like language, so now there is something new to learn (that's useless outside the orm) that provides nothing over actual sql but has all the drawbacks of the orm and sql. Even working with simple objects and queries is more difficult as I hardly know what the orm is doing. When I do look at what it generates, it's always been atrocious in the orms I've used. In addition, trying to wrap my mind around the concepts the orm builds poorly on top of the relational database is truly angering. Taking a wonderful interface to a database and turning it into a set of nonsensical object relationships actually makes it more difficult and more time consuming to write the app with the orm. This is unnecessary complexity for complexity's sake, something proponents of simplicity avoid.

So my experience has not only been that of horrible performance but also that of a horrible ui that slows development to a crawl and creates apps that need to be rewritten to work properly, the exact opposite of what orm proponents claim. One of the main reasons I want to move to a functional language is that orms don't exist there.

I use Doctrine2 in a number of applications. I have a love/hate relationship with it (mostly through fitting it to legacy database schemas) but it's not slow to put/retrieve data from the database. What's slower is the object mapping (hydration); and if you map database rows and relations into objects yourself then your overhead is going to be similar, just labelled as 'application' overhead rather than 'ORM' overhead.

The 10x slowdown I noticed was moving from native database functions (PHP's PDO or MySQLi extensions) to Doctrine2's underlying database abstraction layer, DBAL. Doing a prepared query (with the same SQL) in the native extension was 10x faster than DBAL, which uses the native extension under the wrapper. I never got to why - DBAL is doing more (fair enough) but not enough for this amount of slowdown.

The only way manual processing of the data from the db could take as long as doctrine hydration is if it's written poorly. An app developer who knows how to write apps properly would never do hydration or any of the stupidity doctrine does. Why would one map database rows into objects in the first place? I've rewritten huge swaths of one of our doctrine apps getting a roughly 5x to 10x improvement in speed and I left a lot of optimizations on the table while still providing a general purpose serializer that uses doctrine's meta data behind the scenes.

As for dbal, I think the overhead is in statement preparation and parameter expansion (for array params) and it can be avoided by using the native functions, though that overhead was never 10x for me.

The problem with ORMs is that they make DB calls seem cheap, and when your application is young and you don't have a lot of rows, everything works wonderfully. Then all the N+1 problems start adding up and the SELECT COUNT(*) performance issues. Since the ORM code is baked into the models, your code is littered with these calls -- in the service layer, views, background workers.

That's why I really like Promises/Futures in Javascript. You know exactly when you're executing a DB operation and you have to think about the implications more. It's not just a simple object accessor.

The only big problem with ORMs is when developers are allowed to use them to construct database schemata.

As long as you have a not-crazy person in charge of the initial design, you can allow—and even encourage—developers to make versioned changes to the database.

If you let us start from scratch with ORMs, it will suck for everyone.

There’s a fundamental mismatch between what developers think of as objects and how relational designers think of tables.

ORMs can smooth that over. But nothinking can undo the wrong-headed-ness of a developer thinking of a relational dB as an object store.

Give me an O! Give me an R! Give me an M!

What does that spell? SLOW PERFORMANCE!

Todays programmers dont understand data. They understand frameworks. To find the nr of all cars that are out of insurance they write:

    10 Nr=0
    20 Hey framework, give me all cars!
    Framework: Ok, here are 8001093 business objects representing all the cars in our DB. Each has all the attributes the car has. Color, mileage etc.
    30 Thanks!
    40 Foreach Cars as Car
    50 If Car->insurance_end_date < FancyDateLib.now() Nr++
I see variants of this everywhere. And the performance impact is just untoppable. It's often several million times slower then a simple sql query.

But beefy hardware with lots of ram and memory takes care of it. 'Our software is enterprise grade, so of course it cannot run on commodity hardware'.

Not this again. ORMs are tools, basically dynamic code generators that run SQL and map the results to in-memory objects, and vice-versa. Some are simplistic and others are incredibly advanced, and the code itself is usually faster than your own sql->objects logic that you would write otherwise.

The issues with performance are almost always with the way the tool is used, like choosing a bad algorithms or the wrong data structure for a certain situation.

Lazy-loading a large child object in a tight loop over 100s of entries will always be slow because it's an improper approach, and nothing other than knowledge and competency will resolve that. It's rather amazing ORMs can be so controversial when the real issue is the developer, which speaks to the far bigger problem of expertise and quality in this industry.

It is the developer, but ORMs are so controversial in part because they often obscure that you're doing something crazily ineffective in ways that makes developers that don't understand the abstraction fail to see that they're doing something obviously wrong.

It's more stark that you're doing something crazy if you do a SELECT, instantiate objects from each returned row, then apply a filtering rule to that object, than if you're "just" operating on something that looks just like it's all local. Both to you, and to people reviewing your code.

I prefer working with an ORM, but it does take discipline to use an ORM properly; to make sure you know and use the facilities for building queries that return only the rows you want, and instantiate objects with only the columns you need.

I tend to agree with you that it will only be resolved by knowledge and competence, and throwing out ORMs won't solve that. But I also understand the frustration at how many ORM users basically seem to use it as a means to let them pretend there's no database server there, rather than as a tool to generate queries more easily.

Perhaps the anti-pattern to rule them all is that as things become easier to do, it gives more people the opportunity to do them badly. We have so many incredible tools, and simple but powerful high level languages that allow us to write apps without bothering with complicated stuff like assembler. However, these tools don’t mean that we get to ignore how CPUs work, and ORMs don’t mean that we get to ignore how data structures work, or how we’re supposed to operate on them.

I think you’re 100% right. ORMs are just tools, they can be used well or poorly. That’s up to the developer. As a former DBA, I also think they’re great tools. Aside from making the business logic easier to write, they also bring the business logic and the data closer together.

> as things become easier to do, it gives more people the opportunity to do them badly.

ala, this is why we have electron, php, and javascript. None of those things behave badly, but it's very easy to write something useable in those, but resource intensive, or insecure, or poorly maintainable.

>[...] ORMs are so controversial in part because they often obscure that you're doing something crazily ineffective ...

Then just turn on detailed SQL logging while you are coding and keep an eye on the queries generated by your ORM calls.

If you do this methodically for all your DAOs, then you should spot performance problems early when they can still be easily fixed.

With all this continuous effort required, wouldn't it be simply easier to ditch the whole 'use ORM because it makes life easier' nonsense and use just some SQL-mapping tool like iBatis? Where you really, truly have 100% immediate control over queries, no dancing around with logging and wasting time figuring out why the lib decided to fetch this and not that.

I honestly don't get the fear some people have from SQL. If you can't do SQL good enough for 98% of use cases out there, you are not senior dev, not even experienced one. It's just one of those basic dev requirements that won't go away in next 50 years anyway.

True, but on the other hand if the developer gets told they are loading nearly 20,000 objects they should be smart enough to realize something is going to need to be optimized here, and although I don't know how to do it I better find out.

That depends upon the context. I've seen developers argue over optimizing ETL processes that weren't in the slightest bit time sensitive, took 25 minutes and were going to be run once.

More often than not I've seen developers fret over performance when it's not actually bothering users. This fetishization of performance seems to be a cultural thing in tech.

Conversely, data integrity, normalization and transactionality are usually given far too little weight.

> This fetishization of performance seems to be a cultural thing in tech.

Tell me where you've seen that, because I'd love to move there. From my experience, there is a common fetishization of non-performance. The "premature optimization" adage taken to its extreme - "we can buy more hardware", "developer time > machine time", "who cares about wasting electricity of millions of our users", etc.

Pretty sure that is what tailing the log during development is all about. I don't see what is so obscure about it, in the Rails based examples in the post, one can just see the logs flying by with the exact queries being executed.

It is, but a developer that's using ORMs because they're scared of SQL tends to be the same type of developer that will decide not to look at those.

I'm not dismissing ORMs - I use Sequel (the Ruby ORM) for almost all my database access. But I've also seen enough people use ORMs as an excuse to pretend they don't need to understand SQL or understand the database to understand why some people look at ORMs with suspicion.

A lot of code that is obviously bad when the database queries are plain for everyone to see are not so obviously bad when it's less clear if that method call translates to a database query or just extracts data locally.

Note that this is a general issue with this type of abstraction: It is a common complaint against transparent RPC wrappers as well that if they're too good at hiding that an object is remote it's easy for someone to carelessly cause massive amounts of unnecessary roundtrips.

I don't know, I find that a lot of the people that write the most inefficient ORM code would also not spot the inefficiencies in SQL queries either.

On the flipside, I've seen a lot of people that can write efficient SQL queries, but then turn around and ruin their performance with a bespoke code model that inefficiently calls those SQL queries and poorly leaks the memory and database connections while doing so.

A good ORM makes it rather easy to fix an inefficient query of a fellow developer or previous self (almost all of the examples in this article are effectively one-line changes; many of the biggest performance gains I've made in ORM usage have been removing code and/or making it more readable), but fixing the mistakes of a bespoke model can be a huge challenge with a lot of surprises.

Not this again. ORMs are some of the most impenetrable code you'll ever see. Understanding the performance characteristics of a CRUD application using SQL is vastly simpler than trying to puzzle out the arcane ways in which ORMs decide to throw up garbage once you add one more thing to the mix.

It's just code, compiled into a library that you can use, and queries are either logged by the library or in your database. What exactly is so impenetrable? That's just strange. Also CRUD is the ideal scenario for ORMs which handle all the mapping, security, parameterization, and even conversions between data types and models seamlessly while letting you just work with your objects.

It seems you either didn't use ORMs correctly or used a very poor one.

>It's just code, compiled into a library that you can use.

That's also true for code that one shouldn't use.

Dont get me started over the abuse of active record.

Not that again. ORMs achieve one main thing: being able to map your app’s objects to a relational database and back.

But there are tons of other benefits:

1) Avoid all injection attacks by default by binding variables rather than interpolating their vakues

2) Write SQL code for you to automatically, so you always have balanced parentheses and no typos or errors mixing statements

3) Autogenerate classes and methods from the SQL model automatically so there is a place where you can add custom methods on objects

4) Model fields and relationships in a way the app can understand, so you can make meaningful error checks and manipulations in the app instead of relying on the database engine to give you a nice error message or manually copying the data type logic.

5) Play nice with version control, making sure to encapsulate the code in ONE PLACE instead of a million places when the schema or model changes.

6) Pissing people off on HN so we can better discuss and explain principles of architecting good softeare while talking anout the benefits of ORM

7) Let you write an adapter to move to eg a graph database which is far faster.

8) In fact, just making you avoid doing joins in the database by default is already a feature as you can make your app far more scalable w sharding and possibly think about making it byzantine fault tolerant and distributed!

See for example this:



I was with you some of the way but "making you avoid doing joins in the database" made me drop my monocle. You want joins in the database, they are designed for joins. Moving joins to the client will kill performance and scalability.

And any sane ORM will perform the joins in the database by default.

No, that's not true if you're building a huge site. Google has been avoiding joins as early as 2005.

Joins were good for the smaller websites, but they don't scale. By avoiding joins, you have shared-nothing models that can be partitioned horizontally aka sharding.

Now true, the latest and greatest databases such as CockroachDB go out of their way to try to do joins for you across partitions, even in an ACID manner, but then you have to use those. Better to avoid joins in the DB and do it in the app. You can then use a graph database instead of a relational database, going from O(log N) lookups to O(1) lookups for related data.

Oh and finally, the newest (and pretty cool) craze of BFT, Byzantine Fault Tolerance. You can't achieve that if you're doing joins across different publishers, because they're not supposed to be able to access each other's stuff "just like that".

Our ORM supports joins, even with multiple indexes, it even lets you define relationships and figures out the joins FOR YOU, but it is discouraged if you're building scalable sites.

  By the way thank you for proving point #6 hehe

> Joins were good for the smaller websites, but they don't scale

Most sites do not have to scale beyond this limitation (or can use database followers to throw a bit of money at the problem). Providing Google as an example is a bit exaggerated as almost nothing in the world has the scaling needs that google has.

So you have to use sharding because your databases have limited capacity. But you can just join in the app, because the app have unlimited memory?

The app doesn’t hold the entire database at a time. The app simply does the following:

1) Get the root record(s) from id(s)

2) See what related records it needs, combine them into a list of ids, partition list by shard

3) Ask each shard for the corresponding records

4) Repeat from 2 if necessary

5) Return this whole tree / graph to the user

Graph databases can do this in O(1) instead of O(log N) lookups.

Relational joins are just one way to achieve this, which can be made atomic in the ACID sense.

However, as you scale up your website, eg with 100,000,000 users, it would be silly to do massive joins. Google even says this in their docs now, for BigQuery.

Instead, design your systems from the beginnig to be as parallel as possible, if you think they will scale.

Look at the problems with Ethereum for example. Or Twitter fail whales of the past.

A graph database basically caches related entities with the root entity, so lookups are fast when the query follows the paths of the graph. The price is that any other query is extremely costly. But if it works for your use case, more power to you. But not really relevant in a discussion about ORMs - Object Relational Mappers.

The relational model was designed to address the limitations of the network/graph database, especially to allow arbitrary (ad-hoc) querying and to decouple the physical storage from the logical model. But if you don't need all that, a graph database may be fine.

Most applications don’t need that. But if they do, just have indexes!

Acolyer again:


For the vast majority of use cases regular SQL databases blow graph databases out of the water. Unless you go for graph-specific algorithms like Shortest Path, and even then...

We are not all google, im extremely happy with my joins and perfomance of Mysql.

>The issues with performance are almost always with the way the tool is used

That's overly generic.

The truth is that some tools encourage bad performance habits and a lax attitude about it whereas others don't. ORMs do.

Encouragement and attitude doesn't force you to do anything. You still have to choose to use it, either well or poorly.

This is no different than any other tool that makes things easy but with obvious limits. Proper decision making is still up to you. There really isn't much controversial here if you get past the whole "ORM" hype/hate cycle.

>Encouragement and attitude doesn't force you to do anything. You still have to choose to use it, either well or poorly.

I don't believe in choices, people are flimsy. I believe in creating an environment that encourages good behavior.

> I don't believe in choices

...ok, people still make choices though, you're not controlling their minds. Perhaps educate your workforce so they make the right decisions by themselves, it's more effective and takes less effort than trying to coerce them through generalizations.

>...ok, people still make choices though, you're not controlling their minds.

No, but as a PM you can dictate they don't use an ORM.

Try reading the 2nd part of my comment.

I think the benefit of using an ORM outweighs the cost of making sure you don't have bad programmers on staff. We should be utilizing code reviews and proper source management anyway, right? This should be as simple as a senior/lead developer seeing someone iterating over 8MM rows and saying "don't do that."

Just a UI issue.

Compilers will warn you if you do stupid common mistakes. Not all mistakes, but many of the stupid common ones. If you make stupid common mistakes with an ORM, why doesn't the ORM warn you?

They require rather sophisticated analysis to detect. A compiler will detect trivial errors, but will not flag if you are using an O(n) algorithm when an O(1) could be used. Not yet, anyway.

I read the paper. A third, maybe half of the stuff the ORMs didn't detect looks simple to detect.

(Greybeards may remember how perl's -W switch switch suddenly detected a frightful amount of performance problems using mostly simple tests. Almost 20 years ago now.)

Maybe you are on to something. All the major ORM's are open source (AFAIK) so if you made a write-up of the common issues and how you suggest they could be detected by the ORM, I think it would be very well received.

I'd think you could do warnings on this simply by processing a log of the queries, looking for query patterns that match common poor code patterns.


    SELECT id FROM table [some conditions]
followed by a number of

   SELECT * FROM table WHERE id = ...
is one example of a anti-pattern that suggests that someone is doing an overly simplistic query followed by a loop. Similar with signs of triggering loading of related objects instead of a JOIN.

Doing it on the emitted SQL would also make it quite easy to make it reasonably ORM agnostic - it doesn't need to be perfect, after all, so doing relatively crude pattern matching ought to be able to find at least the more basic problems.

The authors of the paper, if I’m remembering from reading it earlier today, built a static analyzer to help find these types of problems! Some of the work has already been done!

I use an ORM. I don't perform queries like that because the ORM makes it easy to build complex queries, joins, limiting the data returned etc, then execute them in one query, it's a convenience, not a straightjacket.

The problem is not ORMs, the problem is just people not thinking about the resources their query over some data takes, and trying to do things in memory that are better done in the database (as in your example), because databases are hard, and sql is not a particularly friendly language, so they take the easy way out. They'd do things like iterating over all instances just as much if not using a framework/ORM, because they understand their programming language and don't care to understand SQL, and also because they get away with it when there are 1000 cars in the fleet, and didn't anticipate having 1 million.

One of the biggest problems with ActiveRecord (the ORM in the article) is that it uses extremely obtuse names for extremely common methods. There are three different methods for finding how many things meet a given criteria[0], and none of them do the same thing. There are three different methods for loading data from an associated table[1], and none of them do the same thing either. None of the method names explain how the function is implemented or whether it's going to hit the database.

This is a problem that goes beyond the usual ORM antipatterns and bad database designs and becomes its own special kind of hell, because it makes code far harder to reason about and thus far harder to review, in a way that I don't remember encountering in Hibernate, Gorm, or anything else.

[0] count, length, and size

[1] includes, preload, and eager_load

This is mostly due to its popularity and various people needing different use cases, the long-time maintainer is now working on a mach cleaner ORM for Rust, called Diesel[1].

1 - http://diesel.rs

Compile time SQL query checks... amazing! I've been trying to find a similar library for a while. Thanks for linking

jOOQ for Java does something similar. It's so good, I'm likely stuck with Java on the backend until I stop using RDBMs.

Hey Mat, thanks for spreading the love. Hope you never stop using RDBMS!

Fancy a couple of jOOQ stickers? :)

I wouldn't use the stickers, so they would be wasted. Just keep making jOOQ a great to use library. :)

And while I have you here...your blog posts are always great and very informative. It's interesting to see the differences between RDBMSs (I really only have experience with MSSQL/MYSQL some PG), and how they do one thing or another.

> Just keep making jOOQ a great to use library. :)

Will do! Thanks for the nice words about the blog.

> This is a problem that goes beyond the usual ORM antipatterns and bad database designs and becomes its own special kind of hell, because it makes code far harder to reason about and thus far harder to review, in a way that I don't remember encountering in Hibernate, Gorm, or anything else.

I feel like this brings up an annoying cultural problem in tech wgere developers hit intermediate skill levels and feel the need to proclaim their expertise by writing something in the you’re-doing-it-wrong genre but don’t have the breadth of experience to realize that they’re over-generalizing and so they go from “ActiveRecord has this problem” to “ORMs are bad” without asking whether anyone else has it better. I’ve read tons of similar posts where e.g. Java proved that exceptions, OOP, or static typing were bad or Twitter proved you shouldn’t use Rails.

Yes there are a few problems with ActiveRecord; it has been well used and has accreted lots of similar functions (a victim of its own success), and also it doesn't make it crystal clear when you are hitting the database.

IMO ORMs should be simple, unsurprising, and only have one or two methods which actually execute the sql and fetch data - they are there to make it simpler to build queries and map rows to objects, not to blur the line between fetching and manipulating data.

Few months back I was working together with a group of framework programmers from a large consulting company.

They seem to spend solid third of their time googling for how to write a code. Write code, hit a jam, google, find something in stackoverflow, ruminate between what is found, copy paste and try if it works. They don't know their framework well enough to just write a code.

I'm a more of a embedded software/hardware guy and manager, but I can sit down and write pivot pivot query in plain sql without consulting any references.

I'm not saying these guys are doing something wrong. It may be that in the typical consultant work where you have existing old frameworks, new frameworks and new stuff coming in constantly, there is less value to master the tools that you are using. You write some glue, then move on.

That's a good thing, I think. Programmers should have a good understanding of general algorithms but it's up to the frameworks themselves to reduce the need to be "mastered". I don't want to spend time mastering a particular framework because that knowledge is not portable across languages.

I suspect it's good thing also. It may also be the only way to scale programming into larger groups.

You can differentiate the workforce. Programming by copy pasting example code withing frameworks may allow skipping the requirement to understand general algorithms.

Less required expertise means less pay and cheaper products. You hire 10 low-paid easily replaceable code monkeys who slap together pieces of software from ready components. It's like like assembly work. Then you hire one guy who knows things to supervise them and solve the problems when they can't figure them out. It's like blue collar assembly workers and engineers.

I think part of this is the industry and part of it is the technology. Entire languages in the web are basically meant to be glue. Embedded usually has different requirements from Web, memory and performance are real problems you run into fast if you don't know what your doing. Where in web, memory and performance are a problem of economics, everyone is being served from a cluster that can be scaled if the software is shoddy.

That has nothing to do with ORMs. I've seen plenty of people write code like:

  for each r in result
    if r.x > 12
Exactly the same thing without any ORM. If anything ORMs should improve performance for novices since it makes it a lot easier for people to writer better queries that run on the database.

My record was a 5 order of magnitude performance improvement on a site which a big consulting firm had been working on for most of a year, where page rendering times were somewhere north of 20 minutes (the server’s hard timeout).

None of their developers knew about WHERE constraints but they did know how to join tables so they were looping over hundreds of millions of rows in classic ASP using essentially the code you have above, with a bunch of unnecessary type conversions cargo culted into the inner loop (IIRC string to int to string).

Added business lesson: we billed double for a rush job but since it was only about 20 hours to fix the code and build out the remaining features (about half of the site), the original contractor still made considerably more. Our sales guy wasn’t canny enough to realize that he should have offered to do it for, say, a third of the other company's price.

Hey, I think those same consultants worked on the main website for a company I used to work for :)

Seeing how much business people like that got was definitely an educational moment about the efficiency of the IT market ;-)

I've also seen:

where there's no corresponding index. There's also the infamous N+1 query pattern:

  // get list of ids
  for each id in ids
    r = SELECT * FROM DATA WHERE id=:id
    if r.x > 12
IMHO, both are signs of not fully understanding what the database does for you. In the same vein as the "learn JS before frameworks" argument, I'd argue devs should at least learn and understand SQL, indices, etc. before using ORMs.

> I'd argue devs should at least learn and understand SQL, indices, etc.

This was all dev 101 when I was coming up (yikes, almost 20 years ago now).

Yeah, I wish it were common knowledge...although, if you ask 10 different devs, you'll get 10 different answers on what "ought" to be included in a basic CS / developer education.

(You don't know the minutiae of pointer arithmetic!? You're not familiar with Docker, Vagrant, or AWS Lambda!? You can't construct a sed / awk one-liner with your eyes taped shut!? You don't know about concurrent skip lists!? You've never completed SICP or read Purely Functional Data Structures!? And so on.)

I agree that this has more to do with the developer than the ORM. I once watched someone write a foreach loop around a set of millions of records, to update their "BatchDate" property to the current date / time. To add insult to injury, to get the current time, they issued a separate query to the DB, in the form of "SELECT GETDATE()", then injected the result - as a string - into the update statement.

To be fair though, ORMs do move us one layer further away from the actual DB, so people tend to be even less likely to understand how to write ORM calls which result in performant SQL calls.

The ORM makes it more likely to do stuff in the application instead of the DB because many queries are hard or impossible to represent in an ORM fashion.

Additionally, the ORM makes it much harder to see what is going on under the hood. For example you don't know if a value of an object is stored in the main table or lazy loaded from a related table. So you have no clue if echo "$user.name lives in $user.city" results in 0 sql queries or one or two.

The ORM makes it more likely to do stuff in the application instead of the DB

Again this doesn't match my experience. A mediocre .NET developer who can access the database with LINQ is far more likely to run their queries on the database than a mediorce .NET developer who has to try to write SQL queries by hand.

And the truth is that modern ORMs are often pretty clever with their optimizations, and in many cases the ORM will often outperform an average developer doing the obvious thing is SQL.

Also ORMs have enough visibility into your code to tell you of for doing stupid things. There are libraries working with ActiveRecord (and likely others too) which will warn you that you could skip the specific query if you added .eager_load(). Or about 1+N you're doing needlessly.

This can be done on SQL itself, but it would be much harder.

> ORM makes it much harder to see what is going on under the hood

I've never in the last 20+ years seen an ORM that doesn't allow you to log the SQL queries with a single configuration option.

Heck, I'm disappointed in any ORM that doesn't log queries by default in development mode!

With an add on Django even displays them in the browser!

It's trivial to find optimizations in django debug toolbar..."oh yeah I should use select_related"

If you have no idea what you are doing with an orm you probably won't be able to write decent SQL anyways.

This is only true if you let the ORM design the schema for you. I would hazard a guess that most people only use ORMs for queries (including mutating ones). In which case you very much do know which table a field is in.

Even if you use the ORM to design the schema for you, an ORM isn't going to just willy nilly put fields in tables by some confusing whim. Even the most opaque ORM conventions for field naming tend to have relatively easy ways to spot that MyClass.propertyName binds to table my_class and field property_name, or whatever is the case.

ActiveRecord has a find_by_sql which I use for medium to complex queries. They are easier to write and understand in SQL than in Ruby. The problem is that there are many developers that don't know SQL. I found them in Ruby projects and Python and Node.js. They write whatever they manage to code with the ORM and don't understand the implications on database performance.

An ORM doesn’t have to be slow. Most can source their data from custom queries or stored procedures.

Not using an ORM requires writing lots of code to do what the ORM would otherwise do. Is that really appropriate today? Should we not take advantage of the available CPU/RAM? I say ORM is the correct choice in many cases. Not always.

ORM doesn’t have to be slow, but I have seen a lot of implementations like the parent comment. For example, doing a foreach over 2000 rows where an UPDATE would be much, much quicker. ORMs help a lot with developing software, but you need to profile and optimise - in some cases bypassing the ORM.

    I say ORM is the correct choice in many cases
I'm still undecided about this. Would be fun to compare some actual approaches. Which is your favorite ORM?

Currently I’m trying to use Dapper if possible. It allows me to write efficient queries manually and just maps the results to objects.

For simple queries though (single-table), Entity Framework is just as fast.

Because I’m mostly working on dashboards and stuff, writing to the database isn’t much of a concern.

What is the code to get the number of cars with expired insurence when using Dapper?

Something along the lines of

    db.Query<Car>("SELECT * FROM Cars WHERE InsuranceEndDate < @Date", new { Date = DateTime.Now });
/edit: Sorry, missed the "number of". Well, you get the idea. It'd use 'QuerySingle' instead.

That does not look very ORMish to me. It might return objects. But isn't one point of an ORM that you tell the ORM which data you want and not what SQL query to send to the DB?

Also Dapper fan. I like mapping into objects without code generation and complicated queries are easier to write compared to LINQ if familiar with actual SQL. Only thing I really don't like is bulk inserts as it will do one statement per object. You can still use both Dapper and LINQ at same time but it might confuse others.

> Which is your favorite ORM?

jOOQ. Not really an ORM, but handles the tedium of mapping results to an object while letting me write type checked SQL. IMO, it is the best solution for dealing with an RDBMS. I wish every language had a jOOQ equivalent.

“Todays programmers”: I'm sorry to disappoint you, but this isn't new. My father (DB engineer, now on the verge of retirement) has been complaining about that for almost two decades now!

Just as a reminder, Hibernate is 17 years old :).

Also, modern ORM frameworks give ways to iterate on the data from the programmers language, but translated directly to SQL. See what [Diesel](http://diesel.rs/) does, for instance.

The paper talks about other issues, which comes from negligence or a lack of understanding of the performance costs of the queries. But this performance impact could aslo happen in a SQL-only environment (not using a framework won't help you to create indexes, or to add pagination to your queries).

That’s how I have seen it done in Perl, PHP, Django, etc as well. It’s not an ORM thing it is a naive programmer thing.

The ORM makes it easier to do things like Class.filter(insurance_end_date < today_date)

But hey what do I know, I am not berating the youf of today so I don’t belong in this skit.

    Class.filter(insurance_end_date < today_date)
Which ORM is that? It would be cool to compare some approaches to common problems.

ASP.NET MVC4 also has Linq so you can do stuff like Class.All(x => x.attribute == value).ForEach(foo => foo.method())

Django has similar functionality through Q and F expressions.

Most ORMs have this functionality of specific language constructs to allow complex queries to be expressed without writing SQL, just with slightly different semantics.

The Django ORM for example.

You can ever traverse relations and still use the same syntax: https://docs.djangoproject.com/en/2.0/topics/db/queries/#loo...

Diesel has something similar (except that it can't overload < (you can overload < in Rust, but the type doesn't match what's needed), so you need to specify it as insurance_end_date.lt(today_date))

C++ has a similar thing with sqlpp11: https://github.com/rbock/sqlpp11

Looks similar to Laravel’s Eloquent, which is inspired by Rails.

All the ORMs I have used can easily do the equivalent of this

     select count(*) from cars where condition;
If programmers are doing what you say that they are then the programmers are the problem not the library.

Furthermore most ORMs (certainly any that I would consider using!) allow escaped SQL to be used - and if the query gets much more complicated than a couple of where clauses I consider using this feature.

A decent ORM used well allows programmers to program faster on the simple stuff but still write fast code for the complex stuff.

ORMs are terrible but you can make the same mistakes in the article without a ORM, or in a query builder

I think the article alludes to much more important problems like querying in a loop, querying the same information again just because it's not in function/object scope

A lot of the mistakes I see are because developers don't learn SQL, use it wrong, then assume it's slow, then use NoSQL, then reenforce each other into believing NoSQL is saving their performance

Let's not make this a conversation on ORM or not, but how do we make sure developers understand how to properly make use of their relational databases, which for the last decade or so has been painted as old & cruddy compared to sexy NoSQL

This thread demonstrates that ORM is still (one of several of?) our field's Vietnam war... http://blogs.tedneward.com/post/the-vietnam-of-computer-scie...

On top of the data/frameworks issue there's the one of data locality. Frameworks make the tradeoff decisions for you, but not always the right ones since the right tradeoffs are application-dependent. In one context using a cluster of commodity machines that do a mix of compute for your service and store and shuffle data around is a good idea. In another context a few beefy servers with the beef divided up between the app servers and database servers depending on where most of the compute most efficiently takes place is a better idea. (Though even with poor performance design, hardware is still good enough that a lot of things run just fine by having multiple services on one modest machine. That doesn't stop lots of programmers from reaching for frameworks that demand easily scaleable multi-machine systems even when there's no need and won't ever be need.)

In any case consistency matters: SQL strings are just one form of shipping code to the data to compute remotely instead of retrieving the data and computing locally. When you bring in an ORM, inconsistency is bound to follow with lots of computation that could have been in a query or stored procedure now living in the ORM framework and the application itself.

I can at least understand many people's annoyance with stored procedures though. Most DBs procedural extension languages suck badly enough that shipping JavaScript strings over to NoSQL stores can seem like a big improvement, and various levels of SQL standard conformance is all you typically get to use from the broader design pool of declarative languages.

I love to write SQL queries and even use them extensively in my code. However, almost everyone I talk to resists this, and warns me that "one day you will regret..." It gives me an uneasy feeling that I actually might, though it hasn't happened yet.

I think one of the problems with SQL queries is that as far as your application is concerned they are just strings. There's no typing information or even syntax checking. Also if you do something like 'select * from' then there the results returned aren't deterministic.

If you scatter these throughout your code and then the database schema changes, you have a hell of a refactoring job to make sure everything still works. One advantage of an Orm is that if you keep your objects in line with your database the generated sql will stay correct.

Personally, I am more than happy to take that hit. I think it is a price well worth paying in order to have optimised queries that do exactly what I what them to do. To make it easier for myself though I will always try to keep my queries in one place in the code. Then all my code needs to know is it is getting clients_older_than(32) or whatever..

You can get the best of both worlds by using e.g. jOOQ in Java (allows you to write e.g. db.select(MY_TABLE.MY_COL).from(MY_TABLE) where those values are generated from the database therefore they exist and are of the right type.

It maps 1:1 to the SQL statement that gets executed so there's no magic e.g. extra n+1 queries being introduced without you noticing.

But if you change your schema, re-generate, and immediate compile errors showing you where you're referencing something that's now been deleted, so it's not fragile like putting SQL in a String in your code (where if the schema changes, the compiler can't help you)

Came here to endorse jOOQ as well. It alleviates a lot of boring boilerplate like ORMs do, but you don't trade off control neither take the risk of wrong use silently causing problems.

I wasn't familiar with jOOQ before. It looks pretty comprehensive.

Honestly, jOOQ is one of the killer libraries for Java that keep me using it for backend code.

> There's no typing information or even syntax checking.

Intellij is pretty good at highlighting and pointing out typo errors. The code will still compile though.

> no typing information

Given the popularity of javascript, I thought this would now be considered a feature ;)

Refactoring scattered string-only queries isn't that bad if you're confident with tools like grep/ag/sed and so on and maybe have a bit of foresight in naming things not super generically. You can even add logging at the junction between app query callouts and the DB to log the full queries and the stacktrace of where they came from if you suspect you may have missed things in your text search due to people constructing query strings incrementally with other variables concatenated in. And as you mention having the convention of centralized named procedures (whether stored or in the app itself) mitigates this and other problems since there should just be one place to check.

To me the line of argument around lacking static type checks always felt like FUD, but maybe it's best countered with counter-FUD...

If you've opened up your DB layer to accept strings (that you are supposed to build with a SqlBuilder statically referencing column and table names) you still have no guarantees that everything still works because some code somewhere might have just used a handwritten String instead. In other words you still need tests.

The static type checking value only helps against schema changes that rename or remove things, which is generally pretty rare, and depending on the level of autogeneration and query mapping might not even help if the column type changes. It doesn't help when the semantics change. For example (much more commonly) the introduction of a new column that is expected to be filtered on and/or required to be set a non-null value in inserts. So the static references are mainly reduced to being a mechanism to make finding users of a table easier, and hope that whoever is making the schema changes is going to look around for those users and update them accordingly. (When you don't own the table, as is usually the case in large software with many teams, the table owner similarly doesn't own your code, so that's kind of a vain hope. The best solution is what is done in open code with unknown consumers -- versioning. Stop renaming/deleting/changing the semantics of things, just provide new things, under different versions or namespaces if they really need to share names.) But you can accomplish this to the same effectiveness by creating a static reference to the table near where you execute queries on it, and use easy to read handwritten strings for the rest. In my experience though many queries are trivial enough that a SqlBuilder-esque pattern isn't much overhead, it's not a hard hit most of the time, and more complicated ones may belong as stored procedures if you've already invested in that direction.

If you go the full ORM route, which it sounds like you aren't suggesting since you mention optimized queries that do exactly what you tell them to do, the table owner may graciously update the central object builder to set a non-null default value for everyone (or specify one in the table def), which would maybe stop things from breaking immediately, but maybe wouldn't actually stop breakages (especially on the select side where data is retrieved that under the new semantics was meant to be filtered), so end users are still on their own for whether this new column matters to them or not. And that's just one type of schema change that's not a simple rename or column removal, there are many others.

Don't let the dark side blind you. Managing data is complex. ORMs add more complexity and they make data look easy. IMHO the biggest deal with most ORMs is that the satisfy the programming platform and coders aesthetics.

Writing better code makes programs more robust and reduces maintenance cost. For a management point of view, the boss may think that your project was more easy than the project given to the other team that butchered their work.

So long as you're using prepared statements, you should be okay.

What difference does using prepared statements make?

You can avoid SQL injection. It makes it harder to shoot yourself in the foot, though still possible to inject if your prepared statement includes a string interpolated variable.

Yes, good point.

You have abstraction between your code and DB you can change underlying tables but interface will remain consistent requiring no changes to the application. There are security benefits too easier to prevent SQL Injection, you also only give the app user permission to execute sps so no arbitrary queries can be run.

The only way you'll regret it is if you type the query directly into an editor which doesn't support syntax highlighting or doesn't validate the query before you run it and it blows up in production. This is 'typo crashed my code' kind of stuff.

The solution becomes really obvious if it happens once for 95% of people good at computers.

You might regret it a bit more if you do something like a filesystem read to get the query.

There's your regret. It hurts so much, doesn't it? :)

That's one thing I love about Gemstone/Smalltalk.

In Smalltalk, you would write such a query as:

   (cars select: [ :car | car insurance endDate < now ]) size.
(There probably is a shortcut for just counting the elements, though admittedly you have to have each car for the query).

So what Gemstone does is that when cars is a DB collection, it interprets the block and turns it into a query. So the code for iterating in-memory and the code for doing an optimized query is the same.

In Ruby, using Sequel:

    DB[:cars].where { insurance_end_date < Time.now }.count
(replacing "Time.now" with Sequel.function("NOW") if you want server side, or whatever date/time formatting function you want that'll return time/date in your desired format client side; client side it'll be evaluated once)

I think most ORMs can do a reasonable job at this in languages where you have sufficient flexibility in overloading behaviour.

Anyone who does this is just bad. I’d be surprised if they could do SQL without an ORM either. I really have to question the value of a CS degree when there are all of these people who have them and still don’t seem to realize that running database queries in a loop is a bad idea. And they’ll still get hired because they’ve successfully memorized some data structure trivia that they’ll almost certainly never use.

This isn't what happens when you use an ORM, this is what happens when you have a JavaScript code monkey writing data access code and then slapping the word "Enterprise" on it.

The places I've worked at that have written true Enterprise software, whether consulting, publicly traded healthcare companies, banks etc, have had extremely knowledgeable DBAs writing stored procedures for data access. The ORM usage was reserved for modifying the results of stored procedures or small things like account management. Nothing truly critical to the business was handled through an ORM directly pulling data from tables.

The problem isn't "kids these days." The problem is relational databases. It's a profoundly non-intuitive and fundamentally broken model. Developers shudder away from SQL precisely because it is so alien from the problem they're trying to solve.

As for performance, note that even when you know exactly what you're doing it is hard to wring good performance out of relational databases. This is why companies spend hundreds of thousands of dollars a year to hire "Database Administrators" to design and "tune" their database. (What other piece of software requires a highly paid, full time expert?) The most complicated databases are so complicated that we get "Junior DBAs" and "Senior DBAs" and multiple levels of certification.

And let's not forget what happens at extreme scales. At the lowest latencies and the largest data sets relational dbs are simply impossible to use. This might never be a problem for most businesses who are processing a few hundred messages a second (if that) but it should be on the mind of any startup that hopes to one day have millions of customers.

In the long run memory will become cheaper, faster and persistent. (Let us pray.) When that happens most everybody will abandon the big mess that are relational databases and just manipulate objects in memory as the gods intended. Then the relational model just becomes something to scare grand kids with.

Interesting. I have had a couple of people in interviews (who know SQL/Databases on cv) writing code a lot like that pulling in two whole tables, joining them, and then filtering the joined results.

It is possible i was overly harsh in my judgement, as assumed meant no real familiarity at all.

People do this because it’s easier, takes less time, and is within the paradigm their head is already in all day.

We should fix our abstractions instead of telling people they don’t understand data. The latter is fine every so often, but doesn’t scale and isn’t even true most of the time.

Or maybe we should just nut up and tell people they don’t understand data?

Why not both? Truly, we should work to have better abstractions. We shouldn’t glorify making our own lives more difficult. And we shouldn’t glorify ignorance either! It’s not an either/or dichotomy here

What a stupid comment. Your point has absolutely nothing to do with ORMs and only shows your utter inexperience with them.

You could replace everything you said with using SQL directly and just doing a select * from cars

This comment breaks the site guideline against calling names in arguments. Could you please read https://news.ycombinator.com/newsguidelines.html and stick to the rules, regardless of how wrong or annoying another comment might be?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact