Hacker News new | past | comments | ask | show | jobs | submit login
Database development mistakes made by application developers (stackoverflow.com)
154 points by phalien on Dec 13, 2010 | hide | past | web | favorite | 69 comments

I never understood the groans of dismay from other developers when the subject of writing SQL comes up. I'm a dyed-in-the-wool application developer, but I love writing SQL.

Unlike HTML & CSS, which seem designed to spread into a greater morass of obtuse and interdependant hacks with every trial-and-error iteration, SQL has always felt elegant to me.

It takes just as much time to craft an SQL query as it does to translate a design into HTML/CSS, but 99% of the time I can sit back and be proud of my handiwork; even re-using it later.

On the other hand, I have never completed a webpage without looking at the resulting markup and thinking, "What an ugly goddamn kluge." SASS and JQuery and HTML5 and CSS3 don't make a difference. Maybe I'm just bad at it, though.

I eschew ORMs at every opportunity, and it is mysterious to me why "writes SQL for you" is such a common "feature." I would liken it to a tool which "paints paintings for you" or "designs your app for you." Why would I entrust a computer program to do something which requires the creative spark of a human?

ORM's are great for the 99% of your DB interaction which basically boils down to "Fetch a record from this table and put the result in an object", or "Fetch these records from that table and put the result in an array of objects".

However, for that remaining 1% of complicated queries, a good developer can do a much better job than any ORM.

I like the relational algebra underlying SQL. I don't like the clunky syntax, or the weak type system, or the lack of a good composition mechanism. Agreed that HTML and CSS are even worse!

Having written a SQL->SQL compiler (), I can state that the language is a mess. I still like it better than the NoSQL tools I've used, though.

() The company's product was, broadly speaking, a specialized analytics engine, with temporal features. The analyses were written in a dialect of SQL-92 (just SELECT statements, plus extra features for temporal logic and currency conversion), which I had to compile to PL/SQL and T-SQL to run as stored procedures.

By "lack of a good composition mechanism," are you referring to the trappings of writing and dealing with SQL syntax itself? As in, "Ugh. I decided to change this join into a subquery, and now my @#$#% parentheses are imbalanced!"?

If so, I agree strongly, but I feel like that problem could be solved by better error reporting on the part of the compiler. I dread SQL errors now, at least with Oracle (which I work with the most often), because they almost never tell me what I need to know. To a certain extent it's essential to the abstraction of data reporting itself, but certain errors are just comically vague.

Not really, I'm referring more to the ability to factor out common patterns and compose them together to make queries, etc.

Or the lack of closure under composition. It really f-ing grates.

I think this has more to do with people hating to write CRUD vs crafting queries for a report of some sort. I can see the argueement for thinking about SQL as wasting brain power when you're just doing simple interactions.

I agree as far as Create, Update and Delete are concerned.

It's the Read part that gets me. I'm all for frameworks that map HTML forms to model object properties to database fields so you don't have to write boilerplate, but I don't think you can treat "Read" like the other three. For the life of me I cannot articulate my reasoning right now, elegantly or otherwise, but ORMs brush a lot of complexity under the rug in that respect.

For what it's worth, here's a few from me

Indexes. Everybody says it, but I'm still surprised at the number of devs who have next to no idea about index and how they work.

Few guides: Every index adds a write penalty, so don't just add them to everything. If you've got a compound index, it will only be used if the query also uses all the fields in the index (If you build an index on surname, firstname and dob, but execute select * from users where firstname="Bob" and surname="Jones", the index won't be used). Benchmark the difference each index makes, and stand back in awe at the improvement a few well places indexes can contribute.

[Edit: as pointed out below, my experience is well out of date and it looks like there's a number of database engines where the compound index thing doesn't apply]

Learn to think in sets. Most devs are used to operating on discrete pieces of data; we're not used to thinking in sets, and so writing efficient SQL doesn't come naturally. Put in the effort and learn it and you'll be well rewarded. SQL is built around sets of data, and can almost assuredly do stuff faster than your code can. Hint: anytime you find yourself getting the result of a query, iterating over the result set and executing another query for each result you're doing it wrong

Learn how your ORM works. If you're using an ORM (and most of us do), learn how to use it efficiently. This is where having a good working knowledge of SQL helps, because you can then find out how to help your ORM create good SQL. If you're using ActiveRecord, understand how to correctly use :conditions, :include etc, and what they're doing behind the scenes. Ideally you want your ORM to make one call to the DB, and to retrieve everything you need in that one call (calls are expensive, even for very small queries). Poorly written code causes the DB to get hit constantly. Not good,

> (If you build an index on surname, firstname and dob, but execute select * from users where firstname="Bob" and surname="Jones", the index won't be used).

I don't believe that's true, at least not for all RDBMS. I just checked the docs for MySQL and PostgreSQL and they seem to agree with me.

http://dev.mysql.com/doc/refman/5.0/en/multiple-column-index... | "The name index is an index over the last_name and first_name columns. The index can be used for queries that specify values in a known range for last_name, or for both last_name and first_name."

http://www.postgresql.org/docs/8.2/static/indexes-multicolum... | "A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns"

If we want to 100% correct, indexes don't just hurt writes, and they don't hurt all writes either. Two examples (there are more):

- If I do an UPDATE query, and I don't modify indexed columns, the index doesn't need to be modified.

- If I have indexes on (last_name, first_name) and (last_name) the (last_name) index is redundant. Some queries may use the last_name index when I don't specify the first_name - but they could have used the other index. Now I have bloated what contents need to be memory resident (affecting reads as well).

I'd say that suggesting there's a "penalty" at all implies a naive hardware setup[1].

Total system I/O is a finite resource, but this can be hard to reach if indexes are on separate spindles[2] from the data and each other. This can cost substantially more initially, but disappears at scale.

Perhaps, then, as for the OP's original topic, one mistake is not hiring someone who knows these performance strategies. Another is using an RDBMS that doesn't support them.

[1] which, granted, is far and away the norm

[2] or sets of spindles, or SSDs

It's not totally redundant. The index on last_name alone will be more efficient than (last_name, first_name) because the index will be able to contain more keys per index page. The difference can be significant if you're doing something that does an index scan on this column.

This is called a "covering" index, e.g. the index on lastname+firstname also "covers" queries on with a predicate on just the last name. Most relational databases will do this, but it's always good to be sure in your particular case.

It's a composite index. A covering index would be this:

ALTER TABLE staff ADD INDEX idx (last_name, first_name); SELECT first_name FROM staff WHERE last_name = 'n';

A covering index is something very special - all results are returned from the index. Not all databases do this - Postgres does not, because it does not want to maintain all the versions (see MVCC in wikipedia) in the index.

Indeed you are correct.

[Edit: removed defensive response. I've updated my original comment to remove its absoluteness]

After reading the first example at http://www.postgresql.org/docs/8.2/static/indexes-intro.html, I'm a little confused.

I thought that operations using primary keys were always fast, that they were somehow always indexed.

I thought this would be fast:

  SELECT content FROM test1 WHERE id = constant;
and this would be slow because it wasn't on a primary key:

  SELECT content FROM test1 WHERE content = "xxx";
EDIT: this comment was supposing that 'id' was actually of type "serial primary key".


In a primary index, the leaf nodes are the actual data records, so when you've followed the index to the leaf you have the exact physical location on disk that you need to get the data from. If you do a query that retrieves multiple records, for example

  SELECT * FROM test1 WHERE id > constant;
...then that translates very nicely into a bulk read from the underlying datastore.

Contrast that with a secondary index, where each leaf node essentially contains the physical address of the record, so for each record you're returning, you have to make an additional random seek to the underlying datastore to get the entire record. This is why primary indexes are faster than secondary indexes, in general. Various caching schemes smooth out the difference though.

Another good tip to keep in mind when writing indexes is that you can have covering indexes. If you often do a query of the type

  SELECT a, b FROM table WHERE c = ... AND d > ... ORDER BY e, f
If you then have an index over (c,d,e,f,a,b), that index can cover the entire query, i.e. you can return the records by just looking at the index, you don't need to fetch the actual records.

> In a primary index

That's a clustered index. In some data stores primary keys are always clustered (innodb), in some they're not (myisam), and some even let you optionally cluster the table data on a non-primary index (sql server).

> Contrast that with a secondary index, where each leaf node essentially contains the physical address of the record, so for each record you're returning, you have to make an additional random seek to the underlying datastore to get the entire record.

Not necessarily: often the non-leaf nodes of the clustered index are stored somewhere completely separate from the leaf nodes and end up with the address of a leaf node page. In such case seeks may not any better in principle but scans on the clustered key where you need lots of non-key data from the data page are. If you do not need lots of non-key columns from the data page, scans on the clustered key can actually be slower because the key columns are stored so much further apart and not all next to each other.

On the other hand, the secondary index may not end with row addresses but instead with primary keys which then must be looked up in turn, so in such case logically cannot be any faster than a primary key for a seek to a data page.

> This is why primary indexes are faster than secondary indexes, in general.

Clustered indexes are actually often not faster for many operations. A secondary index is often much faster if it covers the fields selected by the query without needing to do a bookmark lookup to data pages, and scan on secondary indexes are generally faster than clustered key scans, again moderated by whether you need to do bookmark lookups or not.

A primary key is just a unique index. All of its other properties and uses have nothing to do with the underlying implementation. It may be slightly faster for some types of queries than a "secondary" unique index if the RDBMS clusters the data by the primary key and you are using traditional hard drives. In your example, if you indexed the content column, the lookup on it would be fast. As a bonus, make sure to read up on composite indecies. They are the bread and butter , so to speak.

I don't see any primary keys in that example, just one column that has an index (fast to look up on) and one that doesn't (slow to look up on, but they don't do that in the example).

My bad, I commented thinking 'id' was of serial primary key type.

Pedantically, SQL is built around bags. It's one of the worst misfeatures.

Good Additions. I would also point out that if devs mess with indexes they definitely need to know the difference between clustered and nonclustered. I've seen atrocities due to a simple misunderstanding of this (guess what happens when you confuse the two and implement the opposite!)

Use non-clustered, right? clustered is just for massive operations with distributed databases and junk, right?

Clustered is directly related to the physical order of the records. Non-clustered is a logical order that uses pointers to the physical location. Only one clustered is allowed per table.


The concept also does not exist in some databases. Oracle, IIRC, does not have the notion of a clustered index. SQL Server does. Another illustration of why you should not assume that all relational databases work the same way.

DB2 supports clustering, and Oracle has a similar concept: http://www.oracle-base.com/articles/8i/IndexOrganizedTables....

You are definitely right about each RDBMS having their own quirks though! Solid advice: know what you are working with and its nuances.

> (If you build an index on surname, firstname and dob, but execute select * from users where firstname="Bob" and surname="Jones", the index won't be used).

I did not know that - I always thought that a compound index would be used for a query against a subset of those columns provided the subset was a prefix of the overall list - so in your example, a query on surname would use the index but a query on firstname and dob would not.

The poster corrected himself above - it's not the case.

Don't know if this was mentioned in the answers (haven't read them all):

Don't assume that once you make your query run perfectly it will stay that way for ever. Once row count grows, your query might start to perform very poorly. Happened to me just recently, when a query was running in the 0.0001s range, but over night crawled to a halt and took 1 minute(!) to execute. Always pay attention to your querys and monitor if anyone of them is starting to slow down.

Your second point (overnight performance hit) needs to be explained to managers often in projects that didn't spend the money in tools or proper performance testing.

Don't expect some gradual curve of loss of performance. Once a table exceeds the cache size an ok performing query can go very bad with no warning (say 100x longer). This might mean that you really need to do a production patch of some type. Plan for this before it happens.

Are there wrappers for database APIs which log the time spent on each query?

I know MySQL has an option to log slow queries for you: http://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html

The MySQL slow query log could be of use. Of course you don't get a stack trace this way but it's essentially free (as in built already and not a huge performance hit).

Problem I have with this log is that legitimately slow queries, in my case usually those where I have concatenated 100's or 1000's or inserts into one query, make it hard to use in a meaningful way.

It's possible I'm using it wrong, did get much past trying it out. Fairly easy to roll my own rough timer in a DB wrapper, haven't got around to finishing this yet though.

You can get a profile (strace like) in the Percona-enhanced releases: http://www.percona.com/docs/wiki/percona-server:features:pro...

There is likely some overhead for this feature, as it needs to make more system calls (getrusage).

You should use your RDBMS for that kind of stuff, all the good ones can log everything you ever wanted to know about your database queries.

Also: don't make assumptions about how you believe a database works, just because you've used another one (or another version of the same).

And measure every optimization you make, too :) I saw someone doing an optimization that should have accelerated the processing, only to discover once in production it was 4x slower.

My addition to the already great list: using temporary tables to serve content. Temp tables are great when you have a complex report, and breaking things up into stages gives you a performance boost (think you need to do multiple lookups against a list of users that is small but takes a long time to build). So if you generate this report at midnight via a cron job, temp tables work great. However, generating a bunch of them at once is a performance hit. Instead, use indecies or (in the worst case) caching with triggers.

Very good points. I have bookmarked the Stackoverflow post. I am glad that I had a good teacher for this subject in college. I really can't stress enough how important it is to design your database properly. I personally think that designing the database structure is one of the most difficult things to do when starting out a project. It is also the most fun :D Also, the satisfaction of optimizing your queries with just a few tweaks and seeing large amounts of improvements feels very good.

I love SQL and find it easy to solve problems with it. But I feel the opposite for ORMs. It adds another "learning barrier" for me. I hate it because sometimes I know how to solve a problem in SQL but have a really hard time solving it using the ORMs syntax. I have my doubts on it because at the start of the project you are supposed to define what DB you will be using.

Also, am I the only one who wasted countless hours trying to figure out how to execute a query using an ORM? While being absolutely sure how to do it with SQL?

This is my current situation. I am not that experienced and have a lot to learn. That is why I will be spending more time learning the ORMs I will be using and more hours figuring it out :)

I started in database development in 1995ish. Surprised to see an article like this detailing the exact same problems we had back then. Our solution was to have classroom sessions with developers, to provide them with the education they wanted/needed. Biggest problem back then was the use of embedded SQL rather than stored procedures.

I was going to post the exact same thing. I started in '99 and am shocked, shocked to see this is still an ongoing issue. I thought by now most devs had at least read a little Codd, Date, and/or Pascal.

Maybe it comes from having started building web and db apps for large enterprises where the DB was the nucleus of the operation and the apps the satellites. It seems among the web crowd it's the opposite - the app is the nucleus, with the DB and presentation layer the satellites, resulting in ORM among other things.

In those enterprises, access to the DB's was strictly controlled by the DBA team and data made available to apps via stored procs only. The app devs would request a stored proc with certain parameters that would return a certain dataset, and the DBA team would implement it and grant access to that app only (or grant access to an already-existing sp that did the same thing).

The idea of ORM still bugs me for that reason, even though I use it anyway. I can't help but suspect it's the formalization of a convenient but bad practice. I could be wrong about that though, biased as I am in this regard.

> It seems among the web crowd it's the opposite - the app is the nucleus, with the DB and presentation layer the satellites, resulting in ORM among other things.

Indeed, this is one of the strongly-held "opinions" of Ruby on Rails. The database is more or less treated with contempt, a mere bitbucket.

#1 on my list: don't assume your application is the only one that will ever need access to your data. PUT YOUR DATA IN YOUR DATABASE.

#2: don't put stuff that aren't true facts into your database -- just because it's a reliable cross-process data store does NOT mean it's appropriate for all cross-process data sharing.

#3: don't fall in love with SQL. It fucking sucks.

A sort of hack I use sometimes (clearly documented) is to avoid joins when dealing with large data, mainly if caching is not an option (it should be but, hey, I found out it's not always like that!).

Say you have a table for people and another for cities where they currently live at. If you were to do this for a small country you won't deal with many different cities so a join won't waste much time. But if you're using all the countries in the world then the number of cities is really big. When that happens, use two different and simple queries: one to get all the people, another to get all the cities. Then just check if both entries exist on your sets and use them on a while or for-each loop. It can be a big improvement.

Note: This is a really simple example which isn't the best case scenario for this tip. Think large sets with lots of joins between them.

Can you explain this some more? You don't really list what you're trying to achieve in your example and I can't think of a case where looping over a large data set in code is better than querying against it.

I believe it's not easy to explain but let me try again and if it doesn't work and you are really interested in it drop me a line by email and I try with some more complex examples, ok?

Once I had this problem with a calendar on some application from a client. It was a month calendar and for every day there were some conditions that needed to be verified: is it an holiday?, is it fully scheduled?, is there at least one person available that day?, and so on.

How was it working? Well, for every day it would query the database for each one of those conditions which would in the end sum up to 5 minutes since it was a not so small dataset.

So how could we fix this? I started with the usual approach, let's just try to get this done in one query and some joins. There was some improvement but it as a messy query full of joins and even some sub-queries. It took something like 20 seconds to load that calendar and that wasn't enough.

Next approach: divide the most complex joins and sub-queries in some really simple queries. Examples: get a set of all the holidays, get a set of all the fully booked days, etc. Then, when you loop to show the calendar you just check if the iterating day is on the holidays set, then on the next one until you find one that works; if you don't just leave it blank. This made the calendar loading right away.

Benefits: cleaner and (much) quicker code (and also easier to cache if you can, just cache each query individually since some of them don't change that often, like holidays).

Is it making any sense now?

This is much clearer thank you. From your initial example, I thought you were suggesting to just load 2 large tables into memory at once and iterate through them to find matches.

Also, at the surface, this seems like it would be a good application for using materialized views. I obviously don't know the ins and outs of what the usage here was, but this seems like a case where slowing down the insert would be a good speed sacrifice for retrieval.

I'm not going to even feign being an expert at database optimization, but that seems like it might be a more efficient way to approach it.

I actually didn't know about materialized views but from what I've read it seems like this wouldn't be a good place to put it to work just because the conditions on the calendar where changed pretty frequently (most of them weren't, like I said on my previous post, but some were and from that I read that could be a problem since the view needed to be updated frequently).

It feels really good to learn something new, I'm thinking of some scenarios which might improve with this type of views, thanks!

A suggestion to use surrogate keys instead of natural ones doesn't seem right to me. IMO proper choice of natural keys leads to better mapping from a knowledge domain into the corresponding relational model; if you are unable to find natural keys maybe there's something wrong with your database schema?

Not being able to find a natural key isn't why you use a surrogate. You use a surrogate because natural keys are not static: they can change. Surrogate allows you to have a key that is unrelated to the data, data that will change. Basing keys on changing data is risky.

I've experienced this myself, and while natural keys do work in theory, in practice, they are prone to failure.

ON UPDATE CASCADE. Tis a pity it's not more widely implemented.

Which does nothing for anything utilizing the resource itself outside the database. I'm not referring at all to keeping the database consistent. Having something you can always refer to to grab the same data is invaluable.

Fair enough. I'm more worried about database consistency -- it's just the way I roll, I guess.

I'm all for database consistency. It's just not only database consistency that I'm worried about. I've just been burned by keys that were also data before. Having keys that aren't related at all to the data just hurts a few sensibilities and a few individuals sense of calm. =)

There are limits to even this. For example, you may have a table referencing itself (e.g. for tree structures) and on most systems you can't cascade updates. Also if your key is used in lots of other tables the cascade will become expensive.

No matter how correct you think your schema is today something will always come up. Surrogate keys also give you a way to always have a single column key to use in joins and foreign key constraints.

A professor of mine (this was YEARS ago) maintained that if each table did not have a natural key then the model was wrong. In theory this may be correct, but the use of surrogate keys avoids a lot of problems in practice.

The discussion in the comments of the following blog post make some pretty good points in favor of surrogate keys:


Interesting list, have a few good points, unfortunately some points are personal opinions rather than universally agreed practices.

6. Not simplifying complex queries through views

Number 6 needs to be used with discretion.

When you reference a complex view, you are in essence making use of a stored query snippet -- there is no magic to the view, and unless you've materialized it (which is very rare), there is no efficiency advantage. All it really is is code sharing.

So what's the problem then?

The problem then is that the query analyzer has to assume that you actually care about every element of said view, every returned column, every spurious join. I've dealt with too many performance problems where people made use of an overly generous view that invalidated most indexes, removed the potential for covering indexes (the #1 most important performance element of a database), and did a lot of joins that were irrelevant to the usage.

6 and 13 (use IN not OR) really rely on the optimizer doing its job and can hide some nasty performance issues. I do know one release of Sybase would freak on IN but be fine for an OR (have no idea why). Also, views where very dicey.

Given the rise of NoSQL solutions and the amazing lack of query runtime optimization performed by most relational DBs, looking at this list made me ask "why even bother?"

There's a yawning chasm of difference between "a query that gets the right data in the right way, but does it somewhat inefficiently," and "a query that gets the right data in a stupendously roundabout way." For the former, query optimization is a reasonable salve. For the latter, no amount of automated optimization can save you from crippling performance problems.

Look at it this way: If someone comes to you and says, "I want to get from New York to California in 2 hours, so I'm going to take a 747." Off the top of your head, you could tell them, "I don't think you can do that without a Concord." A computer could easily optimize that plan.

But if someone says, "I want to get from New York to California in 2 hours, so I'm going to stick grapes in my nostrils and soak my shoes in canola oil," there's nothing you can do to help this person, because they clearly have no idea where to even start. A computer could do no better.

Using or instead of in() is taking a 747 instead of a Concord. But many of the points in that article, like using distinct to mask a lack of understanding about how you're asking for data, are very much the grapes & canola-oil-Nike's of SQL.

EDIT: Whoops. Sorry; since you had been downvoted, I misread the gray-on-light-gray text as "Given the amazing amount of run-time query optimization performed by most relational DBs." Guess I should have run my comment through an HN Post Optimizer.

NoSQL solutions are not a drop-in replacement for relational databases, either, though. They're different tools for different jobs.

No matter what, you're going to need to look at the same data in different ways. If that isn't true, then the decision is easy. But in pretty much every non-trivial application I've ever worked with, the source information and the requirements for the resulting data format have never been the same for every feature.

When you decide to go with NoSQL, you are implicitly accepting some of the burden previously handled by the RDBMS: The onus is now on you to do your own "indexing," your own "views," your own "normalization." I only have experience with CouchDB, which provides an excellent array of tools for handling these responsibilities, but for certain types of data, there is still no substitute for an RDBMS.

The more I use CouchDB the more I see it as something that complements "traditional" relational database features.

I'm actually rather hoping that someone produces a hybrid NoSQL-SQL system that gives the best of both worlds.

"amazing lack of runtime optimization"? Do you have anything to back that up?

Most "old" RBDMSes (i.e., not MySQL) are actually ok to quite good at optimizing queries. Sure, you will always find things that are not optimized properly, but having worked on a query optimizer myself, this is actually much harder than you might think it is. Not only coming up with optimizations, but also making sure they are legal, and don't degrade certain queries. There's nothing worse than optimizing a legit query to pieces...

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact