
Database development mistakes made by application developers - phalien
http://stackoverflow.com/questions/621884/database-development-mistakes-made-by-application-developers/
======
mcantor
I never understood the groans of dismay from other developers when the subject
of writing SQL comes up. I'm a dyed-in-the-wool application developer, but I
_love_ writing SQL.

Unlike HTML & CSS, which seem designed to spread into a greater morass of
obtuse and interdependant hacks with every trial-and-error iteration, SQL has
always felt elegant to me.

It takes just as much time to craft an SQL query as it does to translate a
design into HTML/CSS, but 99% of the time I can sit back and be proud of my
handiwork; even re-using it later.

On the other hand, I have never completed a webpage without looking at the
resulting markup and thinking, "What an ugly goddamn kluge." SASS and JQuery
and HTML5 and CSS3 don't make a difference. Maybe I'm just bad at it, though.

I eschew ORMs at every opportunity, and it is mysterious to me why "writes SQL
for you" is such a common "feature." I would liken it to a tool which "paints
paintings for you" or "designs your app for you." Why would I entrust a
computer program to do something which requires the creative spark of a human?

~~~
contextfree
I like the relational algebra underlying SQL. I don't like the clunky syntax,
or the weak type system, or the lack of a good composition mechanism. Agreed
that HTML and CSS are even worse!

~~~
mcantor
By "lack of a good composition mechanism," are you referring to the trappings
of writing and dealing with SQL syntax itself? As in, "Ugh. I decided to
change this join into a subquery, and now my @#$#% parentheses are
imbalanced!"?

If so, I agree strongly, but I feel like that problem could be solved by
better error reporting on the part of the compiler. I dread SQL errors now, at
least with Oracle (which I work with the most often), because they almost
_never_ tell me what I need to know. To a certain extent it's essential to the
abstraction of data reporting itself, but certain errors are just comically
vague.

~~~
contextfree
Not really, I'm referring more to the ability to factor out common patterns
and compose them together to make queries, etc.

------
cubicle67
For what it's worth, here's a few from me

 _Indexes_. Everybody says it, but I'm still surprised at the number of devs
who have next to no idea about index and how they work.

Few guides: Every index adds a write penalty, so don't just add them to
everything. If you've got a compound index, it will only be used if the query
also uses all the fields in the index (If you build an index on surname,
firstname and dob, but execute select * from users where firstname="Bob" and
surname="Jones", the index won't be used). Benchmark the difference each index
makes, and stand back in awe at the improvement a few well places indexes can
contribute.

[Edit: as pointed out below, my experience is well out of date and it looks
like there's a number of database engines where the compound index thing
doesn't apply]

 _Learn to think in sets_. Most devs are used to operating on discrete pieces
of data; we're not used to thinking in sets, and so writing efficient SQL
doesn't come naturally. Put in the effort and learn it and you'll be well
rewarded. SQL is built around sets of data, and can almost assuredly do stuff
faster than your code can. Hint: anytime you find yourself getting the result
of a query, iterating over the result set and executing another query for each
result _you're doing it wrong_

 _Learn how your ORM works_. If you're using an ORM (and most of us do), learn
how to use it efficiently. This is where having a good working knowledge of
SQL helps, because you can then find out how to help your ORM create good SQL.
If you're using ActiveRecord, understand how to correctly use :conditions,
:include etc, and what they're doing behind the scenes. Ideally you want your
ORM to make one call to the DB, and to retrieve everything you need in that
one call (calls are expensive, even for very small queries). Poorly written
code causes the DB to get hit constantly. Not good,

~~~
nbpoole
> (If you build an index on surname, firstname and dob, but execute select *
> from users where firstname="Bob" and surname="Jones", the index won't be
> used).

I don't believe that's true, at least not for all RDBMS. I just checked the
docs for MySQL and PostgreSQL and they seem to agree with me.

[http://dev.mysql.com/doc/refman/5.0/en/multiple-column-
index...](http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html)
| "The name index is an index over the last_name and first_name columns. The
index can be used for queries that specify values in a known range for
last_name, or for both last_name and first_name."

[http://www.postgresql.org/docs/8.2/static/indexes-
multicolum...](http://www.postgresql.org/docs/8.2/static/indexes-
multicolumn.html) | "A multicolumn B-tree index can be used with query
conditions that involve any subset of the index's columns, but the index is
most efficient when there are constraints on the leading (leftmost) columns"

~~~
morgo
If we want to 100% correct, indexes don't just hurt writes, and they don't
hurt all writes either. Two examples (there are more):

\- If I do an UPDATE query, and I don't modify indexed columns, the index
doesn't need to be modified.

\- If I have indexes on (last_name, first_name) _and_ (last_name) the
(last_name) index is redundant. Some queries may use the last_name index when
I don't specify the first_name - but they could have used the other index. Now
I have bloated what contents need to be memory resident (affecting reads as
well).

~~~
mmt
I'd say that suggesting there's a "penalty" at all implies a naive hardware
setup[1].

Total system I/O is a finite resource, but this can be hard to reach if
indexes are on separate spindles[2] from the data and each other. This can
cost substantially more initially, but disappears at scale.

Perhaps, then, as for the OP's original topic, one mistake is not hiring
someone who knows these performance strategies. Another is using an RDBMS that
doesn't support them.

[1] which, granted, is far and away the norm

[2] or sets of spindles, or SSDs

------
CWIZO
Don't know if this was mentioned in the answers (haven't read them all):

Don't assume that once you make your query run perfectly it will stay that way
for ever. Once row count grows, your query might start to perform very poorly.
Happened to me just recently, when a query was running in the 0.0001s range,
but over night crawled to a halt and took 1 minute(!) to execute. Always pay
attention to your querys and monitor if anyone of them is starting to slow
down.

~~~
Estragon
Are there wrappers for database APIs which log the time spent on each query?

~~~
IgorPartola
The MySQL slow query log could be of use. Of course you don't get a stack
trace this way but it's essentially free (as in built already and not a huge
performance hit).

~~~
robryan
Problem I have with this log is that legitimately slow queries, in my case
usually those where I have concatenated 100's or 1000's or inserts into one
query, make it hard to use in a meaningful way.

It's possible I'm using it wrong, did get much past trying it out. Fairly easy
to roll my own rough timer in a DB wrapper, haven't got around to finishing
this yet though.

------
thibaut_barrere
Also: don't make assumptions about how you believe a database works, just
because you've used another one (or another version of the same).

And measure every optimization you make, too :) I saw someone doing an
optimization that should have accelerated the processing, only to discover
once in production it was 4x slower.

------
IgorPartola
My addition to the already great list: using temporary tables to serve
content. Temp tables are great when you have a complex report, and breaking
things up into stages gives you a performance boost (think you need to do
multiple lookups against a list of users that is small but takes a long time
to build). So if you generate this report at midnight via a cron job, temp
tables work great. However, generating a bunch of them at once is a
performance hit. Instead, use indecies or (in the worst case) caching with
triggers.

------
wenbert
Very good points. I have bookmarked the Stackoverflow post. I am glad that I
had a good teacher for this subject in college. I really can't stress enough
how important it is to design your database properly. I personally think that
designing the database structure is one of the most difficult things to do
when starting out a project. It is also the most fun :D Also, the satisfaction
of optimizing your queries with just a few tweaks and seeing large amounts of
improvements feels very good.

I love SQL and find it easy to solve problems with it. But I feel the opposite
for ORMs. It adds another "learning barrier" for me. I hate it because
sometimes I know how to solve a problem in SQL but have a really hard time
solving it using the ORMs syntax. I have my doubts on it because at the start
of the project you are supposed to define what DB you will be using.

Also, am I the only one who wasted countless hours trying to figure out how to
execute a query using an ORM? While being absolutely sure how to do it with
SQL?

This is my current situation. I am not that experienced and have a lot to
learn. That is why I will be spending more time learning the ORMs I will be
using and more hours figuring it out :)

------
mdoyle
I started in database development in 1995ish. Surprised to see an article like
this detailing the exact same problems we had back then. Our solution was to
have classroom sessions with developers, to provide them with the education
they wanted/needed. Biggest problem back then was the use of embedded SQL
rather than stored procedures.

~~~
SkyMarshal
I was going to post the exact same thing. I started in '99 and am shocked,
shocked to see this is still an ongoing issue. I thought by now most devs had
at least read a little Codd, Date, and/or Pascal.

Maybe it comes from having started building web and db apps for large
enterprises where the DB was the nucleus of the operation and the apps the
satellites. It seems among the web crowd it's the opposite - the app is the
nucleus, with the DB and presentation layer the satellites, resulting in ORM
among other things.

In those enterprises, access to the DB's was strictly controlled by the DBA
team and data made available to apps via stored procs only. The app devs would
request a stored proc with certain parameters that would return a certain
dataset, and the DBA team would implement it and grant access to that app only
(or grant access to an already-existing sp that did the same thing).

The idea of ORM still bugs me for that reason, even though I use it anyway. I
can't help but suspect it's the formalization of a convenient but bad
practice. I could be wrong about that though, biased as I am in this regard.

~~~
jacques_chester
> It seems among the web crowd it's the opposite - the app is the nucleus,
> with the DB and presentation layer the satellites, resulting in ORM among
> other things.

Indeed, this is one of the strongly-held "opinions" of Ruby on Rails. The
database is more or less treated with contempt, a mere bitbucket.

------
jfb
#1 on my list: don't assume your application is the only one that will ever
need access to your data. PUT YOUR DATA IN YOUR DATABASE.

#2: don't put stuff that aren't true facts into your database -- just because
it's a reliable cross-process data store does NOT mean it's appropriate for
all cross-process data sharing.

#3: don't fall in love with SQL. It fucking sucks.

------
iaskwhy
A sort of hack I use sometimes (clearly documented) is to avoid joins when
dealing with large data, mainly if caching is not an option (it should be but,
hey, I found out it's not always like that!).

Say you have a table for people and another for cities where they currently
live at. If you were to do this for a small country you won't deal with many
different cities so a join won't waste much time. But if you're using all the
countries in the world then the number of cities is really big. When that
happens, use two different and simple queries: one to get all the people,
another to get all the cities. Then just check if both entries exist on your
sets and use them on a while or for-each loop. It can be a big improvement.

Note: This is a really simple example which isn't the best case scenario for
this tip. Think large sets with lots of joins between them.

~~~
torme
Can you explain this some more? You don't really list what you're trying to
achieve in your example and I can't think of a case where looping over a large
data set in code is better than querying against it.

~~~
iaskwhy
I believe it's not easy to explain but let me try again and if it doesn't work
and you are really interested in it drop me a line by email and I try with
some more complex examples, ok?

Once I had this problem with a calendar on some application from a client. It
was a month calendar and for every day there were some conditions that needed
to be verified: is it an holiday?, is it fully scheduled?, is there at least
one person available that day?, and so on.

How was it working? Well, for every day it would query the database for each
one of those conditions which would in the end sum up to 5 minutes since it
was a not so small dataset.

So how could we fix this? I started with the usual approach, let's just try to
get this done in one query and some joins. There was some improvement but it
as a messy query full of joins and even some sub-queries. It took something
like 20 seconds to load that calendar and that wasn't enough.

Next approach: divide the most complex joins and sub-queries in some really
simple queries. Examples: get a set of all the holidays, get a set of all the
fully booked days, etc. Then, when you loop to show the calendar you just
check if the iterating day is on the holidays set, then on the next one until
you find one that works; if you don't just leave it blank. This made the
calendar loading right away.

Benefits: cleaner and (much) quicker code (and also easier to cache if you
can, just cache each query individually since some of them don't change that
often, like holidays).

Is it making any sense now?

~~~
torme
This is much clearer thank you. From your initial example, I thought you were
suggesting to just load 2 large tables into memory at once and iterate through
them to find matches.

Also, at the surface, this seems like it would be a good application for using
materialized views. I obviously don't know the ins and outs of what the usage
here was, but this seems like a case where slowing down the insert would be a
good speed sacrifice for retrieval.

I'm not going to even feign being an expert at database optimization, but that
seems like it might be a more efficient way to approach it.

~~~
iaskwhy
I actually didn't know about materialized views but from what I've read it
seems like this wouldn't be a good place to put it to work just because the
conditions on the calendar where changed pretty frequently (most of them
weren't, like I said on my previous post, but some were and from that I read
that could be a problem since the view needed to be updated frequently).

It feels really good to learn something new, I'm thinking of some scenarios
which might improve with this type of views, thanks!

------
alexeyklyukin
A suggestion to use surrogate keys instead of natural ones doesn't seem right
to me. IMO proper choice of natural keys leads to better mapping from a
knowledge domain into the corresponding relational model; if you are unable to
find natural keys maybe there's something wrong with your database schema?

~~~
jasonlotito
Not being able to find a natural key isn't why you use a surrogate. You use a
surrogate because natural keys are not static: they can change. Surrogate
allows you to have a key that is unrelated to the data, data that will change.
Basing keys on changing data is risky.

I've experienced this myself, and while natural keys do work in theory, in
practice, they are prone to failure.

~~~
jfb
ON UPDATE CASCADE. Tis a pity it's not more widely implemented.

~~~
jasonlotito
Which does nothing for anything utilizing the resource itself outside the
database. I'm not referring at all to keeping the database consistent. Having
something you can always refer to to grab the same data is invaluable.

~~~
jfb
Fair enough. I'm more worried about database consistency -- it's just the way
I roll, I guess.

~~~
jasonlotito
I'm all for database consistency. It's just not only database consistency that
I'm worried about. I've just been burned by keys that were also data before.
Having keys that aren't related at all to the data just hurts a few
sensibilities and a few individuals sense of calm. =)

------
DrJokepu
Interesting list, have a few good points, unfortunately some points are
personal opinions rather than universally agreed practices.

------
ergo98
_6\. Not simplifying complex queries through views_

Number 6 needs to be used with discretion.

When you reference a complex view, you are in essence making use of a stored
query snippet -- there is no magic to the view, and unless you've materialized
it (which is very rare), there is no efficiency advantage. All it really is is
code sharing.

So what's the problem then?

The problem then is that the query analyzer has to assume that you actually
care about every element of said view, every returned column, every spurious
join. I've dealt with too many performance problems where people made use of
an overly generous view that invalidated most indexes, removed the potential
for covering indexes (the #1 most important performance element of a
database), and did a lot of joins that were irrelevant to the usage.

~~~
protomyth
6 and 13 (use IN not OR) really rely on the optimizer doing its job and can
hide some nasty performance issues. I do know one release of Sybase would
freak on IN but be fine for an OR (have no idea why). Also, views where very
dicey.

------
projectileboy
Given the rise of NoSQL solutions and the amazing lack of query runtime
optimization performed by most relational DBs, looking at this list made me
ask "why even bother?"

~~~
mcantor
There's a yawning chasm of difference between "a query that gets the right
data in the right way, but does it somewhat inefficiently," and "a query that
gets the right data in a stupendously roundabout way." For the former, query
optimization is a reasonable salve. For the latter, no amount of automated
optimization can save you from crippling performance problems.

Look at it this way: If someone comes to you and says, "I want to get from New
York to California in 2 hours, so I'm going to take a 747." Off the top of
your head, you could tell them, "I don't think you can do that without a
Concord." A computer could easily optimize that plan.

But if someone says, "I want to get from New York to California in 2 hours, so
I'm going to stick grapes in my nostrils and soak my shoes in canola oil,"
there's nothing you can do to help this person, because they clearly have no
idea where to even start. A computer could do no better.

Using _or_ instead of _in()_ is taking a 747 instead of a Concord. But many of
the points in that article, like using _distinct_ to mask a lack of
understanding about how you're asking for data, are very much the grapes &
canola-oil-Nike's of SQL.

 _EDIT_ : Whoops. Sorry; since you had been downvoted, I misread the gray-on-
light-gray text as "Given the amazing amount of run-time query optimization
performed by most relational DBs." Guess I should have run my comment through
an HN Post Optimizer.

NoSQL solutions are not a drop-in replacement for relational databases,
either, though. They're different tools for different jobs.

No matter what, you're going to need to look at the same data in different
ways. If that isn't true, then the decision is easy. But in pretty much every
non-trivial application I've ever worked with, the source information and the
requirements for the resulting data format have never been the same for every
feature.

When you decide to go with NoSQL, you are implicitly accepting some of the
burden previously handled by the RDBMS: The onus is now on you to do your own
"indexing," your own "views," your own "normalization." I only have experience
with CouchDB, which provides an excellent array of tools for handling these
responsibilities, but for certain types of data, there is still no substitute
for an RDBMS.

~~~
arethuza
The more I use CouchDB the more I see it as something that complements
"traditional" relational database features.

I'm actually rather hoping that someone produces a hybrid NoSQL-SQL system
that gives the best of both worlds.

