

Database Design Mistakes - rootj
http://thomaslarock.com/2012/01/do-you-make-these-5-database-design-mistakes/

======
EzGraphs
"Now, for reasons I can’t explain, people just love GUIDs in database
designs."

Yeah - they seem kind of silly and meaningless, so instead you use sequential
integers. After all, they work nicely with object-relational frameworks like
ActiveRecord, they are sequential, and they are more "human readable". And
then one day you have a subset of data that you need to migrate from one
environment. And when you try, you end up with some clashing records created
in the target environment. No big deal, you will just increment all of your
sequential integers so that there are no clashes. Oh wait, there are foreign
keys - need to update those as well. Which reminds me, what is going to happen
when ids in the target system catch up the the ones imported. OK we can change
the sequence to cause it to start adding records after the one imported. Hey,
this is really kind of complicated. There must be an easier way...

Hmmm... about those GUIDs? They completely eliminate an entire class of
problems that occur when migrating data across environments (a very common
scenario in some quarters).

If this type of scenario never will arise for you, then GUIDs may not make
sense for your application. Hopefully this does explain why some "people just
love" them.

~~~
SomeRandomUser
I've never worked with big databases so bare with me if this is a really silly
question.

Let's say I have database A with sequential IDs, and database B with
sequential IDs. Couldn't I just _translate_ the sequential IDs to GUIDs in
both databases, merge them to database C and then translate the records from A
and B to sequential IDs?

Or modyfying IDs is a big no-no in this scenario?

~~~
zrail
Modifying IDs is a _giant pain in the ass_. At several previous jobs we've had
to rekey billions of records with sequential IDs and it never ever goes
correctly the first (or tenth) time, and when it eventually does work it takes
forever to complete.

------
venutip
My favorite part of this writeup is the revelation that the "TIMESTAMP" type
in MS SQL Server is not what 99.9% of reasonable, non-insane people would
consider it to be – namely, a timestamp. From the MS SQL Server documentation
the author links to:

"The timestamp data type is just an incrementing number and does not preserve
a date or a time. To record a date or time, use a datetime data type."

Seriously? In what way is that even _remotely_ acceptable? How many PMs and QA
engineers saw that and were like, "Cool makes sense SHIP IT."?

If MS SQL Server were a person, he'd be a colossal asshole.

------
ecopoesis
I don't use foreign keys because they've bitten me in the ass too many times.
I can maintain data integrity in the app level so I do it there. Adding FKs to
the mix means my DB is also doing the work I already have to deal with in the
app and that's use slowing things down.

FKs are a kind of business logic and I want none of that in my DB. I want my
DB to give me storage, preferable ACID storage, and nothing more. I'll keep my
business logic in my app.

~~~
losvedir
Wow, I can't fathom how this works. Where do you store the relationship, then?
Say you have posts and comments. I would want a posts table and a comments
table, with a post_id foreign key in comments. How would you do it without a
foreign key?

~~~
chris_wot
He still has the key field, but doesn't add the constraint.

------
tracker1
1\. INT32 is fairly natural and generally a default goto in most higher level
languages that, you know, actually _USE_ the data... anything smaller, or
larger than INT32 should be discouraged unless absolutely necessary... Yes,
storage should be considered, but you also need to consider the friction such
design can cause with different client languages. Beyond this, you can go with
Unsigned values, but again, this can cause a lot of friction.

2\. This is precisely why I don't like my RDBMS talking directly to each
other. In fact, I like my RDBMS to be as stupid as possible. As mentioned
above, the most common structure to your programming environment is likely
what you should use... for time insensitive date/time I always store/read in
UTC offset, same goes across boundaries. For events, I use the local time (no
offset specified) and/or with the location or offset. It depends on the use.

3\. Most databases that support UUID/GUID fields support a generator that will
create an index friendly sequential UUID, where the main portion is
sequential, with most of the value still random.

UUIDs are essential for many offline access/creation, synchronization and
sharding systems. The biggest mistake a DBA tends to make is to assume that a
single server will run everything well enough for everyone that uses a given
application.

4\. I tend to always have primary and foreign key constraints.. additional
indexes should only be considered once you have typical use cases in place.
After you have done the necessary profiling to determine which indexes will be
the most benefit.

I would state that not understanding how indexes work in your given RDBMS can
be a big issue. I think MSSQL tends to be one of the most forgiving in this
area, and Oracle least so.

5\. (see my response to number 4)

The arguments for 1, 2, 3 are argumentative at best.. there are perfectly good
use cases for using a datatype larger than necessary (reducing friction in
code is huge), not to mention UUIDs being absolutely essential in many use
cases. MSSQL with create a UUID field, and a unique constraint when you enable
replication if a table doesn't already have it for example, so it may as well
be a primary key, if it makes sense.

As to 4-5, this comes down to understanding index strategies, which is a
separate learning issue imho.

------
jrarredondo
The worst database design mistake is starting database design after
development is almost done!

~~~
jasonlotito
Reminds me of this quote from Linus Torvalds:

"Bad programmers worry about the code. Good programmers worry about data
structures and their relationships."

------
sehrope
> If you know that the only possible values for a certain column are between 0
> and 100,000 then you don’t need to slap a BIGINT data type for that column
> when a INT would do just fine.

What conceivable column could have 100K possible values but not more than 4B?
Factoring in gaps in sequence numbers, wasted ids for testing, parallel
generation of ids, etc, it's not that hard to crack the 32-bit boundary.

If you're dealing with something you know is of a fixed small size, for
example surrogate keys for a constants table that is manually created by you
(the app designer) then you can use an int (32-bit), short (16-bit), or even
an single byte (8-bit). For anything else though just use 64-bit ids.

> what about those NCHAR(2000) columns that are storing mostly first and last
> names? How much extra overhead for those?

This makes no sense. Using VARCHAR fields the size is just a max. You should
still pick a reasonable max but just because its defined as VARCHAR(256)
doesn't mean it's stored as 256 bytes.

~~~
tracker1
CHAR/NCHAR are fixed length, VARCHAR/NVARCHAR are dynamic length... Usually
the biggest restrictions to varchar/nvarchar are indexing limits.. generally
my indexed fields for n/varchar will be 100 (email, name, etc) non-indexed,
xml hashes, json will be NVARCHAR(MAX)/NTEXT when they aren't indexed.

~~~
sehrope
Yes but I see no point in using fixed length string fields. Double so for the
examples the article gives (first and last names). All the times I've
encountered them has been with legacy systems ands it's been a universal pain.

All your front end code ends up doing TRIM(...) to clean them up the extra
padding. Modern RDBMSs all handle variable length fields efficiently so it's a
waste of programmer time to have to deal with CHAR/NCHAR fields.

------
dialtone
In some databases, such as PostgreSQL, there is no difference between
varchar(), varchar(N) and text.

[http://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-
varcha...](http://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-
text/)

And 7.8 MB really change practically nothing in the overall equation. 1 is a
good point but badly argumented.

------
akaiser
You should have started with the NCHAR(2000) example. Choosing INT over BIGINT
is really something that, as you calculated yourself, wouldn't change the fate
of a project. Especially not in our 64 bit world ;)

The point you make about timestamps is very clear to everybody. Do you have
more examples which people don't think about?

------
emilyst
> Yeah, that _is_ what she said.

Wow, what the fuck? Was there any reason for that?

------
chris_wot
Point 5 makes no sense. The whole point in using a surrogate key is that a
natural key is almost always flawed. There are no good reasons for the advise
he gives for point 5, in fact it is a mistake to do what he says!

~~~
siddboots
I think his point could have been made clearer, but he's really just saying
that there will often be a mutable natural key that is captured in the
modelled data, and that it makes sense to enforced constraints over that key
if you know that the constraints exist.

~~~
chris_wot
Ok, maybe. But I can only think of examples where if the schematic was
normalized properly then it would be a non-issue.

------
cletus
At the risk of self-promotion:

[http://stackoverflow.com/questions/621884/database-
developme...](http://stackoverflow.com/questions/621884/database-development-
mistakes-made-by-application-developers/621891#621891)

To add some points:

> Multiply those rows by 4 bytes and you have 8 million bytes, or roughly
> 7.8MB of wasted space.

I have a hard time imagining any installation where this isn't a ludicrously
small amount of space, particular in relation to the total data size.

As for using too big a type, another factor is to consider is "how easy is it
to expand this later?" If it's easy, sure use a small type. If it's hard, err
on the side of too large.

> Now, for reasons I can’t explain, people just love GUIDs in database
> designs.

There are some pretty good (or at least understandable) reasons for liking
GUIDs for keys. Firstly you need to understand how the database generates IDs.

The first camp is typified by MySQL with it's auto-incrementing keys. This is
super-convenient in many cases and generates ordered and mostly sequential
keys but it comes at a cost in that you typically have to round-trip to the DB
and insert something to get a key. This generates spurious round trips if you
need this key as a foreign key in other entities.

The other school is sequences typified by Oracle. A sequence isn't otherwise
tied to a particular table. You still need a round trip to the DB to get an ID
but it doesn't require you to insert anything. Sequences are in-memory and
generally fast (round trip notwithstanding). Sequences are mostly ordered and
partially sequential depending on your setup (this is less true in a clustered
environment).

So, GUIDs. They're 128 bit. Collisions are unlikely. OSs tend to come with
decent GUID generators (so you're not just using a 32 bit PRNG to generate
your supposedly 128 bit keyspace items). They require no DB round trip and
they have one other property that can be important depending on your setup and
your database: scattering.

If you have a partitioned database then you want to evenly spread your data as
much as possible. There are many techniques for this that might or might not
be data dependent (eg first character of email address has poor scattering
properties but a good hash of an email address will probably be spread much
more evenly). GUIDs neatly spread elements in your keyspace for very little
effort.

> If you are using a surrogate key (and, ideally, one that is sequential)

I don't see sequential or ordered being strictly better properties for keys.

~~~
sehrope
> You still need a round trip to the DB to get an ID

No you don't. Not sure about MySQL but Oracle supports returning generated
keys as part of an insertion[1]. An additional round trip is not required as
it comes back in the "success" response of the insert itself.

[1]: [http://info.michael-simons.eu/2007/10/09/jdbc-get-
autogenera...](http://info.michael-simons.eu/2007/10/09/jdbc-get-
autogenerated-keys-on-a-oracle-db/)

------
EGreg
Okay so what can I say ... from my experience I disagree with his conclusions
for 3, 4 and 5.

3\. If you are building any kind of table that you expect to horizontally
partition, do NOT use sequential things like autoincrementing ids! The mutex
lock will be your bottleneck. Instead, use this algorithm to generate ids:

    
    
        10: id = GENERATE_UNIQUE_STRING(20)
        20: IF ( SELECT_ROW_BY_ID(id) ) GOTO 10
        30: INSERT_ROW_WITH_ID(id)
        (yeah, it's written in BASIC just for fun)
    

20 lowercase alphanumeric characters gives you 26^20 combinations. And the
chance of a clash becomes around 50% only after you have filled 26^10 ids
(birthday "paradox").

If you really have to have a sequentially incrementing "ordinal" field, e.g.
for messages posted to a stream in a specific order, then use transactions and
lock only one row, in the related table, which will contain the max value of
the field. For example in MySQL use InnoDB which implements row-level locking,
and have stream.message_count field, which you use when you INSERT into the
message table.

4\. Again, when you are sharding you probably don't want to have foreign key
constraints defined in your database layer. If this guy is shocked he must
have not really done sharding before.

5\. While you can use surrogate keys, that limits your horizontal partition to
the way you generate these keys. Oftentimes you will want to look things up in
your shards by a certain RANGE, and for that you want to preserve the ordering
in your primary key (up to normalization).

So tehre you have it, it's not that simple :)

