
Do you make these 5 database design mistakes? - SQLRockstar
http://thomaslarock.com/2012/01/do-you-make-these-5-database-design-mistakes/
======
ams6110
I think more important than obsession over column sizes (in most databases,
varchar columns only use the space necessary, not the maximum, more or less)
is using correct datatypes. E.g. using varchar for a column that you know will
only have numbers. Or using int for timestamps (e.g. using the unix epoch
offset) and then having to do a lot of converting. Or sticking serialized
objects from your application domain into blob columns.

Most databases have a lot of built in safety and functionality if you are
using the correct datatypes, but that is something you can't benefit from if
you are treating everything like a blob or a string.

~~~
jbigelow76
I've got to disagree about not using varchar for a column you you know will
only contain numbers. I'm assuming you mean something like the numbers of a
phone number (dashes and parens have been scrubbed out). Even if a field will
only contain numbers I won't use a numeric based field type unless there is
the possibility that I would be using the value in some form of mathematical
function. I might add and subtract from an account balance but I would never
add or subtract from a Social Security Number or zip code.

(I don't keep to this rule for surrogate keys which will be int/bigint auto
incrementing, even though I won't be doing any math on the values themselves)

~~~
onemoreact
There is value in what your saying.

But, if you store a SSN as say 123-45-6789 your using at least 11 bytes to
store a 4 byte number. Use it as an index and you waste that space again,
store it in a cache and you have less space for useful information, compare it
with another SSN and it's slower etc. Now individually it's probably
irrelevant but start doing this with several fields on a large application and
you can noticeably slow things down even if the extra disk space seems cheap.

PS: 95% of the time it's probably the safest option, but it's still something
you should consider based on how that data fit's in with the rest of your
application.

~~~
kls
Some numeric data types will trim off leading 0's so if you have a number like
00543 when stored in the DB it will be retrieved as 543. Things like social
security numbers and phones numbers are unique identifiers that just so happen
to be numeric they should be treated as such with regards to data
considerations.

~~~
Tsagadai
Phone numbers are not unique identifiers. In several regions around the world,
phone numbers are reused. Please keep this in mind because it is very short
sighted to say you will never expand globally.

~~~
ListMistress
I agree, neither phone numbers nor social insurance numbers are unique.

But the issue is with storing them in numeric datatypes. They aren't
guaranteed to be unique forever (and indeed, are alphanumeric in many
jurisdictions) and the leading zero problem is a killer when you go to
reconstitute them, both for performance and for logic reasons. Same goes for
US ZipCodes.

This is why I have to have a very strong technical reason to make a non-math
column numeric, especially with externally set data (like SSNs, SINs, Account
codes, etc.) The people who set them could just start adding letters or
symbols...and this isn't rare.

~~~
coolgeek
Indeed, Medicare numbers are usually the SSN with a letter appended

------
cleaver
One mistake I used to see a lot is not paying attention to indices. It's sort
of like the "go big" error in that you just create indices all over the place
even if they could never be used. Also, it could be adding columns to the
index that won't improve access time, or not paying attention to the order of
columns.

In reality, you need to analyse your code and profile your application to see
what actually is needed for your indices. Also essential is understanding the
overhead that an index creates.

I don't see this as often today, however. I think that in a lot of cases
developers put off creating indices of any sort until a performance problem
materializes.

~~~
rhizome
_I think that in a lot of cases developers put off creating indices of any
sort until a performance problem materializes._

Which is perfectly fine.

~~~
junegunn
Not perfectly fine if you use MySQL. Adding an index to a table on MySQL locks
the entire table, which can introduce hours of downtime if the size of the
table is large.

~~~
dspillett
The problem there is two-fold:

1\. Not having done proper index analysis in the first place. While you can't
get it right 100% of the time if you are expecting tables to grow that large
you really should think hard about thsi sort of thing as close to the start as
practical.

2\. Using a database system that can't perform an online index build without
locking the whole table. I know that such an operation needs to aquire _some_
locks during its activity no matter what system you use, which will create
performance issues for your live site if you are not able to schedule the
index change in a pre-planned "maintainence" downtime (i.e. an application
that has significant "high availability" requirements), but requiring a full
table lock here seems to be a fault in a system that claims to be "enterprise
ready".

~~~
rhizome
Exactly. There's a difference between "putting off indexes" and "waiting until
it takes hours to add them." The problem is likely evident well before that
point.

------
dspillett
On "not going big, just in case" I don't disagree, but the body text implies
that excessive use of storage space is the problem you are creating for
yourself, which isn't the most impoartant factor here by quite a margin.

Space is cheap. What aren't as cheap are memory and I/O bandwidth: using large
datatypes limits the size of working-set you can fit into a given amount of
memory, _and_ slows down the process of reading data from permenant storage
into memory when needed and not already present.

Increasing the load on your I/O capability in this way is far more of a
problem than the extra storage space consumed.

------
dnewcome
I've got to push back on the anti-guid stance for surrogate keys. They may be
big but it is nice to have opaque IDs so that no one makes flawed assumptions
about ordering or relative insertion time/trying to guess valid IDs, etc.
Opaque is good...

~~~
ListMistress
GUIDs are HUGE and have a significant impact on performance. Go read Kimberly
Tripp's blog post referenced in the article. She does the math for you.

In almost all the cases that I see GUIDs, they were totally unnecessary for
the design. Even the developer who designed them could not give a reason why
they needed to be GUIDs. A row unique across the entire universe? Really?

I'm not saying there are no cases...just that in most cases they negatively
impact performance with little business or logic gain for that price. All
design decisions come down to cost, benefit and risk.

~~~
jaylevitt
How do modern distributed OLTP systems deal with generating unique sequence
numbers? Back in the day, this was a big problem, and I always thought that
GUIDs would someday be a solution (though at the time, _any_ string was too
big/slow to be a primary key). Having one key-issuing server was a SPF;
sharding the key ranges by server made it difficult to add new servers.

~~~
emmelaich
You can have the main key issuing server issue blocks to distributed key
issuing servers. The keys are not necessarily increasing with respect to row
creation time nor are they gapless but it's good enough for many purposes.
It's also still a SPoF but not a single point of contention bad enough to
affect performance significantly.

~~~
jaylevitt
Interesting.. how do you deal with ID collisions after rollover? That was our
big problem, even with 8-byte IDs - the ID allocator had to know the Global
Truth. Dealing with collisions as "exception, try the next one" was too
expensive, though with faster hardware that may be moot.

~~~
emmelaich
Yes, rollover is an interesting problem. With the app and scale that we're
operating at that will not happen within the next few thousand centuries
though. I'm actually describing how I might go about doing it, and what I
/think/ Oracle RAC does.

I'm not actually certain as to the implementation.

------
jtchang
I tend to view designing database schemas as an evolving process. Picking
larger datatypes to me is sometimes easier than having to change it at a later
date. If space really does become an issue I can deal with it later.

You need to really understand the data you plan to put in.

------
zbowling
This webpage causes my chrome to lock up and my laptop fan to go on full
strength. Heavy HEAVY social media and user tracking javascript.

------
swah
I never took databases on college, don't have energy to learn about normal
forms, but find "designing" the schemas for my simple services a joy.

~~~
GlennS
I do see that you can go a long way with relational databases just using
experience.

However, they do have a formalised mathematical basis and I think you dismiss
it too quickly. If you do spend the time to learn the maths then it will push
your abilities that little bit further. You'll also likely find it quite easy
to pick up given that you already understand the field well.

