
Mistakes Beginners Make When Working with Databases - lobo_tuerto
http://www.craigkerstiens.com/2016/06/07/five-mistakes-databases/
======
endymi0n
Pretty weak, partly terrible advice.

\- Storing images and blobs: granted, usually not a good idea

\- Limit/offset will take you a veeeery long way until you have to think about
stuff like deep paging. And however you try to tackle that, if the stuff you
paginate _needs_ ordering, it's simply a hard problem and not a mistake.

\- UUID primary keys? Horrible advice that's only applicable at
Google/Facebook scale (and even they often use 64bit integer keys for a lot of
entities, see Graph API or Adwords). Will wreak havoc on insert and join
performance and index size. 64 bit is more than you'll ever need even for the
most serious application.

\- Default values on NULL columns: Okay advice, but a pretty random issue

\- Going away from normalization is optimization that's mostly premature and
regarding the drawbacks should only be done with utmost care and to resolve
specific performance problems, not as a general approach.

Overall, pretty mediocre advice.

~~~
dragonwriter
> UUID primary keys? Horrible advice that's only applicable at Google/Facebook
> scale

UUID primary keys aren't needed for big keyspaces, they are needed when you
need multiple processes (especially a flexible number of multiple processes)
to generate unique IDs independently of each other and the central database,
which will eventually get stored in the central database. This can be
important at _much_ smaller than Google/Facebook scale, depending on the use
case.

~~~
davidw
Oh yes! We ran into precisely this problem when we had a number of machines
that mostly run independently, that needed to communicate with a central
server. Integer ID's don't work well for that, and especially in light of the
fact that Mysql reuses them in certain circumstances. I couldn't do anything
about Mysql being on the machines, but used Postgres on the server, of course.

The second generation of machines, where I did have a say in things, used
UUID's and Postgres.

~~~
muxxa
Am I the only one who thinks that primary keys should be derived from the
actual data? That way it's impossible for two processes to accidentally create
the same conceptual piece of data (which is still possible with uuids). It
also makes it much easier to recover from situations when you have to quickly
promote a slave to a master role without first verifying that the slave is up
to date. The main bonus though is that the database is much more
comprehensible, e.g. foreign keys are legible without having to join back to
the primary table. The relational model for data is pretty cool and breaks
down when e.g. UNION doesn't work if 2 rows differ only in an arbitrary
integer primary key.

~~~
brianwawok
Okay so you run a school and want to make a students database. What do you use
a a PK?

1) Social security #? Fails when you have international student.

2) Last Name, First Name, Middle Name? Fails when you have a repeat name.

3) Last Name, First Name, Middle Name, Home Town, Start Year? I guess this
works for most of the time. Now you need to join to this table from the
classes table. So now you need all 5 keys duplicated in the classes table to
do the join.

Simple Integer primary keys "suck" in that you are adding bogus data to your
database that has no value.. but man, they sure solve a LOT of problems. You
need to think fairly hard before you get rid of them. In some cases you
totally can. But making your default datamodel include an Integer PK solves a
LOT of problems.

~~~
sopooneo
I have actually done this for a classroom management app I made mostly for fun
when I taught high school. At first I wanted to use the district issued ID
number as the primary key for each student. Except some students came and
started classes and needed attendance records _before_ the district got around
to issuing them their official numbers. So I fell back to auto-increment int
as the primary key, with their "district_id" nullable, default null, but
unique constrained as just another field equivalent to their first_name.

------
mjw_byrne
Not sure about point 3 ("use UUIDs instead of integer PKs"). Exhausting a
32-bit int takes a lot of usage; exhausting a 64-bit one is completely out of
reach for almost everyone. 128-bit UUIDs will take more space to index than
ints or bigints and are less human-readable. Depending on what UUID version
you use you may or may not lose ordering, which can be a nice-to-have.

I think this isn't a question of ints being a mistake and UUIDs being better,
it's about using the most appropriate type based on requirements.

~~~
Illniyar
With scale its not about key exhustation, its the ability to be able to create
a key in multiple servers without having to consult a central location.

~~~
ashmud
For processing data types without quality natural unique keys, being able to
generate a unique key beforehand is very useful. I've written code both ways:
inserting to get an auto-incremented integer, and generating a UUID, and the
UUID approach was much easier to write and maintain.

~~~
jameshart
I'm going to make a bold claim: no datatype has a quality natural unique key.

I am struggling to think of a counterexample. Maybe chemical elements? Where
the natural key is of course the atomic number, not the chemical symbol...

Before you suggest 'zipcodes' or 'states', consider that zipcodes are not in
any sense natural, and anyway, like state codes, are so US-centric that they
don't belong as top level elements in most real database schemas.

------
Someone1234
The first point is strange, borderline wrong.

What they really MEAN is don't store images in your general purpose database,
in particular as long strings.

There are however databases with first party support for image storage, which
is useful because now you can store the image and metadata about the image
together (as well re-using existing solutions like replication,
authentication, etc).

Additionally their "solution" is bizarre, they've jumped from using a database
to a paid service by a third party, which is itself backed by a database. That
seems very "apples & oranges" to me, I mean it would obviously work, but is a
big jump from in-house development using a general purpose database.

To give one specific example have they not heard about Oracle Multimedia?
That's exactly what it is designed to offer.

~~~
masklinn
> There are however databases with first party support for image storage,
> which is useful because now you can store the image and metadata about the
> image together (as well re-using existing solutions like replication,
> authentication, etc).

It also keeps them coherent, if you store images on a filesystem but metadata
in a database, since they don't share transactional contexts you will
eventually end up in an inconsistent state ("dead" files without metadata, or
live metadata missing the corresponding image data).

------
stevecalifornia
"Use UUIDs instead of integer PKs" is 95% of the time a HORRIBLE idea. As a
rule of thumb, if you are a beginner, you should NEVER use a UUID instead of
an integer PK.

If you know the definition and the ins-and-outs of 'clustered index', 'index
fragmentation' and 'page split' then feel free to use a UUID if you see fit--
otherwise, please don't.

If you are a beginner and for some reason have to have a UUID: Use a
sequential, auto-incrementing integer for your clustered index column and make
another second column that is your primary key that is your UUID.

This article feels like the blind leading the blind.

~~~
aianus
> "Use UUIDs instead of integer PKs" is 95% of the time a HORRIBLE idea. As a
> rule of thumb, if you are a beginner, you should NEVER use a UUID instead of
> an integer PK.

Can you enlighten us as to why? Seems to me if you're not at the scale where
you see the benefits of UUIDs you're also not at the scale to see the
drawbacks either (bigger, slower indices?).

~~~
balls187
Poor index performance effects all your queries, regardless of scale.

Better more practical advice might be: if you don't need to use a X as a PK,
don't use a X as a PK.

Where X can be either UUID or BIGINT.

If you don't expect your table to scale past a billion rows, INT is more than
fine.

------
drawkbox
Since there is some UUID hate here, some missing pros to using them:

\- No round trip for generating keys, data can be sent in with an existing or
new uuid without having to hit the autonumber/keymaster, removes a single
point of failure for a small fee on each row. Storage is cheap so 16-byte uuid
is not a deal-breaker, the benefit is speed and horizontal scalability.
Optimized read-only tables and/or caching can be made where this has any
impact at all.

\- Some databases like Oracle you need a sequence to even do autoumbering,
huge pain

\- Autonumbering is a pain when having to replicate across environments or
when you start getting multiple databases and clusters

\- Numeric ids for important data is not exposed, prevents easily incrementing
for next/previous (other ways to do this but this is one)

\- Many databases have a UUID field or field optimized for unique
ids/guids/uuids now

\- If you were a piece of data wouldn't you want to be unique? All your data
are unique snowflakes with UUIDs. On a serious note, this can help to identify
data across all types and not just in the same table.

~~~
insulanian
You're optimizing for write, but the performance gain is negligible in
comparison to the performance loss due to index fragmentation and slow joins.

Also, on "storage is cheap": [http://www.sqlskills.com/blogs/kimberly/disk-
space-is-cheap/](http://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/)

------
andysinclair
I'm suprised by this point: "the database tends to be an after-thought in
application design".

Maybe it's a generational thing (I'm 40+), but I tend to always start with
thinking about the data structures and design in conjunction with the UI
design.

Getting a solid database design in place early in the development cycle is
critical to building a solid, stable system- you can of course alter the
database as you start building but having a good idea of the data structure
should be a "before thought" rather than a after-thought.

~~~
joslin01
I'm 30- and I always start with the data first as well. Linus said good
programmers always think data first, and that resonated a lot with me when I
first started coding. As I got further along, I realized how true it was that
application logic often follows the data model.

Treating data as an afterthought is a great way to have to be re-defining your
application logic every time you re-define your data because you forgot
something.

------
jameshart
And once you've got over these initial issues you can move on to the Mistakes
People Who've Figured Out a Thing or Two About Databases Make:

\- storing their images externally, but forgetting to apply a consistent
backup strategy to their image data

\- Database is getting big, query returns lots of records, so just adding
pagination. Forgetting that the user really doesn't actually want to page
through results at all. The correct answer was probably to add proper faceted
filters and full text search.

\- using UUID keys everywhere, then making the mistake of mixing up UUID keys
from one table with UUIDs from another one.

\- Using nullable columns as a tool for schema change, but not deciding what
it actually _means_ for a particular column in a row to contain a NULL.

\- Thinking that they will get away with just storing a chunk of structured
data like an array in a particular column value because from the application
point of view it's really just one blob of data anyway. In general I give a
structured datatype like that two days before someone is writing a query that
digs into the inner structure of it.

------
nsxwolf
I recommend reading the article about pagination linked in the article. I came
away from that thinking the LIMIT ... OFFSET approach is the best one to use
in most circumstances. It's certainly a good starting point - you can monitor
how well it suits your application over time.

Consistency of results may be an issue, but any scheme you use is either going
to show inconsistency when you move between pages, or it's going to lie to
you. That's the reality of concurrent modification.

------
edw519
Nice starter list. Here's more, off the top of my head:

    
    
      - failure to lock rows with pessimistic locking
      - failure to consider optimistic locking when viable
      - data sets within a table (undernormalization)
      - too much duplicate app code that should be stored procedure
      - overuse of stored procedures that should be in app code
      - appending with insert
      - appending to datasets with variable length
      - "smart" keys that can never be changed without rewriting
      - underuse of indexing to slow read performance
      - overuse of indexing to slow update performance
      - records too big for good hashing (undernormalization)
      - columns with different typed data (when DBMS allows)
      - columns with different logical data (app driven)
      - enhancing by inserting columns instead of appending
      - poor or missing audits
      - poor or missing security
      - poor or missing archiving

~~~
kafkaesq
_appending with insert_

As opposed to?

------
l8again
A bane of our existence, when working with dbs, is that we always conflate
storage with query optimization. When we talk about databases, we
unconsciously think of both concerns. This is traditionally instilled in us as
that makes the most sense starting out, but almost always what ends up
happening is your dataset needs to be shared among multiple applications, each
with its own optimization requirements. And that's when things start to go
haywire. At some definition of scale, a better design is to have a storage
layer that is not really concerned with query optimization, and allow
different applications to design their own query systems. Said another way,
even when you are working in RDBMS's, it's good to acknowledge storage and
query as two separate concerns.

------
sbuttgereit
Craig Kerstiens's advice/writings is usually pretty sound, but I think he
makes a fairly significant mistake in providing the advice he does here: over
generalization. I appreciate he's talking about advice to beginners, but even
so I think he needs to set context in his examples. Missing the context of
possible solutions is probably the biggest mistake that anyone makes when
making their choices in this (and many other) regard(s).

None of the things mentioned are _necessarily_ bad depending on context. For
example, if I have a billion people per second visiting my website, yeah,
storing the site images in PostgreSQL alongside the rest of the data probably
isn't a great idea. But, if I am building an ERP system for the SMB market, it
_may_ be a fine idea given the degrees of concurrency I can expect, the cost
of server equipment, and the advantages of keeping related data (binary or
otherwise) together. The trade-offs are different. Simply put, it depends.

I think this post could be improved with some clarification and I do think a
tip sheet for beginner's is a good idea coming from someone that does have
good advice for PostgreSQL... it just needs to be less generally prescriptive
and more instructive about how to think about the given advice.

~~~
sbuttgereit
Actually, thinking about it, I think the biggest database beginner mistake I
see is simply not coming to grips with what the data actually means. The
project I'm working on right now is a do-over project that exists because the
original project team didn't take the time to really understand _why_ they
were doing the project, _why_ the project mattered to the client; they simply
saw, "it needs to do 'x', it needs to do 'y'". The client, in turn, simply
took it for granted that the project team understood problem. At the end: it
resulted in irrecoverably bad data collection.

If you fail to understand the meaning in data and just try to jam values into
some persistence store to get through some transaction someone told you to
code up, you'll likely make decisions for the future that you don't even know
will come yet. Each table, each value, expresses some idea, some piece of
information that, without even considering the application functionality, has
meaning in the context of the rest of the data you're capturing. If you
respect that relationship of ideas, you'll know whether or not normalization
or de-normalization makes sense, you will more likely have a flexible
information architecture rather than a brittle one.

So there's my beginner's advice: really understand what and why you are
stuffing data into a database in the first place. Don't loose sight of the
larger context. Conceptualize the information as information. Then figure out
how the technology facilitates (or doesn't) the expression of that information
with the greatest clarity.

------
acveilleux
If only developers learned more about the DB engines they use and their
capabilities.

I've seen so many bone headed decisions (EAV "schemas" being a recurring one)
because it was easier to program around the DB than learn how to use it
proper. It's not because a few classes insulate you from the bone headed SQL
that it's a good idea.

~~~
zer00eyz
EAV isn't all bad, there are use-cases for it. If you have a highly dynamic
data (user configurable) then EAV might be a good way to STORE that data.

Is it ugly, it sure is. Moving read out to a cache layer, and search out a
purpose built system, the DB becomes a persistence engine. You have solved the
other issue you don't mention with EAV, and thats performance.

Edit (my ability to post should be forbidden till I have had coffee, cleaned
up for clarity)

~~~
acveilleux
Most EAV implementations I've seen had under 10 distinct "schema-less" virtual
tables (sorry about lack of terminology) with perhaps 40-60 distinct "columns"
and the different variations of a record types usually had consistent
attributes.

Basically the whole thing could've been replaced by 5-10 tables. Maybe taking
advantage of Postgres OORDBMS capabilities for the common columns.

Queries tended to group complex CTEs and multiple self-joins. You know it's an
anti-pattern when devs start complaining about postgres join performance and
you see they got two tables being joined 15 times...

------
turrini
You can, instead of using UUID or integer sequences, implement something like
Instagram (and others) did years ago:

[http://instagram-
engineering.tumblr.com/post/10853187575/sha...](http://instagram-
engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram)

------
ReedRothchild
Not sure about point 1 as a blanket statement... many times external services
are not allowed and therefore aren't accessible. How is S3 storing them?
Probably in a DB. In some DB and application frameworks, IO issues with image
data can and have been streamlined. If not, better model design might help.
Putting them in S3 limits what you can do with the images. For basic
application images, yeah that doesn't belong in your DB...

As far as disk space... that could be a problem, but for most of the
applications that developers work on, they never need to scale to the point
where disk space is a problem. And if it does, you can always get more disk
space.

~~~
Mister_Snuggles
This advice really only applies to public web sites.

PostgreSQL could be used as the backend for, as an example, an internal
inventory tracking application. Something like this could allow users to
upload a generic photo of the item in inventory. Putting these in S3 or a CDN
makes no sense for an internal application.

And, of course, "internal application" doesn't necessarily mean a web
application. It could be a fat-client (WinForms, Qt, Cocoa, etc) that connects
directly to the database. It could connect to an application server (which
then connects to the database) and speak some custom protocol. It could be a
text-based terminal app. It could be that there are a variety of these
applications that all connect to the same database.

Better advice would be:

Store images outside of the database. If you are building a public-facing web
site or web application, something like S3 or a CDN may be a good fit. If you
are building an internal web application, or are otherwise unable to use S3 or
a CDN, storing images on a web server and storing the URL (or something that
allows you to determine the URL) is a better option. In some cases, such as
fat-clients that connect directly to the PostgreSQL database, you may find
that storing images in the database is truly the best option. Keep in mind
that this may impact performance in the following ways: (insert list of
potential issues).

~~~
ReedRothchild
I agree with your "better advice"... Even if it's a public facing app, having
them stored in the DB makes it much easier to restore, replicate, etc... Not
to mention allowing fine grained authorization rules. Any time I have worked
on a system (web app, fat-client, etc) where images, which are really data,
are separate from the image metadata, overtime, a colossal mess occurs.
Orphaned files without metadata, metadata without files.

Even in your public web app, you also might have other back-end systems that
cannot access S3, but can access your application services or DB (yikes).
There are definitely many cases where external providers like S3 is the better
approach, but not all. Also, you don't have a performance problem until you
have a performance problem.

------
zzzeek
Please don't use UUID primary keys, at least by default. Not only will your PK
indexes take up 4x the space, all of the foreign keys referring to it will
also need indexes which also blow up in size, etc. I've seen very poor
performance due to this issue. A surrogate PK is still a surrogate PK, so
you're committing a relational algebra sin in any case (one which IMO is
necessary for practicality concerns).

------
bcg1
I really have to take issue with the normalization point. Modern databases
have optimizers that operate under the assumption that your schema is
normalized. When you try to get clever and do optimizations yourself, you can
very easily trick the optimizer into doing the wrong thing, and actually make
things worse. I would argue that de-normalizing is much more of a beginner
mistake that over-normalizing.

------
emilong
One way to generate UUIDs in a more index-friendly way, at least in Postgres,
is to use uuid-ossp's uuid_generate_v1mc() function. They end up sequential,
so maybe not the best if you need total unpredictability, but the lower order
bits should be reasonably random.

h/t to Wayne E. Seguin for his article
[http://www.starkandwayne.com/blog/uuid-primary-keys-in-
postg...](http://www.starkandwayne.com/blog/uuid-primary-keys-in-postgresql/)

Docs here: [https://www.postgresql.org/docs/9.4/static/uuid-
ossp.html](https://www.postgresql.org/docs/9.4/static/uuid-ossp.html)

------
dep_b
Oh god somebody gave the advice that you can normalize too much and some n00b
will take it to heart.

Databases I've encountered that have been undernormalized: a gazillion
Databases I've encountered that were too normalised: never

Until you are a Really Smart Dude(tte) working at a Really Important Company
being paid accordingly you're probably not capable up front that your database
does not need foreign keys, indexes and normalization and please just
normalize like you've learned in databases 101, okay?

------
hipjim
Point 1 is silly - this depends greatly on what your application needs to do.
If you need to store the image together with the metadata or other information
related to the image that is transactional data - then storing the image in
the database is the way to go. This also helps a lot with database backups,
migrations etc..

~~~
Mister_Snuggles
Yup.

Not every PostgreSQL-backed application is a public-facing web app. A fat-
client that connects directly to the PostgreSQL database to do "stuff" with
transactional data will have a much easier time if any related images are
stored alongside the transactional data.

------
danso
This is much more common in spreadsheets, where free-wheeling data entry is
encouraged, but a novice problem I see in real-world data structuring
scenarios is breaking a column's enumeration...e.g. for a column that is
supposed to be zip codes, doing something like "90210 and 90240", instead of
creating a second column to handle edge case notes.

Related to that: inconsistent values for NULL/False, e.g. "No", "", "False",
"false", etc...(or rather, not using a boolean type to enforce this).

And related to that: general unawareness of what NULL means. It's not the same
as "" or "False" or 0, both in a technical sense and in a real-world sense.

------
LunaSea
The "Over normalization" point is basically saying that SQL is bad at handling
relational data.

Kind of ironic when you see how much bad press NoSQL databases got for missing
join functionalities.

~~~
collyw
Bad if it is used incorrectly. That could be said of many things.

------
mathattack
Great post. I find that many front end developers start with the User Design
first, ignoring how the data should be stored or accessed. While user
centricity is a great value, if you worry about screens needed rather than
data needed, you cause trouble down the road. I've found that a data-first
mental model is much more scalable and supportable. Many times the details of
the databases get abstracted away, and performance suffers.

~~~
lloyd-christmas
> I find that many front end developers start with the User Design first,
> ignoring how the data should be stored or accessed

This sentence seems contradictory. If you start with user design, you then
know how the data will be accessed which allows you to store it with purpose.
Obviously this shouldn't be at the expense of proper database design, but how
your app functions defines how you should store your data. While not directly
screen-by-screen design, one example I always see is that people default to
creating an Address table. Many apps never store more than one address per
user. If it's one-to-one, why force a join? (I think this is more of an
overlap of both of our views though, and it popped into my head because
Address tables are a pet-peeve)

~~~
developer2
>> how your app functions defines how you should store your data

I think this is the fallacy the parent you are replying to is attacking. Your
data should stand apart from your application. Applications change over time.
Many databases also serve multiple applications (web, mobile, api, stats,
etc.). Your data store should make sense on its own without being tied to an
application's design. If you build your database according to your
application, then making changes to the application can be difficult or
require refactoring the database to match the new application specs. If you
instead build your database to make sense standalone, it's up to each
application to use it appropriately.

~~~
lloyd-christmas
> Many databases also serve multiple applications... Your data store should
> make sense on its own without being tied to an application's design.

While your point is valid, I think it's theory vs. practice. I think most
databases don't get used for wildly different applications. I'd rather design
my database for something that I know is performance-sensitive than design it
for an imaginary application that might exist in the future. I've never
experienced a case where our application changes dramatically enough that
manipulating your database is anything serious enough to write home about. If
you previously only had a shipping address and now you need a billing address
as well, it's a pretty straightforward migration. Incremental change isn't
hard.

Companies that become Medco are few and far between. Their database should
default modeling one-to-one relationships as one-to-many just so someone might
be able to use it more generically in the future. But that's because it's a
realistic use-case. My point was that there is a balance to be had between
stand-alone and real-life usage, and you shouldn't default to full generic
just to make it stand-alone.

~~~
mathattack
_While your point is valid, I think it 's theory vs. practice. I think most
databases don't get used for wildly different applications. I'd rather design
my database for something that I know is performance-sensitive than design it
for an imaginary application that might exist in the future. _

Reminds me of the cynical saying for data warehouses, "Data in, but never
out." :-)

What I have seen frequently enough is myopic data designs hurting flexibility
later. For example - assuming 2 level customer relationships (Corporate parent
and individual store) with things like regions appearing at tags, rather than
flexible hierarchies. The assumptions behind this then gets built into the
code base, and fixing it requires more than just a database update. (And even
if you fix the database, you are missing the historical hierarchies)

~~~
lloyd-christmas
> The assumptions behind this then gets built into the code base

This logic shouldn't be in the code base. A query should be isolated from the
application logic, as I think we can all agree on. A change in the database
should only require changing the query/procedure. Your business logic
shouldn't be dependent on the internal workings of the query, just on it's
input/output which shouldn't need changing. Adding a feature that requires a
database refactor shouldn't impact the internal logic of another feature
(unless it's intentional).

I'm not encouraging willy-nilly design. I'm not saying "stick everything on
one row". Just don't design a one-to-one as a one-to-many just because it
might theoretically change. However, you should still have the foresight to
put yourself in a position where that change is easy. People seem to think
that "refactoring" is a dirty word. I'm reasonably confident none of us have
worked on an application that has never been refactored. Plan on those
potential refactors, not convince yourself that "this is how proper design
works". I find _THAT_ is what inevitably leads to the painful refactors.

> And even if you fix the database, you are missing the historical hierarchies

I'm not following this one. Are you referring to an audit trail?

~~~
mathattack
_> And even if you fix the database, you are missing the historical
hierarchies

I'm not following this one. Are you referring to an audit trail?_

No - meaning if update the database schema, data will be missing that wasn't
collected properly the first time. (If you didn't think you needed customer
hierarchies, you didn't create them as customers came in)

I hear you on avoiding over-generalizing. That creates problems too.

~~~
lloyd-christmas
> data will be missing that wasn't collected properly the first time

Maybe I'm nitpicking but even if you had that field in the database, it's
still up to the application to collect it (unless it's something like a
timestamp, but that's just a dumb mistake regardless of how you design your
database)

~~~
mathattack
Yes - I think we both agree.

------
wvenable
This passable advice at a scale that probably 99.999% of people will never
need. I've worked on very popular systems, very old systems, and very popular
and very old systems and I've never been at this scale.

When you're not at some massively huge scale, this advice is mostly terrible.
You should do exactly the opposite.

------
ivanhoe
I'd say the number one mistake newbies make is not setting proper indices...
much bigger deal then any of these...

------
barkingdog
Hmm, going to say a hard no on naively preferring UUIDs. It's a little bit of
extra work to go with the instagram style approach [0] of using many logical
postgres shards with a custom id function that implements twitter snowflake
style IDs, but it's an elegant, scalable and operationally simple solution. I
am very fond of it.

[0] [http://rob.conery.io/2014/05/29/a-better-id-generator-for-
po...](http://rob.conery.io/2014/05/29/a-better-id-generator-for-postgresql/)

------
r00fus
Over-normalization - one enterprise design pattern I've seen is to take fields
that have basic values (which may have more attributes than just text) and
create a field_values table which can be reused across objects. The key is to
have a composite natural key like "field_name, value_name" so you can reuse
across parent tables without creating seperate value tables for common things
like status, phase, etc.

It allows more configurability as well - you could have a UI to control/audit
all such values and allow for easy external mapping.

------
isp
These specifically apply to Postgres, and don't necessarily generalise to all
databases.

One example: counter-example to #4. Oracle database (since 11g) has a "fast
add column" feature - which allows adding a non-NULL column with a default
value to an arbitrarily-large existing table, without a "rewrite" of the
table. (Behind the scenes, the default value is stored as metadata - and the
default value is then read from the metadata for preexisting rows, rather than
updating each and every row in the table.)

------
farnulfo
About default values on NULL columns, Oracle 11g and 12c have some interesting
optimizations : [http://www.oracle.com/technetwork/articles/database/ddl-
opti...](http://www.oracle.com/technetwork/articles/database/ddl-optimizaton-
in-odb12c-2331068.html)

------
perezdev
Number 2 is kind of funny:

>>The unfortunate part is: pagination is quite complex, and there isn’t a one-
size-fits-all solution.

Probably not. But the example he gave is a one size fits most. I've used it on
a dozen different projects and have never had performance issue.

~~~
ivanhoe
Problem is actually any ORDER BY operation on a big table, but often it can be
fixed with a bit of index magic... on small enough DB (and that's like 99% of
web projects) you don't really need to care about this, ever...

------
calebsurfs
As a beginner using MySQL I was expecting not using

    
    
      innodb_file_per_table 

or generally learning db configuration to be on the list. It's a huge pain
when your little web server runs out of disk space because innodb is eating it
all.

~~~
molecule
Good advice, but that's specific to MySQL and its InnoDB engine, while the
article is about general relational databases.

------
jayzalowitz
point one may make a huge difference in page load time.... s3 is fast but it
isn't no connection instant and cached like b64

~~~
foxylion
I think he is talking about a web application. So providing a image tag with
the URL, as a src attribute, to the image hosted on s3 seems like a better
solution than embedding the image in the HTML response (and blocking the page
rendering for a longer time).

------
_RPM
Facebook has ~2 billion users, and if they use 64 bit primary keys, theres no
way they will reach that number (2^63) - 1

------
fhood
So I'm curious, say you do want to store images in your own database, what is
the best way to implement that?

------
knucklesandwich
IMO normalizing up front is rarely ever a mistake. The primary advantage of
normalization is it makes your schema amenable to changes in requirements.
What if category ends up needing a description, and an id for a permalink-able
page with links to each post in a category, etc.? Normalize up front and then
denormalize where you need to when the requirements crystalize and your focus
becomes responsiveness.

~~~
epalmer
Only as a general rule: I have found that write once read mostly tables
(history tables) are great candidates for denormalizing. State tables (read &
write) usually benefit from 3rd form normalization.

~~~
nsxwolf
Right, which I never understood until the first time I was tasked with
building a data warehousing and reporting application. My naive 3NF schema
approach led to joins that crippled report performance, and really messy
schemes to deal with underlying changes in the application's schema over time.

------
tkyjonathan
Using an ORM should be top. Clojure is the only language doing it right.

