

Disk space is cheap… That's not the point - kachnuv_ocasek
http://www.sqlskills.com/BLOGS/KIMBERLY/post/Disk-space-is-cheap.aspx

======
kevinpet
I feel like this a very forceful response to something, but I don't know what.
I'm missing some context to be able to make sense of the article.

~~~
steve-howard
I think she's refuting the idea that GUID primary keys are a one-size-fits-all
solution for database design. Jeff Atwood discussed some of the pros and cons
of this approach a few years ago [1]. The force does seem a little odd,
though, and the use of all-caps words and bold/underlined text throughout the
article made it harder to read and understand.

The take-away, as far as I can tell, is that database design requires some
analysis of what's best for one's particular data set, rather than duplicating
what looks hot right now. Which is a good idea in any context if performance
really matters.

[1] [http://www.codinghorror.com/blog/2007/03/primary-keys-ids-
ve...](http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-
guids.html)

~~~
famousactress
GUIDs are weird, but global ids are _awesome_. I'm not really sure what reason
I'd have to use GUIDs instead of assigning ids out of a single 64bit int key
space.. What am I missing?

But seriously. Global-id-space is something I'll probably use on every new
project I touch... presumably for the same reasons people use GUIDs. So nice
to know that you can change data types, pack things into caches without type-
ids, or log identifiers without extra metadata. Super handy.

~~~
jbri
How are you ensuring global-uniqueness in your global id space when you have
multiple systems that may not even be in contact with each other a lot of the
time?

~~~
eru
Like steve-howard said: You choose randomly from a large enough space. That
way you can make collisions as improbable as you like.

(You can also add an identifier for each system and the current time, if you
don't trust your source of randomness. But that's beside the point.)

~~~
saurik
That's not what steve-howard said, and it certainly isn't how GUID/UUIDs work:
they are based on a combination of your MAC address, a process ID, a
sequential ID, a timestamp, and a random number; with this algorithm, global
uniqueness is confirmed. A "type 4" UUID, or as we like to call "not a UUID",
is a 128-bit random number, and only got defined to retroactively deal with
use cases of UUID that caused "too much information leakage" (such as Word
documents).

~~~
eridius
Global uniqueness is certainly not "confirmed". With a 128-bit key, you have
at most 2^128 different values (and it's actually slightly less because I
think the different UUID schemes all reserve at least one digit). Sure,
collisions are extremely unlikely, but they are also extremely unlikely with
the simple 128-bit random number.

~~~
saurik
Actually, no: they are "confirmed". The collisions in this scheme are
administratively controlled.

Think about it this way: with careful planning, you can use a 24-bit number
and have a reasonably large space of non-conflicting values. I wouldn't trust
rand() for this, but if there was a single server allocating numbers in
sequential order, this would be fine.

While at 128-bits, rand() starts becoming reasonable, the UUID algorithm is
designed to produce the "actually guaranteed to be globally unique" experience
that the sequentially incrementing counter did.

On an individual computer, this requires coordinating with a library
(preferably OS-supported, such as on Windows) in order to obtain global locks,
allowing identifiers to then be identified by a timestamp (to guarantee that
you can reuse the hardware later; this requires your clock to be monotonically
increasing on any given piece of hardware, which in practice means "make it
accurate"; there is a sequence numbering mechanism for allowing clock
rollbacks).

Between computers, node identifiers are MAC addresses, which are unique by
administrative fiat (blocks are allocated to companies, who are responsible
for guaranteeing the uniqueness of their devices; if you feel like some of
these providers are not abiding by this, you can simply not use their hardware
for your networking infrastructure, which is a good idea anyway due to how
Ethernet works).

If you haven't skimmed the spec, I recommend you do at some point:
[http://www.ics.uci.edu/~ejw/authoring/uuid-guid/draft-
leach-...](http://www.ics.uci.edu/~ejw/authoring/uuid-guid/draft-leach-uuids-
guids-01.txt)

"""This specification defines the format of UUIDs (Universally Unique
IDentifier), also known as GUIDs (Globally Unique IDentifier). A UUID is 128
bits long, and if generated according to the one of the mechanisms in this
document, is either guaranteed to be different from all other UUIDs/GUIDs
generated until 3400 A.D. or extremely likely to be different (depending on
the mechanism chosen). UUIDs were originally used in the Network Computing
System (NCS) [1] and later in the Open Software Foundation's (OSF) Distributed
Computing Environment [2]."""

------
yason
Disk space is cheap but your access to it goes through quite a narrow
bandwidth.

------
ntoshev
The clustered index of a table determines the order in which actual records
are stored. This means that with a random GUIDs column as a clustered index
you insert records in random places of the table, while with the two other
options you append records at the end - no surprise this is much faster.

Now, the table is not just sequentially written records, they are actually
organized as some b-tree variant. Still, appending to a b-tree is a faster
operation than inserting in the middle.

Rebuilding the index (actually the table) with a FILLFACTOR parameter makes
space in each b-tree node so that new records don't cause too much reordering.
This is not much of a solution though, because as the table grows the 10%
unallocated space will fill up and the performance will be back where we
started.

To explain the benchmarks shortly:

\- the performance difference between ints and sequential GUIDs is because
GUIDs are larger

\- the performance difference between sequental and random GUIDs comes from
appending records to the table vs inserts at random positions in the table

\- FILLFACTOR helps temporarily with random inserts by making space available
before the inserts happen

------
njharman
Am I the only one who is less convinced after reading this? 10k inserts in
5min on a laptop. That's enough for 95+% of stuff I work on. Seems like my
time is better spent learning technology that has performance leap over
esoteric tuning of SQL Servers. Leave that to people like the author. I'm sure
there's plenty of them to handle the small % that need this level and no more.

~~~
lukesandberg
The benchmarks in the article are obviously limited. For instance there was no
accounting for read performance, or even moderately complex queries. Also
given how fast performance was dropping off for the guids this would probably
matter for any db that was doing 10k inserts over the course of any amount of
time. A drop of of 30% write performance is extremely significant for even
moderately sized applications.

Also i don't think that the percentage of applications that need this level of
performance and no more is as small as you think it is. plenty of successful
businesses run critical application on SQL server (and other dbs) and this
kind of degradation could be disastrous. I don't think the necessity for SQL
tuning is as esoteric at all.

------
maxxxxx
It's the same as "Don't optimize prematurely". True, but also don't be stupid.

------
eridius
What is a "sequential GUID"?

~~~
dspillett
This is where you start with a UUID that is as random as usual, but subsequent
UUIDs for that same field are sequential rather than each being random
relative to each other. This makes them a little faster to generate (though
the significance of this is likely minute compared to the cost of writing them
to disk except in very specific cases) and makes index maintenance operations
more likely to produce efficient results (fewer split pages, less need to
defragment the indexes later, and so forth).

Another possible advantage of "sequential" UUIDs comes when a number of them
get transmitted in a textual form, for instance if they are product IDs and
you are sending out a long HTML page of links to products. The sequential IDs
will offer more redundancy for any transport level compression to chew on than
the normal random ones (assuming they are indeed sufficiently random) will.
This may also come into play if your database engine supports compressed
storage directly, as MSSQL 2008 does for backups (std and enterprise editions)
and live data files (ent edition only), though to a much lesser extent unless
the compression algorithm chosen is specifically optimal for small variations
in to 128-bit aligned data.

------
typicalrunt
I completely agree.

I've been told many times, even in university by seemingly educated
professors, that disk space is not a major factor while programming. But that
doesn't mean it can be dismissed either.

I have seen some terrible waste of space by programmers that don't know or
care about the cost of enterprise disk space (let alone performance of
system/app).

