

Obfuscate Your Company - Derferman
http://blog.twilio.com/2009/05/obfuscate-your-company.html

======
dpifke
The solution proposed by this article (using MD5 hashes instead of numeric
identifiers for primary keys) omits the point that strings are way more
expensive to look up than integers. You're creating a bunch of extra work for
the database on every single lookup.

A better solution is to use a global sequence rather than per-table sequences.
That way your primary key is still numeric (and fast). A curious onlooker can
guess how fast your database is growing, but they don't know whether that
growth is users, orders, log messages, etc. etc.

~~~
edw519
"omits the point that strings are way more expensive to look up than integers"

So what? We're talking microseconds.

I take it a step further than OP. Every data base record has 2 IDs, an
internal sequential one and an external one. The external one is cross-
referenced to the internal one, which is never seen by human eyes. So every
read is actually 2 reads, which is way more expensive still. I have been doing
this for years and have never seen any noticeable degradation.

The reason I started this has nothing to do with security. It's so that I can
change any key on any data base at any time without actually changing the
primary key and without a conversion. (You'd be surprised how often you need
to do this in commercial applications.)

~~~
dpifke
Using database OIDs as publicly accessible identifiers is a neat idea, one I
haven't tried.

Looking at the Postgres docs
(<http://www.postgresql.org/docs/8.0/static/datatype-oid.html>) they
discourage the use of OIDs, going so far as to say the default in future
versions is not to create OIDs for user-created tables.

Also, this would mean that a dump/restore of your database results in
invalidating all your identifiers.

What RDBMS do you do this on and have you run into any problems?

EDIT: Sure enough, in modern Postgres, these are disabled by default.
Additionally, they appear to be sequentially assigned:

dev01=> select *,oid from test; test_id | test_string | oid
\---------+-----------------+------- 1 | blah | 29837 2 | blahblahblah | 29838
3 | xxxblahblahblah | 29839

~~~
edw519
I implement double IDs in my own code, not in any built-in function of any
DBMS. You're right to be suspicious.

------
10ren
It's a cute insight, but more for observers than doers. You're probably better
off concentrating your energies on building something people want.

e.g. if you have only 4 customers, then working on that reality is more
important that hiding it. However, your competitors _would_ thank you for
focusing on the latter.

------
drawkbox
Another commenter already posted but UUIDs (most platforms) and GUIDs
(Microsoft platforms) are the best unique key and no need to hash them. Most
larger web products use them at least as a stamp or reference but many
distributed systems (especially for sync) use them for simplicity but also
globally uniqueness across many database domains.

<http://tools.ietf.org/html/rfc4122.html>

<http://en.wikipedia.org/wiki/Universally_Unique_Identifier>

~~~
jasonkester
... as long as nobody ever has to type them in.

Sometimes, short and typeable is a bonus, and integers win there every time.

------
StrawberryFrog
There is a good reason why a GUID/UUID or the like should never be used as a
database primary key when records are regularly added.

The short answer is that the records are usually stored in the primary key
order (clustered index) and every new record being inserted at a random
location in the middle of this order rather than just added at the end causes
problems.

See [http://stackoverflow.com/questions/583001/improving-
performa...](http://stackoverflow.com/questions/583001/improving-performance-
of-cluster-index-guid-primary-key)

[http://sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-
KE...](http://sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-
the-clustering-key.aspx)

~~~
russss
Fair point, but most DBs (including MySQL and Postgres) don't cluster their
PKs by default - those posts are both referring to SQL Server which apparently
does (and that's seemingly an implementation detail).

------
farmerbuzz
How much does this really matter anyway? If your app is devoid of content or
users won't it be obvious even without seeing a "low" auto-incrementing ID?

~~~
mattmaroon
It certainly doesn't matter to your customers, of whom 99.999999999999% will
never notice an id in a url.

And if you're about to IPO, just multiply all ids by 5 where customers might
see them. You're not _lying_ about how many customers you are getting if some
analyst sees that, but you might get a few extra bucks per share.

------
eli
md5(rand())) doesn't really make much sense. It's slightly _less_ random than
just using rand().

~~~
babo
But it has a constant length and hides randomness from the observer.

~~~
eli
"Hide randomness" ?

~~~
babo
From hashes you'll recognize the fingerprint of the hash function but not the
underlying data. Here is an example of random UIDs, generating a few user's in
a row and it's obvious that it's random: random.sample(xrange(10000), 10) [88,
3833, 6353, 8113, 8983, 4280, 7878, 3050, 9409, 404]

~~~
eli
So?

And if you generated random IDs with the same number of bits as a hash
function they would be indistinguishable.

------
Tichy
Just occured to me that this would probably also apply to suggested user names
with numbers: "'Jenny' is already taken, we suggest 'Jenny178'". Would be fun
to work out the user numbers from that (proportion of Jennys in the population
etc).

------
jasonkester
This goes both ways. If your thing is actually good, those numbers will
quickly climb to respectable levels.

Start a new Twiddla meeting today and you'll get a six-digit room ID. That's a
piece of information I don't mind people finding out. It shows we've been
around a while and that tons of people have been using it. Consider it a
subtle form of marketing to those who speak AUTOINCREMENT.

That being said, I have in the past started with >1000 seeds for other
services where I knew we wouldn't be getting the same sort of traction
immediately.

------
Confusion
I'd go the other way: just advertise the data you would also report to said
analysts. Saves everyone trouble.

------
ilyak
In Russia, when a book is printed, publisher have to state a lot of meta-data
on the page first: date, number of copies printed, font, paper quality. It's
mandatory, I think it's inherited from USSR.

My western books just do not carry such information. Which makes book
examining a lot less fun! So! I think that users have right to see the correct
meta-data, and that it should be in fact enforced somehow. Imagine a browser
without an URL bar!

------
huhtenberg

      For example, here's one way to generate a random, 
      fixed-length key in PHP:
    
      md5(uniqid(rand(), true))
    

Bad example. The obfuscation is weak and easy to guess. Just remember to salt
your hashes, gentlemen, and you'll be fine (some conditions apply).

(edit) erm .. sorry, had a brain spasm and didn't notice the rand() .. :)

~~~
teej
I've gotten in the habit of using:

    
    
        SHA1(unix_timestamp + "misc salt phrase" + rand )[0, 8]
    

It gives a nice, simple, 8 character id, pretty much not likely to create a
collision in a space of 1M keys, and not guessable. Am i doing it wrong?

~~~
jhancock
this is just fine. you are only at risk if your db libs/frameworks have no
checks for if collision does some day occur. For example, as an id being a
primary key in a db table, your db will cough up an error if you get
collision, but will your ruby/python/blub libs/frameworks handle it
gracefully?

~~~
CalmQuiet
Is it not a simple thing to require "unique" of the table column? ...and then
if the occasional "error" occurs in attempt to create a new row in table have
your framework retry until your randomizing routine succeeds in generating an
unused ID? Or am I missing something?

