

Generating unique Base 62 encoded strings - KevBurnsJr
http://blog.kevburnsjr.com/php-unique-hash

======
Xk
This scares me.

> I could run it out to md5 and trim the first n chars but that’s not going to
> be very unique.

What? MD5 going to be orders of magnitude better than what he's given.

> Storing a truncated checksum in a unique field means that the frequency of
> collisions will increase geometrically as the number of unique keys for a
> base 62 encoded integer approaches 62^n.

Well, duh. That's a given. And his solution won't do any better.

> I’d rather do it right than code myself a timebomb.

Doing it right would be using a real hash. Not something you came up with over
a cup of coffee.

> Pretty random-looking, huh?

If that's his idea of testing for randomness... Use any randomness test and I
guarantee you MD5 will preform better and faster.

> This is a minimum security technique.

This is the best piece of advice in the whole piece. Please never ever use
this for something you want to be secure. I haven't tried to break it (maybe
I'll do that over the weekend), but giving it a first glance I would be
willing to bet anyone with some skill would be able to do so.

"Anyone, no matter how unskilled, can design an algorithm that he himself
cannot break." -- Bruce Schneier

~~~
KevBurnsJr
If I want a string thats 5-6 chars (for, say, a URL shortener), truncating an
MD5 is a bad idea since it IS random.

These keys are GUARANTEED to be unique. You can run all the way up to 62^n
without any key conflicts.

If you truncated an MD5 to 3 characters, by 62^3/2 you'd have a 50% chance of
collision.

~~~
oakenshield
Who says you have to truncate? Split it into four 4 byte chunks and xor them.

~~~
Xk
You won't get any better with that. Even if MD5 was a true source of
randomness, the problem is still that you've only got 32 bits, so you'd expect
a collision after 2^16 with a random function.

Besides, xoring the other bits does nothing to increase the security on non-
broken hashing functions. Take the extreme case of xoring every bit to
generate either a 0 or a 1. You've put a lot of effort into generating that
single bit, but it's no more random than if you just took the lsb of the hash.

------
bonzoesc
> $dec = ($num * $prime)-floor($num * $prime/$ceil)*$ceil;

This looks like a clumsy way to implement modulus multiplication. My php is
(thankfully) rusty, but it looks like an affine cipher with a fixed key?

If it is, a user can obtain two consecutive "hashes" and calculate past and
future hashes. If they know the corresponding plaintext for a single hash,
they can calculate arbitrary hashes.

~~~
KevBurnsJr
PHP's modulus operator is broken for larger values.
[http://php.net/manual/en/language.operators.arithmetic.php#9...](http://php.net/manual/en/language.operators.arithmetic.php#99112)

True about consecutive hash calculation. Notedly it's for obfuscation, not
encryption.

~~~
bonzoesc
Why not just use an algorithm that is less than 2,000 years old, such as AES
(usable through the <http://us3.php.net/manual/en/function.openssl-
encrypt.php> API):

> openssl_encrypt('asdf', 'aes-256-cfb', 'a password', false, 'initialization
> v');

returns: 4kavvg==

You get the same direct mapping of inputs to outputs but you don't have to re-
invent affine ciphers and make guesses about its security properties. And
since you're just using it to turn a small integer into something bigger to
make clients/users happy and not to protect data, Mr. Ptacek won't flame you
out.

~~~
KevBurnsJr
That might also be a good solution, particularly if reversibility is
necessary.

------
bkrausz
Why wouldn't you run MD5 with binary output, then convert the output to base
62? That's ~21 digits in base 62 and doesn't require figuring out your own
hashing function.

~~~
KevBurnsJr
For shorter 5-6 char strings. Ala tinyurl slugs.

~~~
bkrausz
You can truncate the md5 base 62 string to anything you want after
that...you're still going to have approximately as much uniqueness as his
hash.

~~~
KevBurnsJr
md5 truncation will not guarantee avoidance collision. This does.

------
copper
> I chose primes near the golden ratio to maximize the appearance of
> randomness.

Anybody know what this appearance of randomness is?

~~~
KevBurnsJr
If you use a prime near 62^n/2, udihash(range(1,10)) would still be unique but
it would return a list that clearly has some linearity (V0001, 00002, V0003,
00004, V0005, 00006, V0007, 00008, V0009, 0000A, etc).

Using a prime near the golden ratio makes the list appear less linear (cJio3,
EdRc6, qxAQ9, TGtEC, 5ac2F, huKqI, KE3eL, wXmSO, YrVGR, BBE4U).

------
psadauskas
On a related note, I've been trying to figure out a way to encode MongoDB
ObjectIDs ( a 24-char hex string, like `"4d82a373aeb4b69aec000001"` ) into a
shorter Base64 string usable in URLs (eg, `/posts/{id}` ). The problem is, it
still takes a 16-char Base64 to represent the same number as a 24-char hex
string, and the Base64 one is even uglier.

I've been contemplating a way to generate my own ids, similar to this, but was
running into trouble on how to make sure its always generated unique, on
distinct machines.

~~~
KevBurnsJr
Yes, at some point you need some sort of atomic ID generation. I'm using Riak
for the key-value store and Redis's atomic incr for the ID generation. Anyone
who wants to save an object into the bucket first has to get a unique ID from
Redis to be rotated and base62 encoded.

