
Base52 url shortener perfect hash function in C - termie
http://lambdajones.com/b52
======
gojomo
No AEIOU because of fear of spelling naughty words?

Over what set of inputs is this claimed to be a 'perfect hash'?

<http://en.wikipedia.org/wiki/Perfect_hash>

I guess I'm missing the point. Is an explanation of the motivation for this
anywhere?

~~~
lambdajones
b52 maps each short url code to a unique number, 0-(52^5)

~~~
gojomo
There's a shorter way without a bunch of mysterious constants:

    
    
      // decode a 5-digit base52 number
      unsigned long b52(const char *c) {
        unsigned long factor = 1;
        unsigned long result = 0;
        for(int i = 0; i < 5; i++) {
          result += factor * b52map[c[i]];
          factor *= 52;
        }
        return result;
      }
    

(52^5 < 2^32 so an 'unsigned long' is plenty; 52^11 < 2^64 so you could get an
11-digit base52 number into an 'unsigned long long'. I'd be tempted to
generalize the function to handle 0-11 digits, by requiring the parameter to
be null-terminated or passing in a length.)

~~~
lambdajones
Got around to profiling it last night. My code is faster, if you accept the
cost of mystery.

<http://lambdajones.com/b52why>

~~~
gojomo
How could that possibly make a difference for the intended application? Is the
server CPU-maxed and never faces IO waits where more calculation is free?

If speed were a valid goal (and it's not), then:

(1) You would only use overkill 64-bit integers if on a 64-bit processor and
profiling had proven them to be faster.

(2) Precalculating the factors would speed the clearer code, possibly making
it faster:

    
    
      const unsigned long factors[5] = 
        { 1, 52, 52*52, 52*52*52, 52*52*52*52 };
      // decode a 5-digit base52 number
      unsigned long b52(const char *c) {
        unsigned long result = 0;
        for(int i = 0; i < 5; i++) {
          result += factors[i] * b52map[c[i]];
        }
        return result;
      }
    

...or unroll the loop...

    
    
      // decode a 5-digit base52 number
      unsigned long b52(const char *c) {
        unsigned long result = b52map[c[0]];
        result += (52) * b52map[c[1]];
        result += (52*52) * b52map[c[2]];
        result += (52*52*52) * b52map[c[3]];
        result += (52*52*52*52) * b52map[c[4]];   
        return result;
      }
    

(Here I'm trusting the compiler to precalculate the powers of 52, but the code
is still clearly 'parsing a base' rather than 'mysterious multiplications and
divisions'.)

(3) You would strongly consider using a Base64-based approach, so all
conversions involve bit-shifts instead of multiplication. If certain
characters can't be used -- eg vowels -- you might still prefer large gaps in
your ID series. (If you're using a SQL database, it won't care -- but then
again, if there's a SQL database anywhere in the system, its operations will
make these micro-optimizations even more ludicrous than they are already.)

------
steilpass
So what does likely() do?

~~~
unlink
Presumably it is defined to __builtin_expect(x, 1)[0], a hint to gcc that one
branch (in this case the first) of the if is more likely.

[0] <http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html>

------
trezor
If your answer is a url shortener you are asking the wrong question.

~~~
lambdajones
I don't disagree. I tried to tell the product people that on several
occasions, but it fell on deaf ears. So might as well do it right, eh?

