Chasing pointers is cache death. Many theoretically optimal algorithms have been struck down by the God of Memory Latency.
eg, If you measure it, tree performance will often be much improved by storing slightly less than one cache line's worth of data at each node. The CPU can pipeline and predict the shit out of the linear search at each node to the point that all the theoretical O(1) or O(log N) lookups in the world get demolished by a CS101 scan of the array. This makes a relatively good general use cache aware data structure because small collections tend to fit into a single cache line while large data structures have relatively few pointers to chase.
Of course as the OP demonstrates, optimizing for your specific use case can yield even more dramatic improvements.
eg, If you measure it, tree performance will often be much improved by storing slightly less than one cache line's worth of data at each node. The CPU can pipeline and predict the shit out of the linear search at each node to the point that all the theoretical O(1) or O(log N) lookups in the world get demolished by a CS101 scan of the array. This makes a relatively good general use cache aware data structure because small collections tend to fit into a single cache line while large data structures have relatively few pointers to chase.
Of course as the OP demonstrates, optimizing for your specific use case can yield even more dramatic improvements.