The 4cc gets a decent distribution in that you don't have to examine very many collisions to find the character you're looking for (or conclude it's not found).
A poor distribution is an obvious bug in hashing; if you don't suffer from that bug, you don't have to do anything. If you have the bug, it's obvious you have to change your hash calculation to avoid it. The developers of 4cc may have struggled with bugs where they had buckets that were too large for efficient searching.
Convoluted hashing functions are not always used for hash tables. When hash tables are pointers (e.g. object identities themselves are used as keys to associated objects with additional properties), sophisticated hashing functions are not needed; e.g. simply extracting a few bits of the pointer, avoiding the lowest bits (which might all be zero due to alignment!).
I believe that the 4cc dictionaries hit upon the key insights of hashing: calculating a numeric key from an object which then directly identifies a small search bucket.
The Four Corner Code abandons semantics like radicals. Codes are assigned according to certain stroke patterns in the four quadrants of the character, without regard for their semantic role. The inventors hit a key insight there: that any way of calculating a hash code is valid as long as it can be consistently followed, and leads to short searches. The function can look at meaningless fragments of the object (exactly like when we take the middle bits of a pointer). A character's etymology need not play any role in how it is digested. Whereas in the radical methods, you have to know that for instance 火 and 灬 both mean "fire" and are understood as the same radical #86. So in some sense, the predecessor methods like radical indexing may have been almost-hashing. It's hard to argue that 4cc isn't.
> A poor distribution is an obvious bug in hashing; if you don't suffer from that bug, you don't have to do anything.
Right, but if you don't have and solve that problem then what you have made isn't a hash table. Often you don't need a hash table - if you have something that already has a nice distribution, you can use a simpler data structure (like, IDK, a radix tree) and get all the properties you wanted.
> The inventors hit a key insight there: that any way of calculating a hash code is valid as long as it can be consistently followed, and leads to short searches.
If they did, then I would agree you're right. But do we know that they did? Or might they have seen it as just a different way of considering radicals? (E.g. did they ever try indexing anything else that way, not just characters?)
Note that a radix tree and hash table are not mutually exclusive. A radix tree is a way of representing a sparse table. That could be used as a hash table.
There's a trade off there because if the table is very sparse, and we're using hashing, we could just shrink the table so as not to have it so sparse, and then just make it a regular array.
The key aspect of the four corner code is that it mashes together completely unrelated characters. There's no meaningful index to it. It's not easy to look at a four corner code to figure out the list of characters it aliases for.
A poor distribution is an obvious bug in hashing; if you don't suffer from that bug, you don't have to do anything. If you have the bug, it's obvious you have to change your hash calculation to avoid it. The developers of 4cc may have struggled with bugs where they had buckets that were too large for efficient searching.
Convoluted hashing functions are not always used for hash tables. When hash tables are pointers (e.g. object identities themselves are used as keys to associated objects with additional properties), sophisticated hashing functions are not needed; e.g. simply extracting a few bits of the pointer, avoiding the lowest bits (which might all be zero due to alignment!).
I believe that the 4cc dictionaries hit upon the key insights of hashing: calculating a numeric key from an object which then directly identifies a small search bucket.
The Four Corner Code abandons semantics like radicals. Codes are assigned according to certain stroke patterns in the four quadrants of the character, without regard for their semantic role. The inventors hit a key insight there: that any way of calculating a hash code is valid as long as it can be consistently followed, and leads to short searches. The function can look at meaningless fragments of the object (exactly like when we take the middle bits of a pointer). A character's etymology need not play any role in how it is digested. Whereas in the radical methods, you have to know that for instance 火 and 灬 both mean "fire" and are understood as the same radical #86. So in some sense, the predecessor methods like radical indexing may have been almost-hashing. It's hard to argue that 4cc isn't.