I suppose this isn't a bad introduction to hash tables, but I would assume that most HN readers know how to implement a hash table. It seems like talking about fundamental CS concepts in.. Python is a perennial source of quick and easy blog posts for a lot of people.
IMO a better thing to focus on in Python would be how heavily optimized the Python dict implementation is, as compared with other hash table implementations out there. There used to be a good article online by Tim Peters about this, but I haven't been able to find it.
Something strikes me as out of place with this kind of post. On one hand, I want to encourage people to write posts about common concepts, but on the other hand, this is a fairly commonly written about concept, posted by the author, on a site with significant numbers of ads. I guess this seems like some people might have started to use HN as a way to click farm or something.
Trivial posts have a great introductory value. Especially in Python I imagine there to be a lot of people using hash tables without actually knowing how they work, or maybe even never having heard of hashes.
Of course it's arguable whether such a post belongs here, but a bunch of other introductory ones have been popular here before.
If you're interested in implementing a hash table I would assume you're also generally interested in data structures and algorithms. I'd highly suggest the Algorithms, Part 1 class on Coursera. It's totally free and offered through Princeton. It covers hash tables in the last segment, albeit implemented in Java, but still very thorough and is a great introduction to common data structures/algorithms and analysis, like linked lists, union find, trees, quick sort and time/space complexity. https://www.coursera.org/learn/algorithms-part1
I've ignored math prereqs my entire life and turned out fine. If you're genuinely interested, it's not too hard to learn the math as needed. It might actually be a better way to learn, because you're learning is always in service of other learning.
I haven't taken that Coursera course, but in college, I believe only calc I and II are the prereqs for intro to data structures and algorithms. Though I know solid engineers who understand applied CS who don't have a good grasp of math at all.
> It might actually be a better way to learn, because you're learning is always in service of other learning.
Me in freshman calculus: Why are we learning these Taylor series? It's so boring, and I'll never need to use it
Me three years later in upper-division meteorology: It turns out 90% of what we do is numerical methods because fluid dynamics is hard with limited data.
It's definitely easier for me to learn things when I have an application for it.
Well, if you know hash tables you will still need some education on the special quirks of python hash tables. They are in no way related to CS hash tables. And you should not base your own hash table on the python architecture.
They are sorted. Normal hash tables are unsorted, sorting needs lot of time and space. Nobody but python would do that. (ignoring PHP here). But once you went down this path, users start relying on this quirks and you cannot get away from that.
They feature special values for ints. 1 hashes to 1, 4 to 4 and so on. This looks like Lua arrays, which are either arrays or hashtables under the hood, and allows efficient switches from dense to sparse arrays. But mostly it costs time in the most critical fast path. And it doesn't help in security at all.
They are insecure by default. Only with some special cmdline flag they feature a randomized seed. Which is still insecure because the seed can be easily exposed by reading the memory of the seed.
They are extremely slow.
They are using siphash. Everybody using siphash is immediately exposed as having no idea about hash table security.
> They are sorted. Normal hash tables are unsorted, sorting needs lot of time and space. Nobody but python would do that. But once you went down this path, users start relying on this quirks and you cannot get away from that.
They most definitely aren't sorted. It's not a quirk either. Python dicts are essentially a tiny hashtable of pointers into an array of objects. For iteration, that array is traversed directly. That's where the insertion-order preservation comes from. It's cheap. It doesn't involve sorting.
> They are insecure by default. Only with some special cmdline flag they feature a randomized seed.
Hashes are randomized since Python 3.2 or so.
> Which is still insecure because the seed can be easily exposed by reading the memory of the seed.
Obviously.
> They feature special values for ints. 1 hashes to 1, 4 to 4 and do on.
That's a quirk of the hash function (which is a reduction modulo a prime), not the hashtable.
> Everybody using siphash is immediately exposed as having no idea about hash table security.
Ruby Hashes are insertion-ordered; JavaScript objects have numeric properties ordered numerically (so, in a sense, are sorted), but other propertied in insertion order.
Also, might be worth noting that the "The Pythonic Implementation of Python Hash Tables" section seems to cover either Python 3.6 or 3.7, and that the implementation seems to have changed a bit in 3.8: now `sys.getsizeof({}) == 64` and `sys.getsizeof({"a": 100}) == 232`. Does anyone have the low-down on how it's changed and possibly how it's likely to change again in future versions?
I tried to fix it, let me know if it’s broken again... It’s not super easy to check by myself because I'm not at home and I'm working with my phone... :D
This does introduce hash functions, list MD5 and other cryptographic hash functions as examples, and then go on to explain how hash functions are used in hash tables.
Not advocating for MD5, but it does strongly imply the use of cryptographic hash functions for hash tables, which is almost never the case, and a useful distinction to explain.
Cryptographic hash functions are trying to make the original data unrecoverable in any way, and essentially distributing as evenly throughout their output space as possible. They are also often trying to be slow to compute to prevent brute forcing.
On the other hand, hashing for a hash table doesn’t need to be secure in the same way (hence Python’s smaller bits just returning themselves as their hashed values). They also don’t need to distribute themselves evenly. They need to be fast, unlike a cryptographic hash, and they need to restrict the data to a fixed size.
Yes, maybe I haven't been clear, I tried just to explain what an hash is before introducing hash tables, if someone understood that cryptographic hash functions have to be used for hash table it's probably because my explanation wasn't clear enough... :(
Consider that I'm not a native speaker, so sometimes it's hard for me to send the exact message I would like to send... I will try to explain it better as soon as I will get anywhere I can use a computer :)
I'm really sorry for the ads guys, but this is a small site that lives thanks to them... I wish I could avoid them and live just with donations but it isn't possible yet.
Moreover, I don't know anything about SEO or stuff like that so the ads are configured automatically (I went to the AdSense dashboard and clicked on the ”do whatever you want” button!!! :D )
However, I will try to limit the ads in the next few days understanding with the AdSense reports which one are not bringing earnings and can be removed.
On mobile the ads are a bit too close to be completely comfortable, but they're definitely not overly annoying either.
The article is a nice introduction, the rest of the website looks great and you don't seem to have included any of the modern stupid annoyances that will make me hate you (in-page popups, notification request, app install banners, chat bubbles).