Hacker News new | past | comments | ask | show | jobs | submit login
Python Hash Tables: understanding dictionaries (thepythoncorner.com)
83 points by mastro35 on Aug 23, 2020 | hide | past | favorite | 37 comments



I suppose this isn't a bad introduction to hash tables, but I would assume that most HN readers know how to implement a hash table. It seems like talking about fundamental CS concepts in.. Python is a perennial source of quick and easy blog posts for a lot of people.


IMO a better thing to focus on in Python would be how heavily optimized the Python dict implementation is, as compared with other hash table implementations out there. There used to be a good article online by Tim Peters about this, but I haven't been able to find it.


Raymond Hettinger designed the latest implementation, more or less. He's given some talks about it.


Something strikes me as out of place with this kind of post. On one hand, I want to encourage people to write posts about common concepts, but on the other hand, this is a fairly commonly written about concept, posted by the author, on a site with significant numbers of ads. I guess this seems like some people might have started to use HN as a way to click farm or something.


The author has a lot of very similar and very trivial posts. I agree that it seems sketchy.


Trivial posts have a great introductory value. Especially in Python I imagine there to be a lot of people using hash tables without actually knowing how they work, or maybe even never having heard of hashes.

Of course it's arguable whether such a post belongs here, but a bunch of other introductory ones have been popular here before.


Some are trivial, some a little less... I'm trying to find the best format for the site


Anecdata: I don’t know how to implement a hash table, but I know how to use them. Not everyone here has a CS education background.


If you're interested in implementing a hash table I would assume you're also generally interested in data structures and algorithms. I'd highly suggest the Algorithms, Part 1 class on Coursera. It's totally free and offered through Princeton. It covers hash tables in the last segment, albeit implemented in Java, but still very thorough and is a great introduction to common data structures/algorithms and analysis, like linked lists, union find, trees, quick sort and time/space complexity. https://www.coursera.org/learn/algorithms-part1


What level of math does one need to understand to take this course?


I've ignored math prereqs my entire life and turned out fine. If you're genuinely interested, it's not too hard to learn the math as needed. It might actually be a better way to learn, because you're learning is always in service of other learning.

I haven't taken that Coursera course, but in college, I believe only calc I and II are the prereqs for intro to data structures and algorithms. Though I know solid engineers who understand applied CS who don't have a good grasp of math at all.


> It might actually be a better way to learn, because you're learning is always in service of other learning.

Me in freshman calculus: Why are we learning these Taylor series? It's so boring, and I'll never need to use it

Me three years later in upper-division meteorology: It turns out 90% of what we do is numerical methods because fluid dynamics is hard with limited data.

It's definitely easier for me to learn things when I have an application for it.


The math is isn't advanced, any high schooler would've encountered it. Brush up on the basics of exponents and logarithms and you'll be fine.


For the data structure & algorithm courses I've seen, basically none.


Erik Demaine is the man

https://www.youtube.com/watch?v=0M_kIqhwbFo

I hate all algorithms courses except this one, really made me fall in love with algorithms and data structures .

If you want to go deeper, this is the next one on the topic

https://www.youtube.com/watch?v=rvdJDijO2Ro


Well, if you know hash tables you will still need some education on the special quirks of python hash tables. They are in no way related to CS hash tables. And you should not base your own hash table on the python architecture.

They are sorted. Normal hash tables are unsorted, sorting needs lot of time and space. Nobody but python would do that. (ignoring PHP here). But once you went down this path, users start relying on this quirks and you cannot get away from that.

They feature special values for ints. 1 hashes to 1, 4 to 4 and so on. This looks like Lua arrays, which are either arrays or hashtables under the hood, and allows efficient switches from dense to sparse arrays. But mostly it costs time in the most critical fast path. And it doesn't help in security at all.

They are insecure by default. Only with some special cmdline flag they feature a randomized seed. Which is still insecure because the seed can be easily exposed by reading the memory of the seed.

They are extremely slow.

They are using siphash. Everybody using siphash is immediately exposed as having no idea about hash table security.


> They are sorted. Normal hash tables are unsorted, sorting needs lot of time and space. Nobody but python would do that. But once you went down this path, users start relying on this quirks and you cannot get away from that.

They most definitely aren't sorted. It's not a quirk either. Python dicts are essentially a tiny hashtable of pointers into an array of objects. For iteration, that array is traversed directly. That's where the insertion-order preservation comes from. It's cheap. It doesn't involve sorting.

> They are insecure by default. Only with some special cmdline flag they feature a randomized seed.

Hashes are randomized since Python 3.2 or so.

> Which is still insecure because the seed can be easily exposed by reading the memory of the seed.

Obviously.

> They feature special values for ints. 1 hashes to 1, 4 to 4 and do on.

That's a quirk of the hash function (which is a reduction modulo a prime), not the hashtable.

> Everybody using siphash is immediately exposed as having no idea about hash table security.

Well...


> They are sorted.

No, they are insertion-ordered.

> Nobody but python would do that.

Ruby Hashes are insertion-ordered; JavaScript objects have numeric properties ordered numerically (so, in a sense, are sorted), but other propertied in insertion order.


Raymond Hettinger, Modern Python Dictionaries A confluence of a dozen great ideas: https://www.youtube.com/watch?v=npw4s1QTmPg


FYI, the formatting of several of the Python code blocks is messed up, starting with lines like this:

```python linenums=”21” def get_value(self, input_key):

Also, might be worth noting that the "The Pythonic Implementation of Python Hash Tables" section seems to cover either Python 3.6 or 3.7, and that the implementation seems to have changed a bit in 3.8: now `sys.getsizeof({}) == 64` and `sys.getsizeof({"a": 100}) == 232`. Does anyone have the low-down on how it's changed and possibly how it's likely to change again in future versions?


I tried to fix it, let me know if it’s broken again... It’s not super easy to check by myself because I'm not at home and I'm working with my phone... :D


Formatting of several of the code segments is borked.

It should be mentioned that a Python set type is basically a naked hash table, or a dictionary with all keys and no data.

As noted, in 3.6 dicts are ordered, but in 3.8 they can be reversed(); and in 3.9 the new merge and update operators are added:

https://docs.python.org/3.9/whatsnew/3.9.html#dictionary-mer...


> It should be mentioned that a Python set type is basically a naked hash table, or a dictionary with all keys and no data.

Python's set and dict are actually completely different codebases. Notably, the naturally ordered hash table of 3.6 dicts is not used by or for sets.


In case anyone is curious, sets were dicts in the original implementation: https://github.com/python/cpython/blob/a690a9967e715663b7a42...

It was changed in 2005 for python 2.5 https://github.com/python/cpython/commit/9f1a6796eb83a2884df...


> a Python set type is basically a naked hash table, or a dictionary with all keys and no data.

Not exactly. Dicts, as you say, are ordered - but sets are not.


So not the best article and advocates for md5? Also full of ads. Why is this getting traction?


This is not advocating for md5. :/


This does introduce hash functions, list MD5 and other cryptographic hash functions as examples, and then go on to explain how hash functions are used in hash tables.

Not advocating for MD5, but it does strongly imply the use of cryptographic hash functions for hash tables, which is almost never the case, and a useful distinction to explain.

Cryptographic hash functions are trying to make the original data unrecoverable in any way, and essentially distributing as evenly throughout their output space as possible. They are also often trying to be slow to compute to prevent brute forcing.

On the other hand, hashing for a hash table doesn’t need to be secure in the same way (hence Python’s smaller bits just returning themselves as their hashed values). They also don’t need to distribute themselves evenly. They need to be fast, unlike a cryptographic hash, and they need to restrict the data to a fixed size.


Yes, maybe I haven't been clear, I tried just to explain what an hash is before introducing hash tables, if someone understood that cryptographic hash functions have to be used for hash table it's probably because my explanation wasn't clear enough... :(

Consider that I'm not a native speaker, so sometimes it's hard for me to send the exact message I would like to send... I will try to explain it better as soon as I will get anywhere I can use a computer :)

Thanks!


What’s wrong with MD5 for non-cryptographic hashes?


Everything is wrong with MD5. Insecure, extremely slow, bad statistical properties. https://github.com/rurban/smhasher


I'm really sorry for the ads guys, but this is a small site that lives thanks to them... I wish I could avoid them and live just with donations but it isn't possible yet.

Moreover, I don't know anything about SEO or stuff like that so the ads are configured automatically (I went to the AdSense dashboard and clicked on the ”do whatever you want” button!!! :D )

However, I will try to limit the ads in the next few days understanding with the AdSense reports which one are not bringing earnings and can be removed.

Thanks for the feedbacks!


On mobile the ads are a bit too close to be completely comfortable, but they're definitely not overly annoying either.

The article is a nice introduction, the rest of the website looks great and you don't seem to have included any of the modern stupid annoyances that will make me hate you (in-page popups, notification request, app install banners, chat bubbles).

Thanks for the article!


Ugh. Full of ads and full of inaccuracies.

To learn about Python’s dictionaries, I’d recommend watching the talks by Brandon Rhodes and Raymond Hettinger.


Yes, Raymond Hettinger’s talk is really great! I've specified at the end of the article that a great part of it is taken from him.


Any talks you recommend in particular?


The talks by Brandon Rhodes are these two:

* The Mighty Dictionary (PyCon 2010): https://www.youtube.com/watch?v=oMyy4Sm0uBs

* The Dictionary Even Mightier (PyCon 2017): https://www.youtube.com/watch?v=66P5FMkWoVU

There's also one by Raymond Hettinger, which seems to have been given first at a meetup and then, in shorter form, at PyCon 2017:

* Modern Dictionaries: https://www.youtube.com/watch?v=p33CVV29OG8

* Modern Python Dictionaries - A confluence of a dozen great ideas (PyCon 2017): https://www.youtube.com/watch?v=npw4s1QTmPg




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: