Uh? The post very clearly says: "I’m using Python as pseudocode; pretend I used your favorite low-level language". The goal is to show what's possible when you do your best, of course I'm not going to use Python for anything beyond demonstrating ideas.
> But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Again, that was never my goal. I chose Rust because I've stumbled upon this problem while working on a Rust library. I could've chosen C++, or C, or maybe even Go -- the result would've been the same, and I checked codegen to make sure of that.
> I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway. If your main takeaway from "here's how to process large data" was "I don't need to worry about this for small data", well, I don't know what to say.
Large and Small are so contextual. I'm processing 350m events/day in 24h splits, and I managed to stop worrying about locality of reference because I'm the sole occupant of the machine. When I did worry about it, I found radix tree, awk hash and perl/python hash/dict pretty much all occupied much the same space and time but a tuned C implementation got 2-3x faster than any of them. Somebody else pointed out memory resident for most of this would be faster still but you have to then work to process 24h of things against a single memory instance. Which means buying into IPC to get the data "into" that memory.
It interested me you didn't show the idea in rust. That was the only point I was making: Python as pseudocode to think things in documents is fine with me.
But overall, I liked your outcome. I just think it's important to remember large and small are very contextual. Your large case looks to me to be >5m things and for an awful lot of people doing stupid things, 5m is bigger than they'll ever see. If the target was only people who routinely deal in hundreds of millions of things, then sure.
> The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway.
What matters is locality since that allows prefetch to mask latency. If you have this, then you are in a good place even if your data does not fit in the L3 cache. What you did demonstrates the benefits that locality gives from the effect on prefetch. Fitting in L3 cache helps, but not as much as prefetch does. If you do not believe me, test a random access pattern on things in L3 cache vs a sequential access pattern. The sequential access pattern will win every time, because L3 cache is relatively slow and prefetch masks that latency.
I have seen options for disabling prefetch and cache as BIOS options (although I only remember the option for disabling cache in ancient systems). If you could get one, you could do some experiments to see which will matter more.
> But you didn't actually prove a general case, you proved "for a good optimising rust compiler" didn't you?
Again, that was never my goal. I chose Rust because I've stumbled upon this problem while working on a Rust library. I could've chosen C++, or C, or maybe even Go -- the result would've been the same, and I checked codegen to make sure of that.
> I could take the lesson: for less than 10m things, on modern hardware with more than 4 cores and more than 4GB stop worrying and just code in the idiom which works for you.
The number of cores and RAM capacity has nothing to do with this. It's all about how well data fits in cache, and "less than 10m things" are likely to fit in L3 anyway. If your main takeaway from "here's how to process large data" was "I don't need to worry about this for small data", well, I don't know what to say.