apparently the challenge explicitly allow for warming up the os cache on a prior run, so O_DIRECT would be disadvantaged here as the working set fits in memory.
edit: also in my very simple tests a good hashmap can beat linear search already at <10 items. It depends a lot on the hash function cost.
I agree that this method will produce a better estimate of expected mean real world performance under certain load conditions, but still contend it just muddies the waters about which solution is in fact the fastest.
Just modified the original post to add the file_fdw. Again, none of the instances (PG or ClickHouse) were optimised for the workload https://ftisiot.net/posts/1brows/
<<
Just clarified this in the README:
* Input value ranges are as follows:
- Station name: non null UTF-8 string of min length 1 character and max length 100 characters
- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit
>>
If we assume worst case entries: -99.9;<100 char station name>
That is 106 chars plus newline (1 or 2 chars). That could be 108 GB of data! And, I want to be so picky here: The README says "UTF-8" and "max length 100 characters". Eh, we could start a holy war here arguing about what does a character mean in UTF-8. Let's be generous and assume it means one (8-bit) byte (like C). I have good faith in the original author!
Some Googling tells me that the current fastest PCIe Gen5 NVMe drive is Crucial T700 w/ Sequential read (max., MB/s) @ 12,400. That is 8.7 seconds to read 108 GB of data. A few other comments mentioned that the sample file is only 12 GB, so that is 1 second, and 100% can be cached by Linux kernel for 2nd and later reads. However, if the test file is greater than 32 GB, you are definitely screwed for file system caching. You will have to pay a multi-second "I/O tax" to read the file each time. This seems to far outweigh any computation time.
Thoughts?
Last, I still think this is a great challenge. I think I just found my next interview question. This question is easy to understand and program a basic implementation. Further discussion can be fractally complex about the myriad of optimizations that could be applied.
No trolling: Can you imagine if someone tries this exercise in Python? There are a goofy amount of optimizations to deal with huge data in that language thanks to SciPy, Pandas and other friends. I am sure someone can write an insanely performant solution. (I know, I know: The original challenge says no external dependencies.)
Yeah so I had a discussion on Twitter about this, turns out 12GB is small enough to fit into memory, and the author runs submissions by running a solution 5 times in a row, so using direct IO actually hurts because having the kernel cache is a way to enforce the file is in memory for the 4 runs after. I have a direct IO solution with SIMD string search and double parsing, just in C++ (using libraries). It runs in 6 seconds on my 24 core linux box (NVMe).
Disk access can certainly be parallelized, and NVMe is blazing fast, so the bottleneck is more the CPU than disk. There are systems that are built around around modern hardware that realize this (redpanda.com, where I work is one such example)
Parsing is a lot of the compute time and some SIMD tricks like SWAR for finding delimiters can be helpful. Stringzilla is a cool library if you want to see a clean implementation of these algorithms: https://github.com/ashvardanian/StringZilla
Maintaining sum+count is more expensive than you'd imagine... Because to maintain the sum, you need to convert the numbers from ascii to decimal. And to do that you at least need to know the memory alignment of the number and the position of the decimal point.
All far more expensive than the state machine approach which is pretty much 'alignment doesn't matter, ascii doesn't matter, we're gonna just handle all possible states that could end up in a 32 bit register').
I believe the whole thing can be done in 0.3 seconds with the following approach:
(Describing only the 'happy path' here - other paths can be made fast too, but will require different implementations)
* Since temperatures are only to 0.1 decimal points, we have a finite number of temperatures. ~400 temperatures will cover all common cases.
* We also have a finite number of place names. (~400)
* Just make a lookup table of all temps and all place names. (~160,000)
* autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.
* run through all the data, at RAM speed, incrementing counters for each state the machine ends up in. (there will only be 65k). These counters fit fully in cache.
* With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.
* from the values of the counters, you can calculate the answers. 65k is a far smaller number than 1 billion, so you don't need to do this bit fast.
* for anything that doesn't map to a valid bin (higher/lower temperatures, unknown place names), just fallback to slow code (one state of the state machine can be reserved for 'escape to slow code'. Use these escapes for min/max handling too, since it will only happen a few thousand times).
* I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.
apparently the challenge explicitly allow for warming up the os cache on a prior run, so O_DIRECT would be disadvantaged here as the working set fits in memory.
edit: also in my very simple tests a good hashmap can beat linear search already at <10 items. It depends a lot on the hash function cost.