Hacker News new | past | comments | ask | show | jobs | submit login
The One Billion Row Challenge (morling.dev)
482 points by madmax108 10 months ago | hide | past | favorite | 350 comments



As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?

[0] https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...


Yes, you are right. This came up yesterday and indeed two solutions were violating the "must work for all station names" rule by relying on specific hash functions optimized for the specific data set, which I unfortunately missed during evaluation. I've just removed these entries from the leaderboard for the time being. Both authors are reworking their submissions and then they'll be added back.

[0] https://twitter.com/mtopolnik/status/1742652716919251052


Have you considered having two datasets? The "development" dataset which is public, and the "competition" dataset which is private?

That might help against algorithms that are (accidentally, or on purpose) tuned to the specific dataset.


Yeah if you are releasing the competition dataset then it can and will 100% be gamed. What is to stop me from just hardcoding and printing the result without any computation?


Your code is submitted (per parent comment)


Sure, but who is manually checking every entry?


You don't need to check every entry, just the top few.


The next obvious solution is to have something that works fast for non-colliding station names, but then falls back to another (slow) implementation for colliding station names.

It's very cheap to detect station name collisions - just sample ~10,000 points in the file, and hope to find at least one of each station. If you find less stations than the full run, you haven't yet seen them all, keep hunting. If you find two that collide, fall back to the slow method. Checking 0.00001% of the dataset is cheap.


But how would you detect that a station name is colliding or not colliding? With a hash set?


Search for cuckoo hashing. It's a whole thing for data structures.


Cuckoo doesn't aid in detecting collisions, the algorithm is about what happens IF a collision is found. The whole reason we're hashing in the first place is to not have to linearly compare station names when indexing.

In other words: Cuckoo is a strategy to react to the case if two values map to the same hash. But how to know weather you have two different values, or two of the same, if they have an identical hash?


It seems like you have to run the slow collision-resistant thing over the whole file.


Collision detection is far cheaper. One method:

Simply add up all the bytes of all seen placenames while running your fast algorithm. This is effectively a checksum of all bytes of placename data.

Then at the end, calculate the 'correct' name-sum (which can be done cheaply). If it doesn't match, a collision occurred.


You can design inputs that defeat this scheme though. I thought the point was to be correct for all valid inputs.


What are the constraints on station names - i.e. min length, max length, maximum number of different names?


Just clarified this in the README:

* Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit

* Implementations must not rely on specifics of a given data set, e.g. any valid station name as per the constraints above and any data distribution (number of measurements per station) must be supported


The README says max length 100 bytes, which I suppose we can (?) assume are octets. Also, it mentions that you can assume the station string does not contain the separator ';'.

I guess the station string is also supposed to be free of control characters like newlines, though spaces are allowed. This, however, is not stated.


I believe the whole thing can be done in 0.3 seconds with the following approach:

(Describing only the 'happy path' here - other paths can be made fast too, but will require different implementations)

* Since temperatures are only to 0.1 decimal points, we have a finite number of temperatures. ~400 temperatures will cover all common cases.

* We also have a finite number of place names. (~400)

* Just make a lookup table of all temps and all place names. (~160,000)

* autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.

* run through all the data, at RAM speed, incrementing counters for each state the machine ends up in. (there will only be 65k). These counters fit fully in cache.

* With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.

* from the values of the counters, you can calculate the answers. 65k is a far smaller number than 1 billion, so you don't need to do this bit fast.

* for anything that doesn't map to a valid bin (higher/lower temperatures, unknown place names), just fallback to slow code (one state of the state machine can be reserved for 'escape to slow code'. Use these escapes for min/max handling too, since it will only happen a few thousand times).

* I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.


You don't need the look up table. All you are asked for is min/mean/max which can all be computed in one pass without storing the data. All you need is a hash table with 400 entries and 3 floats (running min, mean and max) and and int (count, for updating running mean). That's just 16 bytes, if you use 16 bytes for name of station you can fit everything under 16K.

IO will dominate the running time for this, and JSON parsing will be second.


I don't have a CS background and when I eventually had to do interviews for Google, "Calculate mean/median/mode of temperatures" was the interview question I went with, intending to avoid BS leetcode*

I always worried it was too easy, but I'm heartened by how many comments miss the insight I always looked for and you named: you don't need to store a single thing.

I do wonder if it'll work here, at least as simply as the tracking vars you mention, with so many rows, and the implicit spirit of "you should be able to handle _all_ data", overflow might become a legitimate concern - ex. we might not be able to track mean as simply as maintaining sum + count vars.

* "oh sorry times up, but the right answer is a red black self balancing terenary tree recursive breadth-first search", ironically, was one of my Apple interviews.


In general, median and mode are much harder than min/avg/max. You can't compute the former with constant memory in one pass (you can do approximate median, but not exact median).

(Here there is a restricted range for the temperature with only 199 possible values (-99.9 to 99.9 with 0.1 increment) so you could do it constant memory, need something like 4*199 bytes per unique place name))

For the sum overflow is not an issue if you use 64-bit integers. Parse everything to integers in tenths of degree and even if all 1 billion rows are 99.9 temperature for same place name (worst possible casE), you are very far from overflowing.


Putting this on top comment because I can't edit:

I am silly and wrote 'mode' and shouldn't have :P (wetware error: saw list of 3 items corresponding to leetcode and temperature dataset, my 3 were min/max/average, their 3 are mean/median/mode)


Maintaining sum+count is more expensive than you'd imagine... Because to maintain the sum, you need to convert the numbers from ascii to decimal. And to do that you at least need to know the memory alignment of the number and the position of the decimal point.

All far more expensive than the state machine approach which is pretty much 'alignment doesn't matter, ascii doesn't matter, we're gonna just handle all possible states that could end up in a 32 bit register').


Streaming calculation of the exact median with no storage at all is non-trivial at best in the general case, and I'm not aware of any way to calculate the mode at all. Any pointers to the methods you used in your interview answers?

If you came up with them on the fly, then... well, sign here for your bonus. Can you start Monday?


> Streaming calculation of the exact median with no storage at all is non-trivial at best in the general case

It's not "non-trivial" it's impossible. I'm not sure why people think median can be approximated at all. You need to look at every data point and store a counter for the lesser of: (a) all possible values, or (b) all elements. Just consider a data set with 1 million ones and 999,999 zeroes. Throwing away (or failing to look at) one single number can give you an error of 50%, or 100% for two numbers. If you want to make it really nasty, throw in another million random 64-bit floats between [-0.1, 0) and a million between (1, 1.1]. Four million elements, two million of them unique, in the range [-0.1, 1.1], and failing to account for two elements can get your answer wrong by +/- 1.0.

Unless you start by sorting the list, but I wouldn't call that a "streaming" calculation.


Yeah, that's why I hedged so heavily; AFAIK it's impossible to compute a streaming median (having spent some time trying).

If someone on HN knows how to do it, they will jump in and tell me exactly why it's "trivial," and I'll get some nice R&D work for free. Of course, it will probably involve mounting an FTP account, spinning up a CA and running rsync...


> If someone on HN knows how to do it, they will jump in and tell me exactly why it's "trivial,"

n.b. to you and your parent comment's commenter: you're the only two who used the word trivial in the entire comments section modulo another comment not in this thread that contains "SQL in theory makes this trivial...". The HNer claiming to build a streaming median function in a weekend fantasy might not come to fruition :(


It certainly is possible to approximate the median: https://aakinshin.net/posts/p2-quantile-estimator/

A quick search shows that Boost.Accumulators have several median estimation implementations available to choose from: https://www.boost.org/doc/libs/1_84_0/doc/html/accumulators/...

> Median estimation based on the P^2 quantile estimator, the density estimator, or the P^2 cumulative distribution estimator.


No, it is not possible to approximate the median in the general case. The linked paper does no such thing. Their algorithm can be completely defeated by carefully tailored inputs. They only ever test them against very well-behaved random distributions.

But it brings up an important point: real data that you actually encounter is not a random sampling from "the general case". It is often possible to do a pretty good job approximating the median from real-world data, if you have some understanding of the distribution of that data. You just have to accept the fact that you might be totally wrong, if the data behaves in ways you don't expect. But of course, the more you have to know about your data before running the algorithm, the less useful the algorithm itself is.

The difference is whether you can make guarantees or not. Most algorithm design is concerned about such guarantees: on all possible inputs, this algorithm has at worst such-and-such performance. Hence the reliance on Big-O. You cannot ever make such guarantees on a "median approximator" without specifying some sort of precondition on the inputs.

What guarantees do we want to make? With "an estimator", it'd be nice to say something like: we'd like our approximation to get better given more values. That is: the more values we see, the less the next single value should be able to change our approximation. If you've looked at a billion values, all between [min, max], it'd be nice if you knew that looking at the next one or two values could only have an effect of at most 1 / f(1 billion) for some monotonically increasing f. Median does not have that property: looking at just two more values (even still within the range of [min, max]) could move your final answer all the way from max to min. If you stopped 2 data points earlier, your answer would be as wrong as possible. This remains true, for some inputs, even if you've looked at 10^10^10^10^300 values. The next 2 might change your answer by 100%.


Yeah, you can do it in O(n) time, but you can't do it at all in a streaming fashion. I think you just plain need O(n) memory.


I am silly and wrote 'mode' and shouldn't have :P (wetware error: saw list of 3 items corresponding to leetcode and temperature dataset, my 3 were min/max/average, their 3 are mean/median/mode)


Merely finding the start/end of each line will use more computation (in the inner loop) than the approach I outlined. Let alone converting the number from ascii to a float, or looking up the place name in a has table (oh, and the name is variable length, so you're gonna need to find how long the name is first).


I deleted two previous comments because I realized I misunderstood your proposal. I understand it better now, but I am still confused about something...

Your state machine would need at least 2*160,000 states (you need an extra bit to flag whether you have reached a newline in the last word and need to increment a counter or not), correct? And you are assuming the input is 4 bytes, so won't your transition table need (2^32)*2*160,000 ~ 10^15 entries (at least 3 bytes each)?


The states don't need to map 1:1 with cities or temperatures. They merely need to encode all information collected so far which is still relevant. They also don't need to represent all possible situations - anything that is super rare (eg. temperature of 95C) can simply be diverted to a special "invalid" state which triggers regular code to take over for those few entries.


Hmm, still doesn't seem feasible. Even if you only have 256 "relevant" states (which I think you'll agree is far less than what you need) then given a 32-bit input your state transition table is 2^32*256 = 1 Terabyte.

You could shrink your input size to 2 bytes but then you can't work on a word at a time, and for a realistic number of relevant states your transition table is still way bigger than you can fit in even L3 cache.

Unless I am missing something very basic, this doesn't seem like a viable approach.


> IO will dominate the running time for this, and JSON parsing will be second.

Memory bandwidth might dominate, but probably not I/O. The input file is ~12GB, the machine being used for the test has 32GB, and the fastest and slowest of five runs are discarded. The slowest run will usually be the first run (if the file is not already cached in memory), after which there should be little or no file I/O.


Is there a way to validate from app whether all file pages are cached in memory?

What if the code was run against constraints such as Max memory limit in a docker container.


s/JSON/string/


I took a few shots at this in rust (following on from https://www.reddit.com/r/rust/comments/18ws370/optimizing_a_...), and every single time the bottleneck was down to the string parsing, everything else was largely irrelevant.

You can do this whole thing in about 24 seconds, as long as you're smart about how you chunk up the text and leverage threading. https://github.com/coriolinus/1brc/blob/main/src/main.rs


Interesting, I guess that makes sense.

Having a quick look at your code, couple of thoughts:

   - You shouldn't bother with parsing and validating UTF-8. Just pretend it's ASCII. Non ASCII characters are only going to show up in the station name anyway, and all you are doing with it is hashing it and copying it.

   - You are first chopping the file into line chunks and then parsing the line. You can do it it one go, just look at each character byte by byte until you hit a semicolon, and compute a running hash byte by byte. You can also parse the number into an int (ignoring decimal point) using custom code and be faster than the generic float parser.

   - If instead of reading the file using standard library, you mmap it, should also speed things up a bit.


> I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.

A single core can't saturate RAM bandwidth. Cores are limited by memory parallelism and latency. Most modern x86 server chips can retire 2 SIMD loads per clock cycle (i.e. 32GB/s @ 1GHz using AVX2), so AVX-512 is not necessary to max out per-core bandwidth - but if you're loading from DRAM you'll likely cap out much, much earlier (closer to 10-16GB/s on commodity servers).

As long as the majority of your data spills into RAM, you'll see a massive throughput cliff on a single core. With large streaming use-cases, you'll almost always benefit from multi-core parallelism.

It's very easy to test: allocate a large block of memory (ideally orders of magnitude larger than your L3 cache), pre-fault it, so you're not stalling on page faults, and do an unrolled vector load (AVX2/AVX-512) in a tight loop.


> A single core can't saturate RAM bandwidth.

I was also intrigued by the same statement and was wondering about it as well. For AMD I don't know but on Intel I believe that each L1 cache-miss is going to be routed through the line-fill-buffers (LFB) first before going to L2/L3. Since this LFB is only 10-12 entries large, this number practically becomes the theoretical limit on the number of (load) memory operations each core can issue at a time.

> Most modern x86 server chips can retire 2 SIMD loads per clock cycle (i.e. 32GB/s @ 1GHz using AVX2),

I think in general yes but it appears that ALU on Alderlake and Saphire Rapids has 3 SIMD load ports so they can execute up to 3x SIMD AVX2 mm256_load_si256 per clock cycle.


The goal is that none of the intermediate data spills - the only user of RAM is reading the input data.

But for that, you're right, the workload must be split across as many cores as necessary to keep DRAM maxed out or you're throwing away performance. Luckily this problem is easily split.


> With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.

How can the state machine be run in parallel, when the next state always has a dependency on the previous state?

Also, how exactly would the state register be decoded? After you XOR it with 4 bytes of input, it could be practically any of the 4.7 billion possible values, in the case of an unexpected place name.

And even for expected place names longer than 4 bytes, wouldn't they need several states each, to be properly distinguished from other names with a common prefix?


I would ask for clarification about the rules - it isn't clear to me whether code that is special cased for the known 400 place names (even if it supports additional names via a slower path) would be considered valid:

> Q: Can I make assumptions on the names of the weather stations showing up in the data set?

> A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no ; character).


The code wouldn't be special cased... it would run through the first 10,000 or so rows to find out the dataset to generate the state transition table. If any invalid entry is found later, you'll end up in an invalid state, and can detect that case and fall back to slow code (or even theoretically update the state transition table and return to the fast path).


Even simpler. Just have a Dictionary mapping place name to ID. The slow path is "I haven't seen this name before and I need to store it, making sure that I'm not racing any other thread doing the same."

You'd have an upper limit on how many place names you can have. But no hard coding of what they are.


You have to read and parse the entire file to find the place names.


> autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.

You are basically describing a perfect hash, right?


nearly but not quite - you can deliberately 'collide' when there is no data you need to keep going forward (the obvious example being when you end on a '\r' character). This is necessary to keep an upper bound on the state space.

Remember that in the extreme, a state could mean "city A, temp zero, city b" - "0\rB" so it isn't a direct mapping from state machine state to city.


Where does your ~160,000 come from? There's a billion rows, couldn't there be a different place on every row giving a billion place names?


There are only ~400 place names in the generated data. 160,000 = 400*400

The state machine generation would need to know all 400 - but it's easy enough to scan the first few hundred thousand rows to find them all, and have a fallback incase a name is seen that has never been seen before. (the fallback would be done by having the state machine jump to an 'invalid' state, and then at the end you check that that invalid state's counter is zero, and if it isn't, you redo everything with slow code).


> There are only ~400 place names in the generated data

The README says the maximum number of unique place names is 10 000, so you should probably design for the fast case to handle that many instead of just going by what is in the database that it happens to be measuring with at the moment.


This code is only fast in the 'happy' case (ie. ~400 unique places, with no odd distributions, temperatures with a std dev of 10 from a mean between 0 and 40). It is still correct in the 'unusual' cases, but would be slow because it would revert to fallback code.


Go for it!


You nerd sniper you!

But more seriously, the JVM's support for vector intrinsics is very basic right now, and I think I'd spend far more time battling the JVM to output the code I want it to output than is fun. Java just isn't the right language if you need to superoptimize stuff.

Theoretically all of the above is super simple SIMD stuff, but I have a suspicion that SIMD scatter/gather (needed for the state lookup table) isn't implemented.


If you really want to nerd snipe, write an optimized version in a non-JVM language to compare to the fastest Java one. But that's kind of boring anyway, we've seen this play out a thousand times.

I still appreciate the intention of seeing how far we can squeeze performance of modern Java+JVM though. Too bad very few folks have the freedom to apply that at their day jobs though.


It sounds like you're unfamiliar with vectorized string processing tasks, so you could surely learn a lot by working on an implementation of the proposed approach. I think it's fine to make it in C++ and then port to Java.

For more such tasks you could go to highload.fun.


> run through all the data, at RAM speed,

Is the data given in RAM or do you have to wait for spinning rust I/0 for some GB amount of data?


The test procedure is to run the code 5 times. Even if it isn't in cache the first time, it will be in the OS cache (ie. RAM) for the 2nd-5th runs. The test machine has 32GB RAM, while the data file is only 12GB, so it ought to be fully cached.


> The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.

I think it's better to discard the two slowest, or simply accept the fastest as the correct. There's (in my opinion) no good reason to discard the best runs.


This is a pretty standard measure called the Trimmed Mean: https://statisticsbyjim.com/basics/trimmed-mean/


Variability in software runtime arises mostly from other software running on the same system.

If you are looking for a real-world, whole-system benchmark (like a database or app server), then taking the average makes sense.

If you are benchmarking an individual algorithm or program and its optimisations, then taking the fastest run makes sense - that was the run with least external interference. The only exception might be if you want to benchmark with cold caches, but then you need to reset these carefully between runs as well.


For performance benchmarking the minimal runtime is typically the best estimator if the computations are identical, cause it measures perf w/o interrupts.

If the language is garbage collected, or if the test is randomized you obviously don't want to look at the minimum.


> the minimal runtime is typically the best estimator

Depends what you’re estimating. The minimum is usually not representative of “real world” performance, which is why we use measures of central tendency over many runs for performance benchmarks.


There are good reasons to discard the best runs. If you think of your system as having predictable behavior that's slowed down by background processing happening on the machine, it might make sense to use the best run. But if there are any intrinsic sources of non-determinism in the program (and it's far more likely than you might think), taking the best time is likely to be unrepresentative.

https://tratt.net/laurie/blog/2019/minimum_times_tend_to_mis... is a good piece on the subject.


If you don't think discarding the fastest is acceptable then why are you in favor of discarding the slowest?


Because there can be outliers in terms of slowness; perhaps the CPU gets a hickup, the GC runs an abnormal amount of extra times, the kernel decides to reschedule stuff, whatever, I don't know much about these things.

There can't be outliers in terms of fastness; the CPU doesn't accidentally run the program much faster.

But then again, what the hell do I know...


It's the same logic both ways. The OS runs 10 background tasks on average at any given time. On one of your runs it was doing 15 things, while on another it was doing 5 things. They are both outliers.


The rule lawyer in me wants to spend the first run spinning up a background daemon that loads everything into memory, pins it there, and maybe even prefetches everything into cache as the subsequent runs perform basically a linear scan (you never have to pagemiss if you have an oracle!).

> write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station

Depending on how far you want to stretch it, doesn’t precomputing the result on 1st run count too? Or pre-parsing the numbers into a compact format you can just slurp directly into rolling sums for subsequent runs.

Not the slightest in the spirit of the competition, but not against the rules as far as I can tell.

Edit: if we don't like pre-computation, we can still play with fancy out-of-the-box tricks like pre-sorting the input, pre-parsing, compacting, aligning everything, etc.

Edit 2: while we're here, why not just patch the calculate_time script to return 0 seconds :) And return 9999 for competitors, for good measure


There is a real issue with providing the contestant the real exact file that will be used in the contest. Because there are 1e9 shades of gray between hard coding the correct answer in a single line program which doesn't even read the input, and processing the file as if we don't know what it contains.

This might become a contest of judging what is fair pre-computing and what is not.

That's why machine learning contests don't let participants see the final data.


Fix:

A run consists of:

  * I generate random measurements with the provided script, which will be stored in a readonly file.

  * I run your program to calculate the results (and time it).

  * I run my baseline program and your entry is valid if my baseline program calculates the same results.


The filesystem has to be read-only as well, otherwise the first run can simply write the results to ~/.me_cheater and the program first reads that file before attempting to read the dataset.


I think that would violate this rule

> The computation must happen at application runtime, i.e. you cannot process the measurements file at build time (for instance, when using GraalVM) and just bake the result into the binary


Yeah, and I'd count caching results between runs into the same category, i.e. against the spirit of this challenge. Folks should keep in mind that the goal is to learn something new, rather than "winning" by cheating the rules.


Though the caching done by the OS (I/O for instance) is fair game


Yes, exactly. Retrieving the source data file from the page cache != storing and reloading precomputed results.


Ah I didn't actually follow the link to the github repo. I don't think that percludes pre-processing the input into something that needs no parsing, however, or to spawn a background daemon that prefetches everything in the right order.

The first run is indeed runtime, not build time, so technically I think I still count with pre-computation, sending that to the daemon (or just stashing it somewhere in /tmp, or even crazier patching the jarfile), and just printing it back out on the second run onwards.


> I don't think that percludes pre-processing the input into something that needs no parsing

Why not preprocess it into a file that contains the answer? (-: My read of the description was just: you're meant to read the file as part of the run.


I think the rules should stipulate each run is on its own tmpfs and all processes and pagecaches are cleared between runs.


Isn't this simply bound by the speed of the disk? Surely none of the suggested optimizations (SIMD, multi-threading) are relevant.

It would also depend of how many different stations there are and what they are for the hash lookup, but seriously I doubt this will be anything measurable compared to I/O.


A few things:

Disk access can certainly be parallelized, and NVMe is blazing fast, so the bottleneck is more the CPU than disk. There are systems that are built around around modern hardware that realize this (redpanda.com, where I work is one such example)

Parsing is a lot of the compute time and some SIMD tricks like SWAR for finding delimiters can be helpful. Stringzilla is a cool library if you want to see a clean implementation of these algorithms: https://github.com/ashvardanian/StringZilla

See my reply here about the file being cached completely in memory after the first run: https://news.ycombinator.com/item?id=38864034


I would hope that any reasonably performant implementation would be faster not only than NVMe, but also faster than CPU to RAM data transfers.

The AMD EPYC-Milan in the test server supports memory reads at 150 gigabytes/sec, but thats a 32 core machine, and our test only gets 8 of those cores, so we probably can't expect more than 37 gigabytes per second of read bandwidth.

The total file is ~12 gigabytes, so we should expect to be able to get the task done in 0.3 seconds or so.

There are only ~400 weather stations, so all the results fit in cache. It's unclear if it might be faster to avoid converting ascii to a float - it might be faster to add up all the ascii numbers, and figure out how to convert to a float only when you have the total.


> any reasonably performant implementation would be faster not only than NVMe

Saturating NVMe bandwidth is wicked hard if you don't know what you're doing. Most people think they're I/O bound because profiling hinted at some read() function somewhere, when in fact they're CPU bound in a runtime layer they don't understand.


Can you elaborate on the runtime layers and how to affect them?


> The total file is ~12 gigabytes, so we should expect to be able to get the task done in 0.3 seconds or so.

How? EPYC-Milan features PCIe 4.0 and theoretical maximum throughput of x4 sequential read is ~8GB/s.

In bare-metal machines, this figure is rather closer to 5-6GB/s, and in cloud environments such as CCX33, I imagine it could be even less.

So, unless I am missing something you'd need ~2 seconds only to read the contents of the ~12GB file.


This assumes the file is already in page cache, and therefore already in RAM. The 0.3 seconds is the time to get it from RAM to the CPU L3 cache.


That sounds like cheating. The exercise says to process a large file that's on disk. If the data is in RAM then that's a whole different game.


The benchmark is run 5 times without flushing file system caches. If you mmap it, it will still be in memory between process runs. The author intended this FWIW


That sounds like a benchmarking flaw.

But then, writing good benchmarks is an art in itself.


If the cache is not flushed, it will still be in memory between runs whether you mmap it or not.


That depends a lot on the OS. I would not count on a 13 GB file being cached in RAM after having been read through once.


Isn’t it run multiple times and the slowest is dropped? Why would you expect the file to be evicted from ram so quickly?


That makes sense now, thanks.


Are you sure that would give the same result? The first thing that struck me about this challenge is that there's going to be floating point precision issues.


All the data is only to one decimal place. You can shift the decimal place and do the whole thing with int's and get a 100% precise result.


You can, but will it be the same result? :) binary floating point can't represent 0.1 or 0.2 exactly.

To be fair, with only additions, and only a billion of them, you're probably fine working in floating point and rounding to one decimal in the end.


Just x10 it and count in integers?


I said that works here. The point is that floating point may give a different answer.


It looks like all existing solutions are too slow by a huge factor then :)


Totally depends on your workload and hardware. I have a normal consumer SSD in my desktop and it can easily sustain 7GB/s (56gbps) as long as I only use 700GB of the 2TB available. A normal server has enough PCIe lanes for 15 SSDs like mine so a normal server's IO bandwidth is comparable to its memory bandwidth. More expensive servers have faster lanes (PCIe 5.0) and more of those lanes.

Anyway the OP's file only has 1B rows so it'll be ~1GB compressed which fits in memory after the first discarded run. IO bandwidth shouldn't matter in this scenario:

> All submissions will be evaluated by running the program on a Hetzner Cloud CCX33 instance (8 dedicated vCPU, 32 GB RAM). The time program is used for measuring execution times, i.e. end-to-end times are measured. Each contender will be run five times in a row. The slowest and the fastest runs are discarded

Edit: The github repo says it's 12 GB uncompressed. That confirms that IO bandwidth doesn't matter


Daniel Lemire has an interesting talk on this: https://www.youtube.com/watch?v=wlvKAT7SZIQ. They key takeaway is that the disk is rarely the bottleneck.


It's an old bit of wisdom that's hard to stamp out. That's partly because anything that you don't understand below you in the runtime stack looks like "I/O" during profiling, so the misunderstanding perpetuates even among people who attempt to take a closer look, but don't know enough.


> Isn't this simply bound by the speed of the disk?

Depends on the OS and filesystem being used. The input file is ~12GB, and it's being run 5 times on a machine with 32GB, so after the first run the file could be cached entirely in memory. For example, Linux using ext2 would probably cache the entire file after the first run, but using ZFS it probably would not.


Clearly the fastest way to parse this is to load it all into RAM, and work backwards from the end. That has the advantage of putting the digits in ascending order, and then the delimiter, then the string... until EOF or NL is encountered.


I agree. From here: https://news.ycombinator.com/item?id=38865942

<< Just clarified this in the README: * Input value ranges are as follows:

- Station name: non null UTF-8 string of min length 1 character and max length 100 characters

- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit >>

If we assume worst case entries: -99.9;<100 char station name>

That is 106 chars plus newline (1 or 2 chars). That could be 108 GB of data! And, I want to be so picky here: The README says "UTF-8" and "max length 100 characters". Eh, we could start a holy war here arguing about what does a character mean in UTF-8. Let's be generous and assume it means one (8-bit) byte (like C). I have good faith in the original author!

Some Googling tells me that the current fastest PCIe Gen5 NVMe drive is Crucial T700 w/ Sequential read (max., MB/s) @ 12,400. That is 8.7 seconds to read 108 GB of data. A few other comments mentioned that the sample file is only 12 GB, so that is 1 second, and 100% can be cached by Linux kernel for 2nd and later reads. However, if the test file is greater than 32 GB, you are definitely screwed for file system caching. You will have to pay a multi-second "I/O tax" to read the file each time. This seems to far outweigh any computation time.

Thoughts?

Last, I still think this is a great challenge. I think I just found my next interview question. This question is easy to understand and program a basic implementation. Further discussion can be fractally complex about the myriad of optimizations that could be applied.

No trolling: Can you imagine if someone tries this exercise in Python? There are a goofy amount of optimizations to deal with huge data in that language thanks to SciPy, Pandas and other friends. I am sure someone can write an insanely performant solution. (I know, I know: The original challenge says no external dependencies.)


Yes, this is an extremely trivial problem. Anybody who knows how to program in more than one language is going to find this silly. awk or perl would finish it before jit compilation gets started.


It's a trivial problem to implement, but not trivial to implement faster than others. Awk won't beat solutions that are clever about multiple read queues and parallel processing.


Bit of an update, the record is now 8 seconds in java.

I have tried 2 naive awk implementations and both took 10x the basic java implementation btw (I'm sure it too can be optimized).


Just stick to Java


Please try it out in awk. It would be interesting to see whether awk can do this in a few seconds.


> Q: Can I make assumptions on the names of the weather stations showing up in the data set?

> A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no `;` character).

I'm unsure if it's intentional or not, but this essentially means that the submission should be correct for all inputs, but can and probably should be tuned for the particular input regenerated by `create_measurements.sh`. I can imagine submissions with a perfect hash function tuned for given set of stations, for example.


Given this requirement, they would be wise to have the test data be different from the example data. That would prevent overfitting optimizations.


I'm not sure that it is overfitting to optimize your code for the test set. The requirement is just that you don't break compatibility for arbitrary names in the process.

This sort of thing shows up all the time in the real world - where, for example, 90% of traffic will hit one endpoint. Or 90% of a database is one specific table. Discovering and microoptimizing for the common case is an important skill.


That's why I was so unsure. I think though, that it is generally better to have the input generator seeded and do not disclose the chosen seed in advance, which prevents overfitting to a single input and yet encourages overfitting to a much bigger class of inputs.


That invites probing submissions to figure out a good hash for all names from that one seed.

I think having the names for performance tests public is fine. But there should be correctness tests on sets with random names.


Of course, I assume names are also randomly generated. But it might be the case that they are all ASCII-only, which can be exploited.


Their should be no optimizations which rely on a-priori knowledge of the dataset. I.e. if you calculate a perfect hash function at runtime by inspecting the dataset, that's fine (not sure whether it's beneficial), but hard coding it, is not.


That is a tricky rule to enforce. For example, some solutions assume that the temperature values fit into an int, which could be interpreted as relying on a priori knowledge.


That's true. Some assumptions always have to be made.

In this case, we can assume place names that can be arbitrary strings and temperatures that occur naturally on earth. Optimizing around those constraints should be fine. We could probably limit the string length to something like 100 characters or so.

Bad assumptions would for example be to assume that all possible place names are found within the example data.


Somehow I doubt any weather station would register temperatures above 2,14 billion degrees


UTF-8 makes this far harder... But if you stick to the letter but not spirit of the rules, you can simply fall back to a slow implementation if you ever detect any byte > 127 (indicating a multibyte UTF-8 character).


Only if you validate the UTF-8 as being valid. If you just accept that it is you can treat it as just some bytes. Nothing in the spec I see requires processing UTF-8 as an actual Unicode string.

The easiest way to handle Unicode is to not handle it at all, and just shove it down the line. This is often even correct, as long as you don't need to do any string operations on it.

If the author wanted to play Unicode games, requiring normalization would be the way to go, but that would turn this into a very different challenge.


Since the tail of the line has a known format I guess we are rescued by the fact that the last 0x3B is the semicolon as the rest is just a decimal number. We can’t know the first 0x3B byte is the semicolon since the place names are only guaranteed to not contain 0x3B but can contain 0x013B. So a parser should start from the rear of the line and read the number up to the semicolon and then it can treat the place name as byte soup. Had two places shared line this challenge would have required real utf parsing and been much harder.


It's easier than you think... utf-8 guarantees that all bytes of a multi-byte character have the high bit set. 0x3B (semicolon) does not have the high bit set. Therefore 0x3B is guaranteed to be your seperator.

The same logic applies to newline - therefore, you can jump into the middle of the file anywhere and guarantee to be able to synchronize.


I'm not sure scanning backwards this helps. Running in reverse you still need to look for a newline scanning over an UTF-8 string which might plausibly contain a newline byte.

I'm no UTF-8 guru, but I think you might be possible to do this sort of a springboard for skipping over multi-byte codepoints, since as far as I understand the upper bits of the first byte encodes the length:

    byte utfByte1 = (byte) (val & 0xF0);

    if (utfByte1 == (byte) 0xF0) { // 4 byte codepoint 
      // ignore 3
    }
    else if (utfByte1 == (byte) 0xE0) { // 3 byte codepoint
      // ignore 2
    }
    else if (utfByte1 == (byte) 0xC0) { // 2 byte codepoint
      // ignore 1
    }


Yes, sorry, I'm so used to this sort of thing with the work I've been doing lately I forgot to lay it out like that. Thank you for filling it in for me.


Just for fun. Speed testing awk vs Java.

  awk -F';' '{
      station = $1
      temperature = $2
      sum[station] += temperature
      count[station]++
      if (temperature < min[station] || count[station] == 1) {
          min[station] = temperature
      }
      if (temperature > max[station] || count[station] == 1) {
          max[station] = temperature
      }
  } 
  END {
      for (s in sum) {
          mean = sum[s] / count[s]
          printf "{%s=%.1f/%.1f/%.1f", s, min[s], mean, max[s]
          printf (s == PROCINFO["sorted_in"][length(PROCINFO["sorted_in"])] ? "}\n" : ", ")
      }
  }' measurement.txt


I'd like to see it speed tested against an instance of Postgres using the file Foreign Data Wrapper https://www.postgresql.org/docs/current/file-fdw.html

    CREATE EXTENSION file_fdw;
    CREATE SERVER stations FOREIGN DATA WRAPPER file_fdw;
    CREATE FOREIGN TABLE records (
      station_name text,
      temperature float
    ) SERVER stations OPTIONS (filename 'path/to/file.csv', format 'csv', delimiter ';');
    SELECT station_name, MIN(temperature) AS temp_min, AVG(temperature) AS temp_mean, MAX(temperature) AS temp_max
    FROM records
    GROUP BY station_name
    ORDER BY station_name;


Just modified the original post to add the file_fdw. Again, none of the instances (PG or ClickHouse) were optimised for the workload https://ftisiot.net/posts/1brows/


>The result is 44.465s!

Very nice! Is this time for the first run or second? Is there a big difference between the first and second run, please?


first run! I just rerun the experiment: - ~46 secs on the first run - ~22 secs on the following runs


Not including this in the benchmark time is cheating:

    \copy TEST(CITY, TEMPERATURE) FROM 'measurements.txt' DELIMITER ';' CSV;


The loading time it's included in both examples

In the first one, the table is dropped, recreated, populated and queries. In the second example, the table is created from a file FDW to the CSV file. In both examples the loading time is included in the total time


For some reason I'm impressed that the previous page in the documentation is https://www.postgresql.org/docs/current/earthdistance.html. Lotta fun stuff in Postgres these days.


Man, Postgres is so cool and powerful.


Using FDWs on a daily basis I fully realize their power and appeal but at this exclamation I paused and thought - is this really how we think today? That reading a CSV file directly is a cool feature, state of the art? Sure, FDWs are much more than that, but I would assume we could achieve much more with Machine Learning, and not even just the current wave of LLMs.

Why not have the machine consider the data it is currently seeing (type, even actual values), think about what end-to-end operation is required, how often it needs to be repeated, make a time estimate (then verify the estimate, change it for the next run if needed, keep a history for future needs), choose one of methods it has at its disposal (index autogeneration, conversion of raw data, denormalization, efficient allocation of memory hierarchy, ...). Yeah, I'm not focusing on this specific one billion rows challenge but rather what computers today should be able to do for us.


Using ClickHouse local:

    time clickhouse local -q "SELECT concat('{', arrayStringConcat(groupArray(v), ', '), '}')
    FROM
    (
        SELECT concat(station, '=', min(t), '/', max(t), '/', avg(t)) AS v
        FROM file('measurements.txt', 'CSV', 'station String, t Float32')
        GROUP BY station
        ORDER BY station ASC
    )
    SETTINGS format_csv_delimiter = ';', max_threads = 8" >/dev/null

    real 0m15.201s
    user 2m16.124s
    sys 0m2.351

Most of the time is spent parsing the file


Your sum variable could get pretty large, consider using the streaming mean function, something like: new_mean = ((n*old_mean)+temp)/(n+1)


Had this same thought as well. Definitely comes up when calculating variance this way in streaming data.

https://stats.stackexchange.com/a/235151/1036

I am not sure if `n*old_mean` is a good idea. Wellford's is typically something like inside the loop

count += 1; delta = current - mean; mean += delta/count;


Interesting challenge, shame its only java. Can't wait till people start hand rolling their own JVM bytecode.


Check out the discussion[0], looks like there are submissions in several languages. Go, Rust, Python, and C++, to name a few

[0] https://github.com/gunnarmorling/1brc/discussions


It looks like the problem is dominated by reading in the data file. Some fast solutions just read the whole file into memory.


It would be a more interesting challenge if the data file were larger than the memory. I would love to see what people would come up with on some baby vm with 512 mb of ram.

Even more interesting would be small ram, little local storage and a large file only available via network, I would like to see something other than http but realistically it would be http.


As the file-memory ratio changes the problem becomes more and more stream processing, right? If the number of cities becomes too much to keep in memory then it becomes a database with "let's see who can find a better index data structure fitting for these I/O patterns and HW" game.


For this problem, changing the file size to not fit in RAM doesn't really make the optimal solutions more interesting


Sixteen thousand Excel 97 spreadsheets?


Rather than read the file into memory, memory mapping can be used.


Memory mapping (at least on Linux) isn't actually faster than reading the file manually. Especially if you use appropriately sized buffers.

(Of course, the five times in a row might mess with that.)


Right, it's the five times in a row thing that makes an in-memory solution faster. Otherwise, this is a purely sequential one-pass problem, which is how you'd do it in practice.

Parallelism with edge effects is pretty common. Weather simulation, finite element analysis, and big-world games all have that issue. The middle of each cell is local, but you have to talk to the neighbor cells a little.


You wouldn't want to do this for a huge file. A very fast solution would use a small number of buffers and io_uring (or equivalent), keeping the page table and cache footprint small.


Yeah so I had a discussion on Twitter about this, turns out 12GB is small enough to fit into memory, and the author runs submissions by running a solution 5 times in a row, so using direct IO actually hurts because having the kernel cache is a way to enforce the file is in memory for the 4 runs after. I have a direct IO solution with SIMD string search and double parsing, just in C++ (using libraries). It runs in 6 seconds on my 24 core linux box (NVMe).

Code: https://github.com/rockwotj/1brc

Discussion on Filesystem cache: https://x.com/rockwotj/status/1742168024776430041?s=20


I missed the "5 times in a row." If you do that, yeah, keeping the whole thing in memory is far better.


> double parsing

In case you haven't noticed yet, the input format guarantees exactly one fractional digit, so you can read a single signed integer followed by `.` and one digit instead.


Yeah I missed this originally, and stuff could be faster with this assumption without a full double parser. The fastest java solution dies some near branchless decoding for these


could you just add the character values eg 49 for ascii 1, and then subtract off the offset once at the end instead of doing atoi on each line?

edit: doh that works for min and max but the average overflows.


Yes. I'm not sure it'll help, but def worth a try.


Wow, that's pretty fast considering how simple main.cc looks. I do love c++. Nice use of coroutines, too.


So you are basically at the mercy of the OS caching algorithm. That sounds like a bad plan for a benchmark. You are not measuring what you think you are (your code), you are measuring the OS caching policy.


Dumb question. With io_uring, how do you handle lines that straddle between chunks? I'm asking since, AFAIU, the submitted requests are not guaranteed to be completed in order.

(The easiest I can think of is submitting reads for "overlapped" chunks, I'm not sure there is an easier way and I'm not sure of how much performance overhead there is to it.)


You have to handle it manually. Remember the partial lines at the beginning/end of your chunks and merge them when their mates become available.


What is the downside of memory mapping in this scenario? Shouldn't the page table properly handle the case of doing a single sequential read over a range of pages? Accessing the contents of a file doesn't seem like something caching would matter for. Do you mean that reading of sequential pages will keep adding to the cache compared to reading from a single page? That seems like a similar thing as before where they will be the first things gone from the cache anyways.


Caching and paging almost always matter, even on things like this. The core problem is that the filesystem won't prefetch for you, and you will be waiting to page fault several times over the length of the file. Another problem of the size of the working set is that you will be seeing several slow calls to (essentially) malloc in the kernel to hold all of that data, while using a small, preallocated structure will give you none of that trouble.


> Shouldn't the page table properly handle the case of doing a single sequential read over a range of pages?

That's what I used to think, too. But the kernel ain't that smart.


But would that really be faster when you need to read every byte of a file?

I thought memory mapping solved a different problem.


Alternatively, "must be written in Java" can be interpreted to mean "must use the JVM to begin execution", and you can clearly spawn another process from Java...


Another rule was no external dependencies

I guess you could from Java itself write a new binary and then run that binary, but it would be against the spirit of the challenge.


There's sun.misc.Unsafe which is a builtin that gives you raw read/write. Turn that into ROP, mmap your native program, and now you're full native :)


Just use JNA to allocate an RWX buffer with mmap, unpack a bit of clever code into the buffer, and forge a native function pointer to call it :)


Ooh fun, Advent of Code chaser!

A fair comparison between languages should include the make and build times. I haven't used Java / Maven for years, and I'm reminded why, heading into minute 2 of downloads for './mvnw clean verify'.


Java build times are very fast. You are just measuring your internet speed here.

(Also, gradle is faster as a build tool for incremental compilation)


Gradle and fast should not be in the same sentence.

I have no idea what gradle and its daemon do except spending orders of magnitude more time starting up than running the actual build.

You're much better off running javac directly. Then it's fast.


Then someone in your team failed learning the bare minimum to write a sane gradle file, and are putting imperative stuff into the configuration phase.

For big projects, compiling only what’s necessary is a huge time win.


> Then someone in your team failed learning the bare minimum to write a sane gradle file

Then someone in over 15 years and 8 versions gradle failed to make a system that doesn't require "learning to write sane gradle files" to make sure that it's only two orders of magnitude and not five orders of magnitude slower than it can be.

> and are putting imperative stuff into the configuration phase

Has nothing to do with the slowness that is gradle.

> For big projects, compiling only what’s necessary is a huge time win.

It is, and no one is arguing with that


It is your own failure if you beat the tool over yourself, for not having learnt it.


A tool that:

- in 15 years is still slow by default

- in 15 years could not acquire any sensible defaults that shouldn't need extra work to configure

- is written in an esoteric language with next to zero tools to debug and introspect

- randomly changes and moves thing around every couple of years for no discernible reason and making zero impact on reducing its slowness

- has no configuration specification to speak of because the config is written in that same esoteric programming language

And its zealots blame its failures on its users


Build systems are a complex problem. Like, you can compare it to cargo/go’s build system etc, but these forget about 90% of the problem, and come up with a hard-coded for the happy-path rust/go project, with zero non-native dependency, all downloaded from the repository. This is not a bad thing, don’t get me wrong. But it is not enough to compile, say, a browser.

Meanwhile, there are only a couple, truly general build systems, all of them quite complex — as you can’t get away with less. Gradle and bazel are pretty much the only truly generic build systems (I’m sure there are a couple more, but not many). You can actually compile a mixed java, c++, whatever project with gradle, for example, just fine, with proper parallelism, caching, etc.

As for your points:

> in 15 years is still slow by default

Absolutely false. A 3-4 lines gradle build file for java will be fast, and correctly parallelize and maximally utilize previously built artifacts.

> in 15 years could not acquire any sensible defaults that shouldn't need extra work to configure

It’s literally 4 lines for a minimal project, maybe even less. I’m getting the feeling that you have zero idea about what you talk about.

> is written in an esoteric language with next to zero tools to debug and introspect

I somewhat agree, though nowadays kotlin is also an alternative with strong typing, proper auto-complete, the only trade off is a tiny bit slower first compile of the config file. Also, both could/can be debugged by a normal java debugger.

> randomly changes and moves thing around every couple of years for no discernible reason

There is some truth to this. But the speed has definitely improved.

> has no configuration specification

Kotlin is statically typed. Also, the API has okayish documentation, people just like to copy-paste everything and wonder why it fails. You wouldn’t be able to fly a plane either from watching 3 sitcoms showing a cockpit, would you?


> Absolutely false. A 3-4 lines gradle build file for java will be fast, and correctly parallelize and maximally utilize previously built artifacts.

I've never seen this on any project that utilizes gradle. Every time, without fail, it's a multi-second startup of gradle that eventually invokes something that is actually fast: javac.

> I’m getting the feeling that you have zero idea about what you talk about.

Only 7 years dealing with java projects, both simple and complex

>> has no configuration specification

> Kotlin is statically typed.

Kotlin being statically typed has literally nothing to do with a specification.

Becuase Gradle doesn't have a configuration. It has a Groovy/Kotlin module that grew organically, haphazardly and without any long term plans. So writing gradle configuration is akin to reverse-engineering what the authors intended to do with a particular piece of code or API.

> Also, the API has okayish documentation

"okayish" is a synonim for bad. You'd think that a 15-year project in its 8th major version would have excellent API documentation

> people just like to copy-paste everything and wonder why it fails. You wouldn’t be able to fly a plane either from watching 3 sitcoms

So, a project with no specification to speak of, with "okayish" API that is slow as molasses by default out of the gate [1] keeps blaming its users for its own failure because somehow it's a plane, and not a glorified overengineered Makefile.

[1] A reminder, if you will, that gradle managed to come up with caching configs to speed up its startup time only in version 8, after 15 years of development.


They are trying to make gradle better, by implementing a "maven over it" - declarative build configuration.

But for time being maven is way easier to configure and understand.


It has always been “an imperative code that outputs the declarative configuration that can be used for the build”. Which is a very reasonable thing to do, but unfortunately most people fail to understand that a println in the config file’s global scope is different than inside a closure for a task description.


> but unfortunately most people fail to understand that a println in the config file’s global scope is different than inside a closure for a task description.

Literally no one complaining about gradle being slow is doing that. Allmost all of gradle's problems come not from people using it, but from Gradle itself.

I mean, you said it yourself: in 15 years the state of their API docs is "okayish". But somehow people are still expected to make sense of this shit because apparently it's a plane? But planes famously have extensive documentation, and procedures, and decades of expertise available. Exactly unlike gradle.


So what other general purpose build tool do you recommend/vouch for?

I actually really like Mill, but it is very small still. I’m unaware of too many playing in the same categories as Gradle.


"If you say a tool sucks, show me a better one" isn't as good as an argument as you think it is.

I know of at least one project who just wrote everything in Python because current build tools invariably suck: https://tonsky.me/blog/python-build/


It could be if tool is work of good design and engineering. But not when tool's claim to fame is that it does not use XML to describe build config like Maven/Ant.


There are not many alternatives that are truly general build tools (not limited to single language/repository, etc), and properly parallelize and cache.

So, besides gradle and bazel, what else is there?


Gradle is faster than what? Than maven? Maybe. But not than Go or Cargo.


What step are we talking about? Javac itself is absolutely on the same order of magnitude speed as Go per loc, while rust is significantly slower (which makes sense, the latter is a properly optimizing compiler with heavy static analysis, while the former two just spews out java byte code/machine code).

Gradle with a daemon is also pretty fast, you are just probably used to some complex project with hundreds of dependencies and compare it to a cargo file with a single dependency.


Just a nitpick, but the static analysis in Rust doesn’t account for most of why compilation is slow, as proven by the fact that cargo check is much faster than cargo build.


Yeah, I know. I believe it’s mostly the amount of LLVM IR that is the problem (e.g. each generic instantiation outputs by default a new copy of roughly the same code), isn’t it?


> Javac itself is absolutely on the same order of magnitude speed as Go per loc

Not really when it has to run on a cold JVM. And before it warms up, it’s already done.

Then you are in territory of keeping the compiler process between the runs, but that is a memory hog and also not always realiable (gradle often decides to run a fresh daemon for whatever reason).

So theoretically, in lab conditions yes, in practice no.

> while rust is significantly slower

Everybody repeats that but there is surprisingly little evidence in form of benchmarks. I can see rustic compiles over 50k Loc per second on average on my M2, which is roughly the same order of magnitude as Java. And checking without building is faster.


Javac doesn't run in JVM. It is a C (C++?) application.


https://github.com/openjdk/jdk/tree/master/src/jdk.compiler/...,

Wiki citation for good measures https://en.wikipedia.org/wiki/Javac.

Maybe you're thinking of Hotspot, which is written in C++.


> Gradle with a daemon is also pretty fast

I've yet to see a project where Gradle daemon a) does anything useful and b) is acutally used by gradle itself (instead of seemingly doing everything from scratch, no idea what it does in the seconds it takes for it to start up).


>I've yet to see a project where Gradle daemon a) does anything useful

Also my experience.

Only using Gradle deamon does not help making a multi-project setup much faster. You also have to enable build cache, configuration cache and configure on demand with `--configuration-cache --configure-on-demand` and hope nobody in the project breaks the ability for Gradle to use these caches. But then it still took at least 10 seconds to build and start my services (and that's with incremental builds, like you changed one line of code after the first slow build). I spend two days and more after release to speed this stuff up, before it was 30 seconds sometimes 60 seconds.

And the protobuf Gradle plugin sometimes did not update the generated code, so you had force-delete the files on every build. And then other stuff in the caches broke and you had to delete `.gradle` directory and sometimes even the `~/.gradle` directory. And sometimes the Gralde daemon hangs so you have to force it to stop with `--stop.

Go build, deno and bun are so much more reliable and faster. Something that was surprisingly fast was using the Gradle setup with skaffolding. Java hot code swapping is very fast.


This. Gradle wastes tremendous amounts of time on non-compile tasks. So even if Java takes a few seconds to compile, the whole build is often much, much longer. Interestingly, my impression is that maven is significantly more snappy.


I don't even think it's faster than Maven.


That depends on project and scope of the project.


Itself.

> gradle is faster as a build tool for incremental compilation

Implicit:

> …than it is building from scratch, where it needs to download lots of stuff

I mean, yes, saying Java builds are fast does seem a bit “rolls eyes, yes technically by loc when the compiler is actually running” …but, ^_^! let’s not start banging on about how great cargo/rust compile times are… they’re really terrible once procedural macros are used, or sys dependencies invoke some heinous c dependency build like autoconf and lots of crates do… and there’s still a subpar incremental compilation story.

So, you know. Eh. Live and let live. Gradle isn’t that bad.


So maybe I was just unlucky with gradle and lucky with cargo. A project that is a mixture of 20k LoC Scala and 300k LoC Java took 6 minutes to compile, and incremental was still several tens of seconds. Cargo/Rust cold compiles a project of 1M LoC (all dependencies) in about 1:30 on the same hardware and incremental is like 2-4 seconds.

As for precedural macros - yes they can be slow to compile but so are Java annotation processors.


Scala is a much different case, it has one of the most advanced type systems, many implicit scoping rules etc. Even a java change that might affect the scala codebase could result in a slow incremental build. Also, the build system may not have been as modular as it could.

In my experience, java annotations gets processed very fast.


> it has one of the most advanced type systems

Well, that's a bit of a stretch. It is definitely one of the most complex type systems due to interactions between nominal sub-typing and type classes which is a bit hairy. But in terms of what this type system can express or prove, it's both inferior (cannot express lifetimes or exclusive access) and superior (HKTs, path dependent types) to Rust in some aspects. Let's say they are close.


> It is definitely one of the most complex type systems due to interactions between nominal sub-typing and type classes

That’s what I meant mostly. Though if I’m being honest, I’m not familiar with the implementation of either language’s type system.


Scala is the problem here. Scala has several issues filed for slow compilation.

300K LOC Java should be done within ~1.5 minutes flat from zero. Can be even faster if you are running maven daemon for example. Or even within ~30 seconds if everything is within one module and javac is only invoked once.


...very relative


But run time is slow, is that you want to convey?


No, I would even go as far as to say that a naive, non-microbenchmark java program can often perform better than a naive, low-level language one.


> A fair comparison between languages should include the make and build times

if you do that, you should also include programming time, and divide both by the number of runs the code will have over its lifetime. Also add an appropriate fraction of the time needed to learn to program.

In such a challenge, that likely would make a very naive version win.

Apart from being impractical, I think that would be against the idea behind this challenge.


> In such a challenge, that likely would make a very naive version win.

Not if you give a realistic number for an actual database of this scale. Tens of hours of dev time isn't much after you divide it by a few million.


Why would you clean?

“I discard cache and it is so slow, aarghhh”


Don't need maven

> No external dependencies may be used


maven is just a build tool. It creates a single uberjar which you run with the JDK directly - you don't ship the maven project.

But the real meaning is that you're not allowed to use external libraries, rather than build tool related dependency.


At the Czech technical university C course, we've had a very similar assignment. The submissions of all the course students were continuously evaluated on a leaderboard and many spent tens of hours optimizing to get the extra points for better grade (but mostly status points).


I suspect Java is not the fastest language for this. I’d love to see unofficial contenders tackle this challenge using different languages.


There are a handful of implementations in other languages already. Here’s mine in C99: https://github.com/dannyvankooten/1brc

I’ve also seen versions in Rust, Go, Python, Clickhouse and DuckDB. The discussions tab on the GitHub repo lists some of these.


> Each contender will be run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.

Shouldn’t he take the single fastest time, assuming file (and JDK) being in file cache is controlled for?


This is done to simulate real-world performance. Your binary is not the only binary in the system and other services may be running as well. So fastest time is the happiest path and slowest is the unluckiest.

The range of remaining three is what you expect to get 99% of the time on a real world system.


Any production where you're routinely scanning a text file with a million records is probably a batch process, and I'd be shocked if the usual performance wasn't much closer to the worst case than the average.


But this is a contrived test, and looking for the fastest solution, so your arguments point to taking the fastest: the one with the least interference.


Yes, the one works the fastest in average case will be the fastest in the real world, and will be the one least affected by general noise present in the system.

The test is very well designed, we may say.


I agree that this method will produce a better estimate of expected mean real world performance under certain load conditions, but still contend it just muddies the waters about which solution is in fact the fastest.

And here is Andrei Alexandrscu arguing the same in 2012: https://forum.dlang.org/thread/mailman.73.1347916419.5162.di...


The one that happens to not be benchmarked at the same time as cron jobs will be the fastest in the real world...?


Not necessarily. A good code with nice parallelization and efficient code path will always win. If the code is inefficient to begin with, having no cron jobs at the same time won't help.


> Your binary is not the only binary in the system and other services may be running as well.

Technically yes, but these days most of my machines are single purpose VMs; database/load balancer/app server/etc, so it still seems weird not to take the fastest.


Then, your VM is not the only VM sharing the bare metal. Same thing applies, only on a slightly higher level.

As long as you share the metal with other stuff (be it containers, binaries, VMs), there's always competition for resources, and your average time becomes your real world time.


Physical memory is not fungible like that across VMs. So, you can expect stuff loaded into memory to stay there unless your kernel inside the VM decides it not to.


No, it's. VirtIO's balooning device can "inflate" to pseudo-allocate memory on a VM to free physical memory for other hosts.

Moreover, even if your files in memory, you cannot reserve "memory controller bandwidth". A VM using tons of memory bandwidth or a couple of cores in the same NUMA node with your VM will inevitably cause some traffic and slow you down.


That VM is hardly "single purpose".

There's logrotate and other cleanup tasks, monitoring, dns, a firewall, and many more stuff running on that server. No matter how much you offload to the host (or forego), there's always a kernel and supporting deamons running alongside or under your app.


I said "technically yes" :)


Why?

The author does explicitly want all the files to be cached in memory for the later runs: https://x.com/gunnarmorling/status/1742181941409882151?s=20


This must be Decodale's engineering problem disguised as a challenge.


Very fun challenge that nerd sniped me right away. Had to do a C version in standard C99 with POSIX threads. It[1] clocks in at just under 4 seconds on my AMD Ryzen 4800U Laptop CPU.

Should run about 10-20% faster than that on the mentioned Hetzner hardware.

- Since we only do one decimal of floating point precision it uses integer math right from the get-go.

- FNV1-a hash with linear probing and a load factor well under 0.5.

- Data file is mmap’d into memory.

- Data is processed in 8 totally separate chunks (no concurrent data structures) and then those aggregations are in turn aggregated when all threads have finished.

1: https://github.com/dannyvankooten/1brc


Comparing solutions from the Rstats world: https://twitter.com/RandVegan/status/1742721195781267650


This has been super fun. My current PR[1] is another 15% faster than my entry in the README. Sadly I haven't been able to make any progress on using SIMD to accelerate any part of it.

I think the issues with hashing could be easily covered by having the 500 city names in the test data also be randomly generated at test time. There is no way to ensure there aren't hash collisions without doing a complete comparison between the names.

[1] https://github.com/gunnarmorling/1brc/pull/56


1 billion rows, 500 unique values...

It becomes very possible to find an instance of each unique value, then runtime-design a hash algorithm where those 500 values don't collide.

Java allows self modifying code after all (and this challenge also allows native code, which can also be compiled-at-runtime)


You have to compare the keys in order to figure out if you have a hash collision or a new key. Without first scanning the entire file to know what the list of keys are there isn't a way around it. Even determining that list of keys involves doing key comparisons for every row.


looking at the sample code I’m quite glad I’ve never really touched Java. What a clunky language


Ok, I’ll bite - why?


It looks like they mostly work in JS, so maybe it's the type definitions that look clunky.

Java is explicit, yes. This is intentional :)


yea, I guess it's the lack of human-friendly default assumptions mixed with prescriptive design patterns. definitely an anachronistic design intent :(


have you worked with any compiled languages? It's not much different. It can get a lot worse ^_^. Java's one of the best languages to work in IMO.


I work professionally in C++, but I wouldn’t tout it either.

I find Python to be the most pleasant personally.


Python with types is indeed great.


Funny thing is, most js devs do typescript now, which is the same as java, of not worse because you need an extra colon for your static types


Weird to think typescript is similar to a language that requires a class definition for “hello world” lol. You can’t even fit it on a single 80 character line…


Putting everything in a class has some really nice benefits.

For example, for performance monitoring/debugging, it's really nice to see memory usage and thread creation by package/class. Whereas, I've seen teams spend weeks trying to find memory leaks in NodeJS applications, because you just get "here's general usage by object-shape". Would literally take me less than a minute with Java...


it feels like the characters-typed to functionality ratio is quite low. at the same time, the code isn't much cleaner and there's a lot of mental overhead in deciphering the abstractions (in this case Collector).


Java is probably the language I use the most but I hate how much boilerplate it has. Something I’ve found to help with that is using Java Records instead of classes, a record is kind of like a struct, but I don’t think it’s exactly the same, there’s no boilerplate and something I quite like about them is that their fields are immutable (more specifically “private final”) which works well with more functional programming, something that can be quite clunky with classes in java

Anyway I’m not a software engineer so take what I say about software with a grain of salt. Also I feel like other java programmers will hate you if they need to read or use your code. For example lets say you had a 2D point data type that is a record. instead of things being like “pos.add(5)” if pos is a record you need to reassign it like “pos = pos.add(5)” where add returns the pos as an object. Similar to how BigInteger works in java.

Anyway I love records because there’s zero boilerplate, it feels very similar to me to how structs are written in rust and go, or how classes are done in scala. I just never see anyone talk about it. I guess that might be because most people programming new projects on JVM are probably not programming in java?


> instead of things being like “pos.add(5)” if pos is a record you need to reassign it like “pos = pos.add(5)”... That's just mutable vs immutable data structures


I think the optimal strategy would be to use the "reduce" step in mapreduce. Have threads that read portions of the file and add data to a "list", 1 for each unique name. Then, this set of threads can "process" these lists. I don't think we need to sort, that'd be too expensive, just a linear pass would be good. I can't see how we can do SIMD since we want max/min which mandate a linear pass anyway.


Agreed, the aggregations chosen here are embarrassingly parallel, you just keep the count to aggregate means.

Would have been more interesting with something like median/k-th percentile, or some other aggregation not as easy.


Not sure if this what you meant, but there are SIMD min/max instructions.

https://www.felixcloutier.com/x86/phminposuw


I suppose it defeats the spirit of the game to, knowing your worst run is discarded, calculate the results on the first run by whatever slow method you want, save them somewhere useful, and just read the results and print them out on the following runs?

Or at the very least, convert the input into a more convenient binary format for the following runs.


I thought the same, but to ensure fairness, I would suggest that the application should run in a stateless container without internet access, and the infrastructure (Hetzner VM) should be recreated from scratch for each run to eliminate all caches.


Interestingly, because your program runs on Linux and is run 5 times, Linux will almost certainly cache the 12gb file to RAM on the first invocation.

This means that future invocations don't have to load the file from disk. This also makes it pretty critical that your program doesn't use more than 16gb of ram itself (out of the server's 32gb) or it'll push the file out of cache making future invocations of your program slower.


Unfortunately I have no time to code this up but things that I would try to make it fast:

- Read with O_DIRECT (but compare it with the mmap approach). I know O_DIRECT gets much hate, but this could be one of the rare cases where it helps.

- Use a simple array with sentinel and linear search for the station names. This is dumb, but if the number of stations is small enough this could beat the hash. (In the back of my head I have a rule of thumb that linear search is faster for up to 1000 elements, but I'm not sure anymore where I got this from).


> Read with O_DIRECT

apparently the challenge explicitly allow for warming up the os cache on a prior run, so O_DIRECT would be disadvantaged here as the working set fits in memory.

edit: also in my very simple tests a good hashmap can beat linear search already at <10 items. It depends a lot on the hash function cost.


> beat linear search already at <10 items

The sources I found corroborate this. My 1000 elements rule of thumb is apparently way off.


I tried the linear search by station name in my first naive approach. Using a hashmap was at least 2-3x as fast with the ~415 distinct keys in the 1BRC dataset.


linear search is never faster assuming your hash algorithm is cheap enough. However, if your list only contains 2 items, then your hash algorithm better be "Is first character > 'm'".

When you want real speed like this and are happy with insane complexity, code that writes and self-compiles a hash function based on the data can make sense.


Here's my implementation in Go, which runs in under 5 seconds. It doesn't use anything too obscure, only the built-in maps and no external libraries. It's also the fastest solution I've tested on my M3 Pro Mac. Eager to see what beats it!

https://gist.github.com/corlinp/176a97c58099bca36bcd5679e68f...


Look at the leaderboard for the fastest solutions

https://github.com/gunnarmorling/1brc#results

The fastest is currently at 12 seconds (not on M3 Pro though).


There’s an implementation in C which should run in well under 2 seconds on your M3.

https://github.com/dannyvankooten/1brc


Under 2 seconds seems rather impossible on my machine since the disk maxes out at 5.1 GB/s. This one ran in 7.4s on my M3:

`bin/analyze measurements.txt 30.34s user 16.28s system 629% cpu 7.406 total`


On most Linux distributions and when the file is mmap'd, if you run it a second time the data will still be in RAM and not have to be read from disk. This gets the runtime down to 1.1s for this AMD 2950x (https://github.com/gunnarmorling/1brc/discussions/46#discuss...).

With SIMD and certain assumptions about the input this can seemingly be further reduced to well under a second, eg see https://github.com/gunnarmorling/1brc/discussions/138.


The fastest java solution from the current leaderboard runs within 2.6 seconds in my brand new M3 Mac.


Is that `./calculate_average_royvanrijn.sh`? It runs in 6.7s for me.

I would love it if you could run my solution and compare!


I don't understand, it should be pretty easy. A rolling average with BigDecimal would probably be sufficient but a scientific lib might be better for a rolling average or more than a hundred million numbers.

https://stackoverflow.com/questions/277309/java-floating-poi...


The difficulty is creating the fastest implementation. If you look at the results of the submissions so far you’ll see a big difference in duration, between 11 seconds and more than 4 minutes.

11 seconds seems pretty impressive for a 12Gb file. Would be interesting to know what programming language could do it faster. For a database comparison you’d probably want to include loading the data into your database for a fair comparrison.


Perl would do it quite fast and it has the benefit of accessing posix primitives directly.


A naive perl solution is really really slow compared to even the reference Java implementation. (I know, I've tried)


That's strange, you should be able to stream the file right into a tiny perl executable at the same speed as the bottlenecking hardware. The kernel will take care of all the logistics. You're probably trying to do too much explicitly. Just use a pipe. Perl should be done before Jit completes.


Using cat to redirect the file to /dev/null takes 18s on my machine (a low-end NUC). Just running a noop on the file in Perl (ie. feeding it into a `while (<>)` loop but not acting on the contents) takes ~2 minutes.

1B lines is a lot, and Java ain't a slouch.


Why are you using cat at all? Use a pipe. This isn't hard stuff. Don't use <>, feed the file into a scalar or array. it should only take a few seconds to process a billion lines.

https://www.perl.com/pub/2003/11/21/slurp.html/#:~:text=Anot....


If it isn't hard, then perhaps you could demonstrate with a complete perl program that you think should beat java.


I profiled my attempt, actually reading each line is the bottleneck.


Perl is always going to be much faster than Java at tasks like this. Use stdin and chomp() instead of reading each line explicitly.

This is really a small, trivial task for a perl script. Even with a billion lines this is nothing for a modern cpu and perl.



Reddit?


It’s easy to solve but even fizzbuzz becomes complicated if you want double digit GB/s output.


It's really not. We're talking about gigahertz CPUs and likely solid state storage that can stream many gb/s.. running through a perl script. There really isn't much that is faster than that.


This would make for a really fun challenge in SQL, too.


1:05 in PostgreSQL 16 but it was harder than I thought to saturate the CPUs and not be disk-bound. Also, I ran on GCP not Hetzner, so maybe different hardware.

SQL in theory makes this trivial, handles many of the big optimizations and looking at the repo, cuts 1000+ LOC down to a handful. Modern SQL engines handle everything for you, which is the whole damned point of SQL. Any decent engine will handle parallelism, caching, I/O, etc. Some exotic engines can leverage GPU but for N=1 billion, I doubt GPU will be faster. Here's the basic query:

   SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps GROUP BY 1 ORDER BY 1;
In practice, generic SQL engines like PostgreSQL bloat the storage which is a Big Problem for queries like this - in my test, even with INT2 normalization (see below), pgsql took 37 bytes per record which is insane (23+ bytes of overhead to support transactions: https://www.postgresql.org/docs/current/storage-page-layout....). The big trick is to use PostgreSQL arrays to store the data by city, which removes this overhead and reduces the table size from 34GB (doesn't fit in memory) to 2GB (which does).

The first optimization is to observe that the cardinality of cities is small and can be normalized into integers (INTEGER aka INT4), and that the temps can as well (1 decimal of precision). Using SMALLINT (aka INT2) is probably not faster on modern CPUs but should use less RAM, which is better for both caching on smaller systems and cache hitrate on all systems. NUMERIC generally isn't faster or tighter on most engines.

To see the query plan, use EXPLAIN:

   postgres=# explain SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
                                                  QUERY PLAN
   ---------------------------------------------------------------------------------------------------------------
   Limit  (cost=13828444.05..13828445.37 rows=5 width=38)
     ->  Finalize GroupAggregate  (cost=13828444.05..13828497.22 rows=200 width=38)
         Group Key: city
         ->  Gather Merge  (cost=13828444.05..13828490.72 rows=400 width=38)
               Workers Planned: 2
               ->  Sort  (cost=13827444.02..13827444.52 rows=200 width=38)
                     Sort Key: city
                     ->  Partial HashAggregate  (cost=13827434.38..13827436.38 rows=200 width=38)
                           Group Key: city
                           ->  Parallel Seq Scan on temps_int2  (cost=0.00..9126106.69 rows=470132769 width=4)
 JIT:
   Functions: 8
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(13 rows)

Sigh, pg16 is still pretty conservative about parallelism, so let's crank it up.

   SET max_parallel_workers=16; set max_parallel_workers_per_gather=16;
   SET min_parallel_table_scan_size=0; set min_parallel_index_scan_size=0;
   SET parallel_setup_cost = 0; -- Reduce the cost threshold for parallel execution
   SET parallel_tuple_cost = 0.001; -- Lower the cost per tuple for parallel execution

   postgres=# explain SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
   ...
               Workers Planned: 14
   ...
top(1) is showing that we're burying the CPU:

   top - 10:09:48 up 22 min,  3 users,  load average: 5.36, 1.87, 0.95
   Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie
   %Cpu(s): 12.6 us,  4.8 sy,  0.0 ni,  7.8 id, 74.2 wa,  0.0 hi,  0.7 si,  0.0 st
   MiB Mem :  32084.9 total,    258.7 free,    576.7 used,  31249.5 buff/cache
   MiB Swap:      0.0 total,      0.0 free,      0.0 used.  30902.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1062 postgres  20   0  357200 227952 197700 D  17.9   0.7   9:35.88 postgres: 16/main: postgres postgres [local] SELECT
   1516 postgres  20   0  355384  91232  62384 D  17.6   0.3   0:08.79 postgres: 16/main: parallel worker for PID 1062
   1522 postgres  20   0  355384  93336  64544 D  17.6   0.3   0:08.53 postgres: 16/main: parallel worker for PID 1062
   1518 postgres  20   0  355384  90148  61300 D  17.3   0.3   0:08.53 postgres: 16/main: parallel worker for PID 1062
   1521 postgres  20   0  355384  92624  63776 D  17.3   0.3   0:08.54 postgres: 16/main: parallel worker for PID 1062
   1519 postgres  20   0  355384  90440  61592 D  16.6   0.3   0:08.58 postgres: 16/main: parallel worker for PID 1062
   1520 postgres  20   0  355384  92732  63884 D  16.6   0.3   0:08.49 postgres: 16/main: parallel worker for PID 1062
   1517 postgres  20   0  355384  91544  62696 D  16.3   0.3   0:08.55 postgres: 16/main: parallel worker for PID 1062
interestingly, when we match workers to CPU cores, we don't %CPU drops to 14% i.e. we don't leverage the hardware.

OK enough pre-optimization, here's the baseline:

   postgres=# SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
    city | min |         avg          | max
   ------+-----+----------------------+------
    0 |   0 | 276.0853550961625011 | 1099
    1 |   0 | 275.3679265859715333 | 1098
    2 |   0 | 274.6485567539599619 | 1098
    3 |   0 | 274.9825584419741823 | 1099
    4 |   0 | 275.0633718875598229 | 1097
   (5 rows)

   Time: 140642.641 ms (02:20.643)
I also tried to leverage a B-tree covering index (CREATE INDEX temps_by_city ON temps_int2 (city) INCLUDE (temp) ) but it wasn't faster - I killed the job after 4 minutes. Yes, I checked that it pg16 used a parallel index-only scan (SET random_page_cost =0.0001; set min_parallel_index_scan_size=0; set enable_seqscan = false; SET enable_parallel_index_scan = ON;) - top(1) shows ~1.7% CPU, suggesting that we were I/O bound.

Instead, to amortize the tuple overhead, we can store the data as arrays:

   CREATE TABLE temps_by_city AS SELECT city, array_agg(temp) from temps_int2 group by city;

   $ ./table_sizes.sh
   ...total...   | 36 GB
   temps_int2    | 34 GB
   temps_by_city | 1980 MB
Yay, it now fits in RAM.

   -- https://stackoverflow.com/a/18964261/430938 adding IMMUTABLE PARALLEL SAFE
   CREATE OR REPLACE FUNCTION array_min(_data ANYARRAY) RETURNS NUMERIC AS $$
       SELECT min(a) FROM UNNEST(_data) AS a
   $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;

   SET max_parallel_workers=16; set max_parallel_workers_per_gather=16;
   SET min_parallel_table_scan_size=0; set min_parallel_index_scan_size=0;
   SET parallel_setup_cost = 0; -- Reduce the cost threshold for parallel execution
   SET parallel_tuple_cost = 0.001; -- Lower the cost per tuple for parallel execution

   postgres=# create table tmp1 as select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;
   SELECT 1001
   Time: 132616.944 ms (02:12.617)

   postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;
                                           QUERY PLAN
   -------------------------------------------------------------------------------------------------
   Sort  (cost=338.31..338.81 rows=200 width=98)
   Sort Key: city
   ->  Finalize HashAggregate  (cost=330.14..330.66 rows=200 width=98)
         Group Key: city
         ->  Gather  (cost=321.52..322.64 rows=600 width=98)
               Workers Planned: 3
               ->  Partial HashAggregate  (cost=321.52..322.04 rows=200 width=98)
                     Group Key: city
                     ->  Parallel Seq Scan on temps_by_city  (cost=0.00..0.04 rows=423 width=34)
(9 rows)

Ah, only using 3 cores of the 8...

---

   CREATE TABLE temps_by_city3 AS SELECT city, temp % 1000, array_agg(temp) from temps_int2 group by 1,2;

   postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city3 group by 1 order by 1 ;
                                               QUERY PLAN
   ---------------------------------------------------------------------------------------------------------
   Finalize GroupAggregate  (cost=854300.99..854378.73 rows=200 width=98)
   Group Key: city
   ->  Gather Merge  (cost=854300.99..854348.73 rows=2200 width=98)
         Workers Planned: 11
         ->  Sort  (cost=854300.77..854301.27 rows=200 width=98)
               Sort Key: city
               ->  Partial HashAggregate  (cost=854290.63..854293.13 rows=200 width=98)
                     Group Key: city
                     ->  Parallel Seq Scan on temps_by_city3  (cost=0.00..98836.19 rows=994019 width=34)
 JIT:
   Functions: 7
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(12 rows)

This 100% saturates the CPU and runs in ~65 secs (about 2x faster).

---

Creating the data

Here's a very fast lousy first pass for PostgreSQL that works on most versions - for pg16, there's random_normal(). I'm not on Hetzner so I used GCP (c2d-standard-8 with 16vCPU, 64GB, and 150GB disk, ubuntu 22.04 and

   CREATE TABLE temps_int2 (city int2, temp int2);

   -- random()*random() is a cheap distribution. 1100 = 110 * 10 to provide one decimal of precision.
   INSERT INTO temps_int2 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e9)i;


FWIW, you can make the data loading a decent bit faster:

1) use the integer version of generate_series, the 1e9 leads to the floating point version being chosen

2) move the generate_series() to the select list of a subselect - for boring reasons, that we should fix, the FROM version materializes the result first

3) using COPY is much faster, however a bit awkward to write

psql -Xq -c 'COPY (SELECT (1000random())::int2 as city, (random()random()*1100)::int2 temp FROM (SELECT generate_series(1,1e9::int8))) TO STDOUT WITH BINARY' | psql -Xq -c 'COPY temps_int2 FROM STDIN WITH BINARY'


wow! awesome tips. I knew COPY rocks but didn't realize it would win vs INSERT INTO SELECT FROM ! https://pganalyze.com/blog/5mins-postgres-optimizing-bulk-lo...

1/10th scale tests:

   psql -c 'create table temps_int2_copy as select * from temps_int2_copy where 1=0;'
   psql -Xq -c 'COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp FROM (SELECT generate_series(1,1e8::int8))) TO STDOUT WITH BINARY' | psql -Xq -c 'COPY temps_int2_copy FROM STDIN WITH BINARY'
==> 32sec

   psql -c 'create table temps_int2_copy2 as select * from temps_int2_copy where 1=0;'
   psql -c 'INSERT INTO temps_int2_copy2 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e8::int)i;'
==> 90sec

   psql -c 'create UNLOGGED table temps_int2_copy3 as select * from temps_int2_copy where 1=0;'
   psql -c 'INSERT INTO temps_int2_copy3 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e8::int)i;'; date
==> 45sec (still not faster!)

Of course, if we really want to "go fast" then we want parallel loading, which means firing up N postgresql backends and each generate and write the data concurrently to different tables, then each compute a partial summary in N summary tables, and finally merge them together.

   echo "COPY temps_int2_copy_p1 from '/var/lib/postgresql/output100m.txt' with csv;" | /usr/lib/postgresql/16/bin/postgres --single -D /etc/postgresql/16/main/ postgres
==> 32sec (saturates one core)

The ultimate would be to hack into postgres and skip everything and just write the actual filesystem files in-place using knowledge of the file formats, then "wire in" these files to the database system tables. This normally gets hairy (e.g. TOAST) but with a simple table like this, it might be possible. This is a project I've always wanted to try.


> wow! awesome tips. I knew COPY rocks but didn't realize it would win vs INSERT INTO SELECT FROM !

We (postgres) should fix that at some point... The difference basically is that there's a dedicated path to insert many tuples at once that's often used by COPY that isn't used by INSERT INTO ... SELECT. The logic for determining when that optimization is correct (consider e.g. after-insert per-row triggers, the trigger invocation for row N may not yet see row N+1) is specific to COPY right now. We need to generalize it to be usable in more places.

To be fair, part of the reason the COPY approach is faster is that the generate_series() query actually uses a fair bit of CPU on its own, and the piped psql's lead to the data generation and data loading being run separately. Of course, partially paying for that by needing to serialize/deserialize the data and handling all the data in four processes.

When doing the COPYs separately to/from a file, it actually takes longer to generate the data than loading the data into an unlogged table.

  # COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp  FROM (SELECT generate_series(1,1e8::int8))) TO '/tmp/data.pgcopy' WITH BINARY;
  COPY 100000000
  Time: 21560.956 ms (00:21.561)

  # BEGIN;DROP TABLE IF EXISTS temps_int2; CREATE UNLOGGED TABLE temps_int2 (city int2 NOT NULL, temp int2 NOT NULL); COPY temps_int2 FROM '/tmp/data.pgcopy' WITH BINARY;COMMIT;

  BEGIN
  Time: 0.128 ms
  DROP TABLE
  Time: 0.752 ms
  CREATE TABLE
  Time: 0.609 ms
  COPY 100000000
  Time: 18874.010 ms (00:18.874)
  COMMIT
  Time: 229.650 ms
Loading into a logged table is a bit slower, at 20250.835 ms.

> Of course, if we really want to "go fast" then we want parallel loading, which means firing up N postgresql backends and each generate and write the data concurrently to different tables, then each compute a partial summary in N summary tables, and finally merge them together.

With PG >= 16, you need a fair bit of concurrency to hit bottlenecks due to multiple backends loading data into the same table with COPY. On my ~4 year old workstation I reach over 3GB/s, with a bit more work we can get higher. Before that the limit was a lot lower.

If I use large enough shared buffers so that IO does not become a bottleneck, I can load the 1e9 rows fairly quickly in parallel, using pgbench:

  c=20; psql -Xq -c "COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp  FROM (SELECT generate_series(1,1e6::int8))) TO '/tmp/data-1e6.pgcopy' WITH BINARY;" -c 'DROP TABLE IF EXISTS temps_int2; CREATE UNLOGGED TABLE temps_int2 (city int2 NOT NULL, temp int2 NOT NULL); ' && time pgbench -c$c -j$c -n -f <( echo "COPY temps_int2 FROM '/tmp/data-1e6.pgcopy' WITH BINARY;" ) -t $((1000/${c})) -P1

  real 0m26.486s
That's just 1.2GB/s, because the bottleneck is the per-row and per-field processing, due to their narrowness.

> The ultimate would be to hack into postgres and skip everything and just write the actual filesystem files in-place using knowledge of the file formats, then "wire in" these files to the database system tables. This normally gets hairy (e.g. TOAST) but with a simple table like this, it might be possible. This is a project I've always wanted to try.

I doubt that will ever be a good idea. For one, the row metadata contain transactional information, that'd be hard to create correctly outside of postgres. It'd also be too easy to cause issues with corrupted data.

However, there's a lot we could do to speed up data loading performance further. The parsing that COPY does, uhm, show signs of iterative development over decades. Absurdly enough, that's where the bottleneck most commonly is right now. I'm reasonably confident that there's at least 3-4x possible without going to particularly extreme lengths. I think there's also at least a not-too-hard 2x for the portion of loading loading data into the table.


thx! Indeed, upon reading the source and sleeping on it, I agree, and in fact it looks like a single-user postgres backend with COPY FROM <filename> BINARY is approximately the same architecture as writing the database files in-place, and of course includes support for default values, triggers, constraints, TOAST and more.

I've reproduced the speed difference between pg COPY vs various cases.

Some results (middle result of 3 stable runs) from 1.4GB BINARY dump:

   echo "drop table if exists tbl; create table tbl(city int2, temp int2);copy tbl FROM '/citydata.bin' binary;" | ./pg/bin/postgres --single -D tmp -p 9999 postgres;

   real 0m34.508s


   # switching to unlogged table

   real 0m30.620s


   # hardcoding heap_multi_insert() to be a NOOP  (return early if ntuples>100)
   # fyi, heap_multi_insert() gets called with n=1000 tuples per call

   real 0m11.276s


   # hardcoding skip_tuple = true in src/backend/commands/copyfrom.c:1142

   real 0m6.894s


   # after testing various things
   time sh -c "tar cf - citydata.bin | (cd /tmp; tar xf -)"

   real 0m2.811s

Note: I tried increasing the blocksize (--with-blocksize) and also MAX_BUFFERED_TUPLES (copyfrom.c:65) but as expected they didn't help, I guess n=1000 tuples amortizes the overhead.


> top(1) is showing that we're burying the CPU:

I think it actually shows that you're IO bound (the 'D' in the 'S' column). On my workstation the query takes ~10.9s after restarting postgres and dropping the os caches. And this is a four year old CPU that wasn't top of the line at the time either.

> postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;

Note that you dropped the limit 5 here. This causes the query to be a good bit slower, the expensive part here is all the array unnesting, which only needs to happen for the actually selected cities.

On my workstation the above takes 1.8s after adding the limit 5.


Exactly. The issue is that postgresql takes 37 bytes per row normally, which then causes it to spill out of RAM on the limited VM specified for this challenge, causing the query to be I/O bound, hence the array representation and unnesting to fit it back into RAM. I'm guessing your machine has more RAM ?


> Exactly. The issue is that postgresql takes 37 bytes per row normally, which then causes it to spill out of RAM on the limited VM specified for this challenge, causing the query to be I/O bound, hence the array representation and unnesting to fit it back into RAM. I'm guessing your machine has more RAM ?

It does, but I restarted postgres and cleared the OS cache.

Btw, the primary bottleneck for the array-ified query is the unnest() handling in the functions. The minimal thing would be to make the functions faster, e.g. via:

CREATE OR REPLACE FUNCTION array_avg(_data anyarray) RETURNS numeric IMMUTABLE PARALLEL SAFE LANGUAGE sql AS $$ select avg(a) from (SELECT unnest(_data) as a) $$;

But that way the unnest is still done 3x. Something like

  SELECT a_min, a_max, a_avg FROM temps_by_city, LATERAL (SELECT min(u) a_min, max(u) a_max, avg(u) a_avg FROM (SELECT unnest(array_agg) u)) limit 5;
should be faster.



My one liner solution runs in 76ms on my 10 year old mac. q)\ts exec (min;max;avg)@\:measurement by city from flip `city`measurement!("SF";";") 0: `:weather_stations.csv 76 8720688


Your 10 year old mac can ingest the input at 157GB/s? Is that from RAM or from disk?


Actually I only had a partial file :/ didn't realise that the file in the data folder was only a sample


1 Billion Row Challenge with Apache Pinot https://hubertdulay.substack.com/p/1-billion-row-challenge-i...

852 ms on an M1 Max


Not including ingestion is cheating; the core challenge is parsing the text input.


Is there a way to create this file without having java ?

This is for use on a BSD where Java may not work too well.

Thanks


You can run the bin/create-sample program from this C implementation here: https://github.com/dannyvankooten/1brc

It’s just the city names + averages from the official repository using a normal distribution to generate 1B random rows.


Would offering a cash prize increase or decrease the quality of submissions?


This can turn into a nice benchmark if done in multiple languages.


Would have been nice to accept other JVM languages like Scala, Clojure, Kotlin, etc and compare against Java, which I expect to have slightly better performance.


first search result for averaging streaming data:

https://nestedsoftware.com/2018/03/20/calculating-a-moving-a...

so you just walk the file and read a chunk, update the averages and move on. the resource usage should be 0.000 nothing and speed should be limited by your disk IO.


Yes, just read a file in chunks and spread the math across cores. How many ways could you possibly implement that?? :)


Custom number parsing, minimising the number of memory allocations to not be punished by the garbage collector. All sort of micro optimisations that make those solutions a terrible way to showcase a language (i.e. you can write much clearer and concise code but obviously slower).


I agree that the simplest solution in each language is the best way to compare - however this problem seems less about showing off java and more about challenging folks.


Looking at some solutions they seem to include their own double parsing implementation. I built a home made serializer for csv files, I am using the default .net parsing functions and I find that parsing numbers/dates is by far the slowest part of the process on large files.


actually i think you can also just average each chunk and then add it to existing data. like read N rows(say all have one location to keep it simple), average the data from the chunk, update/save min and max, move on to next chunk, do the same but now update the average by adding to existing/previously computed average and divide by two. the result will be the same - disk IO will be the most limiting aspect. this "challenge" is not really a challenge. there is nothing complicated about it. it just seems "cool" when you say "process 1 billion rows the fastest you can".


Wouldn't this end up reducing the weight of earlier records by repeatedly dividing them into smaller chunks?

I.e. avg of {22.5, 23, 24} = 23.17... But:

1. 22.5

2. (22.5 + 23)/2 = 22.75

3. (22.75 + 24)/2 = 23.375


say you load 1 million records and you average them at 5.1. you then load another million and average them at 4.5. so you 5.1+4.5=9.6/2=4.8. rinse and repeat. as long as you keep the amount of records processed per each run about the same, your numbers will not be skewed. only the last chunk will most likely be smaller and it will introduce small rounding error, like if it has only 10k records instead of 1M. but still it is the simplest solution with good enough outcome.


essentially that is how integrals are calculated in mathematics if i remember correctly. you take a curve and divide it into columns, the thinner the column the smaller the deviation(because the curve has round edges so your bar will have inherent error) and you simply calculate each column and then total it and you get the body/volume of the function. same principle like radians with circle. you are merely splitting the work into smaller pieces that you can process.


you have to weight the previous results appropriately then it works


I've idly been considering how to score this with more languages, with far less memory. I'm thinking, buildpacks on a Raspberry Pi.


This is an IO bound problem. Trying to "go faster" by using multiple threads or vectorisation isn't going to achieve much.


No, it is not. The file is smaller than the available RAM, so will be cached in subsequent runs.


You're not up to date with how fast IO can be these days. Nobody who cares primarily about performance uses spinning rust anymore.


I'm pretty sure Hetzner servers (where the test is performed) have NVMe drives that can read at 4GB/s. So theoretically IO bound solution would be done in around 0.25 seconds. However the fastest current solution is around 12 seconds.

So it's not IO-bound.


I wonder why the choice to go with a shared compute for measuring this, given the variability that will introduce.


Kind of a shame that median isn't included to make it tougher.


Anyone up to also do this in Rust ?


See the "Show & Tell", where folks are discussing solutions in different languages, including Rust: https://github.com/gunnarmorling/1brc/discussions/categories....


but… can I use pandas?


It takes 330 seconds on a machine where the top OpenJDK version takes 5.75 secs and my .NET version takes 4.8 seconds. However it's just several lines of code and I used ChatGPT as a fun use case for how long it would take to have something working. It took around 5 mins. So if one needs to run this code only once Pandas would be a huge win considering development time. Interestingly the RAM usage was much more than 14 GB input file, probably 2-2.5x of that.

BTW asking ChatGPT to utilize all cores did not yield anything working in reasonable time.


hahaha... nice! Out of interest, could you share more about your .NET solution? (the specifics but not the code)



From a pure performance perspective it’d probably be quite slow in comparison to a dedicated implementation for this task. Especially as it wouldn’t fit in memory.


why wouldn't it fit into memory? the data set is only 12gb.


for my SQL implementation, I normalized the city names and temps to 2 bytes apiece, so it's even less!

https://news.ycombinator.com/item?id=38866073


This is just a map/reduce problem. Use Hadoop. It's Java, isn't it?


Why would I use Hadoop for such a small number of rows…?


1 billion is small for hadoop?


Anything that fits in RAM on one machine is easily too small for Hadoop. In those cases, the overhead of Hadoop is going to make it get destroyed by a single beefy machine. The only times where this might not be the case is when you're doing a crazy amount of computation relative to the data you have.

Note that you can easily reach 1TB of RAM on (enterprise) commodity hardware now, and SSDs are pretty fast too.

Old but gold post from 2014: https://adamdrake.com/command-line-tools-can-be-235x-faster-...



If it fits on one computer it's not a hadoop problem.


It fits on a dusty ten year old USB stick


Sounds like an awk problem tbh.


A Hadoop submission may help people realize that. But since you only have one machine to work with it should be obvious that you're not going to get any speed-up via divide and conquer.


~14GB file? it's on the small side for hadoop


> No external dependencies may be used


[flagged]


Did you try and run it to see how will it performs? Could be an interesting data point on the leaderboard.


I ran it, it was slower than the reference implementation and didn't format the output exactly as defined.


how to make someone else to do your homework for free


Single line solve using clickhouse-local or duckdb.


correct. like this:

time duckdb -list -c "select map_from_entries(list((name,x))) as result from (select name, printf('%.1f/%.1f/%.1f',min(value), mean(value),max(value)) as x from read_csv('measurements.txt', delim=';', columns={'name': 'varchar', 'value':'float'}) group by name order by name)"

takes about 20 seconds


> No external dependencies may be used


the whole premise is silly though. why would anyone use plain java to compute this when databases were built for this or at least are the most finely tuned for it


It's only silly if you miss the point of the challenge :) Which is to learn something new and have fun along the way.


But is it realistic at all?

Also will I get pilloried if I just make it a big StreamOf thing? ;-)


The Java runtime is an external dependency, isn’t it?


Any elixir devs in here find this funny?


You had 32GB more than expected 16GB RAM but for Java I doubt that 32GB is fine enough. Java is aboslutely good on older days but nowadays there is lot of opportunities than that poor guy.


Meaningless to talk about how much RAM you need for a JVM program as it will (obviously) depend on what the program is doing.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: