1:05 in PostgreSQL 16 but it was harder than I thought to saturate the CPUs and not be disk-bound. Also, I ran on GCP not Hetzner, so maybe different hardware.
SQL in theory makes this trivial, handles many of the big optimizations and looking at the repo, cuts 1000+ LOC down to a handful. Modern SQL engines handle everything for you, which is the whole damned point of SQL. Any decent engine will handle parallelism, caching, I/O, etc. Some exotic engines can leverage GPU but for N=1 billion, I doubt GPU will be faster. Here's the basic query:
SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps GROUP BY 1 ORDER BY 1;
In practice, generic SQL engines like PostgreSQL bloat the storage which is a Big Problem for queries like this - in my test, even with INT2 normalization (see below), pgsql took 37 bytes per record which is insane (23+ bytes of overhead to support transactions: https://www.postgresql.org/docs/current/storage-page-layout....). The big trick is to use PostgreSQL arrays to store the data by city, which removes this overhead and reduces the table size from 34GB (doesn't fit in memory) to 2GB (which does).
The first optimization is to observe that the cardinality of cities is small and can be normalized into integers (INTEGER aka INT4), and that the temps can as well (1 decimal of precision). Using SMALLINT (aka INT2) is probably not faster on modern CPUs but should use less RAM, which is better for both caching on smaller systems and cache hitrate on all systems. NUMERIC generally isn't faster or tighter on most engines.
To see the query plan, use EXPLAIN:
postgres=# explain SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Limit (cost=13828444.05..13828445.37 rows=5 width=38)
-> Finalize GroupAggregate (cost=13828444.05..13828497.22 rows=200 width=38)
Group Key: city
-> Gather Merge (cost=13828444.05..13828490.72 rows=400 width=38)
Workers Planned: 2
-> Sort (cost=13827444.02..13827444.52 rows=200 width=38)
Sort Key: city
-> Partial HashAggregate (cost=13827434.38..13827436.38 rows=200 width=38)
Group Key: city
-> Parallel Seq Scan on temps_int2 (cost=0.00..9126106.69 rows=470132769 width=4)
JIT:
Functions: 8
Options: Inlining true, Optimization true, Expressions true, Deforming true
(13 rows)
Sigh, pg16 is still pretty conservative about parallelism, so let's crank it up.
SET max_parallel_workers=16; set max_parallel_workers_per_gather=16;
SET min_parallel_table_scan_size=0; set min_parallel_index_scan_size=0;
SET parallel_setup_cost = 0; -- Reduce the cost threshold for parallel execution
SET parallel_tuple_cost = 0.001; -- Lower the cost per tuple for parallel execution
postgres=# explain SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
...
Workers Planned: 14
...
top(1) is showing that we're burying the CPU:
top - 10:09:48 up 22 min, 3 users, load average: 5.36, 1.87, 0.95
Tasks: 169 total, 1 running, 168 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.6 us, 4.8 sy, 0.0 ni, 7.8 id, 74.2 wa, 0.0 hi, 0.7 si, 0.0 st
MiB Mem : 32084.9 total, 258.7 free, 576.7 used, 31249.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30902.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1062 postgres 20 0 357200 227952 197700 D 17.9 0.7 9:35.88 postgres: 16/main: postgres postgres [local] SELECT
1516 postgres 20 0 355384 91232 62384 D 17.6 0.3 0:08.79 postgres: 16/main: parallel worker for PID 1062
1522 postgres 20 0 355384 93336 64544 D 17.6 0.3 0:08.53 postgres: 16/main: parallel worker for PID 1062
1518 postgres 20 0 355384 90148 61300 D 17.3 0.3 0:08.53 postgres: 16/main: parallel worker for PID 1062
1521 postgres 20 0 355384 92624 63776 D 17.3 0.3 0:08.54 postgres: 16/main: parallel worker for PID 1062
1519 postgres 20 0 355384 90440 61592 D 16.6 0.3 0:08.58 postgres: 16/main: parallel worker for PID 1062
1520 postgres 20 0 355384 92732 63884 D 16.6 0.3 0:08.49 postgres: 16/main: parallel worker for PID 1062
1517 postgres 20 0 355384 91544 62696 D 16.3 0.3 0:08.55 postgres: 16/main: parallel worker for PID 1062
interestingly, when we match workers to CPU cores, we don't %CPU drops to 14% i.e. we don't leverage the hardware.
OK enough pre-optimization, here's the baseline:
postgres=# SELECT city, MIN(temp), AVG(temp), MAX(temp) FROM temps_int2 GROUP BY 1 ORDER BY 1 limit 5;
city | min | avg | max
------+-----+----------------------+------
0 | 0 | 276.0853550961625011 | 1099
1 | 0 | 275.3679265859715333 | 1098
2 | 0 | 274.6485567539599619 | 1098
3 | 0 | 274.9825584419741823 | 1099
4 | 0 | 275.0633718875598229 | 1097
(5 rows)
Time: 140642.641 ms (02:20.643)
I also tried to leverage a B-tree covering index (CREATE INDEX temps_by_city ON temps_int2 (city) INCLUDE (temp) )
but it wasn't faster - I killed the job after 4 minutes. Yes, I checked that it pg16 used a parallel index-only scan
(SET random_page_cost =0.0001; set min_parallel_index_scan_size=0; set enable_seqscan = false;
SET enable_parallel_index_scan = ON;) - top(1) shows ~1.7% CPU, suggesting that we were I/O bound.
Instead, to amortize the tuple overhead, we can store the data as arrays:
CREATE TABLE temps_by_city AS SELECT city, array_agg(temp) from temps_int2 group by city;
$ ./table_sizes.sh
...total... | 36 GB
temps_int2 | 34 GB
temps_by_city | 1980 MB
Yay, it now fits in RAM.
-- https://stackoverflow.com/a/18964261/430938 adding IMMUTABLE PARALLEL SAFE
CREATE OR REPLACE FUNCTION array_min(_data ANYARRAY) RETURNS NUMERIC AS $$
SELECT min(a) FROM UNNEST(_data) AS a
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
SET max_parallel_workers=16; set max_parallel_workers_per_gather=16;
SET min_parallel_table_scan_size=0; set min_parallel_index_scan_size=0;
SET parallel_setup_cost = 0; -- Reduce the cost threshold for parallel execution
SET parallel_tuple_cost = 0.001; -- Lower the cost per tuple for parallel execution
postgres=# create table tmp1 as select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;
SELECT 1001
Time: 132616.944 ms (02:12.617)
postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Sort (cost=338.31..338.81 rows=200 width=98)
Sort Key: city
-> Finalize HashAggregate (cost=330.14..330.66 rows=200 width=98)
Group Key: city
-> Gather (cost=321.52..322.64 rows=600 width=98)
Workers Planned: 3
-> Partial HashAggregate (cost=321.52..322.04 rows=200 width=98)
Group Key: city
-> Parallel Seq Scan on temps_by_city (cost=0.00..0.04 rows=423 width=34)
(9 rows)
Ah, only using 3 cores of the 8...
---
CREATE TABLE temps_by_city3 AS SELECT city, temp % 1000, array_agg(temp) from temps_int2 group by 1,2;
postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city3 group by 1 order by 1 ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=854300.99..854378.73 rows=200 width=98)
Group Key: city
-> Gather Merge (cost=854300.99..854348.73 rows=2200 width=98)
Workers Planned: 11
-> Sort (cost=854300.77..854301.27 rows=200 width=98)
Sort Key: city
-> Partial HashAggregate (cost=854290.63..854293.13 rows=200 width=98)
Group Key: city
-> Parallel Seq Scan on temps_by_city3 (cost=0.00..98836.19 rows=994019 width=34)
JIT:
Functions: 7
Options: Inlining true, Optimization true, Expressions true, Deforming true
(12 rows)
This 100% saturates the CPU and runs in ~65 secs (about 2x faster).
---
Creating the data
Here's a very fast lousy first pass for PostgreSQL that works on most versions - for pg16, there's random_normal(). I'm not on Hetzner so I used GCP (c2d-standard-8 with 16vCPU, 64GB, and 150GB disk, ubuntu 22.04 and
CREATE TABLE temps_int2 (city int2, temp int2);
-- random()*random() is a cheap distribution. 1100 = 110 * 10 to provide one decimal of precision.
INSERT INTO temps_int2 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e9)i;
FWIW, you can make the data loading a decent bit faster:
1) use the integer version of generate_series, the 1e9 leads to the floating point version being chosen
2) move the generate_series() to the select list of a subselect - for boring reasons, that we should fix, the FROM version materializes the result first
3) using COPY is much faster, however a bit awkward to write
psql -Xq -c 'COPY (SELECT (1000random())::int2 as city, (random()random()*1100)::int2 temp FROM (SELECT generate_series(1,1e9::int8))) TO STDOUT WITH BINARY' | psql -Xq -c 'COPY temps_int2 FROM STDIN WITH BINARY'
psql -c 'create table temps_int2_copy as select * from temps_int2_copy where 1=0;'
psql -Xq -c 'COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp FROM (SELECT generate_series(1,1e8::int8))) TO STDOUT WITH BINARY' | psql -Xq -c 'COPY temps_int2_copy FROM STDIN WITH BINARY'
==> 32sec
psql -c 'create table temps_int2_copy2 as select * from temps_int2_copy where 1=0;'
psql -c 'INSERT INTO temps_int2_copy2 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e8::int)i;'
==> 90sec
psql -c 'create UNLOGGED table temps_int2_copy3 as select * from temps_int2_copy where 1=0;'
psql -c 'INSERT INTO temps_int2_copy3 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e8::int)i;'; date
==> 45sec (still not faster!)
Of course, if we really want to "go fast" then we want parallel loading, which means firing up N postgresql backends and each generate and write the data concurrently to different tables, then each compute a partial summary in N summary tables, and finally merge them together.
echo "COPY temps_int2_copy_p1 from '/var/lib/postgresql/output100m.txt' with csv;" | /usr/lib/postgresql/16/bin/postgres --single -D /etc/postgresql/16/main/ postgres
==> 32sec (saturates one core)
The ultimate would be to hack into postgres and skip everything and just write the actual filesystem files in-place using knowledge of the file formats, then "wire in" these files to the database system tables. This normally gets hairy (e.g. TOAST) but with a simple table like this, it might be possible. This is a project I've always wanted to try.
> wow! awesome tips. I knew COPY rocks but didn't realize it would win vs INSERT INTO SELECT FROM !
We (postgres) should fix that at some point... The difference basically is that there's a dedicated path to insert many tuples at once that's often used by COPY that isn't used by INSERT INTO ... SELECT. The logic for determining when that optimization is correct (consider e.g. after-insert per-row triggers, the trigger invocation for row N may not yet see row N+1) is specific to COPY right now. We need to generalize it to be usable in more places.
To be fair, part of the reason the COPY approach is faster is that the generate_series() query actually uses a fair bit of CPU on its own, and the piped psql's lead to the data generation and data loading being run separately. Of course, partially paying for that by needing to serialize/deserialize the data and handling all the data in four processes.
When doing the COPYs separately to/from a file, it actually takes longer to generate the data than loading the data into an unlogged table.
# COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp FROM (SELECT generate_series(1,1e8::int8))) TO '/tmp/data.pgcopy' WITH BINARY;
COPY 100000000
Time: 21560.956 ms (00:21.561)
# BEGIN;DROP TABLE IF EXISTS temps_int2; CREATE UNLOGGED TABLE temps_int2 (city int2 NOT NULL, temp int2 NOT NULL); COPY temps_int2 FROM '/tmp/data.pgcopy' WITH BINARY;COMMIT;
BEGIN
Time: 0.128 ms
DROP TABLE
Time: 0.752 ms
CREATE TABLE
Time: 0.609 ms
COPY 100000000
Time: 18874.010 ms (00:18.874)
COMMIT
Time: 229.650 ms
Loading into a logged table is a bit slower, at 20250.835 ms.
> Of course, if we really want to "go fast" then we want parallel loading, which means firing up N postgresql backends and each generate and write the data concurrently to different tables, then each compute a partial summary in N summary tables, and finally merge them together.
With PG >= 16, you need a fair bit of concurrency to hit bottlenecks due to multiple backends loading data into the same table with COPY. On my ~4 year old workstation I reach over 3GB/s, with a bit more work we can get higher. Before that the limit was a lot lower.
If I use large enough shared buffers so that IO does not become a bottleneck, I can load the 1e9 rows fairly quickly in parallel, using pgbench:
c=20; psql -Xq -c "COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp FROM (SELECT generate_series(1,1e6::int8))) TO '/tmp/data-1e6.pgcopy' WITH BINARY;" -c 'DROP TABLE IF EXISTS temps_int2; CREATE UNLOGGED TABLE temps_int2 (city int2 NOT NULL, temp int2 NOT NULL); ' && time pgbench -c$c -j$c -n -f <( echo "COPY temps_int2 FROM '/tmp/data-1e6.pgcopy' WITH BINARY;" ) -t $((1000/${c})) -P1
real 0m26.486s
That's just 1.2GB/s, because the bottleneck is the per-row and per-field processing, due to their narrowness.
> The ultimate would be to hack into postgres and skip everything and just write the actual filesystem files in-place using knowledge of the file formats, then "wire in" these files to the database system tables. This normally gets hairy (e.g. TOAST) but with a simple table like this, it might be possible. This is a project I've always wanted to try.
I doubt that will ever be a good idea. For one, the row metadata contain transactional information, that'd be hard to create correctly outside of postgres. It'd also be too easy to cause issues with corrupted data.
However, there's a lot we could do to speed up data loading performance further. The parsing that COPY does, uhm, show signs of iterative development over decades. Absurdly enough, that's where the bottleneck most commonly is right now. I'm reasonably confident that there's at least 3-4x possible without going to particularly extreme lengths. I think there's also at least a not-too-hard 2x for the portion of loading loading data into the table.
thx! Indeed, upon reading the source and sleeping on it, I agree, and in fact it looks like a single-user postgres backend with COPY FROM <filename> BINARY is approximately the same architecture as writing the database files in-place, and of course includes support for default values, triggers, constraints, TOAST and more.
I've reproduced the speed difference between pg COPY vs various cases.
Some results (middle result of 3 stable runs) from 1.4GB BINARY dump:
echo "drop table if exists tbl; create table tbl(city int2, temp int2);copy tbl FROM '/citydata.bin' binary;" | ./pg/bin/postgres --single -D tmp -p 9999 postgres;
real 0m34.508s
# switching to unlogged table
real 0m30.620s
# hardcoding heap_multi_insert() to be a NOOP (return early if ntuples>100)
# fyi, heap_multi_insert() gets called with n=1000 tuples per call
real 0m11.276s
# hardcoding skip_tuple = true in src/backend/commands/copyfrom.c:1142
real 0m6.894s
# after testing various things
time sh -c "tar cf - citydata.bin | (cd /tmp; tar xf -)"
real 0m2.811s
Note: I tried increasing the blocksize (--with-blocksize) and also MAX_BUFFERED_TUPLES (copyfrom.c:65) but as expected they didn't help, I guess n=1000 tuples amortizes the overhead.
I think it actually shows that you're IO bound (the 'D' in the 'S' column). On my workstation the query takes ~10.9s after restarting postgres and dropping the os caches. And this is a four year old CPU that wasn't top of the line at the time either.
> postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;
Note that you dropped the limit 5 here. This causes the query to be a good bit slower, the expensive part here is all the array unnesting, which only needs to happen for the actually selected cities.
On my workstation the above takes 1.8s after adding the limit 5.
Exactly. The issue is that postgresql takes 37 bytes per row normally, which then causes it to spill out of RAM on the limited VM specified for this challenge, causing the query to be I/O bound, hence the array representation and unnesting to fit it back into RAM. I'm guessing your machine has more RAM ?
> Exactly. The issue is that postgresql takes 37 bytes per row normally, which then causes it to spill out of RAM on the limited VM specified for this challenge, causing the query to be I/O bound, hence the array representation and unnesting to fit it back into RAM. I'm guessing your machine has more RAM ?
It does, but I restarted postgres and cleared the OS cache.
Btw, the primary bottleneck for the array-ified query is the unnest() handling in the functions. The minimal thing would be to make the functions faster, e.g. via:
CREATE OR REPLACE FUNCTION array_avg(_data anyarray)
RETURNS numeric IMMUTABLE PARALLEL SAFE LANGUAGE sql AS
$$
select avg(a)
from (SELECT unnest(_data) as a)
$$;
But that way the unnest is still done 3x. Something like
SELECT a_min, a_max, a_avg FROM temps_by_city, LATERAL (SELECT min(u) a_min, max(u) a_max, avg(u) a_avg FROM (SELECT unnest(array_agg) u)) limit 5;
SQL in theory makes this trivial, handles many of the big optimizations and looking at the repo, cuts 1000+ LOC down to a handful. Modern SQL engines handle everything for you, which is the whole damned point of SQL. Any decent engine will handle parallelism, caching, I/O, etc. Some exotic engines can leverage GPU but for N=1 billion, I doubt GPU will be faster. Here's the basic query:
In practice, generic SQL engines like PostgreSQL bloat the storage which is a Big Problem for queries like this - in my test, even with INT2 normalization (see below), pgsql took 37 bytes per record which is insane (23+ bytes of overhead to support transactions: https://www.postgresql.org/docs/current/storage-page-layout....). The big trick is to use PostgreSQL arrays to store the data by city, which removes this overhead and reduces the table size from 34GB (doesn't fit in memory) to 2GB (which does).The first optimization is to observe that the cardinality of cities is small and can be normalized into integers (INTEGER aka INT4), and that the temps can as well (1 decimal of precision). Using SMALLINT (aka INT2) is probably not faster on modern CPUs but should use less RAM, which is better for both caching on smaller systems and cache hitrate on all systems. NUMERIC generally isn't faster or tighter on most engines.
To see the query plan, use EXPLAIN:
(13 rows)Sigh, pg16 is still pretty conservative about parallelism, so let's crank it up.
top(1) is showing that we're burying the CPU: interestingly, when we match workers to CPU cores, we don't %CPU drops to 14% i.e. we don't leverage the hardware.OK enough pre-optimization, here's the baseline:
I also tried to leverage a B-tree covering index (CREATE INDEX temps_by_city ON temps_int2 (city) INCLUDE (temp) ) but it wasn't faster - I killed the job after 4 minutes. Yes, I checked that it pg16 used a parallel index-only scan (SET random_page_cost =0.0001; set min_parallel_index_scan_size=0; set enable_seqscan = false; SET enable_parallel_index_scan = ON;) - top(1) shows ~1.7% CPU, suggesting that we were I/O bound.Instead, to amortize the tuple overhead, we can store the data as arrays:
Yay, it now fits in RAM. (9 rows)Ah, only using 3 cores of the 8...
---
(12 rows)This 100% saturates the CPU and runs in ~65 secs (about 2x faster).
---
Creating the data
Here's a very fast lousy first pass for PostgreSQL that works on most versions - for pg16, there's random_normal(). I'm not on Hetzner so I used GCP (c2d-standard-8 with 16vCPU, 64GB, and 150GB disk, ubuntu 22.04 and