1:05 in PostgreSQL 16 but it was harder than I thought to saturate the CPUs and ...

anarazel · on Jan 4, 2024

FWIW, you can make the data loading a decent bit faster:

1) use the integer version of generate_series, the 1e9 leads to the floating point version being chosen

2) move the generate_series() to the select list of a subselect - for boring reasons, that we should fix, the FROM version materializes the result first

3) using COPY is much faster, however a bit awkward to write

psql -Xq -c 'COPY (SELECT (1000random())::int2 as city, (random()random()*1100)::int2 temp FROM (SELECT generate_series(1,1e9::int8))) TO STDOUT WITH BINARY' | psql -Xq -c 'COPY temps_int2 FROM STDIN WITH BINARY'

asah · on Jan 5, 2024

wow! awesome tips. I knew COPY rocks but didn't realize it would win vs INSERT INTO SELECT FROM ! https://pganalyze.com/blog/5mins-postgres-optimizing-bulk-lo...

1/10th scale tests:

   psql -c 'create table temps_int2_copy as select * from temps_int2_copy where 1=0;'
   psql -Xq -c 'COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp FROM (SELECT generate_series(1,1e8::int8))) TO STDOUT WITH BINARY' | psql -Xq -c 'COPY temps_int2_copy FROM STDIN WITH BINARY'

==> 32sec

   psql -c 'create table temps_int2_copy2 as select * from temps_int2_copy where 1=0;'
   psql -c 'INSERT INTO temps_int2_copy2 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e8::int)i;'

==> 90sec

   psql -c 'create UNLOGGED table temps_int2_copy3 as select * from temps_int2_copy where 1=0;'
   psql -c 'INSERT INTO temps_int2_copy3 (city, temp) SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp from generate_series(1,1e8::int)i;'; date

==> 45sec (still not faster!)

Of course, if we really want to "go fast" then we want parallel loading, which means firing up N postgresql backends and each generate and write the data concurrently to different tables, then each compute a partial summary in N summary tables, and finally merge them together.

   echo "COPY temps_int2_copy_p1 from '/var/lib/postgresql/output100m.txt' with csv;" | /usr/lib/postgresql/16/bin/postgres --single -D /etc/postgresql/16/main/ postgres

==> 32sec (saturates one core)

The ultimate would be to hack into postgres and skip everything and just write the actual filesystem files in-place using knowledge of the file formats, then "wire in" these files to the database system tables. This normally gets hairy (e.g. TOAST) but with a simple table like this, it might be possible. This is a project I've always wanted to try.

anarazel · on Jan 6, 2024

> wow! awesome tips. I knew COPY rocks but didn't realize it would win vs INSERT INTO SELECT FROM !

We (postgres) should fix that at some point... The difference basically is that there's a dedicated path to insert many tuples at once that's often used by COPY that isn't used by INSERT INTO ... SELECT. The logic for determining when that optimization is correct (consider e.g. after-insert per-row triggers, the trigger invocation for row N may not yet see row N+1) is specific to COPY right now. We need to generalize it to be usable in more places.

To be fair, part of the reason the COPY approach is faster is that the generate_series() query actually uses a fair bit of CPU on its own, and the piped psql's lead to the data generation and data loading being run separately. Of course, partially paying for that by needing to serialize/deserialize the data and handling all the data in four processes.

When doing the COPYs separately to/from a file, it actually takes longer to generate the data than loading the data into an unlogged table.

  # COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp  FROM (SELECT generate_series(1,1e8::int8))) TO '/tmp/data.pgcopy' WITH BINARY;
  COPY 100000000
  Time: 21560.956 ms (00:21.561)

  # BEGIN;DROP TABLE IF EXISTS temps_int2; CREATE UNLOGGED TABLE temps_int2 (city int2 NOT NULL, temp int2 NOT NULL); COPY temps_int2 FROM '/tmp/data.pgcopy' WITH BINARY;COMMIT;

  BEGIN
  Time: 0.128 ms
  DROP TABLE
  Time: 0.752 ms
  CREATE TABLE
  Time: 0.609 ms
  COPY 100000000
  Time: 18874.010 ms (00:18.874)
  COMMIT
  Time: 229.650 ms

Loading into a logged table is a bit slower, at 20250.835 ms.

> Of course, if we really want to "go fast" then we want parallel loading, which means firing up N postgresql backends and each generate and write the data concurrently to different tables, then each compute a partial summary in N summary tables, and finally merge them together.

With PG >= 16, you need a fair bit of concurrency to hit bottlenecks due to multiple backends loading data into the same table with COPY. On my ~4 year old workstation I reach over 3GB/s, with a bit more work we can get higher. Before that the limit was a lot lower.

If I use large enough shared buffers so that IO does not become a bottleneck, I can load the 1e9 rows fairly quickly in parallel, using pgbench:

  c=20; psql -Xq -c "COPY (SELECT (1000*random())::int2 as city, (random()*random()*1100)::int2 temp  FROM (SELECT generate_series(1,1e6::int8))) TO '/tmp/data-1e6.pgcopy' WITH BINARY;" -c 'DROP TABLE IF EXISTS temps_int2; CREATE UNLOGGED TABLE temps_int2 (city int2 NOT NULL, temp int2 NOT NULL); ' && time pgbench -c$c -j$c -n -f <( echo "COPY temps_int2 FROM '/tmp/data-1e6.pgcopy' WITH BINARY;" ) -t $((1000/${c})) -P1

  real 0m26.486s

That's just 1.2GB/s, because the bottleneck is the per-row and per-field processing, due to their narrowness.

> The ultimate would be to hack into postgres and skip everything and just write the actual filesystem files in-place using knowledge of the file formats, then "wire in" these files to the database system tables. This normally gets hairy (e.g. TOAST) but with a simple table like this, it might be possible. This is a project I've always wanted to try.

I doubt that will ever be a good idea. For one, the row metadata contain transactional information, that'd be hard to create correctly outside of postgres. It'd also be too easy to cause issues with corrupted data.

However, there's a lot we could do to speed up data loading performance further. The parsing that COPY does, uhm, show signs of iterative development over decades. Absurdly enough, that's where the bottleneck most commonly is right now. I'm reasonably confident that there's at least 3-4x possible without going to particularly extreme lengths. I think there's also at least a not-too-hard 2x for the portion of loading loading data into the table.

asah · on Jan 8, 2024

thx! Indeed, upon reading the source and sleeping on it, I agree, and in fact it looks like a single-user postgres backend with COPY FROM <filename> BINARY is approximately the same architecture as writing the database files in-place, and of course includes support for default values, triggers, constraints, TOAST and more.

I've reproduced the speed difference between pg COPY vs various cases.

Some results (middle result of 3 stable runs) from 1.4GB BINARY dump:

   echo "drop table if exists tbl; create table tbl(city int2, temp int2);copy tbl FROM '/citydata.bin' binary;" | ./pg/bin/postgres --single -D tmp -p 9999 postgres;

   real 0m34.508s


   # switching to unlogged table

   real 0m30.620s


   # hardcoding heap_multi_insert() to be a NOOP  (return early if ntuples>100)
   # fyi, heap_multi_insert() gets called with n=1000 tuples per call

   real 0m11.276s


   # hardcoding skip_tuple = true in src/backend/commands/copyfrom.c:1142

   real 0m6.894s


   # after testing various things
   time sh -c "tar cf - citydata.bin | (cd /tmp; tar xf -)"

   real 0m2.811s

Note: I tried increasing the blocksize (--with-blocksize) and also MAX_BUFFERED_TUPLES (copyfrom.c:65) but as expected they didn't help, I guess n=1000 tuples amortizes the overhead.

anarazel · on Jan 4, 2024

> top(1) is showing that we're burying the CPU:

I think it actually shows that you're IO bound (the 'D' in the 'S' column). On my workstation the query takes ~10.9s after restarting postgres and dropping the os caches. And this is a four year old CPU that wasn't top of the line at the time either.

> postgres=# explain select city, min(array_min(array_agg)), avg(array_avg(array_agg)), max(array_max(array_agg)) from temps_by_city group by 1 order by 1 ;

Note that you dropped the limit 5 here. This causes the query to be a good bit slower, the expensive part here is all the array unnesting, which only needs to happen for the actually selected cities.

On my workstation the above takes 1.8s after adding the limit 5.

asah · on Jan 5, 2024

Exactly. The issue is that postgresql takes 37 bytes per row normally, which then causes it to spill out of RAM on the limited VM specified for this challenge, causing the query to be I/O bound, hence the array representation and unnesting to fit it back into RAM. I'm guessing your machine has more RAM ?

anarazel · on Jan 6, 2024

> Exactly. The issue is that postgresql takes 37 bytes per row normally, which then causes it to spill out of RAM on the limited VM specified for this challenge, causing the query to be I/O bound, hence the array representation and unnesting to fit it back into RAM. I'm guessing your machine has more RAM ?

It does, but I restarted postgres and cleared the OS cache.

Btw, the primary bottleneck for the array-ified query is the unnest() handling in the functions. The minimal thing would be to make the functions faster, e.g. via:

CREATE OR REPLACE FUNCTION array_avg(_data anyarray) RETURNS numeric IMMUTABLE PARALLEL SAFE LANGUAGE sql AS $$ select avg(a) from (SELECT unnest(_data) as a) $$;

But that way the unnest is still done 3x. Something like

  SELECT a_min, a_max, a_avg FROM temps_by_city, LATERAL (SELECT min(u) a_min, max(u) a_max, avg(u) a_avg FROM (SELECT unnest(array_agg) u)) limit 5;

should be faster.