> It’s probably also fast because you have a warm cache - e.g. there’s enough me...

yashap · on Aug 13, 2023

First off, ty for running all these benchmarks, above and beyond!

FWIW, I don’t think joins are bad, I’m 100% for normalized DB schemas with joins. But I’ve done tonnes of performance work over the past ~10 years, and run into a bunch of real world cases where, when caches are cold (which does happen frequently with large datasets and limited budgets), queries similar to the above (join two tables, read a page of data, sorting on one table and filtering on the other) take 10s of seconds, sometimes even minutes with very large datasets. In those cases, the only solution has been to denormalize so we can create a compound index, with filter key(s) as the prefix, sort key as the suffix, which makes the query consistently fast. I am not at all suggesting this as the default, better to default to normalized with joins, and only do this for these specific cases. Generally a company just has one or a few cases where this is necessary, can just do the ugly denormalization + compound indexes for these few hot spots. But when ppl say “joins are slow”, these cases are examples of where it’s true.

Re: your above 17 second query, if you do something like adding fields to the user table for primary and secondary contact method (denormalizing), and then create a compound index with the necessary filter keys first, and the sort key second, I think you’ll find the query (which no longer needs a join) is quite fast even if caches are ice cold.

sgarland · on Aug 13, 2023

Created a new table that contains `user_id, created_at, phone_primary, phone_secondary`. Inserted all 10,200,000 rows. Notably (I'll come back to this) due to the generation of the rows, the primary key (`user_id`) is an unsorted integer - this was _not_ done with a serial or identity.

  postgres=# CREATE INDEX sec_phone_created_at ON hn_phone_new (phone_secondary, created_at) WHERE phone_secondary IS NOT NULL;

I reset `shared_buffers` down to the same as before - 263 MB - although the size of this index is tiny, < 10 MB, so realistically I can't shrink buffers down that far anyway. I then did the same `sync/drop cache` as before.

  postgres=# SELECT * FROM hn_phone_new WHERE phone_secondary IS NOT NULL ORDER BY created_at LIMIT 10;
     id   |     created_at      |   phone_primary    |  phone_secondary   
  --------+---------------------+--------------------+--------------------
    58816 | 1995-05-23 03:22:02 | +49 030 522866-87  | +1 159-445-4810
    49964 | 1995-05-23 03:23:00 | +61 02 7440 8606   | +254 20 925 892
   171828 | 1995-05-23 05:06:47 | +380 32 393-35-89  | +49 030 429376-29
    78333 | 1995-05-23 05:31:22 | +380 32 147-11-20  | +52 55 6409 5253
    24264 | 1995-05-23 06:47:21 | +44 0131 6506 1823 | +49 030 610965-83
    96662 | 1995-05-23 06:57:03 | +52 55 1473 0538   | +61 02 5414 8204
    15023 | 1995-05-23 07:55:37 | +44 0131 7959 1581 | +44 0131 8491 6194
    52029 | 1995-05-23 08:59:19 | +380 32 430-77-54  | +254 20 374 856
    20518 | 1995-05-23 09:51:14 | +380 32 264-21-79  | +52 55 7787 0236
    80273 | 1995-05-23 14:59:26 | +61 02 8863 4466   | +33 01 16 10 78 56
  (10 rows)

  Time: 2258.807 ms (00:02.259)

So yes, significant improvement as you'd expect. I then dropped the index and swapped the order:

  postgres=# DROP INDEX sec_phone_created_at;
  postgres=# CREATE INDEX created_at_sec_phone ON hn_phone_new (created_at, phone_secondary) WHERE phone_secondary IS NOT NULL;

Reset everything as before, and re-ran the same query:

  Time: 221.392 ms

Thinking that like MySQL, a portion of the `shared_buffers` had been saved to disk and put back in upon restart (honestly I don't know if Postgres does this), I attempted to flush it by running a few `SELECT COUNT(*)` on other, larger tables, then re-running the query.

  Time: 365.961 ms

This is what `EXPLAIN VERBOSE` looks like for the original index:

   Limit  (cost=8.44..8.45 rows=1 width=61)
     Output: id, created_at, phone_primary, phone_secondary
     ->  Sort  (cost=8.44..8.45 rows=1 width=61)
           Output: id, created_at, phone_primary, phone_secondary
           Sort Key: hn_phone_new.created_at
           ->  Index Scan using sec_phone_created_at on public.hn_phone_new  (cost=0.42..8.43 rows=1 width=61)
                 Output: id, created_at, phone_primary, phone_secondary
  (7 rows)

And this is what it looks like for the second, with the columns swapped:

  Limit  (cost=0.42..8.43 rows=1 width=61)
     Output: id, created_at, phone_primary, phone_secondary
     ->  Index Scan using created_at_sec_phone on public.hn_phone_new  (cost=0.42..8.43 rows=1 width=61)
           Output: id, created_at, phone_primary, phone_secondary
  (4 rows)

So it actually needs to be reversed, so that the query planner doesn't have to add a sort step for the ORDER BY.

yashap · on Aug 14, 2023

Yeah, I believe which order is best (sortKey/filterKey or filterKey/sortKey) really depends on the specific data/queries, best to try both and pick the best one - looks like sortKey/filterKey in this case :)

But I think this does show how, for specific data/queries, sometimes you do have to denormalize, so that you can create the ideal compound index, for specific problem queries. Should still go with normalized schemas and joins as a default, but if problem queries pop up like this that are taking 10, 20, 30 seconds sometimes (when caches are cold), compromising a bit on clean schemas/code for performance makes sense.

I also created a benchmark here, for Postgres: https://gist.github.com/yashap/6d7a34ef37c6b7d3e4fc11b0bece7...

sgarland · on Aug 14, 2023

If you’re limited in RAM and can’t upsize, then yes, this does appear to be a good trade off. You can always refactor later and normalize if necessary.

BTW, although it wouldn’t have helped for your specific benchmark schema creation of TYPES, I’ll plug my genSQL tool [0] for generating random data. It’s primarily designed around MySQL, but it can produce CSVs easily, which every DB can load.

Turns out a lot of random() calls in most languages is slow af, so mine avoids that by (mostly) batching them in a C library. Should be able to create a million somethings in under 10 seconds on modern hardware in Python 3.11.

[0]: https://github.com/stephanGarland/genSQL

yashap · on Aug 19, 2023

Looks useful, ty!