Hacker News new | past | comments | ask | show | jobs | submit login

MySQL stores table data in a b+ tree where updates modify the data directly in place as transactions are committed, and overwritten data is moved to a secondary undo log to support consistent reads. MySQL indexes store primary keys and queries rely on tree traversal to find the row in the b+ tree, but it can also contain references to rows in the undo log.

PostgreSQL tables are known as heaps, which consist of slotted pages where new data is written to the first page with sufficient free space. Since it's not a b-tree and you can't resolve a row with just a primary key without a table scan, Postgres uses the physical location of the row called a tuple ID (TID, or item pointer) that contains the page and position (slot) of the row within that page. So the TID (10, 3) tells Postgres the row is in block 10 slot 3 which can be fetched directly from the page buffer or disk without having to do a tree traversal.

When PostgreSQL updates a row, it doesn’t modify the original data directly. Instead, it:

  1) Writes a new version of the row to a new page
  2) Marks the old row as outdated by updating its tuple header and relevant page metadata
  3) Updates the visibility map to indicate that the page contains outdated rows
  4) Adjusts indexes to point to the new TID of the updated row
This means that indexes need to be updated even if the column value didn't change.

Old rows continue to accumulate in the heap until the VACUUM process permanently deletes them, but this process can impact normal operations and cause issues.

Overall this means Postgres does more disk I/O for the same work as MySQL. The upside is Postgres doesn't have to worry about page splits, so things like bulk inserts can be much more efficient.






> The upside is Postgres doesn't have to worry about page splits, so things like bulk inserts can be much more efficient.

Not in the heap, but if you have any index on the table (I know, don’t do that for bulk loads, but many don’t / it isn’t feasible sometimes) then you’re still dealing with a B+tree (probably).

Also, MySQL still gets the nod for pure bulk load speed via MySQLShell’s Parallel Import Utility [0]. You can of course replicate this in Postgres by manually splitting the input file and running multiple \COPY commands, but having a tool do it all in one is lovely.

[0]: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-uti...


> then you’re still dealing with a B+tree

Absolutely, though they're generally orders of magnitude smaller than the table file unless you're INCLUDE'ing lots of columns.

There's pg_bulkload which supports parallel writers as well as deferred index updates until the loading process is complete. Not sure how it compares to what MySQL offers out of the box, but I definitely agree that the MySQL tooling ecosystem in general has a leg up.


That's a perfect explanation, thank you very much!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: