Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: pg-bulk-ingest – Now with multi-table support (github.com/uktrade)
18 points by michalc on Feb 12, 2024 | hide | past | favorite | 6 comments
I (with others) made a Python utility for loading data into PostgreSQL incrementally from ETL-ish pipelines. Its API supported ingesting into multiple multiple tables at the sort of "structural" level from its beginning (more or less), but didn't actually support it if you tried to do it. And I've been umming and ahhing on how best to do it. No way seemed perfect...

... but pushed by an actual use case, I finally made a decision and did it.




I have always used pg_bulkload. It's a bit of pain to compile but the tool is really fast... I love how you can define what to do with constraint conflicts, etc.

Recently I used it to bulk import a billion rows and dedupe by a single column/constraint by throwing out the rows that conflicted. It did it-- in like two hours.

see: https://ossc-db.github.io/pg_bulkload/pg_bulkload.html


I didn't know about this - thanks!!


Thanks for sharing your tool!

I think it would be best to illustrate the readme with a problem it tries to solve: currently I don't understand why a simple set of INSERT operations wouldn't suffice.

Also, the phrase "ingest data into Postgres" sounds wrong: it would be postgres, not your tool that would ingest data (if it could), your tool should be described as the one "putting data into Postgres". And thus you probably have named your tool wrong...


Ah the name - you're not the first to mention it! Do you (or anyone lurking...) have any suggestions as to what it might better be called?

On what it does/why it exists, we've kept the README quite light to avoid duplication, with the main bits of the docs at https://pg-bulk-ingest.docs.trade.gov.uk/

But to try to answer the question here:

A set of insert statements - there are lots of cases where this would be fine, so pg-bulk-ingest (/its future name ;-) would be unnecessary, and so you might as well use insert statements.

But there are lots of things that pg-bulk-ingest does that a set of insert statements don't:

- It uses COPY, which in many cases is (much?) faster than INSERT

- It creates the table if it doesn't exist

- It migrates an existing table to match the definition if it does exist (pg-bulk-ingest can't handle all possible changes, but enough to make it useful to us)

- And it migrates without holding a long ACCESS EXCLUSIVE lock on the table

- It offers a fairly structured way to do high-watermarked ingest - it stores and retrieves a high watermark so you can fetch the data "since" some previous ingest, with the exact meaning of "since" custom to each source you're ingesting from.

- And it (now) combines COPYs with multiple tables, without loading all incoming data into memory at once, and keeping the ingest within a single transaction, so other clients see all the data, or none of it.

(The high watermarking I think is maybe the least clear of these in the docs, while also one of the most important things. Would really like an example that is somehow standalone that also makes it clear what it does...)


pg_alchemy_objects_writer


Anyone use json_to_recordset? I’ve used it in the past to insert a few billion rows and was pretty happy with it




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: