One alternative way I’ve been testing recently: load the entire csv file into a single row column in Postgres, and the use a view to parse it into lines and values. Then cache the result with a materialized view. 16mb, 500k rows (formula 1 laptimes) takes only a few seconds to load and map to materialized view records.
The improvements from using a transaction are familiar to me and make intuitive sense. But can someone explain why using a prepared statement results in a roughly 5x improvement? Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?
>Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?
If you're inserting one million rows, even 5 microseconds of parse and planning time per query is five extra seconds on a job that could be done in half a second.
Depends on the complexity of the planning, but... typically planning (in mssql, am not familiar with sqlite).
Parsing is linear, planning gets exponential very quickly. It's got to consider data distribution, presence or not of indexes, output ordering, presence of foreign keys, uniqueness, and lots more I can't think of right now.
So, planning is much heavier than parsing (for any non-trivial query).
A very simple statement may not be as simple as it looks to execute.
The table here is unconstrained (from the article "CREATE TABLE integers (value INTEGER);") but suppose that table had a primary key and a foreign key – a simple insertion of a single value into such a table would look trivial but consider what the query plan would look like as it verifies the PK and FK aren't violated. And maybe there are a couple of indexes to be updated as well. And a check constraint or three. Suddenly a simple INSERT of a literal value becomes quite involved under the skin.
(edit: and you can add possible triggers on the table)
One big caveat to this sort of test. With DBs you are very often trading throughput for latency.
Meaning, it's certainly possible to get "the absolute fastest inserts possible" but at the same time, you are impacting both other readers of that table AND writers across the db.
This also gets more messy when you are talking about multiple instances.
With SQLite by default nobody else can even read the database while it's being written, so your comment would be better directed to a conventional database server.