Batch size one billion: SQLite insert speedups, from the useful to the absurd

ako · on Sept 26, 2023

One alternative way I’ve been testing recently: load the entire csv file into a single row column in Postgres, and the use a view to parse it into lines and values. Then cache the result with a materialized view. 16mb, 500k rows (formula 1 laptimes) takes only a few seconds to load and map to materialized view records.

daza · on Sept 26, 2023

What do you mean by "row column in Postgres"? It's confusing. Could you please clarify that?

Linosaurus · on Sept 26, 2023

Sounds like the initial table has just one row and one column, containing the entire 16mb csv string.

ako · on Sept 26, 2023

Correct, at least one row for every time you load a csv file.

klaussilveira · on Sept 26, 2023

That's a fantastic analysis. I wonder how PostgreSQL performs?

eatonphil · on Sept 26, 2023

I've done some comparisons for inserts of various data sizes on mariadb, postgres, and sqlite without indexes.

I didn't have the patience of OP though to push it to 1B.

https://github.com/eatonphil/databases-intuition

riku_iki · on Sept 26, 2023

looks like PGSQL 2x slower for uploading.

eatonphil · on Sept 26, 2023

Looks like I'm not alone. But I don't, and don't think Mark does (he's a MySQL guy I believe), claim to be a Postgres expert.

http://smalldatum.blogspot.com/2023/09/trying-out-orioledb.h...

voidstarcpp · on Sept 26, 2023

PostgreSQL supports heap tables, which should blow away SQLite's mandatory clustered index tables in an unindexed insert test.

LVB · on Sept 26, 2023

The improvements from using a transaction are familiar to me and make intuitive sense. But can someone explain why using a prepared statement results in a roughly 5x improvement? Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?

voidstarcpp · on Sept 26, 2023

>Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?

If you're inserting one million rows, even 5 microseconds of parse and planning time per query is five extra seconds on a job that could be done in half a second.

hot_gril · on Sept 26, 2023

I'm curious which takes longer, parsing or planning.

_a_a_a_ · on Sept 26, 2023

Depends on the complexity of the planning, but... typically planning (in mssql, am not familiar with sqlite).

Parsing is linear, planning gets exponential very quickly. It's got to consider data distribution, presence or not of indexes, output ordering, presence of foreign keys, uniqueness, and lots more I can't think of right now.

So, planning is much heavier than parsing (for any non-trivial query).

hot_gril · on Sept 26, 2023

That's what I expect, but for these very simple batch insert queries, it's hard to guess which takes longer.

_a_a_a_ · on Sept 26, 2023

A very simple statement may not be as simple as it looks to execute.

The table here is unconstrained (from the article "CREATE TABLE integers (value INTEGER);") but suppose that table had a primary key and a foreign key – a simple insertion of a single value into such a table would look trivial but consider what the query plan would look like as it verifies the PK and FK aren't violated. And maybe there are a couple of indexes to be updated as well. And a check constraint or three. Suddenly a simple INSERT of a literal value becomes quite involved under the skin.

(edit: and you can add possible triggers on the table)

hot_gril · on Sept 26, 2023

Right, but in this particular example, the table is just

  CREATE TABLE integers (value INTEGER);

_a_a_a_ · on Sept 26, 2023

I said exactly that.

hot_gril · on Sept 26, 2023

My bad, missed the first part.

_a_a_a_ · on Sept 26, 2023

Happen, happen, we've all done it. NP

cogman10 · on Sept 26, 2023

One big caveat to this sort of test. With DBs you are very often trading throughput for latency.

Meaning, it's certainly possible to get "the absolute fastest inserts possible" but at the same time, you are impacting both other readers of that table AND writers across the db.

This also gets more messy when you are talking about multiple instances.

voidstarcpp · on Sept 26, 2023

With SQLite by default nobody else can even read the database while it's being written, so your comment would be better directed to a conventional database server.

argiopetech · on Sept 26, 2023

Note that the default can be changed with a PRAGMA at compile time to Write-Ahead Logging (WAL) which is both faster and allows reading during writes.

[0] https://www.sqlite.org/wal.html

voidstarcpp · on Sept 26, 2023

In my limited experiments WAL wasn't faster for high-row-speed writes.

When rows/s are maximized SQLite is CPU limited by internal bytecode operations rather than waiting for disk stuff.

argiopetech · on Sept 26, 2023

Too late to edit: This should have said "at run time", as it's set by making a relatively normal (can't be inside a transaction) SQL query.

RedShift1 · on Sept 26, 2023

If you enable the WAL, reading while writing is possible. IMO it should be the default.

mike_r_parsons · on Sept 26, 2023

How does this compare to the following?

create table testInt(num Integer); WITH n(n) AS(SELECT 1 UNION ALL SELECT n+1 FROM n WHERE n < 1000000) insert into testInt select n from n;

On my machine - Mac Mini 3 GHz Dual-Core Intel Core i7 16 GB 1600 MHz DDR3

this takes Run Time: real 0.362 user 0.357236 sys 0.004530

riku_iki · on Sept 26, 2023

you test just 1M record not 1B?..