Hacker News new | past | comments | ask | show | jobs | submit login
Batch size one billion: SQLite insert speedups, from the useful to the absurd (voidstar.tech)
104 points by thunderbong on Sept 26, 2023 | hide | past | favorite | 27 comments



One alternative way I’ve been testing recently: load the entire csv file into a single row column in Postgres, and the use a view to parse it into lines and values. Then cache the result with a materialized view. 16mb, 500k rows (formula 1 laptimes) takes only a few seconds to load and map to materialized view records.


What do you mean by "row column in Postgres"? It's confusing. Could you please clarify that?


Sounds like the initial table has just one row and one column, containing the entire 16mb csv string.


Correct, at least one row for every time you load a csv file.


That's a fantastic analysis. I wonder how PostgreSQL performs?


I've done some comparisons for inserts of various data sizes on mariadb, postgres, and sqlite without indexes.

I didn't have the patience of OP though to push it to 1B.

https://github.com/eatonphil/databases-intuition


looks like PGSQL 2x slower for uploading.


Looks like I'm not alone. But I don't, and don't think Mark does (he's a MySQL guy I believe), claim to be a Postgres expert.

http://smalldatum.blogspot.com/2023/09/trying-out-orioledb.h...


PostgreSQL supports heap tables, which should blow away SQLite's mandatory clustered index tables in an unindexed insert test.


The improvements from using a transaction are familiar to me and make intuitive sense. But can someone explain why using a prepared statement results in a roughly 5x improvement? Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?


>Parsing the very simple SQL doesn't seem like it would account for much time, so is the extra time spent redoing query planning or something else?

If you're inserting one million rows, even 5 microseconds of parse and planning time per query is five extra seconds on a job that could be done in half a second.


I'm curious which takes longer, parsing or planning.


Depends on the complexity of the planning, but... typically planning (in mssql, am not familiar with sqlite).

Parsing is linear, planning gets exponential very quickly. It's got to consider data distribution, presence or not of indexes, output ordering, presence of foreign keys, uniqueness, and lots more I can't think of right now.

So, planning is much heavier than parsing (for any non-trivial query).


That's what I expect, but for these very simple batch insert queries, it's hard to guess which takes longer.


A very simple statement may not be as simple as it looks to execute.

The table here is unconstrained (from the article "CREATE TABLE integers (value INTEGER);") but suppose that table had a primary key and a foreign key – a simple insertion of a single value into such a table would look trivial but consider what the query plan would look like as it verifies the PK and FK aren't violated. And maybe there are a couple of indexes to be updated as well. And a check constraint or three. Suddenly a simple INSERT of a literal value becomes quite involved under the skin.

(edit: and you can add possible triggers on the table)


Right, but in this particular example, the table is just

  CREATE TABLE integers (value INTEGER);


I said exactly that.


My bad, missed the first part.


Happen, happen, we've all done it. NP


One big caveat to this sort of test. With DBs you are very often trading throughput for latency.

Meaning, it's certainly possible to get "the absolute fastest inserts possible" but at the same time, you are impacting both other readers of that table AND writers across the db.

This also gets more messy when you are talking about multiple instances.


With SQLite by default nobody else can even read the database while it's being written, so your comment would be better directed to a conventional database server.


Note that the default can be changed with a PRAGMA at compile time to Write-Ahead Logging (WAL) which is both faster and allows reading during writes.

[0] https://www.sqlite.org/wal.html


In my limited experiments WAL wasn't faster for high-row-speed writes.

When rows/s are maximized SQLite is CPU limited by internal bytecode operations rather than waiting for disk stuff.


Too late to edit: This should have said "at run time", as it's set by making a relatively normal (can't be inside a transaction) SQL query.


If you enable the WAL, reading while writing is possible. IMO it should be the default.


How does this compare to the following?

create table testInt(num Integer); WITH n(n) AS(SELECT 1 UNION ALL SELECT n+1 FROM n WHERE n < 1000000) insert into testInt select n from n;

On my machine - Mac Mini 3 GHz Dual-Core Intel Core i7 16 GB 1600 MHz DDR3

this takes Run Time: real 0.362 user 0.357236 sys 0.004530


you test just 1M record not 1B?..




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: