Hacker News new | past | comments | ask | show | jobs | submit login
Getting the Most Out of Sqlite3 with Python (remusao.github.io)
191 points by pythux on Oct 22, 2017 | hide | past | favorite | 33 comments

For a project a few years ago I chose to use apsw:


It is a different wrapper for sqlite than the one in the Python standard library. As far as I remember, the stdlib one had some brain-dead choices with regard to concurrency and transactions.

There may have been some other reasons I chose it too, but I don't recall at the minute. The issue is that Python has this spec called "DB-ABI" which is supposed to make all databases look alike. But I don't want that -- I just want to use sqlite!

So 'apsw' is a thin layer on top of sqlite. I'm able to read the sqlite documentation in C, and be reasonably confident in what it's doing. But as far as I remember the Python stdlib version had a bunch of magic in the wrappers that was annoying. I might be mistaken but I think it parsed its own SQL? I think it had some sort of silly cache of parsed queries, rather than letting you simple store/control that in the application.

I think it also inserted SQL statements for transactions between your own calls!!! I just want the raw SQL to go to sqlite as I intended.

I definitely ran into issues where non-obvious flags to the API increased insert performance by 10x, and it had nothing to do with sqlite itself! So I abandoned it for a thinner wrapper.

I have also transitioned to APSW because of the transaction point you mentioned. I was trying — and pulling my hair out for some time — to get the stdlib SQLite3 transactions to work how they’re documented, but no combination of magic words seemed to so the trick. A bit of Googling and I come across APSW, which explicitly mentions that the stdlib is broken in this regard because, as you say, it’s paying lip service to the DB-API. (However, that somewhat implies to me that the DB-API must be broken for all transactional RDBMSs. Why is SQLite different?)

Anyway, it’s a bit of extra work to install, because you have to compile it from source and there’s no, e.g., pip package. That’s not so difficult in the grand scheme of things. One thing I miss from the stdlib, however, is the adaptor and convertor interface for transparent type conversion between Python and SQLite; I ended up rolling my own for APSW.

Thanks for this pointer. It appears to have python bindings for the sqlite VFS layer and virtual tables too. That would be fun to experiment with.

You might also like this library, which allows you to create virtual tables using a very simple API (2 functions).


> 1. Use Bulk Operations

> If you need to insert a lot of rows at once in your database, you really should not use execute. The sqlite3 module provides a way to bulk insertions: executemany.

Note that this is actually part of DBAPI 2.0[0], so pretty much every Python database driver should have an executemany(). Worst case scenario, it's implemented on top of execute() and you get little to no benefits (that's the case for psycopg2 currently[1]), best case scenario it can use much more efficient methods (seems to be the case for sqlite3).

> 3. Cursors Can Be Iterated Upon

That is an optional extension but part of the DBAPI 2.0 spec as well[2], and most drivers should implement it

> for row in connection.execute('SELECT * FROM events'):

The return value of execute() is spec-undefined, so beware that specific method, it is absolutely not portable, neither is execute/executemany on the connection object.

> 7. Use Placeholders to Interpolate Python Values

Warning: the actual placeholder is driver-dependent and not portable, you can query the driver for the preferred param style[3], you can also ask it for its thread-safety guarantees (the sqlite3 module advertises module-level thread-safety but not connection-level let alone cursor-level)

[0] https://www.python.org/dev/peps/pep-0249/#executemany

[1] http://initd.org/psycopg/docs/cursor.html#cursor.executemany

[2] https://www.python.org/dev/peps/pep-0249/#next

[3] https://www.python.org/dev/peps/pep-0249/#paramstyle

ou get little to no benefits (that's the case for psycopg2 currently[1])

Yeah, this bit me on the backside using psycopg2 as an underlying driver for Pandas DataFrame.to_sql(). My hack? Dump the dataframe to a StringIO as CSV and load that instead...

In the past when I've had to insert many rows (for example), I wrapped it inside a transaction. Why and when would someone use executemany() instead? Or does executemany() do a transaction behind the scenes?

Why and when would someone use executemany() instead?

What kills you in this use case is roundtrips - send one row, wait for an acknowledgement (even if you are not committing it yet) then sent one more, and so on and so on. There is a lot of dead time spent in protocol overheads or just idle. A well implemented executemany() sends as many rows at a time as will fit in a network packet and keeps streaming them as fast as it can.

Transactions are a separate orthogonal concern.

What executemany can do is prepare the statement once and call it multiple times with different sets of parameters, whereas execute may be preparing the statement every time it is called (possibly hoping that the database system has a statements cache) or not preparing them at all (and the DBMS only does simple query planning).

For some of the executemany use-cases the following execute would work as well:

INSERT INTO myt (myc1, myc2) VALUES ('c11', 'c12'), ('c21', 'c22');

> return value ... undefined

I much prefer treating cursor.execute like list.append. Mutation methods should return None to emphasize their impurity.

Been meaning to try this, good write up.

An alternative between keeping all data in-memory and going full relational database that's worked well for me is Python + H5PY. Write a simple class where your data getters / setters interact directly with the H5 file and don't keep it in memory.

Super fast (data handling and setup), easy on memory, and doesn't lock you into the schema of a relational database.

I often use the Dataset library (from Mozilla, I believe). It provides a lot of convenience for incorporating SQLite (and other databases) into Python projects and it's very easy to use it to fall back to straight SQL queries and other statements.


Dataset is neat, but it relies on the monster that is sqlalchemy. I have recently been using Pony ORM for new experiments with sqlite, and then if the project sticks data transfer to MySQL, et is trivial.

Pony is not a light as Dataset, but way less heavy/confusing than sqlalchemy.

I agree, but Pony is also less flexible than SQLAlchemy :/ For anything other than trivial queries, you'll end up writing some horrible code.

Having used Pony for a month or so, it feels like someone saw generators, thought SELECTs would be cool using them, and then implemented all other query types as well instead of shelving the whole thing like they should have.

Kenneth Reitz's records library is similar (and also reliant on Sqlalchemy): https://github.com/kennethreitz/records

Some really nice tips. I have worked quite a few times with Python + SQLite3 combination and I always enjoyed the API, that's why I chose Python for my toy document store that I created on top of SQLite3 (shameless plug http://blog.creapptives.com/post/47494540287/a-document-stor... ).

The docs cover all this and much more. Do yourself a favor and read them. Sqlite is a very capable library and this post is just the tip of the iceberg.


Agreed. But in general avoid 2.x if you can. It will be EOL at some point, quite possible before the code you're writing now is EOL. Up to date docs are at:


> 7. Use Placeholders to Interpolate Python Values

Well, where you can. You can't use placeholders for table names, for example.

When you can't, just don't derive them from unsafe input.

One thing I was hoping to see; building sqlite3 for Python with json1 extension support. I've tried a few times and gave up, confused. How hard is it?

Entirely disabling synchronous is incredibly dangerous. Use "pragma journal_mode=wal" for a safe performance boost.

If your program is load 100gb of data, query, then destroy the db... you don't care if a crash loses data. You just run again.

I'm not sure synchronous=off qualifies as incredibly dangerous. You can corrupt the database with synchronous=off, but only if you loss power or take an OS reboot in the middle of a transaction. Even with synchronous=off, transactions are still ACID and the database is guaranteed to be intact following an application crash. It takes an OS crash or power failure to cause problems.

Disabling synchronous is dangerous and I don't know where it could be used practically (if you care about your data https://www.sqlite.org/howtocorrupt.html#_failure_to_sync).

That being said 'journal_mode=wal' is a bit faster than normal async but nowhere even close to the speed of disabling async completely.

> ... don't know where it could be used practically.

Maybe with read only databases?

Would you get a speed increase in reads? I thought async and wal speedup would just be for writes (no good for read only database). Maybe I'm misinformed. Have to play with it and see.

Hmmm, I'm not sure either. If you got around to testing it, it'd be interesting to hear about. :)

sqlite3 is the most underrated database ever, especially with python. You can get a ton done with it. Who needs HBase and MongoDB...

Nice writeup. I'll keep this in mind for when PostgreSQL + SQLAlchemy would be overkill.

I used autohotkey for this kind of in memory mostly read string lookups and nosql type queries without sqlite. Autohotkey is way faster than Python. Maybe it's the immutable strings slowing down python, but there was like a >10x performance difference. I even had a linux port of autohotkey at one time, but I think you can also just use wine if you are on linux.

Python is slow, you can downvote me all you want. Case in point: https://news.ycombinator.com/item?id=15548904

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact