I wish they would build compression directly into SQLite.
I use SQLite as a log store mostly dumping JSON data in it.
Due to the lack of compression the DB is probably 10 times the size it could be.
You can also try sqlite-zstd [1], which is an sqlite extension allows transparent compression of individual rows by automatically training dictionaries based on groups of data rows.
Disclaimer: I made it and it's not production ready
I mean, they do: [1]. It's an extension, true, probably because they don't want to force zlib as a dependency. Still, very straightforward to build with and use, and as you can see from the code, not hard to implement either.
Compression at the _database_ level would be more useful. If you have a large text dataset but need to use the Full Text Search feature, I don't think you can compress the index.
You could probably pass the data through a gzip filter before storage if you don't need the data to be indexable while in the database. I know, that's an extra step, but in most programming languages it's fairly straightforward.
And to be honest, if you need to search through or index JSON blobs in a database you need to reconsider your design.
> And to be honest, if you need to search through or index JSON blobs in a database you need to reconsider your design.
As a quick-and-dirty logging solution it makes quite a lot of sense. You can add whatever fields you want to your structured logs, and if you find you need to search on a particular field, just add an index for that field.
>You could probably pass the data through a gzip filter before storage
That p much voids all the benefit of using SQLite.
>And to be honest, if you need to search through or index JSON blobs in a database you need to reconsider your design.
This is absolutely intentional the log contains information to manually re-play certain events in case of an fatal error mostly for debug purpose or performance benchmarks.
This means 99.99% of the data will never be needed and eventually will be discarded.
But when I need to re-play data I need to be able to query for exactly the data I need.
Each JSON "blob" (which is actually store as TEXT in SQLite) has a few additional fields to quickly filter off 95% or more of unwanted data. The rest SQLite simply reads completely to separate wanted and unwanted entries further.
With a DB size around 100GB I get the data I need by reading 5GB or less. That's perfectly fine for my use case even if I would let the DB grow quite a lot larger it would be just fine.
I would imagine SQLite would do a lot of random access IO, which would make this difficult even without indexing.
But there is a gzip “variant” that we use in genomics called bgzip that supports this. It’s basically a chunked gzip file (multiple gzip records concatenated together) with an extra flag (in the gzip header) for the uncompressed length of each chunk. Using this information, you can do random IO on the compressed file while only uncompressing the chunks necessary. I’m sure other compression formats have the same support.
> I would imagine SQLite would do a lot of random access IO, which would make this difficult even without indexing.
To individual blobs? Seems unlikely.
Posgres automatically compress large blobs, it's part of the TOAST system:
> The TOAST management code is triggered only when a row value to be stored in a table is wider than TOAST_TUPLE_THRESHOLD bytes (normally 2 kB). The TOAST code will compress and/or move field values out-of-line until the row value is shorter than TOAST_TUPLE_TARGET bytes (also normally 2 kB, adjustable) or no more gains can be had.
So if a TOAST-able value is larger than TOAST_TUPLE_THRESHOLD it'll get compressed (using pglz by default, from 14.0 LZ4 will finally be supported), and if it's still above TOAST_TUPLE_TARGET it'll get stored out-of-line (otherwise it gets stored inline).
Not the filesystem... to the database file itself. I thought that was what we were talking about.
The blobs in the sqlite archive are compressed, but one of the critiques was that the compression was better when you used multiple files to build the compression dictionary (zip vs tarbal).
And now I realize that I probably replied to the wrong comment...
Interesting. Sqlite moves some data off-page (out of the btree node) if the row is bigger than a certain threshold (the beginning of it is still stored in-page). I wonder if this could be hooked to compress the overflow data.