Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Reduce SQLite database size by up to 80% with transparent compression (github.com/phiresky)
420 points by phiresky on Aug 1, 2022 | hide | past | favorite | 90 comments



This is really neat. I like how the compression can be applied to specific columns rather than whole rows - or whole pages. I could imagine storing JSON from an HTTP request here and using Generated Columns[1] to extract the bits my application is interested in.

I also like how simple and powerful the custom partitioning is.

[1]: https://www.sqlite.org/gencol.html


I’ve been following that pattern in PostgreSQL and love it. It cleanly decouples the ingest and keeps the stack simple.


Do you mean gencols or compression?


I meant consuming the raw json into a jsonb column and then using views or triggers to maintain the structured version of it.


Can you elaborate?


Not the author, but if you have a write intensive workflow and I/O starts to be your blottleneck , compressing the data application-side allow your database to write less bytes, and thus to handle more load for the same capacity.


Be aware of the number of iops. That’s a killer.


Look into ETL vs ELT

ETL you grab the data from somewhere, transform it to clean it/format it, then load it (in this case to postgres).

ELT, you grab the data, load it as is. Then you use another step to extract what you need or transform it. You keep the original data in case you need it there too.


On that subject I've often thought of writing a combination decompressor/JSON parser that can parse the contents of the dictionary before decompression for faster JSON parsing and lower memory usage.


Maybe the compression algorithm used in https://github.com/beenotung/compress-json can inspire you.


Thanks, but that wasn't quite what I meant. What I was thinking about was processing existing gzipped or zstd compressed JSON rather than having a new compression format.

I could see how customising the compressor could be helpful - a bit like how there's an `--rsyncable` option to `gzip`, but I'd like to keep compatibility with existing formats. I like to avoid coming up with new formats whenever possible because you might want to mine the data in 20 years time - and if it's a custom format it'll be much harder to deal with.

I don't have a current use-case for this, but I like thinking about it. In particular I like data stores to be as simple as possible. Managing persistent state is the hard bit of most systems as unlike with runtime state or code you have to manage forwards and backwards compatibility, migrations, etc. and you can't just "turn it off and on again" to fix issues.


a lot of document key/value databases that store arbitrary JSON tend to work this way (for example Cassandra does this, IIRC). some will even automatically build indices for you (though hand-tuning is usually better)


wait, are you saying that the columns of view tables could be compressed? that doesn't make sense to me cause views don't use extra space to begin with....?

storing raw JSON in sqlite means the entire blob would be compressed and live in a single column, right?


You'd compress the JSON, and leave the data that you extract from the JSON uncompressed using Generated Columns[1]. That way you can save the original JSON data, but still have fast querying, etc.

It would look something like this:

    CREATE TABLE t1(
        json TEXT,
        data INT GENERATED ALWAYS AS (json_extract(body, '$.d')) STORED
    );
If all you need is fast querying you could make `data` a `VIRTUAL` column and set up an index on it.

You can read more about this kind of technique here: https://dgl.cx/2020/06/sqlite-json-support

[1]: https://www.sqlite.org/gencol.html


The image near the top of the readme implies that compressing the whole file means not having random access, which is indeed true, but it also implies that there's nothing between whole file compression and row/column level compression, and that is not true. You could compress pages, kinda like ZFS compresses blocks, and still maintain random access, and get pretty close to whole-file compression.


That requires changing your filesystem to a compressing one though, and you lose the compression when you copy the file to another filesystem. It does have the advantage that the page cache could have the uncompressed copy though.


No, the database itself can compress in the same way as ZFS. Now, don't layer compression on compression, otherwise it should work.


This makes me wonder how sqlite augmented with blosc or some similar faster-than-memory compressor might behave. Size and speed wins?


AFAIK Blosc is just a container. Not a good fit for row-level compression.


the vfs isn’t row level, though!


What does "transparent" mean in this context?


Transparent to the application - basically the SQL queries (select, insert, update) in your code base don't have to change at all.

This would be in contrast to e.g. compressing the data in your code before inserting it into the database, then decompressing after you load it.


One option not listed (under "Other options for compressing data in a database") but probably closest to what this solution does under the hood, is to do row-interdependent compression at application level.

That is, while saving rows to the database, pick out every (say) 10th row, and compress it by itself but keep the compression state in memory. Then for the next 9 rows, compress that row based on that dictionary, and a reference to the row it's based on. (For example, in Python's zlib module, you'd use the zdict parameter [1] of compress and decompress.) A bit like keyframes in video compression.

You can potentially get better results this way than a generic solution like the article, because you might have some application knowledge about what rows are likely to be similar to others. But it obviously requires application level code (e.g. fetching a logical row requires fetching two physical rows so you have enough information to decode), and it would be a nightmare if you need to remove or update the rows.

[1] https://docs.python.org/3/library/zlib.html#zlib.compressobj


> You can potentially get better results this way than a generic solution like the article because you might have some application knowledge about what rows are likely to be similar to others

The linked project actually does exactly this - you can basically specify an SQL expression (such as strftime(date, '%Y-%m')) that is then used to group rows together in order to decide what to train dictionaries on. What you're describing as picking every 10th row would be the expression (id/10). So you can use application knowledge, you just don't have to.

The difference is that in my project the dictionary is stored separately so not affected by row deletions, and the trained dictionary is based on all samples not just one of them. Since the dictionary is stored separately it's only worth it if you combine at least maybe 100-1000 rows under one dictionary.

> Python's zlib module, you'd use the zdict parameter

Yes, the training feature of zstd is basically a better version of keeping another sample of data around and using it as a prefix. In fact you can use a data sample instead of a trained dictionary as the `-D` parameter of the zstd command line, zstd just has the additional feature of reducing a set of samples down to the most essential (data-saving) parts.


Thanks for making this project, it looks fantastic! I might end up using it if I can figure out how to from Python.

I did mention that this is what you're doing, I just thought it wasn't clear from the blog post. But I did miss that you can use an SQL expression to choose the grouping. This is ideal for the application I'm thinking of, which a "message type" field that corresponds pretty closely with which rows are similar.


You can use it from python as follows:

    db = sqlite3.connect(...)
    db.enable_load_extension()
    db.load_extension("libsqlite_zstd.so")
You probably still have to figure out how to make that work cross-platform / how to ship that binary though.


Oh interesting, I was assuming that recompiling the whole sqlite module woudl be in order. Thanks again.


Your explanation is great. I like the key frame codec analogy. Like you said though, this really belongs at the DB layer. An application level solution would be much more than a nightmare.

Plugin/extension/module development is severely underused in our field. Extensibility, if possible, is often the best way to handle edge cases or increments in a dependency, without forking it.

See "Benefits of Being an Extension to PostgreSQL": https://www.citusdata.com/blog/2017/10/25/what-it-means-to-b...

Some great software that is extensible in brilliant ways through plugins, that comes to mind, is:

postgres: https://www.postgresql.org/docs/current/external-extensions....

nginx: https://www.nginx.com/resources/wiki/modules/

sqlite: https://www.sqlite.org/loadext.html

redis: https://redis.io/docs/modules/

babel: https://babeljs.io/docs/en/plugins/


This looks great. How would I apply this retroactively? I.e. tell SQLite to compress column x in an existing database, say, of type TEXT?


Oh, interesting. I scrape GitHub's APIs to generate user and repositories statistics data, then compress it with ZStandard to allow the databases to fit within GitHub's file size restrictions.[1] So, this is very relevant to me.

It's pleasantly surprising how much data you can fit within a ZStandard-compressed SQLite database. The two technologies fit very well together.

[1]: https://github.com/andrewmcwattersandco/github-statistics


I don't want to diss on the effort, I have an honest question: wouldn't this something that would be more appropriate to be done at the filesystem level?


Context-aware compression can be a lot of faster than blind filesystem level compression.

Some data, like plain text, JSON, or XML, compresses extremely well. You're almost certainly looking at compression rates of up to 90%. Dense binary formats, especially formats that are already compressed like most image formats won't compress barely at all (sometimes they can even get bigger), and attempting to compress them is an enormous waste of processing power.


Tried this on my twitter database's `json` column (ie the JSON response from the API fetching a singular tweet) with a `dict_chooser` of '%Y-%W' to bundle weeks together and it took 985k rows down from 6.6G to 1.5G. Not as good as compressing the whole thing obviously but really handy for a queryable backup.


At least in btrfs there's compress vs compress-force mount options. compress uses a cheap estimator of compressibility, whereas compress-force always passes data through the compression algorithm.

ZSTD has its own estimator to avoid doing unnecessary work, but I haven't found a comparison of these estimators (which is cheaper).

compress results in max extent size of 128K for compressed extents, and 128M for uncompressed data. Whereas compress-force results in 512K max extent size for uncompressed data. While there's a cost to tracking more extents, it might be less than the cost of frequent modification of large extents, particularly if snapshots are used. Of course there's no one answer, it'd be workload and hardware dependant. Hard drives have significant latency impact from non-contiguous extents. NVMe much less impact.


The zstd estimator is not that great, so you should definitely keep the auto-detection of the file system on.

If you try to compress a completely incompressible file, it will still be extremely slow at higher compression levels: https://github.com/borgbase/vorta/issues/485#issuecomment-88...


I can't recall if they actually implemented or just talked about it, but during the ZFS team meetings there was some discussion about using LZ4 as a "compression canary", to determine if it was worth attempting Zstandard. That is, first try compressing with LZ4, and if it did compress, do actual compression with Zstandard.

IIRC the argument went that LZ4 had a very quick bail-out routine, and quite fast compression, so the overall overhead would be quite small, at least for the higher Zstandard compression levels.


They implemented it; it landed in May: https://github.com/openzfs/zfs/commit/f375b23c


Quite the reverse actually, the btrfs estimator is way worse than the zstd one. By using the btrfs estimator, you leave a ton of compression ratio on the table, for no good reason.


Continuing the hypothetical without really knowing about today's filesystems, such a system could presumably calculate a moving per page, fd or process- compression factor and turn it down or up if it started seeing lots of incompressible data?


For a columnar database, this would probably work fairly well as it would put data of similar compressability adjacent to more of its kind, but more row-based databases where various types of data is intermingled would probably not benefit nearly as much.


It would be interesting to try something like this on the file system level, yes.

Basically you could heuristically find interesting files, then train a compression dictionary for e.g. each file or each directory. Then you'd compress 32kB blocks of each file with the corresponding dictionary. Note that this would be pretty different from how existing file system compression works (by splitting everything into blocks of e.g. 128kB and just compressing those normally).

I think it would be hard to find heuristics that work well for this method. The result would be pretty different than what I'm doing here with different tradeoffs. sqlite-zstd is able to use application-level information of which data is similar and will compress well together. You couldn't do that within the filesystem. It also works "in cooperation" with the file format of the database instead of against it - e.g. you probably wouldn't want to compress the inner B-tree nodes.

On the other hand, it would then work for any file format not just SQLite databases. E.g. a huge folder of small json files


No. If you did that, SQLite would have to (logically) decompress about your entire database on every query. For example, how would it read the database schema to determine whether “select foo from bar” is valid sql for the database? If it figures that out, how would it figure out there’s an appropriate index to use? How would it read that index?

Databases do a lot of seeking into database files and typically only read a small part of them. That’s not a scenario that’s a good fit with whole-file compression.

Compressing individual pages is a bit better, but I think this can still be better.

Your file system probably would cache the decompressed database file, but that would use memory, lots of it if your database is large.


FS-level compression is usually limited to compressing relatively small blocks (think ~dozens of sectors), they don't normally compress a large file as a single solid block.


Yes, but still, the file system would do more decompression than needed, and that will cost performance (CPU and caching)


Great idea. zfs can compress data on disk. But it's also useful to keep the data compressed in memory so you can fit more data in memory. There's a lot of cpu/memory tradeoffs so it's worth having the options to do it on the data.

It's also worth noting that compression on the application layer works way better on column stores than row stores.


> Btrfs also supports transparent file compression. There are three algorithms available: ZLIB, LZO and ZSTD (since v4.14). Basically, compression is on a file by file basis. You can have a single btrfs mount point that has some files that are uncompressed, some that are compressed with LZO, some with ZLIB, for instance (though you may not want it that way, it is supported).

https://btrfs.wiki.kernel.org/index.php/Compression


> wouldn't this something that would be more appropriate to be done at the filesystem level?

Sometimes? You've gotten a ton of replies digging into the "why" and "when", and I only wanted to add that your question brought to mind this quote from the official SQLite website:

> "Think of SQLite not as a replacement for Oracle but as a replacement for fopen()"

https://www.sqlite.org/about.html


It depends on the data and application,

I use a similar approach for compressing data that go in a couple of database/store. Having control on when and how to compress data can significantly improve performance, also, it sometimes makes sense to keep compress data even in memory since it can allow you to make better use of it.

Also keep in mind that often you can't control the filesystem.

One use case for which I am working right now is that I have is storing highly repetitive data on a very slow flash storage, (that technically support compression, but it is slow and with not great compression ratio), I can significantly improve performance.

edit: grammar


Yeah but your app doesn't have control over the filesystem. Usually an sqlite db is bundled with an application or created by it.

That said, it looks like you could get the same results by doing client side compression and decompression of data yourself, and storing compressed blobs in sqlite. It only compresses text column data, not entire pages.


But it uses a shared zstd dictionary among rows!

The details are described here: https://phiresky.github.io/blog/2022/sqlite-zstd/



With this you can selectively compress per column


This kind of solution should work fine event if the database is encrypted.


This mentions something I've been wondering about:

>Depending on the data, this can reduce the size of the database by 80% while keeping performance mostly the same (or even improving it, since the data to be read from disk is smaller).

"Even improving it" is a bold claim!

Modern CPU's are often hungry for data, aren't they? While being incredibly fast and mostly idle?

Is reading a smaller compressed file and decompressing it in the CPU's hyperfast cache before use faster than reading it uncompressed?

Maybe you can speed up an entire server, even one with lots of free space and no need for compression, by adding full disk compression anyway!

Unless the CPU's are at 100% utilization this could always be used. Since the CPU will then use the data, it might be fast enough to uncompress it at home faster than the bus can keep dropping it off at file read speeds.

This chart I found that goes up to 2000 suggests cache access had an increasing performance gap with even RAM bandwidth, to say nothing of SSD's[1]

It is from here[2]

(However, compressing a full disk can only help if there are extra CPU cycles.

For activity spikes when the CPU doesn't have extra cycles or the cache for it, both compressed and uncompressed versions could be available, at most doubling disk usage, so that the uncompressed file could be read instead.

This can ensure the CPU isn't burdened with a decompression task, since this only helps if the fastest caches are used for it and they might have better things to do than uncompress files.

However, whenever the CPU has idle cores and lots of cache, which is most of the time, it could request the compressed version. If it is busy it could ask for the original version which saves the decompression work.)

Can this help solve the problem of waiting for the disk?

[1] http://www.extremetech.com/wp-content/uploads/2014/08/CPU-DR...

[2] https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu...


> Maybe you can speed up an entire server, even one with lots of free space and no need for compression, by adding full disk compression anyway!

Yes. Although not all filesystems support compression. One of the filesystems that does support and this is regularly benchmarked, is ZFS. It is also interesting that while you are definitely using more CPU to read a compressed stream, if that causes you to finish your task earlier, then it can be a net gain even from a CPU usage standpoint.

eg.

https://www.servethehome.com/the-case-for-using-zfs-compress...


Let's say I have a web visits analytics log database, with a column "user-agent" and a column "url" that have most of the time the same values (99.9% of rows have one of 100 typical values).

Would this work well with compression here?

ie. from 50 bytes on average for these columns... to a few bytes only?


Yes, if you use the "dictionary" option it will compute build a buffer of common strings and use that during compression, so your user-agent columns will largely become references to the dictionary. Make sure you set compact to true! See this for more information: https://facebook.github.io/zstd/#small-data


Will this mode require a separate file along the data.db Sqlite file, to store the compression dictionary?

From your last link:

> The result of this training is stored in a file called "dictionary", which must be loaded before compression and decompression.

Also, what happens if a column has been compressed based on 100 typical values, and then later we insert thousands of new rows with 500 new frequent values. Does the compression dictionary get automatically updated? Then do old rows need to be rewritten?

PS what is compact mode?


It stores these compression dictionaries in the database:

> If dictionary is an int i, it is functionally equivalent to zstd_decompress(data, (select dict from _zstd_dict where id = i)).

Compact saves you 8 bytes per compressed column by removing some metadata zstd uses :

> if compact is true, the output will be without magic header, without checksums, and without dictids. This will save 4 bytes when not using dictionaries and 8 bytes when using dictionaries. this also means the data will not be decodeable as a normal zstd archive with the standard tools. The same compact argument must also be passed to the decompress function.

You will need to update the dicts occasionally, but there's a helper function to do this for you, read the docs about zstd_incremental_maintenance.


Probably yes, but presumably you could achieve this by compressing the data on the file system level as well (for example with ZFS).


SQLite allows you to override the low-level paging routines, wonder if it won't work by just having a different pager (though maybe not that efficient compression, unless the pages are big (64kb)).


LZ77+ ftw. I am a heavy user of Zstd. I think Brotli might be a better choice here. Or at least for my use-cases.


Awesome!

Hopefully this will encourage SQLite to support compression natively in some fashion.


I love the idea. Can I use it with Django projects somehow?


I'm using sqlite on zfs and btrfs, with transparent compression supported by the file system, the query speed is very good and compression rate also very good. (database below 3GB uncompressed, 22% after compressed)


Did you enable WAL? Any casual comparison in performance between WAL and no-WAL?


I enable WAL. It performs better than the default (which is to keep a rollback journal)


What do you think about zfs vs btrfs for your use case? Why do you use both?


I used zfs because I heard it first. But I realized zfs partition cannot shrink in size, one have to copy it to another smaller partition if they have enough spare volume.

So I experiment with btrfs on another device and both are working well after some daily usage.


I am happy for the author, but want to warn the credulous that this is a very strange way to approach compression with SQLite. The much more common way is to use SQLite vfs:

https://www.sqlite.org/vfs.html

Briefly search for SQLite compressed vfs and you will find many extensions. Some are quite mature.

I get the impression that the author doesn't know about the vfs, as it's not even mentioned in the readme. The clickbait title similarly seems amateurish. Caveat emptor.


I did actually previously write a VFS, though it did something else entirely:

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

You're right that a comparison to compression VFS is completely missing. I knew about their existence but I have to admit I didn't look too closely.

Note that the main "novelty" here is training a shared / partitioned compression dictionary specifically to take advantage of redundancy that wouldn't appear within only one row or even within a database page / block . The compression happens at the row level and can additionally use application knowledge - you couldn't do that in the VFS level. For example, you can have separate compression dictionaries per different columns and per groups of rows with some commonality.

I'll have to compare to a compression vfs (do you have a favorite?) and see if maybe these two methods can even be combined.

Edit: I see that https://github.com/mlin/sqlite_zstd_vfs does actually also train dictionaries. It's still at the database-page level so can't take application knowledge into account or compress only parts of the data, but that's still pretty smart then.


How does this compare with modern compression algorithms like Brotli that do context modeling, etc? I've found that they manage to aggressively compress types of data you wouldn't expect to compress well, to the point that investing energy into doing compression yourself doesn't provide big returns anymore. The downside is that codecs like Brotli tend to compress very slowly, but I can imagine being able to do a setup where you only compress old rows so it would be cool to see an experiment with just compressing rows or columns and comparing the sizes with your method.


Zstd with training libraries should beat Brotli and Brotli will struggle with non-text data although I haven’t benchmarked

Your underlying point though remains valid that the incremental complexity of building that training data probably doesn’t warrant it because the place where that becomes valuable is quite rare particularly for typical SQLite databases. Still a neat trick thing though.


I was surprised to discover that Brotli is actually very adaptable. I spent a month or two doing research on custom compression techniques for WebAssembly early in the spec process and we ended up discovering that you can just throw a naive binary executable through brotli and end up with at most like a 5% size loss vs doing a bunch of fancy compression, at which point the cost of the fancy compression starts looking questionable. We ended up not shipping custom compression as a part of the spec as a result.


Curious if you compared Zstd against Brotli. I'd expect Zstd to beat Brotli by a fair margin for non-text payloads (+ faster for decompression which matters since these are compressed once / decompress many).


We didn't, and I'll have to make a note to investigate zstd for my own purposes later! Brotli is kind of the only game in town since it's shipped in every web browser now as a transport codec (the other one is gzip)


Let's not water down the meaning of the word "clickbait". I have no problem with this title.


> amateurish

There's no need to be this dismissive.


welcome to HN


Meh, I find HN to be among the least dismissive places on the internet, especially when you only consider nerd-heavy places. I like to think that this is in part due to comments like the one I wrote being commonplace. I've been at the receiving end of them as well (incl a few kicks under the butt from dang himself) and it's helped me assume good intent, not be overly dismissive, etc. I'm just passing the favour along.


I think the frustrating thing about HN is the (relatively) high number of comments that are both dismissive and substantive. It's much easier (for my brain at least) to ignore comments where the entirety is "this sucks" or "lame" (or even "If I made something this crappy I would throw myself off a building" which isn't uncommon in some forums).

In comparison, gp comment is both dismissive and substantive in that it brings up SQLite VFS layers that can accomplish a similar goal. Author of TFA even updated the README to mention these, addressing gp's complaint.

The whole "this is crap because of X" where X is something that subjectively could be considered legitimate criticism is relatively common on HN, so comments that I don't automatically skip over are rather more dismissive than on other forums.


To call it strange or unusual is fine but it isn't a compelling argument why not to use it. Is it better or worse than a vfs approach and why?


Taking an amateurish or fancy or experimental approach to compressing a sqlite database is a great way to end up with data loss. Doesn't make it a worthless or "bad" project but it's worthy of a warning.


Why not read the sqlite teams own explanation:

https://www.sqlite.org/zipvfs/doc/trunk/www/howitworks.wiki


> Briefly search for SQLite compressed vfs and you will find many extensions.

I found 3 with a brief search - CEVFS, sqlite_zstd_vfs, and ZIPVFS (not free, therefore discounting.)

Neither CEVFS nor sqlite_zstd_vfs support WAL mode (which means, e.g., no litestream replication).

Can you recommend a mature one that does support WAL mode?


sudo apt-get install archivemount

archivemount mydbimage.tgz /mnt/mydbimage -o big_writes

// -o readonly disable write support

// -o nosave do not save changes upon unmount.

//open/cache /mnt/mydbimage/mydbimage.db

//flush

fusermount -u /mnt/mydbimage

//No LGPL/Rust/cargo dependency issues...

//as SQLite's best use-case is embedding ;)


I understand why you suggest this, even though archivemount is LGPL, but I digress.

FUSE is cool and all but given it's "userspace" and we're just reading a single file here, I don't find this compelling versus reading from a block device in-process.

TFA is compelling because it's compression over specific data sets without being in the hot path for every operation. Putting a sqlite.db in a tarball is... problematic. There is no random access, so if your database doesn't fit in memory/cache, performance will be abysmal.


LGPL is not great for porting, and linux block caching is far from the performance bottleneck if you know what you are doing.

I think we can agree it depends on the use-case. =)


Eh, tar is probably not an amazing format for this, since it's designed for tape drives, and has ridiculously slow random access.


archivemount supports numerous formats as it is using libarchive.

My point was it doesn't contaminate the parent projects license.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: