Hacker News new | past | comments | ask | show | jobs | submit login
An unscientific benchmark of SQLite vs. the file system (btrfs) (github.com/chrisdavies)
113 points by leeoniya on Feb 12, 2022 | hide | past | favorite | 67 comments



It is stated regarding Btrfs -

> we're writing the entire file each time

This is a super-common misunderstanding of Copy-On-Write. Entire files are not re-written on each write to a file. New blocks are written and then the (btree) metadata is updated to include the updated block in the file. [1] This causes greater fragmentation than writing new blocks over old ones in the files existing structure as non-COW systems can do. Another thing which somewhat slows btrfs is all blocks have checksums calculated, stored in metadata and checked on read. If entire files really did need copied on each edit - it would be a great deal slower !

[1] https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Files


I couldn't see any fsync (or similar) in your filesystem code (sorry if I missed them) -- it doesn't seem like a completely fair comparison, as sqlite has various promises about commits being completely made, etc, which you won't get from just chucking files on a filesystem.


Yeah. I mentioned that in the final test where I write to a temporary file, then rename. This is much slower, probably due to an implicit fsync.


You can disable the fsync calls in sqlite if you wanted to do a little better with this benchmark. You're also explicitly choosing the WAL in your go-sqlite3 configuration which is not at all replicated by your filesystem test. I think, honestly, that you're just going to mislead and confuse people who don't know any better with this writeup. I can write faster to /dev/null, too, but that isn't a very interesting comparison.


Does /dev/null support sharding?


Even better, it's embarrassingly parallel!


Yes.


There's some 10-year old webpage that used to be popular on HN that goes into data integrity issues and fsync and filesystem databases that someone should dig up so you can read.



Yes that's the one.

And damn near every application is "broken" according to the ext4 documentation.


I am not aware of implicit fsyncs. Can you please link to what you are referring to?


Probably in reference to Ext4's[0] configurable rename-replace behavior under auto_da_alloc

[0]: https://www.kernel.org/doc/Documentation/filesystems/ext4.tx...

Edit: note this isn't a universal property either, so it's still wrong: https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_cra...


Yeah considering that the test is on btrfs, I would really be interested in knowing what OP is referring to.

Also, even if we were on ext4, it's worth pointing out that the ext4 documentation refers to apps that rely on that behavior as "broken applications", so it's hardly a good term of comparison.


If you only care about storing data and not doing any complex operations or query against it why are you using something like sqlite as well?

It’s not a comparison as being in sqlite makes the ability to access this data significantly easier. This is comparing apples and dogs and i don’t see the merits.


In one application I use sqlite to store JPEG thumbnails. That's it; there's nothing else in there. It's super handy; I specifically needed to reduce the number of files that I open and close because that's slow on Windows/NTFS. SQLite made this trivial. I could have managed a binary pack format on my own but I didn't have to.


Indeed, "SQLite does not compete with client/server databases. SQLite competes with fopen()."

https://www.sqlite.org/whentouse.html


Another alternative that I've seen used is a zip or tar with no compression if you are just appending files and reading but only rarely updating or deleting.

But sqlite is still better, it is more reliable, a bad write on that end of zip index destroys the whole zip archive and sqlite also gives you a lot more potential flexibility later on if you need to add metadata or something else. It is better in terms of inserts and deletes, although you will still need to vacuum.


A bad write on a zip file destroys the O(1) seek time, but it doesn't destroy the zip. That goes back to PKZip trying to work on floppies and over modems. You can still do an O(n) seek on a particular file, or expand and recompress the file to recover whatever isn't truncated.

For this situation it does matter, but it is recoverable.


Oh interesting! Good to know


That property came back into use when we crossed the 2GB and then 4GB threshold for archives, because the TOC uses a 32 bit integer to reference backward into the file. A lot of implementations used a signed int, and even when they fixed it, organic growth in bandwidth and thus content size ran you into the wall in another couple of years. We had collectively mostly supported UTF-8 filenames around that same time period, so it was going from one problem to the next (or choosing between implementations that had fixed one but not the other).

For the 32 bit address problem, you could read forward from the front, and as long as no entry was over 4GB, you could still read the file. If the file count was low enough you could cache all of the entry objects, and if reads dominated opens, then you were functionally back to O(1) access (but O(n) startup).

There was a 64 bit extension going around, but when I stopped working with zip files every day I stopped tracking the progress.


You could have easily done that with a single file with a tail header for look ups.


I'm aware; I mentioned this at the end of my post. This is still easier. I want to clearly express how absolutely trivial it was to use SQLite for this. I didn't really have to think about anything.


Easier than SQLite? With less bugs and better tests?

SQLite advertises itself as an fopen replacement. Sounds like a perfect match for parent’s use case.


Depends on the indexing needs, right? Something like IFF should be fine if you need a sequence of images to store that you don’t need to actually query for individually.


Always, but I find it really hard to see how not using SQLite in this use case is the better option. Maybe if you need every single percentile of performance? In 99% of other cases just go with SQLite.


Yes easier, as in time to integrate and get working correctly. I’ve used SQLite extensively over the last 10 years and yes it’s a good solution, but not a replacement for fopen. A flat file would be easier to integrate, test, and harden over SQLite. Would take about the same time as integrating SQLite into a system for the same purpose and would be easily extendable to support features as the system grows.


> Would take about the same time as integrating SQLite into a system for the same purpose and would be easily extendable to support features as the system grows.

I find it much easier to add features to my post-2007 projects (when I started using SQLite) for the specific reason that I can open the SQLite file in a GUI and pretty quickly see what’s going on with data organization (schema) and how the customer uses the software I wrote (ie by what columns they actually use/misuse).

Prior to that, there’s various versions of my b-tree library and lots of zips, or linear text indexes, or any combination of whatever fit the need. Data storage implementation needs to be reasoned about in detail for each pre-2007 revisit in ways that don’t happen with SQLite projects.


What's the UI you use for viewing SQLite files?


At client site to debug issues and deal with things like bulk product name changes https://sqlitebrowser.org/

For more involved work that I do at my shop or on my laptop, DataGrip by Jet Brains is great. Before I got fed up with Apple, I used https://menial.co.uk/base/

DataGrip’s benefit to me is mainly reworking a customer DB at the ER diagram level, then I manually update my code to match.


By the way, SQLite started out as a TCL extension. It works really well in TCL and good old TCL/TK/SQLite gets CRUD done fast. I almost feel bad looking so heroic.


It's not uncommon to see people advocate for using sqlite as an alternative to flat files or json blobs. In particular, transactions provide nice properties vs a web of separate flat files, and the stronger schema can be a good alternative to json. There's a lot of great existing sqlite tooling, too.


Not only other people, the creators also advocate for it as a file format:

https://www.sqlite.org/aff_short.html

https://www.sqlite.org/appfileformat.html


Sqlite archive files are somewhat interesting too: https://www.sqlite.org/sqlar.html


The history around mongrel2 using SQLite as a config file format may be instructive. Seems like there was a lot of resistance in that case.


SQLite makes seeking in a file using a lookup table look like a complex operation from the developer’s perspective, which is absolutely a good thing unless you have a strong alternative rationale for your use case.


> then rename it to overwrite the existing file ... This makes me think that probably my initial file tests weren't waiting for fsync, but the rename forces the application to wait. I'm not sure.

I can answer this one.

This is conflating atomicity (atomic rename) with durability (fsync; whether data is guaranteed to be on disk and will survive power failure). These are two orthogonal things.

When you use sqlite with default settings, it will give you both atomicity and durability.

If you want to get both in a file system like btrfs, you need to (1) fsync the written file, (2) do the atomic rename, (3) fsync the directory that contains the renamed file.

Some file systems, like ext4, do automatic fsyncs "every couple seconds", but that is of course not something to rely on when benchmarking.


I’m the author. Out of curiosity, how did you find and post this before I did?


i follow you on github :)

sorry!


Ha! No worries.


Go is perhaps not the best tool to use for this as calling sqlite via Cgo will incur a penalty. Might be significant in a very hot loop.


Go’s the tool I’m using to build a side project, so I was really only curious about Go + SQLite vs Go + file system.


You might find these projects interesting:

- https://pkg.go.dev/modernc.org/ql

- https://pkg.go.dev/modernc.org/sqlite

It’s hard to find much information about them, but the first one seems to be a database similar to sqlite, and the latter seems to be some kind of automated translation of sqlite to Go. I’m not sure either of these will outperform sqlite on CGo, but both of them should eliminate the CGo overhead and might give you some more room.

P.S.: The latter is a bit unorthodox in that it contains separate code for each architecture supported by Go, most likely a reflection of the method of the automated translation.


I hadn’t heard of ql, but I tried the latter project, and it was much slower for inserts. Something like 500/second.


depending on your data, there's also Keyvalue stores like boltdb & badger.


I believe https://github.com/erthink/libmdbx is the highest-throughput transactional KV store with a Golang binding right now. (It’s the current storage backend for https://github.com/ledgerwatch/erigon, which switches quickly to whatever wacky library is optimally performant at the moment.)


I would use the term 'unrealistic' vs. unscientific. Points not discussed: Filesystems are not equal, thus portability may be a concern. For example, filesystems often have path or filename length limits, number of files or directory entries per directory limits, and cannot always store all characters as part of path or file names. Losses may not be evident when they occur or exhibit well-defined behavior, particularly if arbitrary filesystem-related tooling is in use. Filesystems do not permit complex indexing structures, rather typically use only an hierarchical indexing paradigm plus OS-level caching. Efficiency in real-world use is dependent on the approximate ratio of disparate reads, writes, and read-and-write access paths you will actually use (typically not just a contrived direct FS-hierarchical single-record write-or-retrieval) also noting the OS cache settings and backend block storage specified, and whether there is any chance of contention from other processes. Finally, persistence expectations are critical. If you want all writes to last you are in a very different boat to if you just need raw performance. In the latter case, never touching the disk during normal operations and only periodically flushing memory state to disk is a clear win. With SQLite3 you can do this easily via https://sqlite.org/inmemorydb.html or slightly less efficiently at the OS level by providing a RAM disk for it to work on. OP could try this and add to the benchmarks. Finally, filesystems do not permit the additional features of SQLite such as triggers and well defined error and write persist behavior, which are well advised to prevent entire categories of issues on datastores anticipating a degree of longevity and criticality. Oh, and SQLite3/FS performance via a bunch of go libraries isn't necessarily the same as raw.


The article doesn’t mention whether SQLite WAL mode is on. That could make SQLite significantly faster: https://sqlite.org/wal.html


It mentions it at the end: ./tmp.db?_timeout=5000&_journal=WAL&_sync=1

Not sure if it was updated after your comment.


A quick google shows there are few FUSE SQLite implementations. Then you can use grep, ls, etc


How would you defrag the sqlite file? Or would VACUUM or what have you automagically accomplish that?

Actually, having a sqlite filesystem is intriguing because you could in theory add any kind of metadata or filesystem feature you wanted to (such as a forward-error-correcting checksum field, auto-compressed/decompressed data, dedup, encryption, etc.) It would make it nearly trivial for anyone to experiment with new filesystems.


I don't know, maybe: `touch /mnt/sqlite/.vacuum` could be setup to vacuum?


Why not hook into fstrim for that?


Has anyone tried whether this will give you build performance benefits if you put your application source and/or dependencies on this fs?

I’ve played with ramdisk before but didn’t notice any difference, probably due the fs cache.



Try "delete from abcde;" to empty out a million row table with a gb or so of text blobs in it, and see how long that takes compared to deleting a gb of files.


How about a fast filesystem? This is by far the slowest.


That’s fair. It’s the file system I have easy access to. I should spin up a VPS and use ext4 or xfs. I may do that when I have some spare time. It looks like I may get double the perf by making that switch. SQLite also seems to get a bump on different file systems.

Edit: just ran it on EXT4 on Linode.

SQLite: 8.7s

EXT4: 18.3s


I’d bet that xfs is the fastest, it usually is


If you're using the same file, it would make sense if the db 'warms up' by growing enough to comfortably hold all your data.


Good point. That’s the more probable explanation, actually. I’d imagine the execution plan would be cached within the first batch of requests. I hadn’t considered the file resizing, but it’s a good hunch.


Now fight: https://news.ycombinator.com/item?id=27897427

(Sadly, it seems like both benchmarks are flawed.)


Now you got me intrigued by the concept of a big flat raw sqlite file AS a filesystem... I see there is an old FUSE implementation


For SQLite in this test writes in particular means transactions, where each transaction contains exactly one write.


Yeah. It’s simulating the behavior of a CRUD server I’m thinking about writing. I wouldn’t be able to batch writes very easily there, so it would be a transaction per write.


Last I checked BTRFS has heavy write amplification for small files. I definitely wouldn't use it as a DB


Interesting, I always found postgres to feel slower, but never took the hussle to benchmark it against mysql


TL;DR: yes, probably for a real-world use case.

Anyway, the perf is good enough for what I need, and the flexibility of querying makes it well worth it. There is something nice about plain text files, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: