
35% Faster Than The Filesystem (2017) - Tomte
https://www.sqlite.org/fasterthanfs.html
======
zaroth
For small- to mid-sized projects, I’ve always realized huge gains in
simplicity by haves “Files” tables to store various assets.

It means instances in a web-farm can pull the files down when they initialize
easily, it means files are automatically versioned, it provides an obvious
place to put the files when they are being uploaded on the Admin panel. It
means all the files are getting backed up as part of the database snapshots
automatically.

Things I’ve used this for range from static HTML assets, to logos which are
used to dynamically brand a page, to files that get attached to transactional
emails, etc.

So yes, if you are building a massive scale system holding millions of files,
sure, find a purpose-built system to store, distribute, version, and backup
your files.

But for many projects I believe the files-in-database approach is dismissed
too early for not being “clean” enough. But having a single source (the DB)
for everything the app needs to initialize is a massive simplification. You
may find it’s well worth the trade-offs. Really the only downside I ever found
— at the scale I was operating at — was that it bloats the database backups.

~~~
perspective1
My (naive?) view is that Nginx caches static files by default after the first
hit so the inconvenience isn't worth it _in my web apps._ Would I really see
an improvement switching to a SQLite-esque file system?

~~~
majewsky
> My (naive?) view is that Nginx caches static files by default after the
> first hit

That's likely the OS's filesystem cache, not anything in nginx. If you're on
Linux, you can confirm this by clearing the OS filesystem cache with

    
    
      $ sudo -i
      # echo 3 > /proc/sys/vm/drop_caches
    

and checking if the next hit on your URL looks like a cold load.

------
blinkingled
> The performance difference arises (we believe) because when working from an
> SQLite database, the open() and close() system calls are invoked only once,
> whereas open() and close() are invoked once for each blob when using blobs
> stored in individual files. It appears that the overhead of calling open()
> and close() is greater than the overhead of using the database. The size
> reduction arises from the fact that individual files are padded out to the
> next multiple of the filesystem block size, whereas the blobs are packed
> more tightly into an SQLite database.

open() and close() are significantly slower[1] on non-Linux OSes so this would
be even more drastic on say Windows or macOS.

[1] I once had to debug a library that was lot slower on Mac and Windows and
it mostly turned out to be because of doing open/close of files instead of
caching the handles - on Linux it was barely showing up on the perf traces and
I had no easy way to debug on the other OSes. On a hunch I cached the file
handles and the performance difference was mostly gone.

~~~
vdfs
Also doesn't make any sense, of course something that use open/close once will
be faster than using open/close n time.

~~~
blinkingled
It gets weird when you throw Linux in the mix - open/close are extremely cheap
on Linux, to the point of being free - at least used to be before Spectre and
Meltdown and what not. So on Linux doing n more open/closes isn't much
different than doing a single pair if n is not humongous and the program is
doing other things besides just open close.

------
muststopmyths
stat to get filesize is a pathlogically bad pattern on Windows. If I remove
the call to fileSize() in kvtest and instead replace it with 16k (since the
max size in the post is 12k), I can significantly improve the time measured on
Windows 10.

Before:

    
    
      C:\temp>kvtest run test1.dir --count 100k --blob-api 
      --count 100000 --max-id 1000 --asc
      Total elapsed time: 9.495
      Microseconds per BLOB read: 94.950
      Content read rate: 104.0 MB/s
    

After:

    
    
      C:\temp>kvtest run test1.dir --count 100k --blob-api
      --count 100000 --max-id 1000 --asc
      Total elapsed time: 5.218
      Microseconds per BLOB read: 52.180
      Content read rate: 313.3 MB/s
    

showing here the stable-ish numbers after a couple of runs of each version to
let the cache warm up.

Windows is a completely different beast than Unix and you need to know the
beast to tame it.

You might say that stat+open+read+close should be compared/measured together.
In that case my answer would be that you would not read all files in a
directory in this way on Windows anyway. You'd use overlapped I/O with buffers
of FS cache granularity (which is 64Kb, unless it has changed in recent
versions of Windows). It wouldn't matter for smaller files anyway (since the
OS aggressively reads ahead when you open a file) and would be better for
larger files.

It is a bugbear of mine when I see complaints about Windows from people who
never actually tried to optimize for the OS.

reading blobs in Sqlite from db is still ~3x faster on my Surface Pro 2017
(don't call it Surface Pro 5!), but that's to be expected.

~~~
SQLite
Thanks for this. I did not realize that fileSize() was a performance issue on
Windows. I'll attempt to update kvtest.c to deal with that, rerun the tests,
and update the page, as I have opportunity.

~~~
pdimitar
Apologies for a hijack but since you're here, I really wanted to say thank you
for sqlite! Saved me many headaches and made me a better programmer. You're
amazing for creating it.

And if you don't mind a question: are you guys up for making a strictly typed
variant of sqlite? I'm aware it is going to be a completely different beast
and many wouldn't expect backwards compatibility.

Just curious if such a strongly-typed variant is on your radar at all.

------
realshowbiz
I remember a job interview where the take home project was to parse a log file
and search/sort by various top attributes as fast as possible.

I thought about writing something from scratch, but I knew that whatever I
could optimize in an hour or two would be nothing compared to more robust
tools that already existed. So I used sqlite to ingest the log once, and query
it in various ways to produce the requested reports. The script was just a
dozen lines long or so, with one dependency (sqlite pkg).

The interviewer didn’t like this answer. didn’t say why. But I’m glad to see
that I wasn’t way off base about how performant sqlite can be.

My answer was of course not a prod solution, just an answer to a question with
a narrow scope.

~~~
otterley
The interviewer didn't like the answer because it didn't answer the implicit
question, which is how well you can write your own software. We know that you
can load a table into SQLite and have it do the heavy lifting for you. But
we're often not looking for software _users_ as much as we're looking for
_engineers_.

When we interview people, the answers we want often aren't real-world answers
because we're trying to get a deeper understanding of how you think or how
well you know the details of a complex mechanism.

~~~
notyourday
> But we're often not looking for software _users_ as much as we're looking
> for _engineers_.

What the applicant illustrated is that he is a much better engineer than the
interviewer and the interviewer did not like it.

Good software engineering is about leveraging existing robust tools in a new
way, not about re-inventing a wheel while making is square.

For me that would have been an insta-hire.

~~~
ImNotTheNSA
I disagree. Applicant should have done more to understand the requirements for
the task. That’s a massive part of software engineering, and you would be
surprised how much time software engineers will spend _solving the wrong
problem_. He didn’t understand that part of the problem included rolling a
novel solution or restricted use of a database. Almost a daily problem for
software engineers is extracting this information from customers or systems
engineers.

> Good software engineering is about leveraging existing robust tools in a new
> way, not about re-inventing a wheel while making is square.

Use of external libraries or solutions is often very restrictive (if not
banned) in my field. We can use existing internal or known verified solutions.
Not everything is so black and white, and not all options are always on the
table. They rarely ever are.

I don’t think really think “leveraging existing tools” is a defining part of
software engineering... though it certainly is part of it. Software
engineering is much more than just coming up with solutions.

~~~
notyourday
Some people do enjoy reinventing a wheel and making it square. And lots of
managers enjoy having those who work for them to do it. It makes the managers
feel smart.

The applicant solved a problem that was presented to him. If the person who
interviewed him did not want him to do it that way, he or she would have
specified it.

------
dTal
This shouldn't be especially surprising - in fact I would expect this to be
the case, all other things being equal, for two reasons: firstly, open()
followed by read() involves two kernel-userspace context switches, while
SQLite operating on an already open file only needs to call read(), and
secondly SQLite is more aggressive about making forwards-incompatible changes
than most popular filesystems, giving it more latitude to optimize (ZFS is an
exception, but with most filesystems you expect a 10-year old computer to read
a disk created yesterday).

------
Const-me
The fact that SQLite is much faster than file systems is not surprising at
all. But there’s one surprising result there.

> All machines use SSD except Win7 which has a hard-drive.

And yet Win10 is much slower than Win7 in all tests (note they have latency on
the graphs).

How’s that possible? For a drive spinning at 5400 RPM, average random read
latency can’t be smaller than 2.2ms, this is how long it takes to rotate the
disks 180° which is an average. An SSD should do much better than that.

I have a couple possible explanation. Either that particular Win10 machine,
2016 Lenovo YOGA 910, has _extremely_ slow SSD. Or there’s something wrong
with software/drivers/config of the PC, like an antivirus, or wrong chipset
drivers.

~~~
karmakaze
There's the other obvious explanation that _because_ the Win7 machine has
slower storage it automatically uses a larger cache than the Win10 machine
that more than offsets the difference.

~~~
Const-me
Windows kernel uses all free RAM as a file cache. Low priority one i.e. when
an app wants to allocate RAM, the file system cached pages are evicted. Their
Win7 test machine has 4GB RAM, Win10 16GB RAM.

I think I know what’s going on. They don’t publish absolute numbers, they only
publish result compared to SQLight on the same machine. Windows has
approximately fixed overhead when opening files. Most of that overhead goes to
access checks, NTFS security descriptors are way more sophisticated than 9
bits of security in Linuxes. With a fast disk that overhead dominates in the
cost, SQLight doesn’t open any files while running the test, the DB already
opened at startup. With a slow disk, that overhead is masked by large IO cost
which affects both SQLight and NTFS.

~~~
kukx
Nitpick, but file permissions in Linux take normally 12 bits. You probably did
not count SUID, SGID and sticky bit.

~~~
Const-me
I’ve only counted these famous RWX bits, settable by chmod. You’re right,
there’s more security metadata involved in Linux.

Still, Windows permissions are way more complicated, especially when AD
domains are involved. These access control lists have arbitrary count of
entries each. There’s another list, SACL, for audit. Access permissions are
inherited through file system with non-trivial inheritance rules, e.g. allow +
deny = deny. User permissions are inherited through group membership, and
security groups can be nested i.e. include other groups. Security descriptors
may include custom permissions unknown to MS. Any thread can run as any user
in the forest through client impersonation, many network servers do. The
kernel tries to apply effective permissions instantly, as soon as anything
changes (but not for every read/write, that would be too slow, it only does
access checks when opening files).

Combined, these features make access checks way more expensive in Windows, by
orders of magnitude.

------
c-smile
Slightly off the topic.

In one of web applications (CRM alike) I was experimenting with SQLite-per-
user approach. In that application user related data was clustered on per-user
basis.

So instead of single-db-for-all-users I had one-db-per-user.

Performance gain was significant - around 25% for most of requests. I suspect
that this is due the fact that index trees were significantly shorter for each
particular user, data more local, etc.

Yet db schema update and other batch requests can be made without stopping the
server, just handle db files one by one.

Just in case.

~~~
SiVal
This is _exactly_ the topic that made me read this thread. I have been
wondering whether one sqlite file per customer was a good option for an online
service where each customer needed only his own, private data but wanted to
work on it from his desktop, laptop, or phone, with the option of occasionally
downloading the file for additional backup. I just don't know enough about
multi-tenant database approaches to know if this is a good idea or absurd, and
I haven't found anyone talking about it.

Any pointers to further insights would be greatly appreciated. Even a forum
where (utterly unlike StackOverflow) I could ask such a question without
already almost knowing the answer.

~~~
carapace
SQLite in the browser...

[https://en.wikipedia.org/wiki/Web_SQL_Database](https://en.wikipedia.org/wiki/Web_SQL_Database)

It was killed though. Derp.

------
the8472
> The performance difference arises (we believe) because when working from an
> SQLite database, the open() and close() system calls are invoked only once

Yes, open() can introduce significant overhead, especially when doing it
sequentially on a remote filesystem, e.g. NFS.

But that's using the filesystems in the most inefficient way possible, there
are better ways to do it. E.g. you can pre-touch the files or even pre-open
them in parallel and then point to /proc/self/fd/<N> instead, which turns an
open() into a dup(). I was able to make some properietary software run 5x
faster that way simply by feeding it already open file descriptors.

------
wolf550e
Beware opening sqlite files from untrusted sources.

[https://research.checkpoint.com/select-code_execution-
from-u...](https://research.checkpoint.com/select-code_execution-from-using-
sqlite/)

~~~
sigzero
That's true for everything.

~~~
wolf550e
simpler file formats are less fraught, though they have fewer features.

~~~
gridlockd
What's simpler than SQLite that would serve a similar purpose? If it's not a
"battle-tested" format, it's still an issue no matter how simple the format
is. Remember, your average C programmer can easily fit 11 buffer overflows in
10 lines of code.

------
wayneftw
Wow, so maybe instead of a node_modules folder, npm should use a
node_modules.db?

~~~
bob1029
This sounds like a _fantastic_ idea. You would go from 4000 directories and
50000 files to a single file with b-tree indices, hashes, etc. You could
probably get a 100x speedup over the current approach.

~~~
dymk
Here you go:
[https://github.com/guardianproject/libsqlfs](https://github.com/guardianproject/libsqlfs)

~~~
Fnoord
Thank you, I was reading the comments exactly to figure out if there was a
FUSE sqlfs module. TL;DR this URL describes that, yes, there is. Why is this
relevant? Because this way, you keep the flexibilities of a filesystem and, as
a consequence of it being able to function like a filesystem, you keep the
"power of Unix/Linux" ie. all your <asterisk>nix [1] tools keep working.

[1] Calling it Nix confuses with NixOS...

~~~
mkl
So call it *nix? It takes two asterisks two turn on italics between them.

~~~
Fnoord
Cheers. Yeah, I guess it triggered cause I used two asterisks in my text since
originally I used the term twice.

------
gigatexal
SQLite really is the best of both worlds when it comes to files and databases:
it is a file, one that can usually live entirely in memory; and it’s got all
the awesome utility of SQL (even window functions too) that can be used very
simply to do powerful things.

------
microcolonel
Really depends on which filesystem too, you'll notice here that Microsoft's
NTFS implementation is horrendously slow, HFS+ is several times faster, but
still slower.

Since they don't bother explaining which Linux filesystem they used, I suspect
they mean EXT4, and the particular version of EXT4 in the kernel in that
particular version of Ubuntu.

The real news is: ext4 is about as fast as SQLite for small objects.

~~~
Zenst
Which is why for decades many dB offerings offered raw or cooked storage.
Cooked being that they used the file system and raw being using the device
directly. I'm not sure of the upsides and downsides today as I last worked in
that area about 20 years ago, but was very much a case of raw a noticeable
performance gain. But then you are equally tied into the vendors backup
mechanisms.

~~~
Spooky23
I ran a large at the time Informix implementation in the late 90s/early 2000s.

Raw disk vs file system (Solaris UFS) performance was 35% faster on our
transaction systems.

~~~
Zenst
As did I upon AIX and NCR towers. Did you find you had to offset the start
block of the raw partition? As with AIX, even though the OS saw it as raw, it
would stamp over the first 8k (iirc) with volume control data and with that,
could easily corrupt a raw partition if you didn't factor in an offset.

~~~
Spooky23
I don’t recall the details, but we used Veritas Volume Manager for managing
the raw disk, and I’m pretty sure vxvm either hid that or had a driver that
didn’t do it.

Our restore run book was scripted and tight, so we may have been doing it
ourselves, but I didn’t write the code or at least didn’t break it! Fun times.

------
cfstras
This might be the answer on how to make file-heavy tools like npm/yarn work
adequately on windows...

------
Tepix
What if you create something that is optimized for performance without
offering the full SQL language? Just a container to store small files with
padding to avoid the expensive open/close calls.

~~~
liability
I'm not sure how much better you'd get than sqlite if you were using a
straight forward rowid select with a prepared statement. Sqlite is pretty damn
snappy for something like that and using a prepared statement amortizes the
cost of compiling the SQL.

~~~
zzzcpan
You can get a point read with just a single aligned random I/O operation on
disk with this. I don't see how a B-tree based storage can compete.

And even more optimizations are possible, like grouping objects often accessed
together into a single batch to retrieve the whole group in just one random
operation too.

~~~
liability
I'm sure you can go faster, but how much faster? The rowid b-tree lookup, with
the b-tree in memory, will almost certainly be blazing fast compared to the
actual disk read.

------
OJFord
Next week on Show HN: `sqlitefs`, the SQLite-backed FUSE filesystem?

~~~
ruuda
Something similar exists:
[https://www.youtube.com/watch?v=wN6IwNriwHc](https://www.youtube.com/watch?v=wN6IwNriwHc)

~~~
ngirard
Interesting project, thanks.

For the record, their GitHub repository is [https://github.com/samzyy/DB-
based-replicated-filesystem](https://github.com/samzyy/DB-based-replicated-
filesystem)

------
vr46
Billions of tech-years ago, BeOS made a filesystem that had database-like
properties. Now, mainly because I have no idea what I’m talking about in this
area, I don’t know if anything sprung from that into mainstream computing, so
features like this from SQLite make the news. (It’s very cool, to be sure)

------
notyourday
As someone who has spent time consulting for different kinds of "we do stuff
on the internet" companies, I can confidently say that this is a premature
optimization for 99.999% of the companies/projects. The companies simply don't
have enough IO traffic to need it.

Is it cool? Sure. Is it sexy? Maybe. Is it _needed_? Nope.

Do boring stuff. Use files. If you send your small files over the internet via
web and you need to squeeze additional performance, do boring stuff again: use
Varnish.

~~~
fvdessen
> Do boring stuff. Use files.

I have the opposite experience. Files are far from 'boring', they have too
many gotchas to count. Putting files in the db is for many use-cases actually
easier than putting them on the filesystem. By putting all data in the db you
have less moving parts, transactions, referential integrity, well defined
behaviours, etc. The speed increase is just a nice side effect.

~~~
notyourday
> By putting all data in the db you have less moving parts, transactions,
> referential integrity, well defined behaviours, etc

If you are doing it for 100k files and below you are using a Ferrari to pick
up eggs in a corner store. If you are using it at 100k files and above, you
are using a Ferrari to move a pile of paving stones one by one.

~~~
derefr
Show me a filesystem that can efficiently hold onto all the inode information
for a blockchain represented as files and directories. (Hint: not even LevelDB
can hold onto all the trie information efficiently; solutions are being sought
that pack things tighter than LevelDB.)

~~~
notyourday
You don't _need to do it efficiently_. You are optimizing for non-existent
problem.

Here's what you actually need to optimize for: you have a hundred million
files. The are all somehow reachable via
[http://origin/someuniqueurl](http://origin/someuniqueurl). You have multiple
copies because you aren't an idiot and you know that users actually hate
either losing or getting corrupted files back. You have fingerprints
associated with every copy. Something happens and you can't reach part of the
tree or you are getting incorrect hashes ( which you know because your origin
computes the hashes every time someone requests and one and if those hashes
don't match it triggers a re-request to a backup copy ). And now you need to

(a) ensure you still have the needed protection factor ( you you went from 3x
to 2x as one copy is dead ) for all affected files

(b) minimize the time needed to remove the "thing" ( probably a disk or a node
) that caused the failure.

(c) minimize the cost of ensuring (a) and (b)

People spent lots of time figuring out very clever ways of doing it via DB. It
neither scaled nor worked well when real life (i.e. disk crashes/nodes going
away/bad data was returned) happened.

The only thing that you really need to know is

(a) where the data and its copies are stored.

(b) what is the hash of the data that is stored there

(c) how fast you can recover the information about where that data is stored
in event of the issue with whatever the system that is used to keep track
between (a) and (b).

(d) preferably you also want to know what other data objects ( files ) could
be affected when some of the other objects are misbehaving ( if you know a
file at nodeA/volumeB had a fingerprint
Ykntr8H8pL9PyAtCwdw/CB5tToXTPf55+hKSZb0uhV0 when it was written and now for
some reason it says its fingerprint is
sw0SI4jkU5VVk6CH4oPYtwx+bK2hIlrw8hVM7i9zmNk while the rest of the copies are
saying their fingerprint is still Ykntr8H8pL9PyAtCwdw/CB5tToXTPf55+hKSZb0uhV0
you could decide that you want to trash the nodeX/volumeB because you no
longer trust it.

~~~
derefr
Er, yes, you do need to do it efficiently. A blockchain needs to read about
1000 of these “files” per transaction in order to verify the next block (and
there are ~10000 transactions per block, and the accesses have no special
locality because they’re a mixture of requests submitted by unaffiliated
parties.) It’s an at-scale OLTP system (ingesting thousands of TPS) that is
required to execute on consumer hardware.

This is a real problem. The current implementation of Ethereum is IOPS-bound
precisely because of this overhead, with ~80% of the time spent verifying a
block being spent on the filesystem and index-access overhead of getting the
relevant data into memory, rather than being IO-bound by the actual bandwidth
of data required. This is an on-disk datastructures problem.

~~~
notyourday
Oh, this is the block chain discussion i.e. another solution in search of a
problem. Never mind.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=14550060](https://news.ycombinator.com/item?id=14550060)

------
pedrocr
In the past I've had huge speedups by moving simple single-table databases
that had grown a bit (e.g., time series data) from sqlite to postgres. Insert
performance is also quite bad forcing you to write applications with extra
caching layers to be able to do a bunch of inserts at once. Sqlite is great
for many applications but it's speed is somewhat oversold. I really wish the
postgres engine was embeddable into applications easily.

~~~
liability
I once tried out some RSS aggregator/reader that required me to install and
set up postgres for no good reason other than it's what the developer was
accustomed to. _Huge_ pain in the ass, particularly since simply using sqlite
would have been perfectly adaquate (does a RSS reader _really_ need concurrent
writes? _Really?_ )

For any software meant to be used on a non-technie's desktop, anything other
than an embedded 'zero config' arrangement is a nonstarter.

~~~
pedrocr
This is exactly true, which is why embedding postgres would be great. It's
very easy to hit the performance limitations of sqlite and then be stuck with
the choice of either poor performance or forcing users to setup postgres like
you experienced.

------
herf
Wonder what the number is with "noatime"? Every open is a write on a normal
filesystem, but you can turn it off.

~~~
wolf550e
The default (relatime) is to update atime only once every 24h, so it's
negligible for many workloads.

~~~
ysleepy
Or more importantly atime is updated in tandem with mtime.

------
toxik
While an impressive feat on SQLite’s end, I sincerely hope no engineers see
this and take it at face value. The filesystem is slower because it is more
flexible, and I would imagine that once you have concurrent reads and writes
to your thumbnails, a filesystem is to prefer.

~~~
chias
IMO the point of this article is pretty clearly to answer the developer
question "my application needs to read and write data, should I use the
filesystem or should I use a sqlite db?" and not make some kind of sweeping
"sqlite > fs" claim. In this context, face value is fine.

