Hacker News new | past | comments | ask | show | jobs | submit login
35% Faster Than The Filesystem (2017) (sqlite.org)
267 points by Tomte 63 days ago | hide | past | web | favorite | 164 comments

For small- to mid-sized projects, I’ve always realized huge gains in simplicity by haves “Files” tables to store various assets.

It means instances in a web-farm can pull the files down when they initialize easily, it means files are automatically versioned, it provides an obvious place to put the files when they are being uploaded on the Admin panel. It means all the files are getting backed up as part of the database snapshots automatically.

Things I’ve used this for range from static HTML assets, to logos which are used to dynamically brand a page, to files that get attached to transactional emails, etc.

So yes, if you are building a massive scale system holding millions of files, sure, find a purpose-built system to store, distribute, version, and backup your files.

But for many projects I believe the files-in-database approach is dismissed too early for not being “clean” enough. But having a single source (the DB) for everything the app needs to initialize is a massive simplification. You may find it’s well worth the trade-offs. Really the only downside I ever found — at the scale I was operating at — was that it bloats the database backups.

I though, one of the advantage here is, you are able to store/read metadata along the file content easy without worrying the metadata and file may desync in some way or corrupted. This makes serve something like Image (photo dimension) or video(length) metadata way easier. Some filesystem (like ntfs) do have place to save metadata along files, but most filesystem don't have the way to done that properly

You're referring to extended attributes, which are like a key/value database for each file. Wikipedia says:

> In Linux, the ext2, ext3, ext4, JFS, Squashfs, Yaffs2, ReiserFS, Reiser4, XFS, Btrfs, OrangeFS, Lustre, OCFS2 1.6, ZFS, and F2FS[9] filesystems support extended attributes (abbreviated xattr) when enabled in the kernel configuration.

> The Linux kernel allows extended attribute to have names of up to 255 bytes and values of up to 64KiB,[13] as do XFS and ReiserFS, but ext2/3/4 and btrfs impose much smaller limits, requiring all the attributes (names and values) of one file to fit in one "filesystem block" (usually 4 KiB).

Applications that make extensive use of xattrs typically recommend one of the FSs that allow full-size xattrs. For example, OpenStack Swift, an object storage service, recommends XFS as a backing storage. (Source: I operate Swift clusters at $dayjob.) But if your metadata is not larger than 4 KiB, any production-quality Linux FS will do.

Then you would encounter another problem, extended attributes arn't usually really portable between files system / operation systems (as the name 'extendded' suggests, it may vary between FSs or even not available at all). And may gone easily if any of the programs not operate on them correctly. Windows would show a big warning if your copy operation may cause the extra metadata along the file to gone, but I am not sure whether linux will do that.

Just store the metadata in the file name. Unless your file system besides metadata also have a 11 character filename limit.

What kind of versioning are your referring to? I know some databases has it but not sure if I understand it.

Personally I use S3 or similar for this. At the number of files I'm working with storing it in a relational database would be crazy expensive.

Scheme could be something like;

FileId, VersionId, Name, Data, UploadedTimestamp, UploadedBy, ...

If you’re on AWS, pushing and pulling from an S3 bucket is probably a great solution, and then of course there’s nothing to worry about in terms of backups.

Do you still need to keep an index of the files / metadata in a DB, or can you tag everything you need directly on the S3 objects and just pull the whole bucket?

Even if you’re not in AWS S3 works great, especially where you need to serve content direct to client’s browsers. S3 supports either making a file public and available to all, or using pre-signed URLs to provide short term authorisation to access a file to specific clients. No need to pull the bucket at all.

You will hit limits to how fast you can generate those crypto signatures for s3, and a limit to how much can serve; at a certain scale you will want to use a cdn

> You will hit limits to how fast you can generate those crypto signatures for s3

Presigned URLs are HMAC-SHA256:


I just measured GetObjectRequest.Presign() at 1063 ns. That's a million signatures per second of my 2015 MacBook Pro using the stock AWS SDK for Go _on one core_. This is plenty fast already, and it's guaranteed to be running in parallel on better hardware in production. There's no way signature generation is a bottleneck in any real application.

> and a limit to how much can serve

I've gotten hundreds of gigabits per second between S3 and EC2 without doing anything special. What problems are you referring to?

> at a certain scale you will want to use a cdn

So use a CDN too. CloudFront fits there neatly, and it can even sign its own requests to S3:


None of those criticisms are valid reasons to avoid S3. S3 is an excellent solution for many common problems.

Having said that, S3 has genuine drawbacks for certain workloads. (For example, the long tail on time to first byte is sometimes a dealbreaker.) In my opinion the biggest limitation is budgetary: typically one will spend all their money on AWS data transfer fees long before hitting S3 service limits.

Tons of tiny files is the problem where the bytes to sign ratio is low.

And I didn't say to avoid s3. I work for a subsiary of AWS I want you to use it a ton.

Let's say your serving up a bunch of thumbnails, maybe 1024 per user per load.

Or in my scenario 25k files. 25k signs per user is 40 users per second per server signing constantly.

For the last 6-8 months I've been working on a hybrid electron + web app that uses files on local disk for storage.

I have a Datastore implementation that uses disk when locally and firebase when remote.


You can read about it there.

I thought it would be a great idea to use just simple flat files as you can commit them to git, diff them, and there's kind of already an API for working with them.

In retrospect I regret it and wish we had something more powerful.

First. It's twice as much code to maintain. Which introduces bugs.

Seconds if you're keeping your PDFs online vs on disk they're duplicated. It would be nice to build in a higher level abstraction that was both network AND disk aware and just pulled things down when you needed it.

Third, the annotations on your PDF aren't really indexable without a 3rd party system.

Getting the right balance of remote-first vs web-first is tough.

Now, after building a complex webapp with electron support I think it's better to go web-first and then try to use the browsers local caching to get offline-first support.

I once did this specifically to get the files into the database backups, so i didn't have to deal with them separately!

The files were the storage for some third-party content targeting tool, and were edited using an admin UI that tool, then deployed to the servers where the live part of the tool would read them. You couldn't bring the system up in a complete state without them.

Apart from those files, we could restore the whole system using the latest code (really, the artifact built from it), and the latest database dump. So, before taking a database dump, we'd run a script to load those files into the database. We also had a script for undumping them, which we tested regularly in our staging environment.

We didn't use the database for deployment, because the third-party tool did that. Perhaps we should have - it would have been a bit of a subversion of the tool's usual mode of operation, but it might have had some advantages.

>Really the only downside I ever found — at the scale I was operating at — was that it bloats the database backups.

Easily solved by having two separate databases. One for dynamic content, one for static files, which is probably good practice regardless.

Doesn't that introduce the desync problem again? Is there something that can enforce consistency across the two databases?

I'm assuming the static and dynamic content don't need to be transactionally integrated. If they do, you could still solve it with two databases, but you'd use one database for the "current" data and another database to store "historical" data.

Also, remember the GP said "small to mid-size."

Backups are fundamentally limited to eventual consistent, there is no need for databases to be synchronously replicated for backups. I mean splitting database has no effect on backup consistency, although a more decent way of dealing with it is not splitting database, but simply running an async replica to do backups from.

Are you are talking past each other...? I suspect the grandparent post is concerned about the loss of ACID properties for updating data and metadata if they are split across two databases. This is a concern for regular application access too. Meanwhile, you assume that this consistency problem is already solved in the application, and so of course the backup and restoration problem can also then be handled.

To the grandparent: you typically solve this application consistency issue by using something like immutable object store semantics with versioned references. The object store is capable of answering requests for multiple versions of the asset, and the application metadata store tracks individual versions. You can sequence the order in which you commit to these stores, so the asset is always available before it is published in the metadata store. Alternatively, you can make consumers aware of temporary unavailability, so they can wait for the asset to become available even if they stumble on the new version reference before the content is committed.

You can also find hybrids where the metadata store is used to track the asset lifecycle, exposing previous and next asset version references in a sort of two-phase commit protocol at the application level. This can allow consuming applications to choose whether they wait for the latest or move forward with older data. It also makes it easier to handle failure/recovery procedures such as canceling an update that is taking too long and reclaiming resources.

Backups on databases like PostgreSQL are strongly consistent, regardless of where you take it from (master or replica). Postgres replication is strictly sequential wrt transaction commit order.

As such, splitting the database may incur in significant consistency issues that a backup doesn't incur into.

I believe this splitting technique is not a good one except for potentially narrow use cases.

Strong consistency is not what you think then. You can only do stale reads from backups.

I thought ahachete meant that a database backup is atomic, i e. it will only contain fully completed transactions. The problem with data split across databases then is that transactions don't span multiple databases, so you can't get an atomic snapshot of data spanning multiple databases.

Are you saying that even a single-database backup is not atomic?

No, it's atomic. It's just not important if separate databases are not atomic to each other when you recover from backups, you will still have data loss and an inconsistent state. I.e. things that should be in the database missing, and that shouldn't be overwriting missing IDs, etc. Backups cannot be strongly consistent, so you have to take the exact same approaches to deal with this whether you store everything in a single database or in two separate ones.

My (naive?) view is that Nginx caches static files by default after the first hit so the inconvenience isn't worth it in my web apps. Would I really see an improvement switching to a SQLite-esque file system?

> My (naive?) view is that Nginx caches static files by default after the first hit

That's likely the OS's filesystem cache, not anything in nginx. If you're on Linux, you can confirm this by clearing the OS filesystem cache with

  $ sudo -i
  # echo 3 > /proc/sys/vm/drop_caches
and checking if the next hit on your URL looks like a cold load.

If the files are being handled programmatically, pull them as byte[] directly from the database.

Once I tried to hookup a way to serve the files directly from the DB through IIS, and it was a total failure. By the time I had figured out getting all the various headers right, I was still running into some weird caching issue that would corrupt the request at random intervals.

So now, if the files are static hosting assets, I will initialize a ‘cache/‘ folder on the local filesystem (for that machine on the web-farm) when the first request comes in and just read all those files back into the local file system for the web server to handle.

The only remaining concern is refreshing the cache when files change in the DB.

Nginx doesn't cache files by default, but it has a file based caching module and a file based store module that can be used to retrieve files from backends and cache on local filesystem.

> was that it bloats the database backups.

Having a deduplicating backup solution might help with that. Simplest way to do that (in terms of ease-of-implementation), would be to keep backups on a ZFS or btrfs filesystem. Note that (AFAIK) with btrfs, deduplication must be triggered manually, so you'd need to run `btrfs filesystem defragment` on the backups to take advantage of it.

Note for other readers: The parent post probably means incremental ZFS snapshots, which won't store multiple copies of any given block unless the piece of data has been moved.

ZFS also has a block deduplication scheme, but turning it on is a poor idea. The deduplication table needs to be stored in memory for decent performance, and this takes a lot of memory. Even then, performance is rarely as good as with dedup turned off. It doesn't even save as much space as you'd hope -- it's per-record, and ZFS storage records are 128KiB by default. If you're trying to deduplicate 10KiB files with that, you're going to have a bad time.

(Note also: The ZFS record size is dynamic; the 128KiB default is the maximum record size it'll allow, i.e. the most data that has to be read and rewritten (somewhere else) if you modify one byte in a larger file. If you just write a single 10KiB file and force it to disk, then you'll get a 10KiB record. It's complicated; try not to read more into this than "dedup is rarely a good idea," though if I'm allowed a second line, I'll add "compression is always a good idea; enable zle at a minimum.")

Have you seen https://icculus.org/physfs/ ?

You could put everything inside a zip and that's it.

Personal opinion: Those things (static assets) belong in a container with Nginx serving them (and that’s not even counting SAAS options like S3). Easy to integrate with CDNs, easy to update, easy to roll out and back. And you get to take advantage of layers to keep the network and disk usage minimal.

I don’t see how a container solves the backing store / source of truth problem?

How are changes getting put into the container? Are you versioning the container each time a file is changed or added? When files change are all instances of the containers being restarted to get the new/changed files?

One answer: the files are stored in git and the image is built on commit and tagged with the current ref, then pushed. The deployment is then updated to reference the new tag and services are restarted, pulling the new image and launching the new container. This is the typical ci/cd setup for kubernetes deployments.

> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.

open() and close() are significantly slower[1] on non-Linux OSes so this would be even more drastic on say Windows or macOS.

[1] I once had to debug a library that was lot slower on Mac and Windows and it mostly turned out to be because of doing open/close of files instead of caching the handles - on Linux it was barely showing up on the perf traces and I had no easy way to debug on the other OSes. On a hunch I cached the file handles and the performance difference was mostly gone.

Also doesn't make any sense, of course something that use open/close once will be faster than using open/close n time.

It gets weird when you throw Linux in the mix - open/close are extremely cheap on Linux, to the point of being free - at least used to be before Spectre and Meltdown and what not. So on Linux doing n more open/closes isn't much different than doing a single pair if n is not humongous and the program is doing other things besides just open close.

stat to get filesize is a pathlogically bad pattern on Windows. If I remove the call to fileSize() in kvtest and instead replace it with 16k (since the max size in the post is 12k), I can significantly improve the time measured on Windows 10.


  C:\temp>kvtest run test1.dir --count 100k --blob-api 
  --count 100000 --max-id 1000 --asc
  Total elapsed time: 9.495
  Microseconds per BLOB read: 94.950
  Content read rate: 104.0 MB/s

  C:\temp>kvtest run test1.dir --count 100k --blob-api
  --count 100000 --max-id 1000 --asc
  Total elapsed time: 5.218
  Microseconds per BLOB read: 52.180
  Content read rate: 313.3 MB/s
showing here the stable-ish numbers after a couple of runs of each version to let the cache warm up.

Windows is a completely different beast than Unix and you need to know the beast to tame it.

You might say that stat+open+read+close should be compared/measured together. In that case my answer would be that you would not read all files in a directory in this way on Windows anyway. You'd use overlapped I/O with buffers of FS cache granularity (which is 64Kb, unless it has changed in recent versions of Windows). It wouldn't matter for smaller files anyway (since the OS aggressively reads ahead when you open a file) and would be better for larger files.

It is a bugbear of mine when I see complaints about Windows from people who never actually tried to optimize for the OS.

reading blobs in Sqlite from db is still ~3x faster on my Surface Pro 2017 (don't call it Surface Pro 5!), but that's to be expected.

Thanks for this. I did not realize that fileSize() was a performance issue on Windows. I'll attempt to update kvtest.c to deal with that, rerun the tests, and update the page, as I have opportunity.

Apologies for a hijack but since you're here, I really wanted to say thank you for sqlite! Saved me many headaches and made me a better programmer. You're amazing for creating it.

And if you don't mind a question: are you guys up for making a strictly typed variant of sqlite? I'm aware it is going to be a completely different beast and many wouldn't expect backwards compatibility.

Just curious if such a strongly-typed variant is on your radar at all.

Also, that test program for the above arguments just loops through the first 1000 files 100 times. (iMax is 1000). Would be good to point that out a bit more clearly if it's intentional. Curious to know why if it is.

I remember a job interview where the take home project was to parse a log file and search/sort by various top attributes as fast as possible.

I thought about writing something from scratch, but I knew that whatever I could optimize in an hour or two would be nothing compared to more robust tools that already existed. So I used sqlite to ingest the log once, and query it in various ways to produce the requested reports. The script was just a dozen lines long or so, with one dependency (sqlite pkg).

The interviewer didn’t like this answer. didn’t say why. But I’m glad to see that I wasn’t way off base about how performant sqlite can be.

My answer was of course not a prod solution, just an answer to a question with a narrow scope.

The interviewer didn't like the answer because it didn't answer the implicit question, which is how well you can write your own software. We know that you can load a table into SQLite and have it do the heavy lifting for you. But we're often not looking for software _users_ as much as we're looking for _engineers_.

When we interview people, the answers we want often aren't real-world answers because we're trying to get a deeper understanding of how you think or how well you know the details of a complex mechanism.

If that’s the case then the restrictions should’ve been given in the problem statement and the interviewer should’ve been able to answer why the candidate’s answer was rejected.

Consider it a test of your ability to infer the obvious (a valuable skill.)

I'm sorry but what is obvious in an interview setting? Is it obvious that you can or cannot use standard library? Is it obvious that you can or cannot use state of the art existing software to solve problems?

I do many interviews and I remember even when I started I had to be pretty clear what the candidate can and cannot use. I actually appreciate that they can find existing solutions that are likely better than what one can come up with in a few hours but if I need to test them on how they can code something like that I'll specifically state it.

Yeah, although imagine if carpenters were interviewed/judged based on the quality by which they could create hammers and drills from scratch.

But it was a good thing overall. I showed me signs that the company has an elevated degree of not invented here culture, and that they aren’t very clear when requesting/scoping work.


If software engineers are carpenters, the hammers and drills are the programming languages, editors, and IDEs. And sqllite is an off the shelf plank.

> But we're often not looking for software _users_ as much as we're looking for _engineers_.

What the applicant illustrated is that he is a much better engineer than the interviewer and the interviewer did not like it.

Good software engineering is about leveraging existing robust tools in a new way, not about re-inventing a wheel while making is square.

For me that would have been an insta-hire.

I both agree and disagree with you.

I agree that the answer was a good one in the real-world sense, and one that I would like an engineer to use under the right circumstances.

I disagree that the answer demonstrated engineering prowess, however. Sometimes you have to assume the lack of existence of certain tools to understand how a candidate would have implemented the solution him- or herself to get a more thorough understanding of his or her engineering chops.

I disagree. Applicant should have done more to understand the requirements for the task. That’s a massive part of software engineering, and you would be surprised how much time software engineers will spend solving the wrong problem. He didn’t understand that part of the problem included rolling a novel solution or restricted use of a database. Almost a daily problem for software engineers is extracting this information from customers or systems engineers.

> Good software engineering is about leveraging existing robust tools in a new way, not about re-inventing a wheel while making is square.

Use of external libraries or solutions is often very restrictive (if not banned) in my field. We can use existing internal or known verified solutions. Not everything is so black and white, and not all options are always on the table. They rarely ever are.

I don’t think really think “leveraging existing tools” is a defining part of software engineering... though it certainly is part of it. Software engineering is much more than just coming up with solutions.

Some people do enjoy reinventing a wheel and making it square. And lots of managers enjoy having those who work for them to do it. It makes the managers feel smart.

The applicant solved a problem that was presented to him. If the person who interviewed him did not want him to do it that way, he or she would have specified it.

I agree that the candidate should have maybe asked if they can just use third party software but I also think that was failure on the part of the interviewer. If there are constraints about the tools the candidate may use to solve the given problem then those constraints need to be specified. Similarly to how you specify which programming language you allow to solve the problem or if you allow standard library use.

You deserve the programmers you're looking for.

God damn but is that a true statement.

I have come across cases like this in a couple interviews. What I did was do both. In those cases I often got an either an onsite and/or offer immediately.

that seems to be a great asnwer to the question as posed, and as would be faced in real-world situations rather than academic ones.

This shouldn't be especially surprising - in fact I would expect this to be the case, all other things being equal, for two reasons: firstly, open() followed by read() involves two kernel-userspace context switches, while SQLite operating on an already open file only needs to call read(), and secondly SQLite is more aggressive about making forwards-incompatible changes than most popular filesystems, giving it more latitude to optimize (ZFS is an exception, but with most filesystems you expect a 10-year old computer to read a disk created yesterday).

The fact that SQLite is much faster than file systems is not surprising at all. But there’s one surprising result there.

> All machines use SSD except Win7 which has a hard-drive.

And yet Win10 is much slower than Win7 in all tests (note they have latency on the graphs).

How’s that possible? For a drive spinning at 5400 RPM, average random read latency can’t be smaller than 2.2ms, this is how long it takes to rotate the disks 180° which is an average. An SSD should do much better than that.

I have a couple possible explanation. Either that particular Win10 machine, 2016 Lenovo YOGA 910, has _extremely_ slow SSD. Or there’s something wrong with software/drivers/config of the PC, like an antivirus, or wrong chipset drivers.

There's the other obvious explanation that because the Win7 machine has slower storage it automatically uses a larger cache than the Win10 machine that more than offsets the difference.

Windows kernel uses all free RAM as a file cache. Low priority one i.e. when an app wants to allocate RAM, the file system cached pages are evicted. Their Win7 test machine has 4GB RAM, Win10 16GB RAM.

I think I know what’s going on. They don’t publish absolute numbers, they only publish result compared to SQLight on the same machine. Windows has approximately fixed overhead when opening files. Most of that overhead goes to access checks, NTFS security descriptors are way more sophisticated than 9 bits of security in Linuxes. With a fast disk that overhead dominates in the cost, SQLight doesn’t open any files while running the test, the DB already opened at startup. With a slow disk, that overhead is masked by large IO cost which affects both SQLight and NTFS.

Nitpick, but file permissions in Linux take normally 12 bits. You probably did not count SUID, SGID and sticky bit.

I’ve only counted these famous RWX bits, settable by chmod. You’re right, there’s more security metadata involved in Linux.

Still, Windows permissions are way more complicated, especially when AD domains are involved. These access control lists have arbitrary count of entries each. There’s another list, SACL, for audit. Access permissions are inherited through file system with non-trivial inheritance rules, e.g. allow + deny = deny. User permissions are inherited through group membership, and security groups can be nested i.e. include other groups. Security descriptors may include custom permissions unknown to MS. Any thread can run as any user in the forest through client impersonation, many network servers do. The kernel tries to apply effective permissions instantly, as soon as anything changes (but not for every read/write, that would be too slow, it only does access checks when opening files).

Combined, these features make access checks way more expensive in Windows, by orders of magnitude.

Maybe they have a bad benchmark setup and the 7 box is setup to cache writes to the disk.

After thinking for a bit, I think their results are actually reasonable. Their graphs are relative to how SQLight performs on the same machine. When measuring absolute numbers i.e. CPU cycles, all modern versions of Windows have comparable overhead on opening files, it’s the same file system and the kernels are quite similar. When a drive is slow, that overhead is masked by IO cost which almost equally applies to SQLight and NTFS. For a fast SSD that overhead is much higher than the cost of doing IO, hence the results.

Slightly off the topic.

In one of web applications (CRM alike) I was experimenting with SQLite-per-user approach. In that application user related data was clustered on per-user basis.

So instead of single-db-for-all-users I had one-db-per-user.

Performance gain was significant - around 25% for most of requests. I suspect that this is due the fact that index trees were significantly shorter for each particular user, data more local, etc.

Yet db schema update and other batch requests can be made without stopping the server, just handle db files one by one.

Just in case.

The formalization of this approach is https://github.com/biokoda/actordb — which is essentially a sharded document store where one type of “document” is an SQL-addressable SQLite database.

Sadly, ActorDB seems to be inactive as a project—but the fundamental premise is sound.

I’d love to see the concept reimplemented as e.g. a Redis module (since a Redis cluster essentially has the same “shape” as ActorDB’s custom distributed K-V store.)

This is exactly the topic that made me read this thread. I have been wondering whether one sqlite file per customer was a good option for an online service where each customer needed only his own, private data but wanted to work on it from his desktop, laptop, or phone, with the option of occasionally downloading the file for additional backup. I just don't know enough about multi-tenant database approaches to know if this is a good idea or absurd, and I haven't found anyone talking about it.

Any pointers to further insights would be greatly appreciated. Even a forum where (utterly unlike StackOverflow) I could ask such a question without already almost knowing the answer.

SQLite in the browser...


It was killed though. Derp.

> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once

Yes, open() can introduce significant overhead, especially when doing it sequentially on a remote filesystem, e.g. NFS.

But that's using the filesystems in the most inefficient way possible, there are better ways to do it. E.g. you can pre-touch the files or even pre-open them in parallel and then point to /proc/self/fd/<N> instead, which turns an open() into a dup(). I was able to make some properietary software run 5x faster that way simply by feeding it already open file descriptors.

Beware opening sqlite files from untrusted sources.


That's true for everything.

simpler file formats are less fraught, though they have fewer features.

What's simpler than SQLite that would serve a similar purpose? If it's not a "battle-tested" format, it's still an issue no matter how simple the format is. Remember, your average C programmer can easily fit 11 buffer overflows in 10 lines of code.

Wow, so maybe instead of a node_modules folder, npm should use a node_modules.db?

This sounds like a fantastic idea. You would go from 4000 directories and 50000 files to a single file with b-tree indices, hashes, etc. You could probably get a 100x speedup over the current approach.

Thank you, I was reading the comments exactly to figure out if there was a FUSE sqlfs module. TL;DR this URL describes that, yes, there is. Why is this relevant? Because this way, you keep the flexibilities of a filesystem and, as a consequence of it being able to function like a filesystem, you keep the "power of Unix/Linux" ie. all your <asterisk>nix [1] tools keep working.

[1] Calling it Nix confuses with NixOS...

So call it *nix? It takes two asterisks two turn on italics between them.

Cheers. Yeah, I guess it triggered cause I used two asterisks in my text since originally I used the term twice.

well, it was *nix before nix (the package manager) existed. So I think Nix, as good as it is, stuffed up with that name.

Even better would be not to use it entirely - it'd solve far bigger problems with node.

There's already support for accessing zip and tar archives as filesystems (also see ASAR), which doesn't incur the overhead of SQL evaluation. At least on Windows, this would result in a huge speedup (just because you're avoiding antivirus latency). I don't expect it to help Mac or Linux that much (as their filesystems are comparatively much more efficient with larger numbers of small files).

I had a problem with zips, they can be only 232 bytes = 4.3 GB. The Zip64 format fixes this, however few libraries support it.

That sounds like an upside. Force people to keep their hypothetical node_modules.db "lean and mean"!

That would be amazing!

To work though, you'd need some kind of interface to actually access the contents. I guess that's doable, but not ideal.

Someone else posted libsqlfs, which is FUSE Sqlite file system, and should do the job...not exactly plug-and-play though, and presumably wouldn't work on Windows (maybe on WSL tho?)

WSL1 is terrible for I/O (which I suppose this is). WSL2 should be adequate.

Perhaps Dokan? Dokan has a FUSE wrapper [1]

[1] https://dokan-dev.github.io

Not sure if /s :)

With a FUSE filesystem, you're still incurring the syscall overhead that's causing this performance disparity in the first case, plus now the kernel has to context-switch into the FUSEing process. So what would be 2 context switches for each syscall (program -> kernel -> program) now is 3 or 4 context switches (program -> kernel -> FUSE -> (kernel ->) program).

FUSE has some amazing usecases, but performance is none of them.

Erm, no, no sarcasm, but in truth I know very little about FUSE, beyond it being an API for userland file systems.

I'm also approaching this as primarily a Windows user - a lot of the overhead on Windows is access checks, so my thinking is these would be implemented as noops.

It’s kind of the idea behind Yarn’s Plug’n’Play: https://yarnpkg.com/en/docs/pnp

SQLite really is the best of both worlds when it comes to files and databases: it is a file, one that can usually live entirely in memory; and it’s got all the awesome utility of SQL (even window functions too) that can be used very simply to do powerful things.

Really depends on which filesystem too, you'll notice here that Microsoft's NTFS implementation is horrendously slow, HFS+ is several times faster, but still slower.

Since they don't bother explaining which Linux filesystem they used, I suspect they mean EXT4, and the particular version of EXT4 in the kernel in that particular version of Ubuntu.

The real news is: ext4 is about as fast as SQLite for small objects.

Which is why for decades many dB offerings offered raw or cooked storage. Cooked being that they used the file system and raw being using the device directly. I'm not sure of the upsides and downsides today as I last worked in that area about 20 years ago, but was very much a case of raw a noticeable performance gain. But then you are equally tied into the vendors backup mechanisms.

I ran a large at the time Informix implementation in the late 90s/early 2000s.

Raw disk vs file system (Solaris UFS) performance was 35% faster on our transaction systems.

As did I upon AIX and NCR towers. Did you find you had to offset the start block of the raw partition? As with AIX, even though the OS saw it as raw, it would stamp over the first 8k (iirc) with volume control data and with that, could easily corrupt a raw partition if you didn't factor in an offset.

I don’t recall the details, but we used Veritas Volume Manager for managing the raw disk, and I’m pretty sure vxvm either hid that or had a driver that didn’t do it.

Our restore run book was scripted and tight, so we may have been doing it ourselves, but I didn’t write the code or at least didn’t break it! Fun times.

This might be the answer on how to make file-heavy tools like npm/yarn work adequately on windows...

What if you create something that is optimized for performance without offering the full SQL language? Just a container to store small files with padding to avoid the expensive open/close calls.

I'm not sure how much better you'd get than sqlite if you were using a straight forward rowid select with a prepared statement. Sqlite is pretty damn snappy for something like that and using a prepared statement amortizes the cost of compiling the SQL.

You can get a point read with just a single aligned random I/O operation on disk with this. I don't see how a B-tree based storage can compete.

And even more optimizations are possible, like grouping objects often accessed together into a single batch to retrieve the whole group in just one random operation too.

I'm sure you can go faster, but how much faster? The rowid b-tree lookup, with the b-tree in memory, will almost certainly be blazing fast compared to the actual disk read.

As sibling comments stated it's common in games. A well known open source VFS library is PhysicsFS¹, previously it was called PhysFS AFAIR.

¹ https://icculus.org/physfs/

Read-only versions of this are extremely common for games; I first encountered it with 90s DOOM WAD files.

Minecraft vastly improved performance when it switched from having one file per 16x16 map area to just stuffing 32x32 of those chunks into one file with a simple index at the beginning.

Look at Google’s LevelDB and Facebook’s RocksDB.

Write API abstractions that offer baked and templated Sql queries for your specific use case. Flexible enough for future growth, easy enough for every day normal use without inventing a whole new system.

Next week on Show HN: `sqlitefs`, the SQLite-backed FUSE filesystem?

I've taken a stab at this and made quite a bit of progress.

However, the idea was to use the sqlite fuse layer as a stacked layer over the underlying filesystem rather than a full scale file-system (which for example, is what libsqlfs provides).

The motivation was not to improve filesystem performance but improving the speed of querying (ie: lightning fast stat of the entire deeply nested directory tree) at the expense of slower writes. Additionally I could (in theory, never got around to it) store rich metadata about the files, build all sorts of indexes, apply a richer access-control mechanism etc. Unfortunately, it remains in pre-alpha state, because life.

Having it on top of FUSE would make it significantly slower than an in-kernel filesystem.

It would be interesting to see how it performs ie. on nodejs projects with node_modules.sqlite3 - for practical reasons it would be nice to mount it via fuse to node_modules - if needed.

Interesting project, thanks.

For the record, their GitHub repository is https://github.com/samzyy/DB-based-replicated-filesystem

Well, to be fair, it would be a lot faster than mysqlfs.

Billions of tech-years ago, BeOS made a filesystem that had database-like properties. Now, mainly because I have no idea what I’m talking about in this area, I don’t know if anything sprung from that into mainstream computing, so features like this from SQLite make the news. (It’s very cool, to be sure)

As someone who has spent time consulting for different kinds of "we do stuff on the internet" companies, I can confidently say that this is a premature optimization for 99.999% of the companies/projects. The companies simply don't have enough IO traffic to need it.

Is it cool? Sure. Is it sexy? Maybe. Is it needed? Nope.

Do boring stuff. Use files. If you send your small files over the internet via web and you need to squeeze additional performance, do boring stuff again: use Varnish.

> Do boring stuff. Use files.

I have the opposite experience. Files are far from 'boring', they have too many gotchas to count. Putting files in the db is for many use-cases actually easier than putting them on the filesystem. By putting all data in the db you have less moving parts, transactions, referential integrity, well defined behaviours, etc. The speed increase is just a nice side effect.

> By putting all data in the db you have less moving parts, transactions, referential integrity, well defined behaviours, etc

If you are doing it for 100k files and below you are using a Ferrari to pick up eggs in a corner store. If you are using it at 100k files and above, you are using a Ferrari to move a pile of paving stones one by one.

Show me a filesystem that can efficiently hold onto all the inode information for a blockchain represented as files and directories. (Hint: not even LevelDB can hold onto all the trie information efficiently; solutions are being sought that pack things tighter than LevelDB.)

You don't need to do it efficiently. You are optimizing for non-existent problem.

Here's what you actually need to optimize for: you have a hundred million files. The are all somehow reachable via http://origin/someuniqueurl. You have multiple copies because you aren't an idiot and you know that users actually hate either losing or getting corrupted files back. You have fingerprints associated with every copy. Something happens and you can't reach part of the tree or you are getting incorrect hashes ( which you know because your origin computes the hashes every time someone requests and one and if those hashes don't match it triggers a re-request to a backup copy ). And now you need to

(a) ensure you still have the needed protection factor ( you you went from 3x to 2x as one copy is dead ) for all affected files

(b) minimize the time needed to remove the "thing" ( probably a disk or a node ) that caused the failure.

(c) minimize the cost of ensuring (a) and (b)

People spent lots of time figuring out very clever ways of doing it via DB. It neither scaled nor worked well when real life (i.e. disk crashes/nodes going away/bad data was returned) happened.

The only thing that you really need to know is

(a) where the data and its copies are stored.

(b) what is the hash of the data that is stored there

(c) how fast you can recover the information about where that data is stored in event of the issue with whatever the system that is used to keep track between (a) and (b).

(d) preferably you also want to know what other data objects ( files ) could be affected when some of the other objects are misbehaving ( if you know a file at nodeA/volumeB had a fingerprint Ykntr8H8pL9PyAtCwdw/CB5tToXTPf55+hKSZb0uhV0 when it was written and now for some reason it says its fingerprint is sw0SI4jkU5VVk6CH4oPYtwx+bK2hIlrw8hVM7i9zmNk while the rest of the copies are saying their fingerprint is still Ykntr8H8pL9PyAtCwdw/CB5tToXTPf55+hKSZb0uhV0 you could decide that you want to trash the nodeX/volumeB because you no longer trust it.

Er, yes, you do need to do it efficiently. A blockchain needs to read about 1000 of these “files” per transaction in order to verify the next block (and there are ~10000 transactions per block, and the accesses have no special locality because they’re a mixture of requests submitted by unaffiliated parties.) It’s an at-scale OLTP system (ingesting thousands of TPS) that is required to execute on consumer hardware.

This is a real problem. The current implementation of Ethereum is IOPS-bound precisely because of this overhead, with ~80% of the time spent verifying a block being spent on the filesystem and index-access overhead of getting the relevant data into memory, rather than being IO-bound by the actual bandwidth of data required. This is an on-disk datastructures problem.

Oh, this is the block chain discussion i.e. another solution in search of a problem. Never mind.

The main added value of sqlite3 when I work with it is mostly compartmentalisation of your data in one DB and one file.

You not only store files as blobs, you can also store a lot of metadata on top of them, and then, when your needs inevitably grow, start using other SQL tables for the rest of your application configuration or state. And it's all in one neat tidy package.

For the scenario you describe I'd generate an nginx container -- or an Ansible script -- by pulling data from a sqlite3 DB. Technically your approach and mine are not mutually exclusive at all.

And, as a sibling commenter said, the speedy nature of sqlite3 is just a bonus. It's not always the chased-after feature (although with WAL and a few other flags enabled, sqlite3 is damn fast and not many DB engines can compete with it).

> You not only store files as blobs, you can also store a lot of metadata on top of them, and then, when your needs inevitably grow, start using other SQL tables for the rest of your application configuration or state. And it's all in one neat tidy package

That's a great solution in search of a problem for a small dataset.

For a large dataset it is a terrible solution, because disks fail and bit-rot happens. Ability to easily restore a tiny subset of data ( single file or multiple files ) using standard commands ( cp/rsync/etc ) is highly underappreciated.

Why of course. This entire discussion should always be complemented with the sqlite3's intended use cases. Even the authors admit it's not every app.

So yes, definitely reach for a richer DB system when your needs grow beyond the point where sqlite3's weaknesses start to show.

I believe the discussion in the entire HN thread is on the general topic of "is sqlite3 really that good for small-to-mid-sized projects", to which I believe the answer is definitely a yes.

While I agree with the sentiment, using the unix file API is wrought with peril if you try to do it properly.


sqlite does a good job to provide expected consistency even in the face of power loss. Half overwritten files or causal ordering violations can occur if you do not call fsync at the appropriate places.

Sorry, that's junk peddled by people who have no idea how often disks become "unhealthy" in the most spectacular way. Bitrot is real. Disks returning funky data is real. Sqlite does not even have online remote backup that would allow one to easily fix/tweak a portion of a database. When you start storing a few TB of data in every blob you are going to learn some very painful lessons that grey beards learned thirty years ago, except that you will learn them at a real misery producing scale.

Is it an efficient way of querying structured data that can easily be blown away/recreated? Absolutely. Storing bytestreams that should not change in SQLite? That's insanity. Edgy and hipster-esque insanity but insanity none the less.

First thing you can do is just add more RAM, any free ram will be used to cache disk access. Varnish will decrease the latency only if you are using some scripting between the file and its access.

If you are at the point where you need to handle more IO, throwing additional memory onto a file origin will not in my experience do much because filesystems's idea of what to cache is vastly different from user's requests. Not doing any disk IO for hot files is vastly preferable than to doing some. This has been demonstrated in spades at LiveJournal which extensively stuffed files into memcache before Varnish became a thing.

In the past I've had huge speedups by moving simple single-table databases that had grown a bit (e.g., time series data) from sqlite to postgres. Insert performance is also quite bad forcing you to write applications with extra caching layers to be able to do a bunch of inserts at once. Sqlite is great for many applications but it's speed is somewhat oversold. I really wish the postgres engine was embeddable into applications easily.

I once tried out some RSS aggregator/reader that required me to install and set up postgres for no good reason other than it's what the developer was accustomed to. Huge pain in the ass, particularly since simply using sqlite would have been perfectly adaquate (does a RSS reader really need concurrent writes? Really?)

For any software meant to be used on a non-technie's desktop, anything other than an embedded 'zero config' arrangement is a nonstarter.

This is exactly true, which is why embedding postgres would be great. It's very easy to hit the performance limitations of sqlite and then be stuck with the choice of either poor performance or forcing users to setup postgres like you experienced.

Amarok, the kitchen sink music player had an option of connecting to a Mysql server back in 2007. IIRC it used a slower embedded DB out of the box.

I would be curious about your particular implementation with SQLite. When enabling WAL, I am able to saturate a NVMe disk on which the the database file resides. In the single-node case, I have yet to see SQLite (when optimized) run slower than any other SQL implementation. I do agree that SQLite with 100% defaults will easily be overtaken by Postgres, and that if your use case requires that more than 1 node persist transactional data you should definitely be using Postgres.

We are currently using SQLite (C#/System.Data.Sqlite.Core/WAL=on/synchronous=normal) in production for multiple customers as the principal store of transactional business data and application state. We have had literally zero downtime or performance concerns as a consequence of SQLite being used for managing this data. The longer this goes on, the more I am led to believe that using a client-server database system for any single-node persistence scenario is never a good idea. The part that I really like is that I don't have to install anything on the customer's box along with our software. We use a self-contained deployment of .NET Core, and it brings the SQLite engine along for the ride.

The code for my most recent case is here:


The db code itself is here:


WAL is being used but that's not enough to get a mixed INSERT/SELECT to not completely starve reads without first caching INSERT and batching it.

>In the single-node case, I have yet to see SQLite (when optimized) run slower than any other SQL implementation. I do agree that SQLite with 100% defaults will easily be overtaken by Postgres, and that if your use case requires that more than 1 node persist transactional data you should definitely be using Postgres.

This is single node with WAL. If there's something I've missed I'd love to see it (or get a PR). But the research I did when I wrote the code seemed to indicate this was a known limitation of sqlite.

Is there a reason you are managing locking at the application layer on the SQLite connection object?


This is going to absolutely destroy your throughput. I am not sure about go, but in .Net, the underlying SQLite connection instance is inherently thread-safe and can be shared across execution contexts. If you are unable to share a single SQLite connection instance between multiple logical parallel execution contexts in go, I would recommend just creating a new SQLite connection in each case where you need one (e.g. at the beginning of every method currently performing a lock on connection). The only thing you need to remember with this approach is that the synchronous=normal pragma needs to be set each time you new up a connection. This is an extremely fast operation (doesn't touch disk IIRC), so you shouldn't worry too much about any extra overhead here.

It's a simplification for the threading implementation. Spawning a new connection per thread would work but then I'd probably have to deal with failed transactions. The bottleneck wasn't there though, INSERT performance will just naturally have that behavior:


Without batching several INSERTs into a single transaction you can't do very many of them which is why I ended up doing a caching layer and batching them.

I would recommend giving it a try regardless. This is the exact approach we use today and we are seeing zero issues. I am not sure why you think you will see failed transactions. It is a perfectly legitimate/recommended approach to create a new connection per logical transaction.

If the locking is moved to the database concurrent transactions can happen. If those touch the same rows one of them will have to fail to maintain consistency. The bottleneck wasn't there though. Even a single thread doing INSERTs is more than enough to slow down the database. You probably don't have a write-heavy workload.

Well, sqlite is not recommended for write-heavy workflows anyway.

I have the same impressions as your parent commenter: passing around an sqlite connection handle is thread-safe and always will be faster than application-level locking.

If you do however have a write-heavy workflow then it's likely time to replace sqlite.

>I have the same impressions as your parent commenter: passing around an sqlite connection handle is thread-safe and always will be faster than application-level locking.

I'm not debating that. It's just that that's not the current bottleneck so no point in dealing with the extra complexity. The current solution can saturate my SSD with low CPU usage.

>If you do however have a write-heavy workflow then it's likely time to replace sqlite.

That's definitely true. But there's no other SQL database to embed that I know of. That's why I mentioned wanting postgres as a lib. I know of other options that are non-SQL and will require extra rework. If there's an SQL option I'm all ears.

I did a research about a month ago and was very disappointed. Basically, outside BerkeleyDB and Firebase semi-embedded mode (since it is still not fully embedded) we're on our own.

From then on, you can just use BoltDB or its successor(s) if you don't care about SQL. Which I do, and you seem to as well.

At this point I am thinking of developing an Elixir wrapper for sqlite that can also do a best effort strong-typing and be done with it. It seems sqlite made many developers complacent, but then again, the raging consumerism and micro-management economy we live in worldwide does not exactly encourage tinkering and OSS contributions.

I've been loosely following this:


Since I'm writing in rust it might be a good fit. It's non-SQL but does have some nice features. I would need to figure out how to adapt all the syncer features to that model though. Right now I should just focus on finishing the merging code to be able to release a 1.0 some time soon.

Not to be a complete cynic but IMO you should just use any of the pretty well-done NoSQL K/V stores if you don't mind ditching SQL.

There's plenty that are very well done, BoltDB successors and FoundationDB included.

BTW I've heard very good things about the widely used sqlite Rust libraries (sorry, don't remember the names).

>Not to be a complete cynic but IMO you should just use any of the pretty well-done NoSQL K/V stores if you don't mind ditching SQL.

My thought process is that the current solution works so I'd only change to a K/V store with the extra complexity if that got me a better API and more of the stack in Rust. Consuming something in Go seems like a pain, while also adding a GC runtime just for this.

>BTW I've heard very good things about the widely used sqlite Rust libraries (sorry, don't remember the names).

I'm using rusqlite, seems to work fine.

I wonder how a Rust library deals with the fact that sqlite is loosely typed.

Obviously if it was embeddable it wouldn’t be as fast and be subject to the same design limitations as SQLite.

There's nothing about being embeddable that's a limitation. All I really want is a lib I can link to that gives me an entry point to launch the database against a given folder, with some of the tweaks the postgres config file has, and that gives me a connection back. That entry point can launch however many threads it wants and behave exactly like postgres itself. After that my app can connect to a standard postgres and take advantage of all the features and performance.

We've been moving so much into web apps that optimizing for desktop apps like these is not a big concern these days. Sqlite ends up being used in that niche but it's extremely easy to run into it's limitations even in very simple personal usage. It's fine for what it is, an alternative to open(), but there are plenty of use cases where optimizing for fast open is not relevant.

Here's an example of where I'd really like postgres to replace sqlite:


I've had to add an extra caching layer because of sqlite INSERT performance. Given that I've already had to code around limitations and the schema is simple I will probably end up using a simpler non-SQL database in the future instead.

> There's nothing about being embeddable that's a limitation.

An always-running in-memory server process is going to be a lot better at concurrency than multiple embeddable libraries trying to take locks out on the same file on disk.

Read what I wrote. What I want is always-running in-memory. Being embeddable doesn't mean you need to have the sqlite model where several processes access the same database. None of the sqlite uses I've had required that at all.

EDIT: Deleted a controversial comment about the often praised SQLite test suite not actually being public and about locking behaviour in SQLite. References are [1] and [2].

[1] https://www.sqlite.org/th3.html

[2] https://beets.io/blog/sqlite-nightmare.html

D. Richard Hipp has always been pretty up front about his business model. He provides sqlite for free, public domain, and does not typically entertain third party contributions to keep the licensing clean. Additionally his business sells proprietary extensions and support for sqlite. Having that test suite uniquely positions them to do so, and due to this business model they expect to have the funding to continue providing a public domain sqlite to the world for decades to come.

Resp to deleted comment:

sqlite is considered such because it's open source and well tested. That doesn't seem to imply to me that the tests are also open source. I don't think Hipp is lying about his proprietary test suite, so I just can't convince myself to lose sleep over this.

Yes, fair enough. Note that I never accused anybody of not telling the truth. My point was about how useful sqlite can be as the posterchild example of a good test suite when we can't see the secret sauce. Not that it doesn't exist. I deleted the other comment because after reading yours I realised it was needlessly controversial. I think you're right in that SQLite can be considered an example of having a good test suite when looked at from a different perspective, just not from the point of it (the test suite) being open source.

If I understand you correctly, there is another test suite in addition to the one you can download freely?

The TH3 test suite is proprietary:

>"The TH3 test harness is a set of proprietary tests, written in C that provide 100% branch test coverage (and 100% MC/DC test coverage) to the core SQLite library. The TH3 tests are designed to run on embedded and specialized platforms that would not easily support TCL or other workstation services. TH3 tests use only the published SQLite interfaces. TH3 consists of about 69.4 MB or 948.1 KSLOC of C code implementing 44753 distinct test cases. TH3 tests are heavily parameterized, though, so a full-coverage test runs about 1.7 million different test instances. The cases that provide 100% branch test coverage constitute a subset of the total TH3 test suite. A soak test prior to release does hundreds of millions of tests. Additional information on TH3 is available separately."


> The TH3 test harness is a set of proprietary tests


Wonder what the number is with "noatime"? Every open is a write on a normal filesystem, but you can turn it off.

The default (relatime) is to update atime only once every 24h, so it's negligible for many workloads.

Or more importantly atime is updated in tandem with mtime.

While an impressive feat on SQLite’s end, I sincerely hope no engineers see this and take it at face value. The filesystem is slower because it is more flexible, and I would imagine that once you have concurrent reads and writes to your thumbnails, a filesystem is to prefer.

IMO the point of this article is pretty clearly to answer the developer question "my application needs to read and write data, should I use the filesystem or should I use a sqlite db?" and not make some kind of sweeping "sqlite > fs" claim. In this context, face value is fine.

If you want to support concurrent reads and writes, you probably should pick a particular database engine and understand its concurrency semantics, instead of assuming that whatever filesystem your software lands on will have the semantics you expect.

Theoretically there could exist filesystems that can do well on thumbnails and small files, but in practice this has never been true. Databases are typically better at this and specialized solutions can get you even farther, like over an order of magnitude barrier farther.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact