It means instances in a web-farm can pull the files down when they initialize easily, it means files are automatically versioned, it provides an obvious place to put the files when they are being uploaded on the Admin panel. It means all the files are getting backed up as part of the database snapshots automatically.
Things I’ve used this for range from static HTML assets, to logos which are used to dynamically brand a page, to files that get attached to transactional emails, etc.
So yes, if you are building a massive scale system holding millions of files, sure, find a purpose-built system to store, distribute, version, and backup your files.
But for many projects I believe the files-in-database approach is dismissed too early for not being “clean” enough. But having a single source (the DB) for everything the app needs to initialize is a massive simplification. You may find it’s well worth the trade-offs. Really the only downside I ever found — at the scale I was operating at — was that it bloats the database backups.
> In Linux, the ext2, ext3, ext4, JFS, Squashfs, Yaffs2, ReiserFS, Reiser4, XFS, Btrfs, OrangeFS, Lustre, OCFS2 1.6, ZFS, and F2FS filesystems support extended attributes (abbreviated xattr) when enabled in the kernel configuration.
> The Linux kernel allows extended attribute to have names of up to 255 bytes and values of up to 64KiB, as do XFS and ReiserFS, but ext2/3/4 and btrfs impose much smaller limits, requiring all the attributes (names and values) of one file to fit in one "filesystem block" (usually 4 KiB).
Applications that make extensive use of xattrs typically recommend one of the FSs that allow full-size xattrs. For example, OpenStack Swift, an object storage service, recommends XFS as a backing storage. (Source: I operate Swift clusters at $dayjob.) But if your metadata is not larger than 4 KiB, any production-quality Linux FS will do.
Personally I use S3 or similar for this. At the number of files I'm working with storing it in a relational database would be crazy expensive.
FileId, VersionId, Name, Data, UploadedTimestamp, UploadedBy, ...
If you’re on AWS, pushing and pulling from an S3 bucket is probably a great solution, and then of course there’s nothing to worry about in terms of backups.
Do you still need to keep an index of the files / metadata in a DB, or can you tag everything you need directly on the S3 objects and just pull the whole bucket?
Presigned URLs are HMAC-SHA256:
I just measured GetObjectRequest.Presign() at 1063 ns. That's a million signatures per second of my 2015 MacBook Pro using the stock AWS SDK for Go _on one core_. This is plenty fast already, and it's guaranteed to be running in parallel on better hardware in production. There's no way signature generation is a bottleneck in any real application.
> and a limit to how much can serve
I've gotten hundreds of gigabits per second between S3 and EC2 without doing anything special. What problems are you referring to?
> at a certain scale you will want to use a cdn
So use a CDN too. CloudFront fits there neatly, and it can even sign its own requests to S3:
None of those criticisms are valid reasons to avoid S3. S3 is an excellent solution for many common problems.
Having said that, S3 has genuine drawbacks for certain workloads. (For example, the long tail on time to first byte is sometimes a dealbreaker.) In my opinion the biggest limitation is budgetary: typically one will spend all their money on AWS data transfer fees long before hitting S3 service limits.
And I didn't say to avoid s3. I work for a subsiary of AWS I want you to use it a ton.
Let's say your serving up a bunch of thumbnails, maybe 1024 per user per load.
Or in my scenario 25k files. 25k signs per user is 40 users per second per server signing constantly.
I have a Datastore implementation that uses disk when locally and firebase when remote.
You can read about it there.
I thought it would be a great idea to use just simple flat files as you can commit them to git, diff them, and there's kind of already an API for working with them.
In retrospect I regret it and wish we had something more powerful.
First. It's twice as much code to maintain. Which introduces bugs.
Seconds if you're keeping your PDFs online vs on disk they're duplicated. It would be nice to build in a higher level abstraction that was both network AND disk aware and just pulled things down when you needed it.
Third, the annotations on your PDF aren't really indexable without a 3rd party system.
Getting the right balance of remote-first vs web-first is tough.
Now, after building a complex webapp with electron support I think it's better to go web-first and then try to use the browsers local caching to get offline-first support.
The files were the storage for some third-party content targeting tool, and were edited using an admin UI that tool, then deployed to the servers where the live part of the tool would read them. You couldn't bring the system up in a complete state without them.
Apart from those files, we could restore the whole system using the latest code (really, the artifact built from it), and the latest database dump. So, before taking a database dump, we'd run a script to load those files into the database. We also had a script for undumping them, which we tested regularly in our staging environment.
We didn't use the database for deployment, because the third-party tool did that. Perhaps we should have - it would have been a bit of a subversion of the tool's usual mode of operation, but it might have had some advantages.
Easily solved by having two separate databases. One for dynamic content, one for static files, which is probably good practice regardless.
Also, remember the GP said "small to mid-size."
To the grandparent: you typically solve this application consistency issue by using something like immutable object store semantics with versioned references. The object store is capable of answering requests for multiple versions of the asset, and the application metadata store tracks individual versions. You can sequence the order in which you commit to these stores, so the asset is always available before it is published in the metadata store. Alternatively, you can make consumers aware of temporary unavailability, so they can wait for the asset to become available even if they stumble on the new version reference before the content is committed.
You can also find hybrids where the metadata store is used to track the asset lifecycle, exposing previous and next asset version references in a sort of two-phase commit protocol at the application level. This can allow consuming applications to choose whether they wait for the latest or move forward with older data. It also makes it easier to handle failure/recovery procedures such as canceling an update that is taking too long and reclaiming resources.
As such, splitting the database may incur in significant consistency issues that a backup doesn't incur into.
I believe this splitting technique is not a good one except for potentially narrow use cases.
Are you saying that even a single-database backup is not atomic?
That's likely the OS's filesystem cache, not anything in nginx. If you're on Linux, you can confirm this by clearing the OS filesystem cache with
$ sudo -i
# echo 3 > /proc/sys/vm/drop_caches
Once I tried to hookup a way to serve the files directly from the DB through IIS, and it was a total failure. By the time I had figured out getting all the various headers right, I was still running into some weird caching issue that would corrupt the request at random intervals.
So now, if the files are static hosting assets, I will initialize a ‘cache/‘ folder on the local filesystem (for that machine on the web-farm) when the first request comes in and just read all those files back into the local file system for the web server to handle.
The only remaining concern is refreshing the cache when files change in the DB.
Having a deduplicating backup solution might help with that. Simplest way to do that (in terms of ease-of-implementation), would be to keep backups on a ZFS or btrfs filesystem. Note that (AFAIK) with btrfs, deduplication must be triggered manually, so you'd need to run `btrfs filesystem defragment` on the backups to take advantage of it.
ZFS also has a block deduplication scheme, but turning it on is a poor idea. The deduplication table needs to be stored in memory for decent performance, and this takes a lot of memory. Even then, performance is rarely as good as with dedup turned off. It doesn't even save as much space as you'd hope -- it's per-record, and ZFS storage records are 128KiB by default. If you're trying to deduplicate 10KiB files with that, you're going to have a bad time.
(Note also: The ZFS record size is dynamic; the 128KiB default is the maximum record size it'll allow, i.e. the most data that has to be read and rewritten (somewhere else) if you modify one byte in a larger file. If you just write a single 10KiB file and force it to disk, then you'll get a 10KiB record. It's complicated; try not to read more into this than "dedup is rarely a good idea," though if I'm allowed a second line, I'll add "compression is always a good idea; enable zle at a minimum.")
You could put everything inside a zip and that's it.
How are changes getting put into the container? Are you versioning the container each time a file is changed or added? When files change are all instances of the containers being restarted to get the new/changed files?
open() and close() are significantly slower on non-Linux OSes so this would be even more drastic on say Windows or macOS.
 I once had to debug a library that was lot slower on Mac and Windows and it mostly turned out to be because of doing open/close of files instead of caching the handles - on Linux it was barely showing up on the perf traces and I had no easy way to debug on the other OSes. On a hunch I cached the file
handles and the performance difference was mostly gone.
C:\temp>kvtest run test1.dir --count 100k --blob-api
--count 100000 --max-id 1000 --asc
Total elapsed time: 9.495
Microseconds per BLOB read: 94.950
Content read rate: 104.0 MB/s
C:\temp>kvtest run test1.dir --count 100k --blob-api
--count 100000 --max-id 1000 --asc
Total elapsed time: 5.218
Microseconds per BLOB read: 52.180
Content read rate: 313.3 MB/s
Windows is a completely different beast than Unix and you need to know the beast to tame it.
You might say that stat+open+read+close should be compared/measured together. In that case my answer would be that you would not read all files in a directory in this way on Windows anyway. You'd use overlapped I/O with buffers of FS cache granularity (which is 64Kb, unless it has changed in recent versions of Windows). It wouldn't matter for smaller files anyway (since the OS aggressively reads ahead when you open a file) and would be better for larger files.
It is a bugbear of mine when I see complaints about Windows from people who never actually tried to optimize for the OS.
reading blobs in Sqlite from db is still ~3x faster on my Surface Pro 2017 (don't call it Surface Pro 5!), but that's to be expected.
And if you don't mind a question: are you guys up for making a strictly typed variant of sqlite? I'm aware it is going to be a completely different beast and many wouldn't expect backwards compatibility.
Just curious if such a strongly-typed variant is on your radar at all.
I thought about writing something from scratch, but I knew that whatever I could optimize in an hour or two would be nothing compared to more robust tools that already existed. So I used sqlite to ingest the log once, and query it in various ways to produce the requested reports. The script was just a dozen lines long or so, with one dependency (sqlite pkg).
The interviewer didn’t like this answer. didn’t say why. But I’m glad to see that I wasn’t way off base about how performant sqlite can be.
My answer was of course not a prod solution, just an answer to a question with a narrow scope.
When we interview people, the answers we want often aren't real-world answers because we're trying to get a deeper understanding of how you think or how well you know the details of a complex mechanism.
I do many interviews and I remember even when I started I had to be pretty clear what the candidate can and cannot use. I actually appreciate that they can find existing solutions that are likely better than what one can come up with in a few hours but if I need to test them on how they can code something like that I'll specifically state it.
But it was a good thing overall. I showed me signs that the company has an elevated degree of not invented here culture, and that they aren’t very clear when requesting/scoping work.
What the applicant illustrated is that he is a much better engineer than the interviewer and the interviewer did not like it.
Good software engineering is about leveraging existing robust tools in a new way, not about re-inventing a wheel while making is square.
For me that would have been an insta-hire.
I agree that the answer was a good one in the real-world sense, and one that I would like an engineer to use under the right circumstances.
I disagree that the answer demonstrated engineering prowess, however. Sometimes you have to assume the lack of existence of certain tools to understand how a candidate would have implemented the solution him- or herself to get a more thorough understanding of his or her engineering chops.
> Good software engineering is about leveraging existing robust tools in a new way, not about re-inventing a wheel while making is square.
Use of external libraries or solutions is often very restrictive (if not banned) in my field. We can use existing internal or known verified solutions. Not everything is so black and white, and not all options are always on the table. They rarely ever are.
I don’t think really think “leveraging existing tools” is a defining part of software engineering... though it certainly is part of it. Software engineering is much more than just coming up with solutions.
The applicant solved a problem that was presented to him. If the person who interviewed him did not want him to do it that way, he or she would have specified it.
> All machines use SSD except Win7 which has a hard-drive.
And yet Win10 is much slower than Win7 in all tests (note they have latency on the graphs).
How’s that possible? For a drive spinning at 5400 RPM, average random read latency can’t be smaller than 2.2ms, this is how long it takes to rotate the disks 180° which is an average. An SSD should do much better than that.
I have a couple possible explanation. Either that particular Win10 machine, 2016 Lenovo YOGA 910, has _extremely_ slow SSD. Or there’s something wrong with software/drivers/config of the PC, like an antivirus, or wrong chipset drivers.
I think I know what’s going on. They don’t publish absolute numbers, they only publish result compared to SQLight on the same machine. Windows has approximately fixed overhead when opening files. Most of that overhead goes to access checks, NTFS security descriptors are way more sophisticated than 9 bits of security in Linuxes. With a fast disk that overhead dominates in the cost, SQLight doesn’t open any files while running the test, the DB already opened at startup. With a slow disk, that overhead is masked by large IO cost which affects both SQLight and NTFS.
Still, Windows permissions are way more complicated, especially when AD domains are involved. These access control lists have arbitrary count of entries each. There’s another list, SACL, for audit. Access permissions are inherited through file system with non-trivial inheritance rules, e.g. allow + deny = deny. User permissions are inherited through group membership, and security groups can be nested i.e. include other groups. Security descriptors may include custom permissions unknown to MS. Any thread can run as any user in the forest through client impersonation, many network servers do. The kernel tries to apply effective permissions instantly, as soon as anything changes (but not for every read/write, that would be too slow, it only does access checks when opening files).
Combined, these features make access checks way more expensive in Windows, by orders of magnitude.
In one of web applications (CRM alike) I was experimenting with SQLite-per-user approach. In that application user related data was clustered on per-user basis.
So instead of single-db-for-all-users I had one-db-per-user.
Performance gain was significant - around 25% for most of requests. I suspect that this is due the fact that index trees were significantly shorter for each particular user, data more local, etc.
Yet db schema update and other batch requests can be made without stopping the server, just handle db files one by one.
Just in case.
Sadly, ActorDB seems to be inactive as a project—but the fundamental premise is sound.
I’d love to see the concept reimplemented as e.g. a Redis module (since a Redis cluster essentially has the same “shape” as ActorDB’s custom distributed K-V store.)
Any pointers to further insights would be greatly appreciated. Even a forum where (utterly unlike StackOverflow) I could ask such a question without already almost knowing the answer.
It was killed though. Derp.
Yes, open() can introduce significant overhead, especially when doing it sequentially on a remote filesystem, e.g. NFS.
But that's using the filesystems in the most inefficient way possible, there are better ways to do it. E.g. you can pre-touch the files or even pre-open them in parallel and then point to /proc/self/fd/<N> instead, which turns an open() into a dup(). I was able to make some properietary software run 5x faster that way simply by feeding it already open file descriptors.
 Calling it Nix confuses with NixOS...
To work though, you'd need some kind of interface to actually access the contents. I guess that's doable, but not ideal.
Someone else posted libsqlfs, which is FUSE Sqlite file system, and should do the job...not exactly plug-and-play though, and presumably wouldn't work on Windows (maybe on WSL tho?)
Perhaps Dokan? Dokan has a FUSE wrapper 
With a FUSE filesystem, you're still incurring the syscall overhead that's causing this performance disparity in the first case, plus now the kernel has to context-switch into the FUSEing process. So what would be 2 context switches for each syscall (program -> kernel -> program) now is 3 or 4 context switches (program -> kernel -> FUSE -> (kernel ->) program).
FUSE has some amazing usecases, but performance is none of them.
I'm also approaching this as primarily a Windows user - a lot of the overhead on Windows is access checks, so my thinking is these would be implemented as noops.
Since they don't bother explaining which Linux filesystem they used, I suspect they mean EXT4, and the particular version of EXT4 in the kernel in that particular version of Ubuntu.
The real news is: ext4 is about as fast as SQLite for small objects.
Raw disk vs file system (Solaris UFS) performance was 35% faster on our transaction systems.
Our restore run book was scripted and tight, so we may have been doing it ourselves, but I didn’t write the code or at least didn’t break it! Fun times.
And even more optimizations are possible, like grouping objects often accessed together into a single batch to retrieve the whole group in just one random operation too.
However, the idea was to use the sqlite fuse layer as a stacked layer over the underlying filesystem rather than a full scale file-system (which for example, is what libsqlfs provides).
The motivation was not to improve filesystem performance but improving the speed of querying (ie: lightning fast stat of the entire deeply nested directory tree) at the expense of slower writes. Additionally I could (in theory, never got around to it) store rich metadata about the files, build all sorts of indexes, apply a richer access-control mechanism etc. Unfortunately, it remains in pre-alpha state, because life.
For the record, their GitHub repository is https://github.com/samzyy/DB-based-replicated-filesystem
Is it cool? Sure. Is it sexy? Maybe. Is it needed? Nope.
Do boring stuff. Use files. If you send your small files over the internet via web and you need to squeeze additional performance, do boring stuff again: use Varnish.
I have the opposite experience. Files are far from 'boring', they have too many gotchas to count. Putting files in the db is for many use-cases actually easier than putting them on the filesystem. By putting all data in the db you have less moving parts, transactions, referential integrity, well defined behaviours, etc. The speed increase is just a nice side effect.
If you are doing it for 100k files and below you are using a Ferrari to pick up eggs in a corner store. If you are using it at 100k files and above, you are using a Ferrari to move a pile of paving stones one by one.
Here's what you actually need to optimize for: you have a hundred million files. The are all somehow reachable via http://origin/someuniqueurl. You have multiple copies because you aren't an idiot and you know that users actually hate either losing or getting corrupted files back. You have fingerprints associated with every copy. Something happens and you can't reach part of the tree or you are getting incorrect hashes ( which you know because your origin computes the hashes every time someone requests and one and if those hashes don't match it triggers a re-request to a backup copy ). And now you need to
(a) ensure you still have the needed protection factor ( you you went from 3x to 2x as one copy is dead ) for all affected files
(b) minimize the time needed to remove the "thing" ( probably a disk or a node ) that caused the failure.
(c) minimize the cost of ensuring (a) and (b)
People spent lots of time figuring out very clever ways of doing it via DB. It neither scaled nor worked well when real life (i.e. disk crashes/nodes going away/bad data was returned) happened.
The only thing that you really need to know is
(a) where the data and its copies are stored.
(b) what is the hash of the data that is stored there
(c) how fast you can recover the information about where that data is stored in event of the issue with whatever the system that is used to keep track between (a) and (b).
(d) preferably you also want to know what other data objects ( files ) could be affected when some of the other objects are misbehaving ( if you know a file at nodeA/volumeB had a fingerprint Ykntr8H8pL9PyAtCwdw/CB5tToXTPf55+hKSZb0uhV0 when it was written and now for some reason it says its fingerprint is sw0SI4jkU5VVk6CH4oPYtwx+bK2hIlrw8hVM7i9zmNk while the rest of the copies are saying their fingerprint is still Ykntr8H8pL9PyAtCwdw/CB5tToXTPf55+hKSZb0uhV0 you could decide that you want to trash the nodeX/volumeB because you no longer trust it.
This is a real problem. The current implementation of Ethereum is IOPS-bound precisely because of this overhead, with ~80% of the time spent verifying a block being spent on the filesystem and index-access overhead of getting the relevant data into memory, rather than being IO-bound by the actual bandwidth of data required. This is an on-disk datastructures problem.
You not only store files as blobs, you can also store a lot of metadata on top of them, and then, when your needs inevitably grow, start using other SQL tables for the rest of your application configuration or state. And it's all in one neat tidy package.
For the scenario you describe I'd generate an nginx container -- or an Ansible script -- by pulling data from a sqlite3 DB. Technically your approach and mine are not mutually exclusive at all.
And, as a sibling commenter said, the speedy nature of sqlite3 is just a bonus. It's not always the chased-after feature (although with WAL and a few other flags enabled, sqlite3 is damn fast and not many DB engines can compete with it).
That's a great solution in search of a problem for a small dataset.
For a large dataset it is a terrible solution, because disks fail and bit-rot happens. Ability to easily restore a tiny subset of data ( single file or multiple files ) using standard commands ( cp/rsync/etc ) is highly underappreciated.
So yes, definitely reach for a richer DB system when your needs grow beyond the point where sqlite3's weaknesses start to show.
I believe the discussion in the entire HN thread is on the general topic of "is sqlite3 really that good for small-to-mid-sized projects", to which I believe the answer is definitely a yes.
sqlite does a good job to provide expected consistency even in the face of power loss. Half overwritten files or causal ordering violations can occur if you do not call fsync at the appropriate places.
Is it an efficient way of querying structured data that can easily be blown away/recreated? Absolutely. Storing bytestreams that should not change in SQLite? That's insanity. Edgy and hipster-esque insanity but insanity none the less.
For any software meant to be used on a non-technie's desktop, anything other than an embedded 'zero config' arrangement is a nonstarter.
We are currently using SQLite (C#/System.Data.Sqlite.Core/WAL=on/synchronous=normal) in production for multiple customers as the principal store of transactional business data and application state. We have had literally zero downtime or performance concerns as a consequence of SQLite being used for managing this data. The longer this goes on, the more I am led to believe that using a client-server database system for any single-node persistence scenario is never a good idea. The part that I really like is that I don't have to install anything on the customer's box along with our software. We use a self-contained deployment of .NET Core, and it brings the SQLite engine along for the ride.
The db code itself is here:
WAL is being used but that's not enough to get a mixed INSERT/SELECT to not completely starve reads without first caching INSERT and batching it.
>In the single-node case, I have yet to see SQLite (when optimized) run slower than any other SQL implementation. I do agree that SQLite with 100% defaults will easily be overtaken by Postgres, and that if your use case requires that more than 1 node persist transactional data you should definitely be using Postgres.
This is single node with WAL. If there's something I've missed I'd love to see it (or get a PR). But the research I did when I wrote the code seemed to indicate this was a known limitation of sqlite.
This is going to absolutely destroy your throughput. I am not sure about go, but in .Net, the underlying SQLite connection instance is inherently thread-safe and can be shared across execution contexts. If you are unable to share a single SQLite connection instance between multiple logical parallel execution contexts in go, I would recommend just creating a new SQLite connection in each case where you need one (e.g. at the beginning of every method currently performing a lock on connection). The only thing you need to remember with this approach is that the synchronous=normal pragma needs to be set each time you new up a connection. This is an extremely fast operation (doesn't touch disk IIRC), so you shouldn't worry too much about any extra overhead here.
Without batching several INSERTs into a single transaction you can't do very many of them which is why I ended up doing a caching layer and batching them.
I have the same impressions as your parent commenter: passing around an sqlite connection handle is thread-safe and always will be faster than application-level locking.
If you do however have a write-heavy workflow then it's likely time to replace sqlite.
I'm not debating that. It's just that that's not the current bottleneck so no point in dealing with the extra complexity. The current solution can saturate my SSD with low CPU usage.
>If you do however have a write-heavy workflow then it's likely time to replace sqlite.
That's definitely true. But there's no other SQL database to embed that I know of. That's why I mentioned wanting postgres as a lib. I know of other options that are non-SQL and will require extra rework. If there's an SQL option I'm all ears.
From then on, you can just use BoltDB or its successor(s) if you don't care about SQL. Which I do, and you seem to as well.
At this point I am thinking of developing an Elixir wrapper for sqlite that can also do a best effort strong-typing and be done with it. It seems sqlite made many developers complacent, but then again, the raging consumerism and micro-management economy we live in worldwide does not exactly encourage tinkering and OSS contributions.
Since I'm writing in rust it might be a good fit. It's non-SQL but does have some nice features. I would need to figure out how to adapt all the syncer features to that model though. Right now I should just focus on finishing the merging code to be able to release a 1.0 some time soon.
There's plenty that are very well done, BoltDB successors and FoundationDB included.
BTW I've heard very good things about the widely used sqlite Rust libraries (sorry, don't remember the names).
My thought process is that the current solution works so I'd only change to a K/V store with the extra complexity if that got me a better API and more of the stack in Rust. Consuming something in Go seems like a pain, while also adding a GC runtime just for this.
>BTW I've heard very good things about the widely used sqlite Rust libraries (sorry, don't remember the names).
I'm using rusqlite, seems to work fine.
We've been moving so much into web apps that optimizing for desktop apps like these is not a big concern these days. Sqlite ends up being used in that niche but it's extremely easy to run into it's limitations even in very simple personal usage. It's fine for what it is, an alternative to open(), but there are plenty of use cases where optimizing for fast open is not relevant.
Here's an example of where I'd really like postgres to replace sqlite:
I've had to add an extra caching layer because of sqlite INSERT performance. Given that I've already had to code around limitations and the schema is simple I will probably end up using a simpler non-SQL database in the future instead.
An always-running in-memory server process is going to be a lot better at concurrency than multiple embeddable libraries trying to take locks out on the same file on disk.
sqlite is considered such because it's open source and well tested. That doesn't seem to imply to me that the tests are also open source. I don't think Hipp is lying about his proprietary test suite, so I just can't convince myself to lose sleep over this.
>"The TH3 test harness is a set of proprietary tests, written in C that provide 100% branch test coverage (and 100% MC/DC test coverage) to the core SQLite library. The TH3 tests are designed to run on embedded and specialized platforms that would not easily support TCL or other workstation services. TH3 tests use only the published SQLite interfaces. TH3 consists of about 69.4 MB or 948.1 KSLOC of C code implementing 44753 distinct test cases. TH3 tests are heavily parameterized, though, so a full-coverage test runs about 1.7 million different test instances. The cases that provide 100% branch test coverage constitute a subset of the total TH3 test suite. A soak test prior to release does hundreds of millions of tests. Additional information on TH3 is available separately."