Hacker News new | past | comments | ask | show | jobs | submit login
SQLite is 35% Faster Than The Filesystem (2017) (sqlite.org)
226 points by emilengler 9 days ago | hide | past | favorite | 127 comments





This gets posted a lot here and I really wonder if anyone's bothered to try and replicate these results or test them with different sizes of data on different filesystems - You can see that the greatest disparity is with NTFS/windows while linux is pretty close in performance, but they don't bother to mention if they've got ubuntu formatted for ext4 or whatever. I can't really seem to find anything looking around, and this article is like 5 years old now. I remember showing this very article to a supervisor at one point and he scoffed it off as unnecessary overhead when all I needed was a plain file store and no relational queries at all.

It looks like this is the benchmarking code, I'll have to go over it for curiosity sometime: https://www.sqlite.org/src/file/test/kvtest.c


I tried this recently (~2 years ago). SQLite is faster, but returning unused space to the OS is a pain. If you don't need that to be prompt it's a good solution.

Is it a known issue that the filesystem on Windows 10 is so slow? Being 5 times slower than macOS was roughly my experience but I thought there was just something wrong with my Windows laptop. I can't find any benchmark or explanation about this.

It isn't just Windows 10, as far as I know every version of windows has always wanted to spend like a minute (or multiple minutes!) slowly counting the files before it proceeds to slowly delete them one by one... until it finds one that's open in a background process and nixes the whole operation in a half complete state. 5 times slower sounds conservative. It's awful.

Hey, maybe they could put rounded corners on the file deletion window, I bet that would fix it!


The only reason I'm posting on HN right now is because I dared to right-click a folder on a network share. MyFolderName (not responding), whelp, guess I'll go fuck off for 60 seconds while every item in that shortcut menu opens a network connection, navigates to...or whatever the hell it's doing. Yeah, I know, use the CLI. OTOH, I argue: fix your broken UI.

I worked at MSFT for eight years as a full-time, Flavorade-guzzling employee, and fifteen years later rarely a day goes by that I don't question how Microsoft manages to get anyone to buy that steaming pile they call an operating system.


Most people don't buy Windows - it just comes with their computer, thanks to Microsoft's deals with OEMs.

Kids get accustomed to Windows and Microsoft Office in schools thanks to Microsoft's lobbying, so when they grow up they keep using Windows and pay for Office.

Everyone uses Windows and Microsoft Office, so most companies and organizations have to pay for licenses for each of their employees.

People know Windows, so they develop on Windows, so they get locked into yet more Microsoft products and libraries and have to run Windows on servers. That's yet more licenses for a Windows and for Visual Studio.

Their business strategy would almost be impressive if it weren't terrifyingly harmful.


Windows as an OS is... not the best. I miss Ubuntu/Pop!_OS. Switched due to MIDI controller driver compatibility issues that seemed immune to any WINE hackery =/

I actually really like the rest of what Microsoft does though.

VS Code, TypeScript, creating the Language Server Protocol, the .NET Foundation/.NET in general and C#, the list goes on and on. They're an impactful organization which (depending on what kind of software you build/what you write) probably has a significant impact on your day-to-day experience as a developer.

The benefits of vertical integration can have when you (for lack of a better word) "own" these things can be beneficial for the end-user/developer, ignoring whatever kind of anti-competitive arguments there are.

It's the same argument I hear from Apple users -- Apple stewards their digital experience and gives them continuity end-to-end.

Microsoft -- by nature of owning my OS, IDE, the specification and dev team driving my primary language, a cloud provider, and having their fingers in all of these parts of my digital life -- has the ability to be really awesome for me.

They could also do really sinister things if they wanted and I'd be real fucked I suppose, but that's a(n unlikely) risk I accept.


> rarely a day goes by that I don't question how Microsoft manages to get anyone to buy that steaming pile they call an operating system.

"But muh games!"


Yeah, this but unironically.

The Windows filesystem has dismal performance. The Windows graphics stack is arguably better than Linux’s. And it’s not just the drivers, so you can’t just blame it on GPU vendors, though poor support from a certain popular GPU vendor definitely doesn’t help.

Some of it is just the little things, like doing mouse movement at very high priority in kernel space [1] [2]. Other parts are bigger and more impressive, like allowing the graphics driver to crash/upgrade/restart without you also losing all of your open windows.

[1]: https://reactos.org/wiki/Techwiki:Win32k/desktops#win32k.21P...

[2]: https://devblogs.microsoft.com/oldnewthing/20061117-03/?p=28...


Yeah, 'cause the incredible fragmentation of Linux GUI toolkits and windowing systems is so fun to deal with. Year of Linux on the Desktop when.

> Year of Linux on the Desktop when.

Couple decades ago for me.

> Yeah, 'cause the incredible fragmentation of Linux GUI toolkits and windowing systems is so fun to deal with.

Sounds like probably "never" for you. You seem hostile toward Linux, so you're probably better off sticking with whatever you use currently.


I don’t think calling GP “hostile to Linux” after they critiqued one part of the ecosystem is very fair. All systems have their flaws, and if judging part of a system negatively makes you hostile to that system, then I guess Linus Torvalds is hostile to Linux too.

I no longer care. None of it matters one little bit anyhow.

Just take a look at the Windows desktop versus Linux desktop. The Windows desktop is far more fragmented and discongruous.

With Linux, the main difference is whether your desktop is primarily Qt or GTK. Sometimes an app using the other toolkit will look kind of funny, but most of those differences are paved over by any competent distro (eg: Debian, Fedora, Ubuntu, OpenSUSE...).


There are two windowing systems on Linux - the old one and the new one.

GUI toolkits are irrelevant to games.


yeah maybe instead of making fun of that major industry other operating systems should treat it as an opportunity to gain market share.

SteamDeck is particularly interesting because of how much it is tackling that industry head-on. From the perspective of a mobile gaming console, the fact that it’s a PC is awesome. From the perspective of advancing gaming on Linux, it’s even more exciting

> rarely a day goes by that I don't question how Microsoft manages to get anyone to buy that steaming pile they call an operating system.

Haha. Maybe because it works? I do miss XP though. It felt like the only OS that can work as well on a PC and a Tank.


This has very little to do with Windows, itself. The issue is explorer that tries to estimate sizes and put the files in the recycled bin. Also likely it checks all the permissions while counting.

Try the command line: "rmdir /s" and it's quick.


I think for the purpose of this discussion, explorer IS Windows. It is the UI that we all wish didn’t have these issues. I like using the command line / powershell when I’m on Windows for large file operations, but sometimes I still go to drag files around out of habit and quickly remember I have made a huge mistake as everything grinds to a halt / takes forever. On Linux and MacOS GUI file operations most of the time seem like they are just a visual way to call the same operations you’d be using on the command line.

My habit is shift+del for windows; but most of the time I use cygwin anyways.

>I think for the purpose of this discussion, explorer IS Windows.

The article is about the file system, not the UI.


This is true, but I don't understand why Explorer has been plagued with performance issues (and some of those intermittent; I presume based on network status) since forever.

I'm actually a big fan of Windows on the desktop, and as long as I avoid Explorer, I have no issues at all with file system performance while I'm developing, working from the command line, VS Code and Rider, and working with containers and WSL - all works great. But Explorer is such a central component of Windows that it affects pretty all users. I genuinely don't understand how it can be so rubbish ¯\_(ツ)_/¯


Sometimes it's some .dll extensions which get loaded into explorer.exe which cause these problems.

I once had a PDF file on the desktop and for some reason this file was causing explorer.exe to freeze under certain circumstances. It's all a big mistery.


> It's all a big mistery.

Such is the nature of closed source black box software.


So true, I really have to say Explorer sucks. Recently I've been bothered by an issue that it tries to fully read every ts file, like maybe it's trying to read a Typescript file, but in my recent cases they are usually streamed video files of several GBs... It just never gives up and use my disk usage at 100%, I just have to kill explorer.exe everytime. I even tried messing with Registry about ts file but that's no good.

And even its Item Handler system is flawed. Still about .ts, it could stutter when I open context menu on ts files, then freeze forever if I hover over just 1 bit on "Open with" menu.


What about refusing to delete files that some process has open in a dangling descriptor? Is that just a GUI thing? I'm pretty sure it isn't.

Also, is there a better way to hunt down the background process than sysinternals? It seems pretty ridiculous that a low level debugging tool is required to do something as modest as reliably move and delete files, but I never heard of a better workaround.


LockHunter is a nice right-click utility for this

You just outed yourself as a "power user". ;)

>What about refusing to delete files that some process has open in a dangling descriptor?

This is how it is on Windows. Linux handles open file deletion a lot better.


I think that's warts in the file explorer more so than warts in the system itself.

> warts in the file explorer more so than warts in the system itself

I'm struggling to understand why one shouldn't treat the behaviour of explorer.exe as simply part of the Windows system.

Are there third-party explorer.exe alternatives available?

If so, during Windows install, where do I click to _not_ install the MSFT explorer.exe and instead select one of the better alternatives?


I'm not excusing it, I'm just saying where the performance issue is. Filesystem ops are reasonably fast if you use the command line, do it programmatically, etc. With the caveat that Windows Defender can slow things down.

The last time I really used Windows (Win2k) there were alternative WinShell options.

You'd set a regkey to your program. There used to be a 3.1 style Program Manager - I miss that one


Is "you can't delete a file if anything has it open" an explorer thing? I thought that was an OS thing.

That's an OS thing, but "wait forever while it grinds, finds one of those, and bails half finished" isn't.

I think it's file manager in particular. When I need to do something muscular I open a DOS box and just do it at the command line. It feels plenty zippy though I haven't bothered to do any benchmarks; I have to confess I've never learned Powershell because it's so easy to get most things done with command.com.

Although Windows remains my favorite environment I share the annoyance at how MS really polishes some things while failing on basic functionality, when they definitely have the resources and skills to fix that. Every time my wifi or WAN hiccups I wonder why Windows still lacks decent network history/QOS tools out of the box.


Yep, "dir /s /b | findstr x" is way quicker than searching for files in explorer.

Oh man... Deleting a folder with a million files, or even only 100.000 is painful in Windows. You need to use the command line in order to sanely delete such folders.

it does that for the UI

For power users, you're much better off using the command line tools


I have a different power user workaround: I don't use Windows if I can at all avoid it.

EDIT: Does the CLI override the "open in background" check? Or do you still need Mark Russinovich sysinternals to reliably perform elementary file system administration tasks in Windows?


See here for some notes from the WSL team why certain filesystem operations that would be fast on Linux are slow on Windows:

https://github.com/microsoft/WSL/issues/873#issuecomment-391...

https://github.com/microsoft/WSL/issues/873#issuecomment-425...


Thanks for the links -- one pro-tip that stood out to me was to use the D: drive (because it's likely to have less filter drivers attached).

"Windows's IO stack is extensible, allowing filter drivers to attach to volumes and intercept IO requests before the file system sees them. This is used for numerous things, including virus scanning, compression, encryption, file virtualization, things like OneDrive's files on demand feature, gathering pre-fetching data to speed up app startup, and much more. Even a clean install of Windows will have a number of filters present, particularly on the system volume (so if you have a D: drive or partition, I recommend using that instead, since it likely has fewer filters attached). Filters are involved in many IO operations, most notably creating/opening files."


I'm going to have to complain about this because in real world use these are far less of a problem than MFT contention on sub-900 byte files generated by typical unix environments. Particularly things that are lockfile heavy and VCSs etc are the most painful. The WSL thing reads like an excuse here. If you actually go and look at what's happening it's small files which are the damage multiplier.

I think the real issue is that you shouldn't mix the two operating system paradigms and it's far better to just run Linux in a VM and benefit from the near native performance at the cost of a tiny bit of inconvenience. It's not a bad option when you consider the remote IDE capabilities that VScode gives you, which is the one product they're doing 100% right.


I'm sorry, but I find that hard to believe when the linked posts are from a Microsoft employee, who says repeatedly that the problem isn't a simple as "NTFS is slow," that they spent a release optimising NTFS, they've gotten rid of all the low hanging fruit and so on. I don't see why they'd be making excuses for NTFS when the underlying problems seem to be much more fundamental to Windows, which is much worse. You say you discovered the MFT contention issue when you were working on Subversion a decade ago, but these posts are from 2018, so I'm inclined to believe that the MFT contention issue has been mitigated and the real issues really are things like filter drivers, like they say.

I'm also not sure there's two paradigms here, if by that you mean there are workloads more suited to Linux and workloads more suited to Windows. Are there any file system operations that are actually faster on Windows? I'm still inclined to believe, like they say, that "file operations in Windows are more expensive than in Linux," even if large files are less impacted than small files.

I really hope Microsoft can improve this, rather than just throwing it under the rug and saying to use Linux VMs, even if they have to make drastic changes like deprecating filesystem filter drivers completely. I think a lot of us depend on the performance of software that was designed for fast filesystems and ported to Windows (not just Git, either.) I also think a lot of tasks fundamentally use a lot of small files. After all, what is source code?


No it’s still MFT contention. Someone at my company independently discovered the same problem recently with git on WSL.

MSFT don’t listen to or care about their customers findings from experience. That’s why canned our partner status.

In this case they haven’t touched NTFS at all - that’s out right rubbish.

I pointed out the paradigm mismatch already. Run unix loads on unix not NT.


This is actually a problem with the design of NTFS. Small files are stored in the MFT (Master File Table) which results in heavy lock contention and slowness if you're doing lots of small files which is what most people tend to do these days.

NTFS really does quite well on large files but small Unixy things and source code it is hopeless on. I first learned about this trying to make Subversion fast about a decade ago.

You can tune some of this out with fsutil but it gives marginal gains. It's depressing though when you have a VM on windows that is faster than the native OS filesystem :(


Windows' slow performance when dealing with millions of tiny files per hour (e.g. <50 KB) and inability to offer fine-grain IO throttling for processes was why we moved from Windows to Linux in a previous career. Linux handles this scenario in its stride, Windows chokes (and, no, "background" IO priority on Windows isn't a solution).

Simply moving from files to a DBMS wasn't an option, since the whole point of this edge system was to allow third parties to FTP/SFTP/FTPS/etc in and out, to put/get files. We would have had to re-implement multiple transfer technologies/protocols to have a file-less database system (which didn't make cost-effective sense).


One thing I know, is that mechanical disk drives have suffered a very big drop in performance in windows 10.

I've asked this question several times, "Why would an OS need an SSD to work normally and be responsive when it was already fast without SSD some years ago", and I often get responses like I don't know how computers work, that win10 takes advantages of SSD and that it cannot work well on old techs. Even when you have a high amount of RAM and use a HDD, it's still slow.

I'm still looking for a decent filesystem or kernel developer to answer that question with a good enough explanation, because to me, I'm either stupid, or there is negligence or some attempt to increase sales of SSD. Maybe windows now uses SSD as some sort of "secondary RAM", and thus it will never be fast with a HDD.

I mean, linux desktops seem to still be fast enough with mechanical HDD.


Likely it is Windows Defender. Someone recently post the issue tracker for this particular issue and some are pointing at Defender as the cause of the I/O performance.

As a daily windows user, I'm not surprised if it is Defender. I have some files and folders in exclusion list because Defender was interfering with those files that the software is trying to use.


That’s probably part of it, but NTFS has just been slow forever and ever. Anything involving lots of small files is just pain. Perhaps it seems “normal” for anyone used to Explorer progress dialogs, but it’s not.

And Microsoft just doesn’t seem to care :-(.


They care, as much as a group of people without a mind of its own can care, but it's a use case that was not common among windows software when NTFS was designed. Other design considerations were more important than high performance while writing or reading 1 million small files per minute.

It's not Windows Defender. Not entirely, anyway. NTFS is slow when reading or writing lots of small files. It just is, and it always has been, and it's because of how NTFS was designed, if I am remembering correctly.

Known issue? Yes. One of the reasons many games have faster load times under Wine when compared with the actual OS.

No clue on what the actual underlying issue is. And it's not like we can easily switch filesystems to compare.


If you're doing a bunch of I/O, it's often worth it to go into Windows Defender settings and turn off "real time protection" temporarily. It can make a big difference.

absolutely, and for directories for your projects /places where lots of files thrash about, add those dirs as exceptions in the options.

i mean i added c:\ as an exception because i do not need realtime av but i dont think it will actually work like that


At the end of the article it says:

"SQLite is much faster than direct writes to disk on Windows when anti-virus protection is turned on. Since anti-virus software is and should be on by default in Windows, that means that SQLite is generally much faster than direct disk writes on Windows. "


This is a key point. Windows is starting up the anti-virus program and scanning every tiny file as it is written. This is less about NTFS being slow and more about anti-virus being considered a required overhead on Windows.

Alone the fact that it is twice as slow as Windows 7 is puzzling.

One would assume that it gets better with every upgrade, but a factor of 2x worse seems to indicate that they added some kind of logging or something.

Like the logging/access system which Android has started to integrate with Android 9 where every access needs to go through the StorageManager.


I think this was known for a long time. E.g. Git takes forever, compared to the equivalent operations in Linux

Yes. NTFS filesystem performance when accessing many small files rapidly is slow. It's an NTFS limitation, not any specific version of Windows.

I believe it is caused by some per-file overhead that NTFS incurs, though I'm not 100% sure.


> It's an NTFS limitation, not any specific version of Windows.

I would guess their surprise is because the charts specifically show Win10 as being 100% slower than Win7 in the first chart, and still a good 30% slower in the best case (third chart).

That's not NTFS v other, that's Win7 v Win10, both on NTFS.


NTFS comes in different versions, you know.

I don't care what version of Windows you're on, or what version of NTFS you're on; NTFS is dog slow on small files.

I am not sure that NTFS has versioned between Windows 7 and Windows 10, so let's just assume that it hasn't.

If the limitations of the filesystem do not change between operating systems, then what must change is how the OS deals with the filesystem limitation, and/or attempts to work around them.

Windows 7 does better than Windows 10. That doesn't mean NTFS isn't slow.


> NTFS comes in different versions, you know.

A completely pointless remark when versions of NTFS and Windows are correlated. And AFAIK there's been no updated to NTFS itself since the addition of symlinks in NTFS 3.1, released with Windows XP. All "filesystem features" since were added at the level OS, using existing NTFS features.

> If the limitations of the filesystem do not change between operating systems, then what must change is how the OS deals with the filesystem limitation, and/or attempts to work around them.

Yeees? Which would make the issue vary based on the specific version of Windows maybe?

> Windows 7 does better than Windows 10. That doesn't mean NTFS isn't slow.

No, but I never claimed such a thing. I'm just pointing out that your assertion that

> It's […] not any specific version of Windows.

is bullshit.


>> It's […] not any specific version of Windows.

> is bullshit.

No, it's not. NTFS is slow, even when it's running under Linux. NTFS has a very pronounced weakness, here.

Performance changing between Windows 7 and Windows 10 means that ON TOP OF NTFS there are other performance problems in Windows 10. It does not mean that it isn't an NTFS problem in the case of Windows 10 -- it means that it isn't ONLY an NTFS problem in the case of Windows 10. It's an NTFS problem in addition to whatever other things Windows 10 is screwing up, when compared to Windows 7.

The difference between Windows 7 and Windows 10 alone? Yes, that's entirely because of Windows 10 and not because of NTFS. There. I never disagreed with that.


You are making a point that truly doesn't matter.

everyone on HN does that 100% of the time they comment. it's not the "things that matter" section of the site, it's the "comment" section.

Personally, I have not seen that. Which operations did you think were slow? Small-block/large-block?, sequential/random I/O ? etc, etc. Or did you just mean general productivity?

I know that the open function on files is very slow. This makes opening many small files very slow.

My theory is that the OS likes to do a deep inspection of your files in order to feed microsoft's machine learning models.

your theory is based on ignorance and suspicion instead of anything real.

And what is your's based on? It's proprietary software. We're not supposed to know how it works or what it's doing.

Somewhat related, I wrote a fuse-based file system in Rust recently that used SQLite as the backing store for file records, though not the file contents. I imagine I could use it for file content as well, so it's good to know more about its performance.

https://amoffat.github.io/supertag/


I wanted to learn about Fuse and SQLite in Golang, so I built a filesystem that was entirely stored in SQLite. It's incomplete (read-only for directories at the moment, among other nits), but was surprisingly easy and performed with decent performance on some tests I did; Nearly half throughput of a raw filesystem.

I’ve wanted to do something like this for ages- definitely going to check this out! The limitations imposed by the hierarchical structure of traditional file systems bothers me rather frequently.

Are there any notable trade-offs or limitations you’ve encountered with this approach?


SQLite, iirc, has a limit of 1GB per row, and that might too severely limit the utility of your file system if you don't end up splitting files into multiple fragments (rows in the database).

You are right, by default 1 billion bytes per row or blob. Can be raised up to 2GB with compilation parameter[0]. Storing video streams may be not a good idea, but e.g. photos, thumbnails or some JSON documents may be absolutely fine.

[0] https://sqlite.org/limits.html


You read too fast:

> SQLite as the backing store for file records, though not the file contents


I believe they are referring to:

> I imagine I could use it for file content as well


So did you:

> I imagine I could use [SQLite] for file content as well


Very cool! I've been using TMSU for this purpose for a while, but I'll have to check out supertag :)

No idea what's with the sqlite articles making the front page every other day. This one is pretty old as well.

  The measurements in this article were made during the week of 2017-06-05 using a version of SQLite in between 3.19.2 and 3.20.0.

I think it's just simply a lot of folks on HN (including me) just like SQLite very much and instantly voting up an article about SQLite (even when we already saw it :D).

Yeah, IMO there’s certain tech that HN, in aggregate, likes a lot, and readily upvotes positive articles about - SQLite is in that category, along with Postgres, CockroachDB, Go, Rust, etc.

There’s also certain tech HN, in aggregate, strongly dislikes, and readily upvotes negative articles - Mongo, anything “modern JS ecosystem”, systemd, etc.


Often for good reason too. SQLite is amazing. Modern JS and frameworks...not so much.

I think tech like a lot of modern JS frameworks, and Mongo, are really good at dev productivity, especially in the earlier days of products. If you're at a very new startup where the company could die any day, and you must ship absolutely as fast as possible to keep the company alive, that can be a truly essential feature.

But then if said startup gains traction and the team/codebase/systems grow a lot, it can easily become hard to maintain, and you probably wish your backend was implemented in, say, Go/Postgres over Node/Mongo. Or that your mobile apps were written in Swift and Kotlin over React Native. And I think a lot of the HN crowd works at "startups becoming big businesses", so this is probably a common headache. But it doesn't necessarily mean the tech is BAD, just that it's good for certain things (like early days productivity), but a pain for others (maintainability as the system scales).

SQLite is a bit unique in that it's just a super high quality piece of software, that is arguably the best short AND long term solution for the problem it solves (mostly being an embedded DB). But for software where it's more of a tradeoff around early productivity vs. long term maintainability, I think HN is pretty strongly on the long term maintainability side, and that's more of an opinion/choice than a clearly "correct" answer.


I think that whenever some topic sticks on the front page, and has a couple of hundred upvotes, you'll see a trailing/decaying trend of similar posts the days after.

Some people see the main post, do some more research on the topic, find something cool / interesting, and post them here. Then this induces similar reactions, until the interest low enough to make them disappear.

But then again, some recurring topics seem to stick around forever. It seems like there's a new Rust article here almost daily.


It's a social voting site, and it also belongs to a VC group. So when things make it to the front page, they're either voted by many people, or pushed forward by the site owners.

In the case of SQLite, it's probably the first.

Articles also come in batches (of a topic). Somebody sends something about X (e.g. SQLite) and then others see it, some start reading more about X on their own and digging up further articles, and inevitably a few will post those other articles they've found as well. Rinse, repeat.

Age of article also has little to do with being on the front page of HN. There are articles that have been submitted, and made it to the front page, dozens of times from 2010 to today, and their original publication date can be half a century old (eg. some old Lisp articles, Djikstra articles, "Worse is better", etc.).


Because SQLite is the real deal. We see a lot of hype about technologies that have come and gone, but SQLite has stood the test of time AND is a pleasure to use.

In some way it is the Platonic Ideal of what a SQL database should be (from a a developer's perspective). There isn't any mucking about with dev-ops configuration, because the DB is file based. Just type "sqlite3 out.db" and you have a database to write queries against. At the same time it still manages to be quite performant.


Either it's a concerted effort by some number of people to promote it, or it's that most people didn't see the original news and now find it interesting.

Hmmm.

SQLite is 35% faster reading and writing within a large file than the filesystem is at reading and writing small files.

Most filesystems I know are very, very slow at reading and writing small files and much, much faster at reading and writing within large files.

For example, for my iOS/macOS performance book[1], I measured the difference writing 1GB of data in files of different sizes, ranging from 100 files of 10MB to 100K files of 10K each.

Overall, the times span about an order of magnitude, and even the final step, from individual file sizes of 100KB each to 10KB each was a factor of 3-4 different.

[1] https://www.amazon.com/gp/product/0321842847


I wondered the other day how things would behave in node if dependencies were in node_modules.sqlite3 (with posibility to eject to edit if needed).

Well pnpm centralizes it so that you’re only referencing a single location via symlinks and that is a major speed up, I think moving dependencies out of the file system altogether would be nice. I’ve explored the possibility of this, but I think the way snowpack does streaming imports[1] in version 3.0 may be the best solution overall. I haven’t spent much time testing it but it appears to be a major process improvement.

[1] https://michalkuncio.com/how-to-use-snowpack-without-node-mo...


Snowpack is great for f/e, what I was thinking more about is millions of files after npm i everywhere on b/e - on your local dev machine, on ci, in docker image etc.

Shouldn't it be possible to make an sql structure and mount it like an FS to find out?

I was thinking more about using ESM loader hooks [0] directly.

[0] https://nodejs.org/api/esm.html#esm_loaders


That would be the worst of both worlds: The overhead of storing your data in SQL, plus the overhead of filesystem search and access.

The better way to do this would be to have node access the modules in SQLite directly.


filesystem search and access is different when you're talking a single larger file versus multiple smaller files. All filesystems have a per-file overhead that would effectively be eliminated if you could pack all of your files into a database structure. Indices would also speed up access to individual rows of the table significantly.

There is overhead in small file storage anyway, if the files are not the exact size, or a multiple, of the sector size. storing 1kb files on a disk where the sector size is 16kb is far more of an overhead expense than storing those files in an SQL database.


Past related threads:

35% Faster Than The Filesystem (2017) - https://news.ycombinator.com/item?id=20729930 - Aug 2019 (164 comments)

SQLite small blob storage: 35% Faster Than the Filesystem - https://news.ycombinator.com/item?id=14550060 - June 2017 (202 comments)


I'd be shocked if this weren't the case.

I guess you must be a low-level developer then, because to application developer it would seem that sqlite write speed is actually bound by the file system performance (which it depends on).

The title is a joke, right? It's not actually "faster than the filesystem" unless it's not using the filesystem; it's faster than something else that also uses the filesystem because they use different system calls (context switching).

you can be faster than the filesystem, while using the filesystem, if you emulate something the filesystem does poorly in a more performant way.

That's what's happening here. storing files in SQLite removes any per-file overhead in the filesystem. The filesystem now only has one file to deal with, instead of however many are stored inside the SQLite database.

This is a very real phenomenon, and definitely not a joke.


The correct comparison is “SQLite index” is faster compared to “file system inode index”

Did Microsoft not try embed SQL server as the backing store for files in Windows "Chicago" to make search a fundamental part of the OS ???.

It was a Memphis (NT 4.0) goal, canceled. Later, a Longhorn (Vista, "NT 6.0") goal, also canceled.

The Windows file system is still as dumb as it was in NT 3.1 -- hardly any changes since then.


I believe you're thinking of WinFS[1] which was destined for Windows "Longhorn"[2]

[1]: https://en.wikipedia.org/wiki/WinFS

[2]: https://en.wikipedia.org/wiki/Development_of_Windows_Vista


Close, it was Cairo/NT5 for initial incarnation. Then it kept getting pushed back, finally shipped as a beta, then just didn't ship at all:

https://en.wikipedia.org/wiki/WinFS


There's a great Software Engineering question here, and that is,

Should SQLite be modified such that its code, when compiled with a specific #define compilation flag set, be able to act as a drop-in filesystem replacement source code -- for an OS?

?

I wonder how hard it would be to retool SQLite/Linux -- to be able to accomplish that -- and what would be learned (like, what kind of API would exist between Linux/other OS and SQLite) -- if that were attempted?

Yes, there would be some interesting problems to solve, such as how would SQLite write to its database file -- if there is no filesystem/filesystem API underneath it?

Also -- one other feature I'd like -- and that is the ability to do strong logging of up to everything that the filesystem does (if the appropriate compilation switches and/or runtime flags are set) -- but somewhere else on the filesystem!

So maybe two instances of SQLite running at the same time, one acting as the filesystem and the other acting as the second, logging filesystem -- for that first filesystem...


As much as I love sqlite, if I understand correctly, this is telling that sqlite doing fread on an already opened file is 35% faster than not-sqlite doing fopen+fread on a single file.

Is this just a blatantly dishonest comparison, or did I miss something?


Locality strikes again!

> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database.

One trick I used to use is to put a small read/write cache in front of file operations. I would usually pick something like the size of a sector or cluster. Windows is pretty bad at small writes, and mearly 'ok' at reads because you usually can just get away with the windows file store cache in that case.

I came across the write cache bit accidently when trying to minimize wear on an embedded flash device. It was borderline I thought I had done something wrong.

Minimizing open/close and keeping your reads/writes close to sector/cluster size on many filesystems can produce some very nice results. As you can minimize the context switches from user space to kernel. In the case of adding in a 'db' layer packing probably helps as well as slack on small files is huge percentage wise of the total file. So you would be more likely to hit the file cache as well as any in built ones for your stack.


SQLite also performs well when there is a large number of "files". A simulation I wrote used a large number of files (10k+). Eventually I had to transition away from using the file system as a key-value store because open, reads, writes, and directory listings were slow when you have that many files in a single directory and that many open file descriptors.

I'd like to see them bench SQLite against different filesystems, since both fields have progressed quite a bit in the past 4 years.

I would love for directories of small files to be stored this way under ZFS; like infrequently updated tar or zip (no compression) files. (Though the filesystem layer of compression might operate on the whole file, or maybe as two streams, one for the file and one for the index.)

I think this is kind of what reiserfs did back in the day for small files. Keep them in the tree and share pages with lots of small files. It worked rather well, it's just that the file system had plenty of other issues with reliability and recovery at the time. It was vitally impotant that you didn't store any plain text disk images that contained a reiserfs partition on a reiserfs mount, fsck could decide to merge thentwo together causing unknown amounts of corruption.

A killer feature if you allow me to say it.

This joke will never grow old...

I was under the impression that in NTFS files under 4k are stored in the MFT. I feel like they might have chosen 10k to break this barrier and end up in filespace land.

edit: https://en.wikipedia.org/wiki/NTFS#Resident_vs._non-resident...


It's surprising how fast you can get DETS (the persistent storage version of Elixir's in memory kv-store ETS) to act as a database of sorts. Even on a relatively slow SSD.

I wrote an ORM to use SQLite for serializing/persisting objects to disk (i.e. using SQLite DB as a file format).

This was one of the reasons why SQLite was an easy choice.


Everything is a cache

(2017)

So the answer is that NTFS needs a Raid controller with cache RAM? That seems to fix a lot of stuff.

then making a small cdn using nginx and sqlite can be a thing?

CDN is effectively read only and served from memory for the most of the part.

but for a CDN, you'd probably have caching. that would eliminate the need to take advantage of this kind of optimization



Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: