Hacker News new | past | comments | ask | show | jobs | submit login
Hosting SQLite databases on GitHub Pages or any static file hoster (phiresky.github.io)
1812 points by phiresky 16 days ago | hide | past | favorite | 244 comments



Pop open the network pane in the browser tools and try running this SQL query for a demo of how clever this is:

    select country_code, long_name from wdi_country
    order by rowid desc limit 100
It fetches just 54.2KB of new data (across 49 small HTTP requests) to return 100 results - from a statically hosted database file that's 668.8MB!

I have an animated GIF demo of this here: https://twitter.com/simonw/status/1388933800445452290


Huh, that's actually kind of a worst case I didn't think about: Since you're doing a reverse table scan my "sequential access" detection doesn't kick in. If you do the same query but with a forward scan it should fetch roughly the same amount of data but only do like 5 HTTP requests since the request size doubles for every sequential access.

e.g.:

select country_code, long_name from wdi_country where rowid >= 164 order by rowid asc limit 100;


I solved a similar problem recently: given a stream of data, how should you choose packet size in an online way to minimize regret (a linear combination of spare capacity of last packet and total packets used).

Turns out doubling isn’t the best strategy. The optimal solution is actually to add a constant increment to packet size. How much depends on relative cost of the terms in the regret function.


> I’ve set the page size to 1 KiB for this database.

> That’s because I implemented a pre-fetching system that tries to detect access patterns through three separate virtual read heads and exponentially increases the request size for sequential reads.

> Since you're doing a reverse table scan my "sequential access" detection doesn't kick in.

You know, starting off with the default 4kB page size naturally adds some resistance to these kinds of failure cases. If the VFS isn't issuing many requests in parallel, I would think that setting up a page size near target_bandwidth * round_trip_time would be a better initial guess. 1kB would be appropriate for a pretty low latency-bandwidth product.


That's true, but it also means that random access will always use at least that amount of data even if it only has to fetch a tiny amount. I did a few (non-scientific) benchmarks on a few queries and 1kB seemed like an OK compromise.

And note that the request chunk size is bound to the SQLite page size, and to change that page size you have to rewrite the whole DB. So it can't be set on the fly unless you have multiple copies of the database.


1kb fits in most IP MTU sizes, so that seems reasonable.


Do most HTTP responses have less than ~500 bytes of headers? I guess specifically here, GH pages' responses.

It looks like one of the requests made to the DB included a little over 700 bytes of response status line and headers, so that would probably end up spilling into more than one response packet, unfortunately.


With http/3 header compression, I assume the answer is yes.

Hah, I thought "in reverse order by ID" might be a stress test but I was still very impressed by how it performed!


Would it be possible to use a datasette frontend with this as a "backend" and statically host the whole thing?


Not easily - Datasette is written in Python, so you could try running it in WebAssembly (like Mozilla did for Jupyter with https://hacks.mozilla.org/2019/03/iodide-an-experimental-too... ) but it would be enough work that it might be easier to reimplement a subset of Datasette directly in JavaScript.


I've got no experience with Datasette, but if it's all in pure python, then you can use Pyodide[0] to run it within a wasm module. I was surprised how easy it was - took about 15 minutes and I had a Python package working perfectly in the browser. The Python package is about 3x faster than the JS/wasm version, but I'm guessing that performance gap will narrow when we get wasm-gc.

[0] https://github.com/pyodide/pyodide


Thank you!


mind blown. how is this possible???


Use Http Range (normally used for pausing and continuing large file downloads) to request specific byte ranges from the file. From there you can pull only what you need. With sql indexes it'll be very tiny since the lookup is optimized. Of course if you select *, you're still going to pull the entire database locally.


TL;DR http, properly implemented, supports a ton more stuff than even many “web developers” are aware of, like… range requests, which are exactly what you’d think they’d be.


Also SQLite store data in pages and the page size can be tweaked. Combined with range requests any part of the database can be requested.


the most recent update to the W3C's own research webserver, written in Java, called Jigsaw, seems to be dated in 2007. I used it for a lot of purposes until 2002 but I don't know why I stopped working with Jigsaw only that by the time F# emerged in 2004 I was absorbed into a new direction :

https://jigsaw.w3.org/

iirc Jigsaw was used to develop and validate the WebDAV protocols and XQUERY which at the time I remember thinking XQUERY was sure to be the future as implementations of advanced data management and manipulation and query and distribution and declaration looked to be what the whole point of webservers were for. The incredible distractions caused by "rich media" as opposed to multimedia as it was understood then, are really worth thinking about. Saying that, however, the BBC is doing excellent work on restoring the balance of necessary powers to the network standards engineers https://www.bbc.co.uk/sounds/help/questions/about-bbc-sounds...

https://www.bbc.co.uk/rd/blog/2014-03-media-source-extension...

https://www.bbc.co.uk/rd/projects/nearly-live-production


The last link was a very interesting read - I wonder if BBC or anyone else has open-sourced a video streaming and editing system like that.

Connecting this to HTTP range requests, the edited video feed can consist of a list of the exact byte ranges that the clients need to download and play. Found this description of how Akamai uses range requests to serve low-latency streams: https://blogs.akamai.com/2020/11/using-ll-hls-with-byte-rang...


gpt-2 ?


BTW thank you havernator, because I have just realised what I can do with the setup I'm almost ready to pull the trigger on that'll give me a surfeit of online capacity (at least a baseload can be maintained while the rest is used for work instead of cloud time) : I am definitely going to investigate the possibility of providing a high level of standards specifications for simple web serving. If the W3C Jigsaw project had been maintained, I'd simply put it up and invite interested users to persuade me to send them a shell login. OK obviously that's far too naive today, but I would love to run a especially standards compliant host for negligible nominal or even no charge so people could maybe get a view of better ways to present the WWW.

frankly I think that unless we do things like this, the Internet is simply going to become a closed shop to anyone not wielding enterprise budgets and legal department capabilities.


Not trying to offend, but this comment was hard to follow in a weird way. Along with your profile this makes me wonder-- are you gpt-2?


Their profile contains a full essay of near incomprehensible mumbo jumbo:

https://news.ycombinator.com/user?id=Cullinet


So I'm not alone. I felt that too!


Amazing use of SQLite! Thanks!

So assuming no country has a name longer than 98 characters and that all country codes are 2 characters, that is over 500% overhead? Are you missing a /s in your post?


Since random accesses across the internet are really slow, for this kind of fairly small table (where SQLite stores the row data inline within the B-Tree of the table) it basically fetches the whole content for each row - so even if you query only the long_name and country_code column, it will in fact fetch the data of all 29 columns in that table.

If you want it to fetch less data for querying a subset of columns, you could create create an index on those columns - then SQLite will do an COVERING INDEX scan and thus read only the necessary data (with the B-Tree itself and the start / end page alignment being the only overhead).


Nothing to add to the conversation, just wanted to say I absolutely adore this. Years ago I used to think it'd be nice if search engines simply exposed all their shards to clients and let clients do all the results merging and suchlike. Of course that's probably a terrible idea in practice for performance, but this library is definitely implementing the spirit in a very practical way!

It also reminded me of a vague inverse of this hack. In old versions of Qemu (possibly it is still implemented, but I have vague memories it got ripped out), you could point Qemu at a directory on disk and it'd produce an emulated floopy disk drive with a virtual FAT12 image containing the directory contents. AFAIK it didn't keep the actual data in memory, I guess all it needed was file sizes to know how to build a virtual memory mapping that contained the filesystem metadata + proxied reads from the underlying files for data sectors. I look forward to seeing your implementation of this concept in a virtualized SQLite file <-> GraphQL proxy ;)

edit: insane, it still exists and apparently supports write mode?! https://en.wikibooks.org/wiki/QEMU/Devices/Storage#Virtual_F...


Have you actually read the article? SQLite is unmodified, and thinks it runs on a virtual file system, which fetches file chunks via HTTP range headers.

It's REALLY impressive that you only need to read 54 KB out of 700 MB, to fetch the records.


> It's REALLY impressive that you only need to read 54 KB out of 700 MB, to fetch the records.

the harsh reality is that doing sensible queries that only reference and return the data actually needed always makes things faster. Even with server DBMS. Oh, how many times have I lamented the naive "select *" for forcing all the row contents even when there was index coverage for the actually needed data.


Do most static site hosters support range requests?


Most web servers do out of the box, so I would assume most do. Basically all unless they have some reason to turn range processing off or are running a custom/experimental/both server that have implemented the feature (yet).

Not supporting range requests would be a disadvantage for any service hosting large files. Resuming failed long downloads wouldn't work so users might not be happy and there would be more load on your bandwidth and other resources as the AU falls back to performing a full download.


Generally yes. Because not having range support means you can't resume file downloads. Which is a pretty essential feature for a static file host.


More interestingly, do reverse-proxies like Varnish / CDNs like Cloudflare support range requests? If so, do they fetch the whole content on the back, and then allow arbitrary range requests within the cached content on the front?


Yes, Cloudflare behaves as you describe.


I was wondering that too. Support was spotty in general ~20 years ago but I assume things have improved since then.


It's impressive on one hand.

On the other it's still a lot of overhead.


I would say it's less overhead than downloading the entire db to query it locally...? What is your suggestion for accessing a static database with less overhead?


I would bet that if you compare it to a traditional server-client database (which functionally does essentially the same thing: you send it a query over the network, and get a result back), the overhead is probably massive. This is a very clever way to cram that kind of functionality into a static hosting site, and you can imagine some uses for it, but it's clearly not a particularly efficient compared to doing it the "right" way.


The idea is that you're weighing the pros cons vs an actual live database. This is basically only a good idea if you're having someone else paying the hosting fees.


Since it's a static database and the queries against it are most likely going to be static, just pre-run the queries and store the results statically in a more space-efficient format. When you're on a dogshit internet connection in a 3rd world country 50kb can actually be pretty unpleasant. Try rate limiting your internet to EDGE rated to see what I mean.

I'm not saying the whole thing isn't impressive, just that the concept itself is one of those "because I can" rather "because I should" things, which kinda devalues it a whole lot.


For a casual or personal use case though, the alternative of running a client-server database on something like a VPS is probably more overhead than this. It's unlikely to be a very scalable option, but for use cases as described by the author it seems like a good fit.


I know of the drawbacks of the approach and wouldn't chose it for a lot of my projects, but I would say it is very scalable in those cases where I would. Put the DB on GitHub Pages for free along with your HTML/JS code and you can scale to whatever GitHub is willing and capable of delivering. Yes, your users might transfer way more data than needed but you pay nothing for it and do not have to maintain servers.

In the standard scenario for personal projects (not enterprise) I would have a small VPS/Dedicated server with a REST service running - that would be hugged to death immediately if a link would make it to some site like HN. And also, I completely share the experience of the Author that after a couple of years you have moved on, the VPS is dead etc and you don't want to invest time.

Again, before considering using solution, be sure to understand how it works and the resulting limitations or you will likely chose wrong.


> sql.js only allows you to create and read from databases that are fully in memory though - so I implemented a virtual file system that fetches chunks of the database with HTTP Range requests when SQLite tries to read from the filesystem: sql.js-httpvfs. From SQLite’s perspective, it just looks like it’s living on a normal computer with an empty filesystem except for a file called /wdi.sqlite3 that it can read from.

From this paragraph it should be pretty clear that it's actually a great result. The database will obviously need to read more data than it presents, so more is fetched.


The clever part here is only fetching 50KB out of 680MB.


This might be true.

But this approach lets you actually work out what the optimal size is:

  select sum(length(country_code) + length(long_name)) from wdi_country;
gives: 6307

Or on average:

  select sum(length(country_code) + length(long_name))/count(*) from wdi_country;
gives: 23

(Note that it doesn't seem possible to use aggregation functions with a limit clause)


Man you are a frickin genius, seriously. like how you put all this together all the depth of knowledge of different topics this would require the low level and the high level and the way you explain it simply confidently and with impact. Your work is really an inspiration. You computer scienced the sheet out of this thing.

this achievement, and this blog post, to me is on par with blog posts that you would see from a major company where they solve some significant business critical technical challenge in-house. for example: the GitHub blog post about how they created their spinning globe of commits on their homepage, or a Netflix blog post of how they optimized their Network infrastructure to serve so many customers.

your work is truly incredible. You're next level of next level.


Thank you, I really appreciate it. It's pretty fun to do this kind of thing for yourself, but it's really rewarding to be able to share it with other people.


have you thought of getting this to work with network replicated p2p sqlite js library? forgot what it was called.

also does this mean that static sites now can scale without relying on db???

what other implications and application do you see?

wonderful work!!!! please let me know how i can donate


The question I had is answered by this line of code:

  xhr.setRequestHeader("Range", "bytes=" + from + "-" + to);
I am a little surprised you can just do that. In https://github.com/phiresky/sql.js-httpvfs/blob/master/src/l...


Range headers are a pretty standard tools to e.g. continue interrupted downloads and similar.

Any well designed system, especially if it has static sources and is server cached should support it.

Surprisingly many web-frameworks don't support it out of the box, or don't support it well.

Either way gh-pages are static content and probably with some server side regional caches, so I'm not surprised it works.


Range headers are also how DownloadThemAll works.

If you're pulling a single TCP stream across a crowded network, you get maybe 1/40th of the available bandwidth. If you do four range requests instead, you might see >3/40th of the bandwidth.

This place we were contracting at, the managers were widely rumored to stream sports games at their desks, and our release cycle happened to fall on a game day. My poor coworker was downloading our installer every week, and the ETA was over 40 minutes. "Are you using DTA?" "No, what's that?" <fiddle fiddle> ETA: 12 minutes.

12 minute pauses in the middle of a manual process are a lot easier to stomach than 40 minutes. Especially if something goes wrong and you have to do it twice.


I've learnt about this by using https://www.biodalliance.org

It's an embedded genome viewer, you can just point it at a multigigabyte reference files and .seg files and it loads super quick


Here is direct link to GitHub with the usage: https://github.com/dasmoth/dalliance/search?q=bytes%3D%27&ty...


One of the heaviest users of range requests is (or was) the Adobe Acrobat PDF plugin.


Also .mp4 files. The format is designed for seekability, and browsers take advantage of this.


You can even point VLC at a .iso file on a web server, and seek around in it.


Progressive JPEGs work well for this too, so you could have the same file used for a tiny thumbnail and large preview and full sized photo by sending different range requests. However you need to know how many bytes to request.

I'm surprised this isn't used on mobile browsers to lower data usage. I'm sure with a little research you could figure out what a good mapping from pixel size to byte size should be to give good enough results.


A browser doesn't have enough information to use this optimization. At the point where it's about to request an image, it doesn't know how large the resource will be, whether it'll be a progressive JPEG, or even whether it'll be a JPEG at all. Making range requests blindly would probably be a net loss -- for every progressive JPEG that the browser managed to save some time on, it'd have to make follow-up requests for many more non-progressive JPEGs, non-JPEG images, and progressive JPEGs which it didn't get enough data from on the first try.


You are correct.

However, one could use this approach: download as usual, and in a streaming fashion process the data and if it's a progressive JPEG, you can close the connection before you have received everything; and then you can cache the prefix and later download the rest if needed.

Fast clients will just swallow the whole file, while slow clients would be able to benefit from it.

It wouldn't work for pipelined HTTP connections though without cancelling the whole pipeline, so maybe not a very practical solution given the performance benefit that already gives. And HTTP/2 maybe doesn't support cancelling a transfer either, so.. ?

Maybe a direct "Accept" or "Prefer" header to indicate that it's enough to send just something useful for an icon would be a more ideal solution, but it would require server-side support.


I remember putting together a sloppy http range implementation that initially only supported single ranges, it had quite the explosive effect on adobe reader when it didn't get the expected response to its multi-range requests :)


I'm surprised that works, iirc pdf isn't defined in order and can't be parsed streaming


Linearised PDF files are defined in order and can be read from the start.

https://blog.idrsolutions.com/2010/02/linearized-pdf-files/


I think that's what the "optimize for Web" checkbox does


Wow so it actually does something! I wish programs would use such vague descriptions. (Or more of them had helpful instant tooltips.)


Linearized (a.k.a. web optimized) PDF files only help for displaying the first page quickly. The rest of the file is still in pretty much random access order.


For what it's worth, there's a format for content like pdfs that's optimized for viewing on screens (as opposed to printing), djvu: https://en.m.wikipedia.org/wiki/DjVu

Oh, wow initial release 1998,now I'm feeling a bit old...


I guess "all" that's needed for write would be a webdav server with support for PATCH x-update-range? :)

https://tools.ietf.org/html/rfc5789

https://sabre.io/dav/http-patch/

Unfortunately, solid, stand-alone webdav servers are harder to come by than decent http2/1.1 servers.


Not all webservers support/enable it, so YMMV.

But as long as you're dealing with a known server that does, then gravy!


nginx enables it by default. Another ingenious use of range request is zsync, it allows you to diff compressed binaries on a remote with local ones, so that you only have to download what has changed on an update. AppImage uses this


> Not all webservers support/enable it

Could you provide an example of server that does not?

AFAIK, Range is supported by all major CDNs, so not supporting it in web server would be a death knell for it's real-world adoption.


I can't think of a specific one, but a decent proportion (maybe a quarter?) of HTTP downloads I attempt don't support resuming a partial download, therefore don't support Range. (I.e. resuming always starts from the beginning.)

I would assume this is often because the site in question isn't using Apache etc. to serve a file directly, but is either essentially proxying it to some custom-built file serving service, or a script that processes/authenticates the file in some way, and they just never bothered to implement Range.


Depends on what’s being served. Any decent static file server should support it, but if the content is at all dynamically produced then the authors would have to think to implement it and rarely do.


This is easily the most clever web programming hack I’ve seen this year. Bravo. I had seen this used for video or audio of course but it never occurred to me you could use it for databases. There are probably a ton of other formats this is good for too.


I wonder if this could be used to serve dynamic maps.


This is pretty much exactly what we do to serve aerial/satellite imagery maps.

We convert the imagery into Cloud optimised geo tiffs and store them in S3 https://www.cogeo.org/ then the browser can request the tiles directly from S3.

Even the big imagery providers are now storing their imagery as COGs, eg https://registry.opendata.aws/sentinel-2-l2a-cogs/


I believe this is protomaps approach: re-encode the mbtiles (sqlite-based ) format in to something that can be requested with a http range request and thus served from a single dumb webserver that doesn't need to understand sqlite or mbtiles parsing


This is the approach I took with http://github.com/protomaps/pmtiles , though it's optimized for the very specific use case of going from Z/X/Y integer coordinates to binary blobs, and takes shortcuts to accomplish that (fixed-width keys and root index page)


Funny enough I was looking into mbtiles serverless solutions before I went to bed, now I start my day browsing HN and I find this clever solution. Love HN for this type of stuff :)


The innovation here is getting sql.js to use http and range requests for file access rather than all being in memory.

I wonder when people using next.js will start using this for faster builds for larger static sites?


See also https://github.com/bittorrent/sqltorrent, same trick but using BitTorrent


Yeah, that was one of the inspirations for this. That one does not work in the browser though, would be a good project to do that same thing but with sqlite in wasm and integrated with WebTorrent instead of a native torrent program.

I actually did also implement a similar thing fetching data on demand from WebTorrent (and in turn helping to host the data yourself by being on the website): https://phiresky.github.io/tv-show-ratings/ That uses a protobufs split into a hashmap instead of SQLite though.


This looks pretty efficient. Some chains can be interacted with without e.g. web3.js? LevelDB indexes aren't SQLite.

Datasette is one application for views of read-only SQLite dbs with out-of-band replication. https://github.com/simonw/datasette

There are a bunch of *-to-sqlite utilities in corresponding dogsheep project.

Arrow JS for 'paged' browser client access to DuckDB might be possible and faster but without full SQLite SQL compatibility and the SQLite test suite. https://arrow.apache.org/docs/js/

https://duckdb.org/ :

> Direct Parquet & CSV querying

In-browser notebooks like Pyodide and Jyve have local filesystem access with the new "Filesystem Access API", but downloading/copying all data to the browser for every run of a browser-hosted notebook may not be necessary. https://web.dev/file-system-access/


DuckDB can directly & selectively query Parquet files over HTTP/S3 as well. See here for examples: https://github.com/duckdb/duckdb/blob/6c7c9805fdf1604039ebed...


Would also be great to add (efficient) search to a static blog.


yea, sqlite FTS5 has been pretty amazing for quick search solutions (but I use english only)


Definitely. Just need to add a layer to the static site generator for it to populate the SQLite DB, right?


I'd also version the DB in the URL, else you could end up changing the file out from under someone who's already got the page loaded, with who-knows-what results depending on how different the file is. You could just prefix a head to every range request to check for changes, but that adds overhead and doesn't actually completely close the gap, so it'd still be possible to read a file different from the one you intended. Cost is more disk usage, depending on how many copies you keep around, but at least keeping the most recent "old" version seems reasonable unless you're skating really close to the quota on whatever system you're using.


Requests could use the If-Unmodified-Since header to guard against DB changes without the overhead of a separate HEAD request.

If the underlying DB has been changed then the server should respond with 412 Precondition Failed.


Microsoft Access Cloud Edition, basically?


Sort of. Access had a "Forms" feature that let you create basic GUIs on top of your database. Also, the OP's project is (currently) only providing a read-only view of the SQLite database. Adding write support is possible but will be far less impressive to the HN crowd because SQLITE_BUSY will rear its ugly head ;-)


I was mostly referring to the shared file access approach.


I'm curious, in what manner could this method speed up Next.js builds? That's all done locally, which negates the effect of HTTP range requests, right?


I'm guessing they mean rather than build a static Next site that generates 10k+ pages (or whatever large means in the given context), it instead creates one page that just queries the data from the client.

I have one Next static site that has about 20k pages and takes about 20 minutes to build and deploy. I think that's an acceptable build time. But I do know of other people around the net who have mentioned having sites with 20k-ish pages taking an hour+ to build. For them I could see the desire to try this sqlite trick.


You should write a post about this if you implement it. My humble suggestion for blog post title - 'Speed up your Next.js builds with this author's one weird trick! Vercel hates him!'


I generate my entire blog on nextjs from a sqlite data base already. https://lsmurray.com/blog/database-first-development

It’s got a ton of rough edges but the boilerplate is there to get a proof of concept pretty quickly


For really large sites Next.js already has Incremental Static Regeneration which is usually the right solution to fast [re]builds: https://www.smashingmagazine.com/2021/04/incremental-static-...


Its not the same because you have to rebuild all the pages if you change your data source. In this implementation you can upload a new data set and it will work.

Its just a different stack.


Generating 20k pages in 20mins is impressive, 16 pages a second on average.

In my experience, it can take a couple of minutes just to deploy 20 pages, but that could just be the overhead of Typescript and SASS compilation too...


Hugo claims <1ms a page. Which would mean 20k pages in under 20 seconds. 20k pages in 20 mins is not fast!


Confirmed: Build a 30k pages site with Hugo in 53 seconds, including grabbing JSON content from Contentful and deploy on AWS S3.

Took some effort to get everything under 60 seconds, speed was the main reason to use Hugo.


Oops, my memory was off. It's 10,925 pages and the last build took 18 minutes.


This is fantastically creative. And the author does a great job of starting out by describing why this is useful.

And then using SQLite to insert and update DOM elements? Holy cow, icing on the cake. Unlike the first part, there’s no explanation of why you’d want to do that. But by that point I was so drawn in that I didn’t care and was just enjoying the ride.


Yeaah I felt like at that point the article was already long enough so I didn't bother describing the DOM part too much - even though I spent more time implementing that than I did implementing the rest ;)

Basically SQLite has a virtual table mechanism [1] where you have to define a few functions that figure out how to scan your "fake" table / which indices to use and then how to read / write the actual data. I hook into this mechanism and redirect the request to DOM functions like querySelector() etc. Then there's the issue about SQLite being fully synchronous, but I have to run it in a WebWorker - and the WebWorker can't actually access the DOM and it can only communicate asynchronously with the main thread... So I have to do some weird stuff with SharedArrayBuffer and Atomics.wait to make that work [2].

[1] https://www.sqlite.org/vtab.html [2] https://github.com/phiresky/sql.js-httpvfs/blob/master/src/v...


"But by that point I was so drawn in that I didn’t care and was just enjoying the ride."

:-) There a high amount of SQLite content/articles/blogs on the web that can provide this effect. SQLite is to programmers like the stars are to astronomers. A wonder.


As everyone else has been saying, this is amazing work. It sounds like the biggest issue with loading the page is the initial sql.js download - it's about 1.2MB, is that right?

Might it be feasible to easily strip down SQLite so that it only compiles the parts for read-only use? The browser version is obviously somewhat read-only but that's because of the sandbox. I'm talking about excluding the code for CREATE, UPDATE, INSERT and everything else which is just for writing. The aim here would be to produce a significantly smaller WASM binary.

I'm guessing that the answer is no, there's no easy way of doing this without significant rewrites of SQLite's core, but... I can't be the only one to think of this, surely?


The actual transferred data for the sqlite code should only be 550kB (gzip compression).

Stripping out the write parts is a good idea. SQLite actually has a set of compile time flags to omit features [1]. I just tried enabling as many of those as possible, but it didn't seem to reduce wasm size much, though I might be doing something wrong. There's also no easy flags to disable CREATE / UPDATE / INSERT .

[1] https://www.sqlite.org/compile.html#omitfeatures


I smell a series of pull requests...

I'd be curious whether there's any changes which could be made in the file format to optimize for read-only usage. The SQLite format probably has some features which aren't needed in this context -- information about free pages and autoincrement counters isn't relevant in a read-only file, for instance.


I think it wouldn't change much - SQLite is already pretty optimized towards reads, for example a write always replaces a whole page and locks the whole DB. The free pages can easily be removed by doing VACUUM beforehand which should be done anyways to balance the b-trees.

The storage of SQLite is already really efficient, for example integers are always stored as varints so small ones only take a byte. The only thing I think could maybe be improved for this use case is changing the structure of the b-tree to be more columnar - since right now all the data from different columns is intermingled with the btree structure itself, querying a subset of columns has a high overhead.


Sounds feasible to me. Either by replacing all those functions on the C side with empty shells or maybe even with wasm-opt ( but probably the OP has already used it removed all garbage collectible paths.)


Amazing, especially - for me - that the FTS5 full-text search just works. Longer term, I am if it were possible to split the DB code into read and write parts and cross-compile only read part for delivery to the browser.


If you are interested in full-text search on the client, Lunr is also an option: https://lunrjs.com/docs/index.html


Lunr needs the full index on client though, right?

Being able to use fts-5 without the penalty of having to pull down the whole index make it work much better at larger scales, even with the penalty of additional network requests.


This could be genuinely interesting for tools like e.g. sphinx-doc, which currently has a client-side search that does indeed ship the entire index to the client.


There's probably a dataset size tradeoff with small enough number of documents, but even then this could have the option of if db is < x MB total, fetch all in async task and then use that.


Time to create the index for lunr limits the size of the data-set that can be used. If you have a lot of tiny documents, then it is more practical to scan the documents directly, rather than using the lunr index.


Modeling the DOM in SQL... Further evidence that anything we can imagine has some stable representation in third normal form. Is it fast? Maybe not. But is it correct? Provably.


There was a group that rewrote the Chromium DOM itself in a data oriented design (which learns from database design and sort of applies to cache utilization and locality) and got a 6X speedup in some places:

https://meetingcpp.com/mcpp/slides/2018/Data-oriented%20desi...


This paper is basically what we do, except we have a big container type, aptly named "Domain.cs". Inside, you will find a public List<T> of every type. We decided to emulate SQL throughout the vertical (i.e. each List is a table) in order to make mapping to SQL a trivial affair. None of our Domain types contains any complex type as a property. Everything can go 1:1 to SQL. We use LINQ (or SQL) to produce projections as appropriate.

There isn't actually just 1 big domain instance either. It's more like one per user session, and then a global instance.

The impact this had on reducing complexity and bugs is incredible. I haven't seen a null ref exception in a long time. Also, being able to dump your entire universe to disk by serializing a single object is really nice.


In the genomics world, Tabix indices enables similar use cases. An ordered TSV file is compressed in chunks (bgzip) and a Tabix index created to allow range based access by mapping from the index -> chunk. This allows a browser based genome browser zoomed into a section of the genome to fetch information from a multi gigabyte file.

http://www.htslib.org/doc/tabix.html


now if only tabix and most hokey bioinformatics formats would die and just be replaced with a formal schema spec in SQLite...


Honestly I think Tabix's bgzipped TSV is one of the less hokey bioinformatics formats, at least compared to the various custom binary formats floating around.

For web browser based genome browsers I suspect this (very cool!) sqlite hack would require many more http requests.


In a similar vein, I've mentioned this before, but if you're doing Python stuff, you can use the apsw package (not the one in PyPi, though) to write a VFS layer that SQLite will use to read the database.

I've used this for the same basic idea as this article, only letting me store SQLite databases in AWS's S3 that I can access with AWS APIs so they don't need to be public. It works well, though it's absolutely not for every use case, the overhead is considerable.

I even used it once to read SQLite database files in a zip file stored in S3 without having any local storage to use. Not one of my prouder moments, but hey, I coded my way out of the corner someone else designed for me.



Yep, exactly that one. There's a simple example of a VFS implementation on the examples page that's a reasonable starting point:

https://rogerbinns.github.io/apsw/example.html

Once you wrap your head around how you need to pass parameters to the helper, it's really straightforward, you just need to implement the xOpen and xRead calls.


Thank you!


I wrote a similar thing in Rust for a Factorio mod manager. Mods are hosted on the remote HTTP server as ZIP files, and the mod manager needs a single `info.json` file from the ZIP for the mod metadata. So the mod manager avoids downloading the whole mod and then unpacking it by building a file abstraction that uses HTTP range queries to download just the chunks it needs. For ZIP files the directory is stored at the end at an unknown offset, so the read pattern is to gradually seek backwards from the end until you find the start of the directory, then find the file entry, then seek and read the file.

I didn't fiddle with the window sizes like the submitted article (the chunk is fixed to 8KiB), but I did optimize it so that reading chunk N+1 of the file reused the response reader of chunk N rather than make a new request. Furthermore I keep an LRU cache of only the last three chunks in memory, because the ZIP files are each only read once.

[1]: https://github.com/Arnavion/fac-rs/blob/2d2622a1c9934719ce65...

[2]: https://github.com/Arnavion/fac-rs/blob/2d2622a1c9934719ce65...


This is hilariously clever.

Using the "Range" HTTP header to read chunks of the database file absolutely works!

But to be clear, there's no write equivalent, is there? You can't use "Range" with a PUT request.


The write equivalent would be the PATCH method using a "message/byteranges" body: https://tools.ietf.org/id/draft-wright-http-partial-upload-0...


Wow that's fascinating, thanks. That would actually turn HTTP into a kind of random-access filesystem, if adopted.

It's amazing but also slightly terrifying in the knowledge that then someone's going to write an SMB-over-HTTP connector.


TA was clear - there's no way to write, since static file hosting doesn't support dynamic write to begin with.

However, I imagine a service to support your scenario could be written in a standard back-end server language like Go or JS. The challenges involved would be much greater, however -- how to handle concurrency in particular. I suspect one would do better to just run PostgreSQL behind a web API.


That's basically re-inventing the Database but on the client side. We have gone a long way but we are closer to having the server side just as a data store.


Even if there was, I can't imagine your everyday static host ever supporting it.


Apart from the genius of the author (conceiving a sequence of relatively understandable steps that unlock a potentially huge area), this highlights how efficient a SQLite can be in terms of slow resource usage.


This is both a clever hack and a brilliant solution. But it also worries me a bit. Mostly because I've seen a lot of Visual Basic + MS Access solutions. They work fine on a single computer, but they you put a database on a network share to be able to share it between a few computers and the performance is often horrendous. If you're doing a lot of data processing it's often best to do it as close to the data as possible.

But as always, it's seldom the tools, but the right tool used for the wrong usecase that is the problem.


This is really cool! I wonder what the restrictions are and if we would ever be able to write to a SQLite db like this in the future. This could push more to the front end without needing to write apis.


The main restriction is that the DB really needs well fitting indexes, otherwise querying is really slow and fetches a lot of data.

Regarding writing:

You could of course implement a writing API with POST requests for changing pages of the database - but then you would lose most of the benefits of this (not requiring any special kind of server).

I also thought about implementing a kind of overlay filesystem, where chunks that are written to the file are stored in a local storage so the modified data is available locally while still reading everything else from the remote database.

Interestingly in SQLite that's already exactly what the WAL mode does: It's a second file next to the database that's just a set of pages that are overlaid over the main file when read queries happen - which allows concurrent readers and writers since the database itself isn't in an undefined state even when write transactions are happening.

So you could enable WAL mode and disable WAL auto checkpointing, then you get a downloadable WAL file that can be read by normal SQLite and written back to the main file. It would be neat, but I'm not sure what the actual use case would be ;)


> I also thought about implementing a kind of overlay filesystem, where chunks that are written to the file are stored in a local storage so the modified data is available locally while still reading everything else from the remote database.

If ever the intent were to involve eventually persisting those changes, then it would be worthwhile looking at remoteStorage, which works like this.


I was thinking you could implement a write API using the GitHub API, every write can be its own commit.


> I also thought about implementing a kind of overlay filesystem, where chunks that are written to the file are stored in a local storage so the modified data is available locally while still reading everything else from the remote database.

Perhaps adding IPFS to the mix for persisting data would be interesting, I'm sure there are use cases in peer to peer applications. Anyway, amazing innovation thank you for writing this :)


Seems like it might be possible to add WebRTC(or some other peer-based live system, I don't know the performance characteristics of WebRTC) to get realtime writes. Those edits would then be added to the WAL, and every say, 10 minutes anyone who has been active for 10 minutes could checkpoint and push the changes to github.

It's slightly more centralized than perfect, but man do you get a lot for a little.


Since you can do static hosting from a git repo, I wonder if you could directly push your changes to your git repo and have your CI/CD solution just deploy it instead?

There has to be a git.js implementation out there and you could move the DB to it's own repo and create an https access token (for Github)... the issue there is that someone could use that token to commit whatever to your database repo.


Maybe forcing the user to login with GitHub would be an option? And the changes to the database could be modelled as pull requests?


Over the last few months I tried to think of a clever way to set up a legacy site for a dictionary that I serve on a VM just because I also need to run sqlite. Since I want to make sure it'll run for longer than me paying for the VM this is the best possible solution. At some point no more updates will happen and it's going to be a static website. So bundling it like this is incredible. I can run multiple backups on different hosts with no additional costs.


If you just do an occasional key/value lookup, you don't need 1.2 MiB of WebAssembly. [1] That might already exceed your total database size.

I'd solve it via sharding: divide the database into N pieces via range- or hash-sharding. [1] Choose an N that's large enough for each piece to be reasonably small. When you look up a key, fetch the shard of interest.

You can put each piece into separate files (a little simpler to code, and most static servers will use pre-gzipped files for "Content-Encoding: gzip requests" easily, but you waste more disk space due to internal fragmentation) or one file (with range serving and an index of the byte range offset for each piece).

The format for each piece can be anything, eg json (simple) or an sstable-like format (more efficient). [3]

[1] Content-Length of https://phiresky.github.io/youtube-sponsorship-stats/sql-was...

[2] hash-sharding means: piece[i] has all the keys where hash(key) % N = i.

[3] https://github.com/google/leveldb/blob/master/doc/table_form... although they just say "formatted according to the code in block_builder.cc" instead of describing the most relevant part.


I have the feeling I'm watching some semi-major new web tech being born... out of the blue, in a blog post.

Fucking amazing, mad props. Beautiful work!


This is an really awesome idea! I have a plan for a static-ish site (updated daily) that I was going to use sqlite for anyway, but server side. I will definitely look into this approach instead!


   >>> From SQLite’s perspective, it just looks like it’s living on a normal computer with an empty filesystem except for a file called /wdi.sqlite3 that it can read from. 
Beyond static hosting : Now imagine also implementing a virtual file system that SENDS chunks of the database with HTTP Range requests when SQLite tries to write from the filesystem

Or more generally: I predict a WASI implementation which will treat ANY server resource as a virtual file, replacing REST.


First of all, this is a very cool web hack, I like it very much.

I have a question. It's a 668.8MB database file. What does actually happen if the query has to scan 300 mb before finding the right answer? Wouldn't it be better to do the work up front and deliver the answers as static json files? Sure you loose the flexibility of dynamic queries, but do you really have that flexibility in non trivial cases (e.g. 300 mb search)?


If your statistic or whatever can be precomputed, you can precompute it and put it in db table rather than compute it each time by reading the 300MB.


I am a huge fan of sqlite and will definitely be giving this a shot in the near future.

Incredible work.

I see myself using this in conjunction with a conventionally hosted pg db for dynamic content.


Using HTTP range requests like this is just SO clever.



I'm also impressed with the implementation for the blog itself. I love this sort of clean, version-tracked sort of implementation.


Solution works really well to databases which will not be updated frequently, like a standalone site.

Although one should be aware of one very important git behavior - git does not diff binary files (like SQLite dbs). That means 2 things:

1. Each db update will generate a new file in git, maintaining the whole old file in history, instead of the diff in bytes. This will accumulate a lot of clutter in the repo

2. As git does not diff binaries, there is a very small risk of corruption (especially if you work in multiple OSs, because of CRLF)

Ref - https://robinwinslow.uk/dont-ever-commit-binary-files-to-git


Git does actually diff binaries and stores them very efficiently :) If you do a single small UPDATE in your db file git will only store the changed information. It's just kinda slow, and for most binary files the effort git spends to try and compute deltas for binary files is useless - which is why it has a bad reputation for binary files.

Note that the diffs that git shows you are completely unrelated to the deltas it uses to compress it's database - which are always "binary deltas" and not line-based diffs.

Also I'm not sure why you mean that db corruption possibility has something to do with whether or not it stores diffs?


You can just store the database in text format (e.g. csv) in the git and turn it to SQLite db when building the website.


Minor nitpick: Git does not store diffs for any file format but always the full file for each version. So it does not really matter that its binary (except for not being able to VIEW the diff, but I guess you could even implement a plugin for that) but just that it's a big file. Even a huge text file would be fully stored per version.

/edit: The sibling comments mentions that git can infact delta compress older commits for storage efficency. But my point was that git commits are not deltas but full snapshots.


This is mostly correct. Git's core object model is snapshots, which can then optionally be compressed. That should be transparent though.


You are only gonna encounter corruption if u either a) messed up the gitconfig for line endings or b) named the database mydbfile.txt


For similar results, you can use SQL with IndexedDB running fully client-side: https://github.com/agershun/alasql

Nice. Reminds me of Bellard’s https://vfsync.org/ . It’s file system pages served over http, to back his jslinux browser VMs


The Hacker News demographic, mostly US and EU based, might be under appreciating how impactful this is because of CDNs. For anyone not in the continent of your central database, this means absurdly faster interactions for data visualization.

I moved from US to Brazil 3 years ago, and I still notice the latency when a site runs their backend only in one location. This cleaver solution makes interacting with the graph supper snappy even compared to enterprises that do have database servers in Brazil. Very impressive!


TL;DR: Compile SQLite to JS with emscripten, implement a virtual fs in JS to stream chunks of a statically hosted (readonly) SQL database.

If queries make use of indices, only a fraction of the database needs to be downloaded.

Also, you can use SQLite to query the DOM.


I can't figure out exactly how it knows which chunk to download. Does it always download the whole index first? Or does it include it in the built JS file itself?


Both the index and table data are btrees. These are trees - the root node sits in some known location (offset) in the file, referenced by the file header and metadata. As SQLite traverses the tree, it encounters new descendents it would like to visit, presumably identified by their byte offset in the file, which is all needed for this VFS magic to issue a suitable range request.

- SQlite opens the file and reads 4kb worth of header -> range request for byte 0-4096

- headers/metadata refers to index table with root node at 8192kb

- user issues SELECT * from index WHERE name = 'foo'

- SQLite reads root node from the file (range request for 8192kb..)

- Root node indicates left branch covers 'foo'. Left branch node at address 12345kb

- Fetch left branch (range request for 12345kb)

- New node contains an index entry for 'foo', row 55 of data page at 919191kb

- SQLite reads data page (range request for 91919191kb..)

etc etc etc


Thanks, I too was struggling to understand how it's able to do such efficient targeted range requests, you explained it nicely.


SQLite has runtime-pluggable VFS support, i.e. you give it a struct with functions for opening a file, reading some bytes, writing some bytes, synchronizing file contents, closing a file. This project provides such a VFS module, that, because it actually runs in the browser, performs HTTP requests to read data. Emscripten provides a way to run a mix of C/C++ code in the same environment as some JavaScript code inside the browser. The reason SQLite has this pluggable VFS support is to properly support embedded systems, different locking APIs, and things like database encryption.

https://www.sqlite.org/vfs.html


That's part of SQLite. It has been optimised to reduce disk reads, because those can be slow on spinning hard drives. Coincidentally, this translates well into an optimised algorithm that minimises the amount of HTTP range requests to make.


B-Tree indexes are designed to work like this, to require a low number of IO operations. The index contains pointers to other places in the index.


So the answer is “yes, it has to download the index first”? None of these comments answer my question.


The B-Tree is a tree that in this case is perfectly balanced. So if you do a query with an index in a database it will fetch an logarithmic amount of data from the index and then a constant amount of data from the table.

For the example the wdi_data table is 300MB and an index on it is 100MB in size. This index has a tree depth of 4 - which means SQLite has to read exactly 4 pages (4KiB) to get to the bottom of it and find the exact position of the actual row data.

you can check the depth of the b-trees with `sqlite3_analyzer`.


Everything in SQLite is stored in B-Trees. Data or indexes. So you don't need to download the whole index first; you only need to download the necessary pages of the trees to access data, whether it's part of an index or actual data


At least the information describing what tables and indices there are and where to find them - and then it gets what it needs once a query is run. Just like sqlite would if running from a local file on disk.


It just needs to download the pages (x KB chunks) to traverse from the header to the particular index and table schema, a fraction of the whole data.


It has to download some parts ("pages") of the index as the query execution proceeds, and some header/schema description pieces of it first before execution starts.


How close are we to hosting searchable torrent indices on IPFS?


I wonder sometimes, what if we focus on producing native JS VM hardware, would it perform same as or better than WASM? Let's run everything on JS.


This hack reminds me of Godel's hacking of an axiomatic mathematical system, to prove and plant his incompleteness theorem;

This is really a great hack: using a system in a unexpected and creative way for which it wan not originally deigned, but which is also very useful and nice.


Wicket stuff, loved it. Not sure if this will ever be “production ready” or merged by sql.JS-like projects but cool proof of concept.


Hmm... Is it possible to use this with two seperate databases, one a remote read only and one a local writable DB that could be re-exported to a static file host? Having just glanced at the code it looks like you would have to load two seperate libraries, one from this project and one from the original sql.js


Searching the Common Crawl does similar byte level access to S3 based on a smaller index of indexes. Really great, actually.


This is awesome. Got my wheels turning :)


Super cool. I wonder if an approach like this could ever work for a writable database. I guess the issue is that you couldn't have fine grained access control without a whole lot of server side validation, at which point you might as well just run a regular database server.


Very clever. I wonder if there are databases optimised for this use case. I can imagine something that always requires indexes to do the queries and stores data in disk in ways to make it easy to fetch only the bits you need.


That would be equivalent to always putting your data in b-trees or other structures, according to the request patterns, without keeping the tables themselves. Sort of how you need to do that in Redis for any kind of sane request strategy other than key-value lookups.


Hm, yeah, I think a key-value store would be easier to implement. I haven't looked at redis for some time now, but last time I did, persistence was done through snapshotting and everything would really be loaded into memory at start time. So that wouldn't work for this use case, where all you can do is serve a static file.

But my question revolves around databases assuming that the disk they access is local or at least a fast network storage. I wonder if there are any databases optimized to access slow storage over low bandwidth, where you're really trying to optimize the amount of data read more than anything else.


Well, every database already optimizes disk access (at least until the recent years with the ‘just add SSDs’ attitude). However, they tend to assume that indexes should be loaded into memory. For this use-case, you'd want a database that can use indexes themselves from disk and treat them like partitioned storage: e.g. when reading data for years 2019 to 2021, only request parts corresponding to that, and not previous years. Dunno whether SQLite can have indexes partially in memory—with its niche of low-end devices and apps, it's quite possible that it can.

Actually, this sort of partial access (i.e. partitioning) is rather easy to implement by addressing separate data files by name, instead of using numeric ranges into a database. Basically just put the data into files named by the years (in my example); or bucket the data into chunks of arbitrary size and use the chunks as files. Elementary to extend this to multiple fields in the index. In short, partitioning based on actual field values can be much easier in the static-http approach than using opaque ranges. Probably also more effective if something like http2 allows requesting several files in one request—since you can avoid requesting too little or too much.


Hi, I'm working on SkySQL, maybe it's useful for you: https://github.com/upshot-tech/SkySQL


stahp, you gonna get gh pages shut down


Very clever! Keep in mind this is a read-only dataset with the really clever bit being the ability to scan through the dataset and support sqlite query syntax.


This is super clever. If you add handling POST somehow as random access writes, you could have a read-write DB hosted pretty much anywhere


This is perfect for my need, been looking for a way to add search to my static site completely free of server. Now I can use sqlite as index.


This certainly does look like an interesting solution, I'd be keen to try it myself. However, just in case you didn't already know about Lunr (https://lunrjs.com/), it is fairly commonly used to implement search on static websites.

e.g. https://squidfunk.github.io/mkdocs-material/

There are of course other similar libraries too.

EDIT: Whoops, just saw a few comments below Lunr is already mentioned.


yeah I have come across Lunr and maybe a couple of other things in my research and I think for blogging it'll work well. What I'm interested in finding out is what works for a larger static site, that won't require you to load the index up-front. I'm also curious about how this sqlite thing picks the correct range to load (haven't looked at the code) and what the worst case might be.


> I'm also curious about how this sqlite thing picks the correct range to load

Indexes are usually btree, it finds what and where the relevant index is (probably from the schema queries), then goes trawling through the b-tree.

That’s how db engines normally work, they don’t liberally go through the entire db contents.

If there’s no index then it finds the start of the table and scans it sequentially (which is exactly what it sounds like).


Doesn't the webserver have to seek from the beginning of the ~600mb file to the range you want, unless the file is in memory?



That doesn't answer the question.

Yes the web server supports range requests. Yes it only returns 50kb.

But what mechanism is letting it scan to just those ranges in the binary file. Doesn't the file system make you seek to the starting block and then read from there?

The point is, while it looks very efficient, there might be a crap ton of IO going on for your little 50kb response.

EDIT: probably the webserver is doing an fseek(), and performance will vary based on file system impl. This is something I will need to benchmark.


For most filesystems it should be no issue to seek directly to a specific offset on a file, it's just like accessing an array, if you know where it starts and the offset you just need to add them and tell the disk to read that address.

>Of course it can’t write to this file, but a read-only database is still very useful.

perhaps read-only should be added to the title


I thought the "static" part of "static file hoster" was clear enough.


This is incredible ! Don't have a use case for this right now but am certainly saving this for later.


This is truly awesome. Very clever.


If you just want static data, using jsroot or jsfive might be a better option.


I think you missed the main part: the dataset is not in memory. This is not SQLite-in-the-browser but using a virtual file system over HTTP. Neither of those alternatives do anything similar.


Same with those! They'll just grab the chunks they need (also with range requests) to do what you want from binary ROOT or (presumably, I haven't actually used jsfive) HDF5 files.


This is pretty awesome. Now if only GitHub Pages did IPv6 as well... :)


This is great but wish it had write ability, not just read only.


Incredibly clever use of paging and http range queries! Well done.


Hello new paradigm.


This is amazing!

I'm also surprised that Github Pages lets you present arbitrary JS to the point where you can upload SQLite as WebAssembly. Isn't this dangerous?


What is dangerous here? Wasm is more crippled in access than normal js. And js is simply in every web page these days. You can't make a static site hosting service and tell people not to use js, that would be very lame.


this is genius. I love it.

This is pretty brilliant.


This is incredibly clever and fun.

That said, all you need to query an SQLite database on the server side is a simple PHP script (as opposed to a running db server), and most static hosting providers offer PHP support.


This seems cool, but doesn't Fossil (from the SQLite author) already do it "out of the box" ?

https://www.fossil-scm.org/home/doc/trunk/www/index.wiki


Fossil runs a server application.


This is certainly a clever hack. But it makes things cheaper and easier for the website owner at the expense of the users. This kind of solution will add to the bloat that is making the web unusable enough that people are seriously proposing a remote browser as a solution for desktop users (remember Mighty the other day?). We shouldn't let the cleverness distract us from this drawback.


Over high latency links this would be virtually unusable. Why not just download the entire database into memory over XHR on page load? SQLite databases of pure data usually aren’t over 10MB in size.


This particular demo page actually makes queries against an almost 700 MB large (fills one CD!) SQLite database. Because the amount of data read is almost negligible (few hundred kB), performance is limited by latency (as you say). However, high-latency links also tend to be slower, so downloading the entire database a-priori would almost always be much slower.

For example, on a 1 megabit/s link with 300 ms RTT, one example would take about 2 seconds for the data transfer itself while spending another 3 seconds or so on waiting. Downloading the entire file would take around an hour and a half.

For your 10 MB database, transferring it as a whole would take 80 seconds. Assuming this solution instead needs to read e.g. 250 kB (taking 2 seconds to transfer), it could still bounce around 250 times to the database before those 10 MB are fully downloaded. (This would be a really odd query, since it would only read on average two pages per read request)


Right but that is an artificially created demo by the author to justify the solution being presented (no offense). The question is how common are ~GB large SQLite databases in the real world relative to databases that are ~MB large?

In my experience SQLite databases of millions of rows of raw tabular data tend to compress very well into dozens of megabytes. Indeed SQLite is often touted as a file format for applications.


I'm running magnetico (https://github.com/boramalper/magnetico) on my VPS. I currently have an index of 1.6M magnet links stored in a 5GB database.

SQLite is most interesting not when the database is small, but when there are very few writes and all you do is reading. You can also look at https://datasette.io/ and see how SQLite is perfect for representing a lot of datasets and querying them


Even if you compress it, you need it in client memory which can also be a dealbreaker. Some people might need it for some projects, I know I did.


> you need it in client memory which can also be a dealbreaker

Most workstations have GBs of available memory. If not you can dump it in indexeddb as a raw data store.

I never disputed that it would be useful for some use cases. I only said it would be unusable with high latency links. If you have a low latency link and aren’t running very complex queries with lots of random seeks, then this should work fine for you.


My team has a few TB of data in SQLite files that are themselves dozens of GB each.

We're using them as a replacement for leveldb's sstables, but with the structure of full SQL. It is highly effective.


Do you think your team’s usage of SQLite is representative of the average SQLite user?


This is the fundamental flaw of 80% thinking. The fact that SQLite continues to reach for more users is what has made it such a successful general-purpose tool.


You didn’t answer the question.

> The fact that SQLite continues to reach for more users is what has made it such a successful general-purpose tool.

I never disputed this. You’re responding to a straw man.


Where has it been suggested that this is the best solution for "the average SQLite user", instead of a tool you can use if it fits your requirements? To take your 10MB number, the article starts by mentioning you can probably just download the entire thing if you aren't above that exact same number.


I made two claims:

> this would be unusable over high latency links.

That is objectively true

> SQLite databases of pure data usually aren’t over 10MB in size.

No one here has refuted this point.

Any other counterargument is addressing a claim I did not make.


It's not anymore unusable over high latency links than most website. Also worth noting that the caching is very smart, so once things are downloaded it's very fast.

But most high latency links are very slow (so downloading large databases is a horrible experience) and (more importantly) are often priced by the size of downloads.


> It's not anymore unusable over high latency links than most website.

That’s false. Not all web applications suffer equally from high latency links. Depends on how reliant the web application is on independent requests. Making one request and receiving a single bulk download is much less bad than making many dependent requests on a high latency link.


That is hilariously wrong for a lot of use cases. I will find this very handy for some SQLite databases I have that are several GBs in size. I am looking right now at using this contribution.


It’s not hilariously wrong or wrong at all that over high latency links this would be virtually unusable.

It’s certainly possible that people are using SQLite databases with sizes on the order of gigabytes but in my experience those are the exception not the rule.


Over high latency links most anything interesting is virtually unusable, so stop using it as a high horse to stand upon.


And the article starts by mentioning that you can download the entire file if it's not too large. And then goes on to present a solution for larger files. What more answer to "Why not just download the entire database" do you expect?


I have been using SQLite databases for a few user application that has been running for close to a decade now. They are usually about 1.5GB.

BTW, SQLite has a (theoretical?) max size of 140TB! (or so I've read)


What do you think is the size of the average SQLite database?


Given their widespread uses on phones, the overall average is likely very small. KBs even.

A more relevant question might be: what would the average size of SQLite databases for web type (or even this specific use case) applications. I don't know, but 10s or 100s of MBs might not be a bad guess.


However, would those 10s or 100s of MBs be something web sites would just like to share in their entirety to the client to query as they wish? At least many commercial services would prefer to keep most of the data secret and even gain insight from the access requests.

But for a more open or a hobbyist project where minimizing amounts of data transfer is less important than minimizing the amount of doing more work (server-side code), then this seems like a decent solution.

It is also worth reminding that this solution only practically works for read-only databases.


> SQLite databases of pure data usually aren’t over 10MB in size

Why do you think this?


The demo here uses a 670MB database file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: