Hacker News new | past | comments | ask | show | jobs | submit login
Hosting SQLite databases on any static file hoster (2021) (phiresky.github.io)
199 points by punnerud on Oct 12, 2022 | hide | past | favorite | 51 comments



A point of great pain for me in the use of HTTP range headers is that they are absolutely incompatible with HTTP compression.

This boils down to an ambiguity found in the early days of HTTP: when you ask for a byte range with compression enabled, are you referring to a range in the uncompressed or compressed stream?

The only reasonable answer, in my opinion: a range of bytes in the uncompressed version, which is then compressed on-the-fly to save bandwidth.

I suspect the controversial point was that originally HTTP gzip encoding was used via pre-compression of the static files. On-the-fly encoding was probably too computationally expensive at the time.

Anyway, the result of this choice done decades ago is currently preventing effective use of the ranged HTTP requests. As an example, it is fairly likely that chunks from the SQLite database can be significantly compressed, but both HTTP servers and CDN refuse to compress "206 Partial Content" replies.

Interestingly, another use case of HTTP byte ranges is video streaming, which is not affected by this problem since video is _already_ compressed and there is no redundancy to remove anymore.


> This boils down to an ambiguity found in the early days of HTTP

The RFC is clear on this, different Transfer-Encoding is only for the transfer and does not affect the identity of the resource, while different Content-Encoding does affect the identity of the resource.

Its not sure who did it first, but either Browsers or Web Servers started using the Content-Encoding header (which means, "this file is always compressed" like a tar.gz, user agent not meant to uncompress) with the meaning of Transfer-Encoding (which means, the file is not compressed, compression is just applied for the sake of the transfer and the user agent needs to uncompress first). This is a violation of the HTTP spec.

This fuckup resulted in big confusion and now requires an annoying amount of workarounds:

- E-Tag re-generation [1]

- Ranges and Compression are not usable together

- Browsers's "Resume Download" feature removed

- RFC-compliant behavior being a bug [2]

- Some corporate proxies are still unpacking tarballs on the fly, breaking checksum verification

Contrast this to eMail, which has the Encodings correctly sorted out, while using the same RFC822 encoding technique as HTTP.

[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=39727

[2] https://serverfault.com/questions/915171/apache-server-cause...

tl;dr: HTTP Standard is not being implemented correctly


I think another issue with compressing partial content is that those compressed responses cannot be cached efficiently - it would have to be compressed on-the-fly for every range requested. And while compression is not as computationally expensive as it used to be, it does still add overhead that could be more than the bandwidth overhead of uncompressed data.

There should be a workaround for this using a custom pre-compression scheme, instead of relying on HTTP for the compression. Blocks within the file will have to be compressed separately, and you'll need some kind of index mapping uncompressed block offsets to compressed block offsets.

Unfortunately there doesn't seem to be a common of-the-shelf compression format that does this, and it means that you can't just use a standard SQLite file anymore. But it's definitely possible.


Update: This is an example of a compression format that allows random access to 64KB chunks, and is compatible with gzip:https://manpages.ubuntu.com/manpages/kinetic/en/man1/dictzip...


I see the caching argument only partially. It applies to any ranged request, non just compressed ones, and there are plenty of ways for origin servers to describe what and how to cache content. The origin may explicitly only allow caching when the access patterns are expected to likely repeat.


Why do you need HTTP compression, when all of SQLite is compressed? Will it give any value?

Would think HTTP2.0 is of greater value, to save TCP round trip time.


SQLite isn't compressed. There is a commercial offering from the creators of SQLite that adds compression but its rarely seen in the wild to be honest.


SQLite isn't compressed by default, but there are extensions like https://phiresky.github.io/blog/2022/sqlite-zstd/ that offer compression. It would be interesting to see how well that extension works with this project.


I do not think SQLite compresses data by default. Can you point to any documentation suggesting otherwise?


"GitHub Pages" is presumably in the title as a hook, but in TFA the author notes that this will work for any statically-hosted website.


Not any statically-hosted website, it have to support byte range request (as part of the header).

This is normally used to pause/resume video, resume downloading files etc.


Good note. Byte-range requests are supported by default for static content on all web servers, so to your point, this won't work if you've explicitly turned it off.


However, if you check the AFT, the author notes that GitHub is strictly required, so I don't know what to think.


Didn't the author mention it uses wasm and emscript? Which both can be run/compiled on browsers.

I think it's possible everywhere, but you'd just need to do more than copy paste the authors work.


My entire comment was fabricated. I was making fun of the fact that other user was using obscure acronym.


"TFA" is a popular initialism of "the fucking article"/"the fine article", with a long history alongside "RTFM", "RTFA", etc.


I don't think its as popular as you think it is. Plus its used by a bunch of other crap:

https://hn.algolia.com/?q=TFA

so its probably worth just spelling out the thing.


> I don't think its as popular as you think it is.

It is here, so you're one of today's lucky 10,000! https://xkcd.com/1053/

Also, this is the search you wanted: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... (12,797 results as I type this).


Prior discussion (May 2, 2021):

https://news.ycombinator.com/item?id=27016630


It's funny, I was wondering whether this replicated Simon Willison's work on Datasette [1], but I guess I got things confused because Simon himself was the first to comment on that older post about how clever this was.

[1] https://simonwillison.net/2018/Oct/4/datasette-ideas/


There's a web browser version of datasette called datasette-lite which runs on Python ported to WASM with Pyodide which can load SQLite databases.

I grafted the enhanced lazyFile implementation from emscripten and then from this implementation to datasette-lite relatively recently as a curious test. Threw in a 18GB CSV from CA's unclaimed property records here

https://www.sco.ca.gov/upd_download_property_records.html

into a FTS5 Sqlite Database which came out to about 28GB after processing:

POC, non-merging Log/Draft PR for the hack:

https://github.com/simonw/datasette-lite/pull/49

You can run queries through to datasette-lite if you URL hack into it and just get to the query dialog, browsing is kind of a dud at the moment since datasette runs a count(*) which downloads everything.

Example: https://datasette-lite-lab.mindflakes.com/index.html?url=htt...

Still, not bad for a $0.42/mo hostable cached CDN'd read-only database. It's on Cloudflare R2, so there's no BW costs.


Amazing. Runs smoothly. Elon could not like the real state example :). By the way, R2 service is in beta, right?



The author mentioned how running a Web server risks it going down because they forgot to pay or something. Then uses "free" services instead.

Unlike paying, which you can easily set to auto pay. Free services can go up and down whenever they wish. In fact, it's probs in their terms of service. The more generous the free service, the more likely it'd get cut down and be unusable later or just more expensive than alternatives. Like heroku free tier.

Paying money means there's an actual incentive and legal contract for the company to provide you service you paid for.


Static sites benefit from being low-maintenance and highly-portable. You can easily switch from generous free services (Cloudflare Pages, Netlify, Vercel, Firebase, etc.) to highly-available paid services (Amazon S3, Cloudflare R2, Backblaze B2, Google Cloud Storage, etc.) Keeping a VM running and updated is more work (and sometimes higher cost).


(Read only)


In theory it should be possible to write back to static storage, but you’d have to deal with authentication and file management, cacheing writes, consistency/atomicity etc. It’d be a lot of work.


Perhaps you could submit updates to the database as diffs and commit them via normal git channels


Which is still usefull for querying a large database without having to download the entire thing.


Should be included in the title then.


Seafowl (https://www.splitgraph.com/blog/seafowl), which was only launched a couple of days ago, is a database that can be hosted statically


Hey, I co-built Seafowl, thanks for the plug!

To clarify, Seafowl itself can't be hosted statically (it's a Rust server-side application), but it works well for statically hosted pages. It's basically designed to run analytical SQL queries over HTTP, with caching by a CDN/Varnish for SQL query results. The Web page user downloads just the query result rather than required fragments of the database (which, if you're running aggregation queries, might have to scan through a large part of it).


This is a cool project, and really well presented.

I've looked doing similar, trying to query a large dataset through ranged lookups on a static file server in the past. I ran into issue of some hosters not providing support for ranged lookups. Or the CDNs I was using having behaviour like they only fetch the origin data source in 8MB blocks meaning there was quite a lot of latency when doing lots of small reads across a massive file.

It would have been interesting to find out a bit more about these topics, and see a bit more on performance.


This can potentially become very interesting where if all these pages are pushed to edge via CDN, then browsers can essentially have a GraphQL built on top of SQLite on the client side.


This doesn't seem to work for me. Every time I run the SQL, the output is

[error: TypeError: /blog/_next/static/media/sqlite.worker.39534d39.js is not a valid URL.]


Reminder that Mozilla killed off SQLite in the browser, to push indexedDB, a much more limited database that they then went and implemented in Firefox with... SQLite.


That's a bit misleading. IndexedDB is a standard maintained by the World Wide Web Consortium (W3C) and is supported by all major browsers. Web SQL Database was a prior API developed by Apple. But Firefox refused to add support for it and argued against it becoming a standard because it would codify the quirks of SQLite. It was thus deprecated in favor of IndexedDB.


Which was the right choice, web standards shouldn't be tied to a single project.


Usually standards will have holes. Therefore, it is necessary to ask: what happens when a competing standard notices a hole in the current standard? In this case, it is better to have a reference implementation, matching which, the competing implementation should have programs that depend on the standard to continue to work.

Therefore, tying a standard to a reference implementation, especially one as reputed as SQLite, might not have been a bad outcome overall.


It's an open source project, they could've forked it and got their wish to keep the API unaffected by any changes to SQLite. This blog post[1], this amazing blog post gives us all the insight needed to know this was going to go badly. The bit about whiteboarding a replacement, is that better or worse than believing Javascript is really a replacement for SQL? Or that IndexedDB is somehow elegant? I can't tell, the whole thing makes me cringe.

[1] https://hacks.mozilla.org/2010/06/beyond-html5-database-apis...


Making a standard for WebSQL would have been the path of least resistance.


SQLite is a massive library, with a lot of functionality and optional extensions, and behaviour is added or changed over versions. Codifying all the behaviour into a standard would be a massive effort. And then you're either stuck on that version of SQLite, or your standard depends on what an external project is doing with new releases - which is really not what you want for a web standard.

Alternatively, you need to pick a small stable subset of SQLite functionality to expose. But that would mean you can't just use plain SQLite anymore - you either need to modify it significantly, or need to wrap it somehow. Which either way adds a lot of overhead again.


This is SQLite compiled to WebAssembly along with a storage layer ("virtual filesystem") that fetches database pages from a web server. So you have the full power of SQLite, and the page doesn't have to download unnecessary data from the server (like the entire dataset just to show you one row). It's a super nifty hack. SQLite built into the browser is somewhat orthogonal to this concern; it didn't allow you to store database pages on the server, and this one doesn't let you write to it.


I hate indexedDB as much as almost anyone, but this is not an accurate characterization of what happened. People wanted to ship 'just SQLite' and magically anoint an open source library as a standard. There are lots of problems with that.


Still fair to ask, given the world we have now where there’s kinda just “nothing” / broken and crappy implementations, would the alternative world in which SQLite was anointed, have been better head-to-head? I think maybe yes it would have?

The perfect defeated the good?


maybe it could have been better. But aren’t we talking from the decisions that happened more than 10 years ago?

websql was created in 2010 and IndexedD work started also in 2010 - or around those years.

Sqlite didn’t have this amazing reputation it has now back then and decision to have Indexed DB as a standard looked good.

At least that is what I remember just reading about it from 2010-2013.

In retrospect it’s easier to say that probably the alternative would be better. But I don’t think it was the same decade ago.


The new origin-private filesystem allows random access reads and writes. This should give the possibility to use WASM SQLite in the browser with local persistence - everything that WebSQL had.

A major advantage over WebSQL is that the developer is in complete control. The developer can pick the latest version of SQLite, enable any custom extensions (standard or additional ones), or use SQLCipher for encryption, instead of just using whatever the browser provides.

The only real disadvantage is the download size of the wasm code, but it's small enough that it won't be a blocker for sizeable interactive web apps. Just don't do this for your static blog.


Just write a server; don't bastardize GitHub Pages.

Why do people bend over backwards to avoid using a server for anything? Its literally 11 lines of code:

https://go.dev/play/p/vUo6wKni3_A


GitHub pages is free forever, fast, and you don’t need to think about software updates or CPUs or anything. And it’s just a fun project.


>free forever

Like Geocities?


"Writing" a server is the easy bit. It's running it and letting people connect that is the problem. Also scaling to spikes in load etc.

This isn't just about github pages, it will work on most static hosting systems like Amazon S3 etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: