Hacker News new | past | comments | ask | show | jobs | submit login
Hosting SQLite Databases on GitHub Pages (phiresky.netlify.app)
567 points by isnotchicago on July 31, 2021 | hide | past | favorite | 77 comments



Here's the original post https://news.ycombinator.com/item?id=27016630 This guy is a genius


I get this error on Firefox 90.0.2 on Debian 10. It works in chrome though.

[error: RuntimeError: abort(Error: Couldn't load https://phiresky.netlify.app/world-development-indicators-sq.... Status: 0). Build with -s ASSERTIONS=1 for more info.]

Other than that, is pretty awesome and exactly what I was hoping for.


Looks like Netlify changed something since I wrote this article regarding what headers they send. Detecting support for Range-requests is kinda tricky and relies on heuristics [1]. Not sure why it still works in Chrome though.

You can go to this version of my blog, it should work there:

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

Maybe the link could be updated?

Except for the DOM demo, since those need some Cross-Site-Isolation Headers you can't set on GitHub Pages (that's the reason I mirrored it to Netlify originally)

[1] https://github.com/phiresky/sql.js-httpvfs/issues/13


Maybe because Firefox is stricter? Netlify (incorrectly) sends a 4,583 byte chunk, while GitHub (correctly) sent precisely 1024 bytes. Chrome might just trim it to 1024 bytes, but Firefox just fail-safe at the difference.

Edit: Netlify is indeed wrong: asserts in the headers that the content length is 1024 bytes but sends up 4,583 bytes of content. That will definitely fail in Firefox.


And it doesn't seem like it'd be easy to get netlify to fix this type of behavior unfortunately: https://answers.netlify.com/t/add-unpacked-content-length-he...


Don't rely on browser implementation details and a hope that they won't break in the future. Add a small supplementary file to your published data which has a known pattern at a fixed offset, then make a request for that offset and check the response.


Browsers actually implement range-requests correctly, what Netlify has done is advertise support for it but send data incorrectly.


That doesn't change the thrust of the comment. If you're trying to work around spotty support and detecting the feature is "tricky", then change the program so it performs a small power-on self-test against a known dataset.

(And what's the point of crafting a comment in this tone? Is it supposed to be a retort? Whether or not Netlify is doing the wrong thing, if it works in Chrome, but not in Firefox, then that's a materially relevant fact. Don't rely on implementation details and a hope that they won't break in the future.)


Because both Chromium and Gecko follow the IETF (RFC)/W3C specs about this, what Netlify is doing is plain out-of-spec, so what Chromiun and Gecko are doing are implementation details that is explicitly marked as "okay, if you encounter a stupid server that is somehow explicitly advertising range-request support but does it incorrectly, you can do anything and you're still compliant". Drop the request (like Firefox)? Compliant. Silently trim (like Chrome)? Still compliant. Just give zeroes matching the announced length? Yes, still compliant even if you think that's stupid. If you didn't get this simple fact (that is easily verifiable by opening your favourite browser's devtools or even in Fiddler), I don't know how you're not getting this. Yes, it's using heuristics, but Netlify announces support for range-requests so there's no heuristics here to do, either Netlify must remove the header announcing support or Netlify fixes this problem before we talk about heuristics.


A bit late for the original discussion, but in case it helps future readers, I've filed https://github.com/whatwg/fetch/issues/1295 to see if we can get the spec and browsers aligned. (My guess is we will update the spec to allow this and Firefox will update to align with the new spec, since the direction of specs over time is always toward more-lenient.)


That's one lovely trick.

If I may suggest one thing... instead of range requests on a single huge file how about splitting the file in 1-page fragments in separate files and fetching them individually? This buys you caching (e.g. on CDNs) and compression (that you could also perform ahead of time), both things that are somewhat tricky with a single giant file and range requests.

With the reduction in size you get from compression, you can also use larger pages almost for free, potentially further decreasing the number of roundtrips.

There's also a bunch of other things that could be tried later, like using a custom dictionary for compressing the individual pages.


> If I may suggest one thing... instead of range requests on a single huge file how about splitting the file in 1-page fragments in separate files and fetching them individually?

Edgesearch does that though with Cloudflare Workers mediating searches: https://github.com/wilsonzlin/edgesearch

Uses roaring-bitmaps to index, but could also use Stavros' trick (bloom/cuckoo-filters) to further gate false-positives on-the-client: https://news.ycombinator.com/item?id=23473365


I think that's the core innovation here, smart HTTP block storage.

I wonder if there has been any research into optimizing all http range requests at the client level in a similar way. i.e. considering the history of requests on a particular url and doing the same predictive exponential requests, or grabbing the full file asynchronously at a certain point.


> Methods for remotely accessing/paging data in from a client when a complete download of the dataset is unnecessary:

> - Query e.g. parquet on e.g. GitHub with DuckDB: duckdb/test_parquet_remote.test https://github.com/duckdb/duckdb/blob/6c7c9805fdf1604039ebed...

> - Query sqlite on e.g. GitHub with SQLite: [Hosting SQLite databases on Github Pages - (or any static file hoster) - phiresky's blog](...)

>> The above query should do 10-20 GET requests, fetching a total of 130 - 270KiB, depending on if you ran the above demos as well. Note that it only has to do 20 requests and not 270 (as would be expected when fetching 270 KiB with 1 KiB at a time). That’s because I implemented a pre-fetching system that tries to detect access patterns through three separate virtual read heads and exponentially increases the request size for sequential reads. This means that index scans or table scans reading more than a few KiB of data will only cause a number of requests that is logarithmic in the total byte length of the scan. You can see the effect of this by looking at the “Access pattern” column in the page read log above.

> - bittorrent/sqltorrent https://github.com/bittorrent/sqltorrent

>> Sqltorrent is a custom VFS for sqlite which allows applications to query an sqlite database contained within a torrent. Queries can be processed immediately after the database has been opened, even though the database file is still being downloaded. Pieces of the file which are required to complete a query are prioritized so that queries complete reasonably quickly even if only a small fraction of the whole database has been downloaded.

>> […] Creating torrents: Sqltorrent currently only supports torrents containing a single sqlite database file. For efficiency the piece size of the torrent should be kept fairly small, around 32KB. It is also recommended to set the page size equal to the piece size when creating the sqlite database

Would BitTorrent be faster over HTTP/3 (UDP) or is that already a thing for web seeding?

> - https://web.dev/file-system-access/

> The File System Access API: simplifying access to local files: The File System Access API allows web apps to read or save changes directly to files and folders on the user’s device

Hadn't seen wilsonzlin/edgesearch, thx:

> Serverless full-text search with Cloudflare Workers, WebAssembly, and Roaring Bitmaps https://github.com/wilsonzlin/edgesearch

>> How it works: Edgesearch builds a reverse index by mapping terms to a compressed bit set (using Roaring Bitmaps) of IDs of documents containing the term, and creates a custom worker script and data to upload to Cloudflare Workers


> Would BitTorrent be faster over HTTP/3 (UDP) or is that already a thing for web seeding?

The BT protocol itself runs on both TCP and UDP, but it has preferred the UDP variant for many years already.


Thanks. There likely are relative advantages to HTTP/3 QUIC. Here's this from Wikipedia:

> Both HTTP/1.1 and HTTP/2 use TCP as their transport. HTTP/3 uses QUIC, a transport layer network protocol which uses user space congestion control over the User Datagram Protocol (UDP). The switch to QUIC aims to fix a major problem of HTTP/2 called "head-of-line blocking": because the parallel nature of HTTP/2's multiplexing is not visible to TCP's loss recovery mechanisms, a lost or reordered packet causes all active transactions to experience a stall regardless of whether that transaction was impacted by the lost packet. Because QUIC provides native multiplexing, lost packets only impact the streams where data has been lost.

And HTTP Pipelining / Multiplexing isn't specified by just UDP or QUIC:

> HTTP/1.1 specification requires servers to respond to pipelined requests correctly, sending back non-pipelined but valid responses even if server does not support HTTP pipelining. Despite this requirement, many legacy HTTP/1.1 servers do not support pipelining correctly, forcing most HTTP clients to not use HTTP pipelining in practice.

> Time diagram of non-pipelined vs. pipelined connection The technique was superseded by multiplexing via HTTP/2,[2] which is supported by most modern browsers.[3]

> In HTTP/3, the multiplexing is accomplished through the new underlying QUIC transport protocol, which replaces TCP. This further reduces loading time, as there is no head-of-line blocking anymore https://en.wikipedia.org/wiki/HTTP_pipelining


SQLite not being in browsers instead of indexdb saddens me today still.

I designed a system 15 years ago that released dimensional star schemas for specific reports as sqllite databases into Adobe Air (or whatever the prerelease name was) for a large retailer in the UK. We would query the data warehouse, build the sqlite db file (I can't remember the exact db sizes but they weren't too big - 15mb or so) and the computational performance we got from using sqlite as our crunching layer when building the dimensional reports with drilldown was just astounding.


You actually can use SQLite on some browsers with WebSQL today. Unfortunately, WebSQL has been deprecated and is already removed from some browsers (https://softwareengineering.stackexchange.com/questions/2202...).


Oh man, that's a shame. I remember looking into WebSQL when I was designing a little Phonegap tourist app, and ended up just using localstorage instead, like this guy, but I tucked it away as a possible technology for some use case I never quite got around to. SQLite is great and it doesn't make any sense to deprecate WebSQL in favor of some key/value "nosql db" pattern, since there are already plenty of other alternatives for storing nonrelational data.


It was deprecated because it was difficult to write a standard spec for the existing SQLite code (but a key value system is much easier to specify as there is no SQL language).


Yeah but a key value system lacks all the really good things about SQL. And locally with sqlite you don't really have to worry about latency, so you should be able to just get atomicity and consistency on the thread. This shouldn't be a lot to ask from an embedded web DB. I think as with other standards bickering, ten years from now something (Canvas API) will come out that more or less replicates the technology that was already standard ten years ago (Flash graphics), with lots of people cheering for it as if someone just invented sliced bread.


Additional issue was that to become a standard, there would have to be at least two different proof-of-concept implementations. That would mean, someone would have to write a SQLite compatible database, that is not SQLite.


If it hasn't been done by the "rewrite it in Rust" crowd yet, it probably will eventually.


>Yeah but a key value system lacks all the really good things about SQL

exactly

>And locally with sqlite you don't really have to worry about latency, so you should be able to just get atomicity and consistency on the thread

for high performance read you don't need much more.

> think as with other standards bickering, ten years from now something (Canvas API) will come out that more or less replicates the technology that was already standard ten years ago (Flash graphics), with lots of people cheering for it as if someone just invented sliced bread.

flash and actionscript revolutionized the web and a lot of what we have now web tooling wise stems to the desire to compete what flash was possible since 2003.


It's really only the write / ACID features of WebSQL that are interesting to me, though. There are lots of ways to get a high performance read on key/value pairs if you've already downloaded all the data. We don't need a new methodology for that. What's interesting is the ability for web app storage to respect foreign key constraints and transactions within a local data model, without having to rely on rolling your own middleware to resolve inconsistencies.


> Yeah but a key value system lacks all the really good things about SQL.

IndexedDB lacks SQL support natively, but there is nothing about key value systems in general that preclude the use of SQL. CockroachDB comes to mind as a key value store that depends on SQL. SQL support could be built atop IndexedDB by a willing developer. JsStore is in that vein, although the approach is a little bit different.


Whoever that asshole was, he's received sooo much undeserved praise. There is pretty much nothing good about sliced bread.


SQLite was not designed for arbitrary execution with control over queries and internal state. Someone demonstrated that you could redirect the pointer address for various callbacks and potentially exploit it. The solution currently is compiling sqlite in webassembly. Though I am certainly saddened that browsers don't have some sort of web-made equivalent natively.


I've been thinking about inventing exactly that for the last week or so.


Huh, this is a funny one. I had this idea a long time ago when doing some napkin design of a "static wiki". Problem was the querying didn't fit how software optimizes content delivery, so millions of people requesting from a single database would most likely be difficult to accomplish in a performant manner. Secondarily writing to said database would of course be impossible because locking, and you'd need a server anyways to do any sort session based submittal of data.

Very nice for read-only static data sets for small sites though. Infact this may be very useful for county mapping systems, converting over the GIS data to tables in SQLite.

If at all possible it would be better if this could be in ES5 (no async await) javascript, only very very modern browsers are going to be able to access it. People with older phones (which is many) wouldn't be able to use it at all.


> People with older phones (which is many) wouldn't be able to use it at all.

Caniuse.com does not agree with this assertion.


web statistics aren't a good reflection of use cases. It assumes that the percentage of people using browsers are using browsers, when often people whom are using an important service, namely a government service, are not often using other services.


Just transpile it


I can’t fully put my finger on why exactly, but I feel that this is a transformative idea. What’s to stop me from emulating a private SQLite DB for every user of a web app, and use that instead of GraphQL?


I have software deployed that depends on locally run (MySQL) DBs sitting on user/retail outlet machines, and I could think of a few reasons. Big data downloads, modifying anything about the data structure can cause breakage between the middleware and the DB, you're basically distributing a data model along with what should normally be serverside code to the client. Conceptually it's great but for large data, I'd hate to be paying for the bandwidth if it were public. And if it's not just a web page, you need a way to update it on every platform. Also, there are always things in a web app that simply cannot be client-side because they'd pose a security threat.

The concept reminds me a little bit of CD-ROM multimedia, in that it's so self-contained. For something like that it's great.


Nothing stopping you doing that right now with localStorage or IndexedDB. The issue is the browser cannot be trusted to keep that data, or at least these APIs aren't designed for long-term persistent storage. If we could solve this problem, we could go a long way towards some level of decentralisation. On the other hand, which is more secure? Your service or the user's machine. So there's a lot to consider.


>> If we could solve this problem, we could go a long way towards some level of decentralisation

I've heard variations of this batted around recently. Specifically, if we could allow web apps out of the sandbox so they could work like native apps, or have more access to the file system, we could maybe work our way out of the walled gardens and into totally distributed storage / processing / hosting. And it's true, it's just that...

>> On the other hand, which is more secure? Your service or the user's machine.

This.


Wouldn't such a model limit the user to just one device? Usually there's no sync of localStorage across devices.


If I'm understanding your scenario, I think PouchDB and CouchDB kind of address this concern, but for IndexexDB.


IMHO this reduces the need for a full fledged database serving read only information at scale. The restriction before this is that a SQLite file had to be on a single server. Now, having SQLite on S3, you could write a set of scalable web services on top of the file on S3 and scale those services as much as needed.


Nothing, I'm sure, but the idea behind GraphQL is that you can query all of the different backend services you have in a single request, reducing the network latency associated with firing off requests to all those services individually. It would seem that an emulated SQLite database would bring you right back to having to perform multiple network requests, assuming your data needs are more complex than a single relation. Under normal usage, SQLite avoids the N+1 problem by not having IPC overhead, but that wouldn't apply here.


Yes, I was just thinking the same thing. It would be very interesting to skip GraphQL altogether using this alongside RESTful APIs.


You don't own the database. You can't be sure it's not being tampered with, and joining any other users' data together still requires your own backend. You also can't promise data won't be lost.

As another reply said, this could be useful for data-intensive readonly applications.


How about running GraphQL on top of it?


Good writeup, thanks!

All the code snippets, when run, give me the following error message:

[error: RuntimeError: abort(Error: server uses gzip or doesn't have length). Build with -s ASSERTIONS=1 for more info. (evaluating 'new WebAssembly.RuntimeError(e)')]

Could that be a Mobile Safari thing?


Same error on Firefox 90.0.2 desktop.


Same error on mobile chrome on my android


Same error on mobile Firefox, but mobile Chrome works for me (Android).


Same error with Vivaldi 4.1.2369.11 (Chromium based)


Yeah it's sad apple has crippled their browser the browser on a $40 android tablet is more powerful than safari running on a $1200 iPhone


It fails on other browsers as well (e.g. Firefox on Linux) and someone has discovered[1] that it may well be a Netlify bug (sending more data than it's supposed to). I agree with your point that that Apple could do better but in this case it does seem that Apple's browser is doing the right thing and ignoring invalid data provided by Netlify (which is the same behaviour as Firefox at the very least). In any case going direct[2] to GitHub should work fine--that link works for me in Firefox on Linux (whereas the Netlify one does not).

[1]: https://news.ycombinator.com/item?id=28016906 [2]: https://phiresky.github.io/blog/2021/hosting-sqlite-database...


There’s a market for both devices, and justifiably so. Why not celebrate the fact that people have a choice?


separate from this, one of those devices is slowing the web from fully transitioning back to webapps.


You mean Safari and how it restricts local storage for privacy reasons? Or have I missed something else?



Why don't people ditch Safari and use Chrome instead.


I’m not sure if this is sincere but Apple do not allow other browser engines. They are all Safari skins.


> Safari skins

More like embedded WebViews.


Android is used more here in India. Personally I haven't even touched an Apple device till date because it's just hard to find except in Apple store or if you have very rich friends. So I had no idea.

> They are all Safari skins

That's crazy.


There might be good reasons to do it from Apple’s point of view.

For example, if Apple allowed third-party browser engines, any vendor could offer a browser with a vulnerability. That browser could be abused to install apps through that vulnerability, including malicious ones.

I’m not implying that it’s ok for Apple to act like that. My point is that it’s consistent with their security model for iOS.


The real reason is that they want you to get your apps from the app store and that's it. They can't get paid when people install PWAs or sideload open source apps. Same reason for killing the headphone jack. Same reason for having a different shape magsafe for every generation of MacBook. To keep the suckers spending.


MagSafe has had only two variants while it existed (2006–2017). That’s more than a decade.

Both MacBook Pros I’ve owned at that time each came with its power supply included, and lasted me seven years each.

I consider myself frugal, not a sucker.


If you're buying apple products still, you're telling on yourself.


An alternative explanation would be: people have individual preferences, and value is a subjective concept.


FWIW the scientific computing community (who often deal with petabytes of geodata) has been thinking of ideas like this for a while, e.g. techniques around file formats that are easy to lazily and partially parse, (ab)using FUSE to do partial reads using http RANGE requests, some combination thereof, etc:

http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cl...


On yazz.com we have been embedding and running SQLite in web pages for over 2 years now. It is definitely something that works well


The Webapp on https://yazz.com/app/homepage.html is pretty cool. Has a Visual Basic 1 vibe.


Wow okay, so is this like an HTTP based buffer pool manager? Instead of reading pages from disk it reads via HTTP?


That’s how I understood it too


This is a great example of how as technology changes, it changes use cases, which can prompt a revisiting of what was once considered a good idea. You'll often see the pendulum of consensus swing in one direction, and then swing back to the exact opposite direction less than a decade later.

2010s saw REST-conforming APIs with json in the body largely as an (appropriate) reaction to what came before, and also in accordance with changes around what browsers were able to do, and thus how much of web apps moved from the backend to the front.

But then, that brought even more momentum where web apps started doing /even more/. There was a time when downloading a few megabytes per page, generating an SVG chart or drawing an image, interacting to live user interaction was all unthinkable. But interactive charting is now de facto. So now we need ways to access ranges and pieces of bulk data. And it looks a lot more like block storage access than REST.

---

These are core database ideas: you maintain a fast and easy to access local cache of key bits of data (called a bufferpool, stored in memory, in e.g. mysql). In this local cache you keep information on how to access the remaining bulk of the data (called an index). You minimize dipping into "remote" storage that takes 10-100x time to access.

Database people refer to the "memory wall" as a big gap in the cache hierarchy(CPU registers, L1-L3, main memory, disk / network) where the second you dip beyond it, latency tanks (Cue the "latency numbers every programmer should know" chart). And so you have to treat this specially and build your query plan to work around it. As storage techniques changed (e.g. SSDs, then NVMEs and 3d x-point etc), databases research shifted to adapt techniques to leverage new tools.

In this new case, the "wall" is just before the WAN internet, instead of being before the disk subsystem.

---

This new environment might call for a new database (and application) architectural style where executing large and complex code quickly at the client side is no problem at all in an era of 8 core CPUs, emscripten, and javascript JITs. So the query engine can move to the client, the main indexes can be loaded and cached within the app, and the function of the backend is suddenly reduced to simply storing and fetching blocks of data, something "static" file hosts can do no problem.

The fundamental idea is: where do I keep my data stored, where do I keep my business logic, and where do I handle presentation. The answer is what varies. Variations on this thought:

We've already had products that completely remove the "query engine" from the "storage" and provides it as a separate service, e.g. Presto / Athena where you set it up to use anything from flat files to RDBMSs as "data stores" across which it can do fairly complicated query plans, joins, predicate pushdown, etc. Slightly differently, Snowflake is an example of a database that's architected around storing main data in large, cheap cloud storage like s3, no need to copy and keep entire files to the ec2 node, only the block ranges you know you need. Yet another example of leveraging the boundary between the execution and the data.

People have already questioned the wisdom of having a mostly dumb CRUD backend layer with minimal business logic between the web client and the database. The answer is because databases just suck at catering to this niche, but nothing vastly more complicated than that. They certainly could do granular auth, serde, validation, vastly better performance isolation, HTTP instead of a special protocol, javascript client, etc etc. Some tried.

Stored procedures are also generally considered bad (bad tooling, bad performance characteristics and isolation, large element of surprise), but they needn't be. They're vastly better in some products that are generally inaccessible to or unpopular with large chunks of the public. But they're a half baked attempt to keep business logic and data close together. And some companies had decided at a certain time that their faults were not greater than their benefits, and had large portions of critical applications written in this way not too long ago.

---

Part of advancing as an engineer is to be able to weigh the cost of when it's appropriate to sometimes free yourself from the yoke of "best practices" and "how it's done". You might recognize that something about what you're trying to do is different, or times and conditions have changed since a thing was decided.

And also to know when it's not appropriate: existing, excellent tooling probably works okay for many use cases, and the cost of invention is unnecessary.

We see this often when companies and products that are pushing boundaries or up against certain limitations might do something that seems silly or goes against the grain of what's obviously good. That's okay: they're not you, and you're not them, and we all have our own reasons, and that's the point.


I would think that if we know the SQL-queries we need we could pre-perform them and store the results into simple indexed tables. The web-app would then need to just ask for the data at a given index-value. No SQL needed on the browser. Could this work? Pre-executing SQL.


Yeah, that's caching, and it's great when one can do it :-)

edit: kidding aside, this is approximately what a database view is, at least in some implementations (though probably not as common as the more simple implementation): a view is created as a SQL query that pulls some data. It's stored as is. As data is inserted to the database, the views (a.k.a. queries) that would include that row are updated too.


Neat. Just not to be used for authentication!


I wonder if you could turn this with a Google Spreadsheet into a real DB system with writing access and a little obfuscated security wrapper.


bummer on not being able to write to sqlite. I am using neocities and was wonder how i could get a db into play


Depending on how often you need to write, you could use a CI pipeline on a cron to collect your updates, add them to a the sqlite file and commit the changes.


I don't think it would be impossible, although I think it may be easier to just use a JavaScript Git client and something like Critic markup to make changes.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: