Similarly, Matt Mitchell, a senior developer at Room Key now, came to me saying, "Colin, there’s this tool called Solr, and I think it will make searching our hotel database much easier." Again, I can only shake my head at what happens when you hire smart people, treat them like adults, and actually listen to what they come up with. Solr quickly went from a piecewise solution to our searching needs to something far more interesting. In our problem domain, stale reads of certain data are tolerable, and exploiting that was a lever we could pull. Eventually, we ended up baking an instance of Solr/Lucene directly into our individual application processes, making it possible to achieve true linear horizontal scalability for the application.
As of this writing, we’re on track to having 9 million uniques a month, from zero at the start of 2012. We did so with absolutely no fuss, no late nights, no hand wringing, and a laughably small amount of additional opex spend. But I digress.
The thing is, no matter how you bake the data, it has to be edited somewhere. That "somewhere" is rarely good as Markdown files or or SQLite for all but the most trivial implementations. In real-world mid-sized websites, content editors want something easier to work with. Having an actual CMS, even if it's just a decapitated Wordpress, makes their lives much easier. And you can still get the benefits of baked data.
With incremental static regeneration (https://www.smashingmagazine.com/2021/04/incremental-static-...), you also get the benefit of post-build on the fly revalidation and "rebaking" without requiring a full rebuild, courtesy of a server that checks for data updates in the background and rebuilds only the affected pages.
It means users won't get instant deployment of their edits though, which depending on your use-case may be a non-starter.
The Mozilla.org site I reference in the article appears to use Django (potentially with the Django Admin CMS tool) to manage the data, but then bakes and distributes a SQLite file with the data that has been managed by Django.
That's what the "incremental static regeneration" I mentioned does. Every time the data changes, Next.js catches it within your specified revalidation period, and then rebakes only the affected pages by updating their local caches. It's magical.
Better still, with this system, you never have to manage the database layer yourself. There is a CMS, yes, but if you go headless in the cloud, that's someone else's scaling problem. Any edits, whether at build time or subsequently, get baked into static HTML + JS that can be served from any CDN.
The downside is that most of that is black-box magic. It's not clear to me whether it's a secret npm server, some serverless function, or something else entirely doing that "intelligent build process". But by and large it does work, way better than I expected it to.
1) On static pages, you can set a revalidation period (say 60s)
2) Those static pages are always served from the CDN cache, regardless of whether there is a set revalidation. Visitors always get baked HTML + cached JSON data, regardless.
3) Visitors 1, 2, 3, 4 visit within moments of each other, and each see the same cached page.
4) Visitor 5 happens after the 60s revalidation period. They STILL see the same cached pages as visitors 1-4, BUT this also triggers a staleness check against the data origin. (This is the part that's black magic; I'm not really sure how it's doing that. It "just works" if hosted on Vercel; on other providers it may be necessary to spin up a separate `next start` node server. It's entirely unclear to me).
5) Behind the scenes, Next.js via its magic sees new data and rebuilds the affected page. It takes a while, though (maybe a minute or so depending).
6) Visitor 6 visits at 65 seconds, not quite enough time for the new page to have been built. They still see the same cached version as visitors 1-5.
7) But visitor 7 visits a minute later, at 150 seconds, and by this time Next has updated the cache. Visitor 7+ see the newly baked page, with the updated data.
8) The cycle repeats every 60 seconds.
So in production, new visitors will see updated baked pages pretty soon after the data change is made in the CMS. Caveats: it's not "instant" per se (just soon), and it requires a sacrificial visitor (visitor 5 in our example) -- or a editor or bot pretending to be a visitor -- to trigger that staleness recheck by visiting the page.
If you need truly realtime, pushed updates straight from the CMS into your Jamstack, that requires even more workarounds on top of the Next.js ISR that I described above. Some CMS vendors have proprietary solutions to this, like DatoCMS's real-time updates API (https://www.datocms.com/docs/real-time-updates-api) but I don't think there is necessarily a solved, best-practices model to refer to. Basically some sort of pub/sub using workers and server-sent events. But that gets pretty complicated vs clientside polling of updates. A slightly different problem from updating baked data in near- (but not actual) real-time.
Or for data that changes infrequently enough (and not in reaction to user input) that you are OK with changing it via a deploy.
For small datasets (<100MB), perhaps, but I disagree with the author that this scales well. I wouldn't want to be deploying a multi-gigabyte binary every time i make a change to the dataset...
I check-in a one-line change and CI/CD kicks off a docker build and push. A multi gigabyte binary is pushed to AWS ECR every time. I hardly notice. One benefit is the consistency. Small change. Pushed. Large change. Pushed. Dataset. Pushed. New OS, new code, new lib, new data all the same.
Maybe it doesn’t scale to frequent updates? If you are streaming me data, I can’t update my image every second..
Just reference the data as a URL to static storage, problem solved.
What binary size? What RAM consumption? With a SQLite database or something similar you'll have a small and possibly unchanged program, one or more large replaced database files, and only the unavoidable reloading of actually used index pages contributing to memory use and "warmup" after an update.
Of course if instead of relying on some efficient DBMS that manages indexing and caching you have to discard and reload a large file in its entirety after an update in order to hold all its content in memory your database is part of the problem and not part of the solution.
I find that larger database files on Cloud Run do need to be assigned a higher amount of RAM though, or they fail to start - but I'm not sure if that's a bug in how my code works. I've not yet spent much time tweaking SQLite parameters.
> I wouldn't want to be deploying a multi-gigabyte binary every time i make a change to the dataset...
To runtime startup:
> lambda warmup time
> RAM consumption is proportional to binary size
This pattern is mostly useful for content-oriented sites with human editors. It's pretty rare for any human-edited publication to generate more than a few hundred MBs of structured data!
I forgot how to count that low
~4.1TB raid0 on a bunch of ssds xN distributed nodes we depoly, leveldb shards & client in the app.
Daily updates, blue/green needed
1 - https://github.com/J-Swift/cod-stats
2 - https://codstats-frontend.s3.amazonaws.com/index.html