Hacker News new | past | comments | ask | show | jobs | submit login
The Baked Data architectural pattern (simonwillison.net)
70 points by edward on July 29, 2021 | hide | past | favorite | 23 comments



It’s good to finally have a name for this pattern. The first time I was aware of it, and the most ambitious implementation of it, was probably at Roomkey. I heard about it from Colin Steele, who was CTO, and who wrote 2 detailed posts about this (I helped edit this article):

https://www.google.com/amp/s/www.colinsteele.org/post/231037...

Colin wrote:

Similarly, Matt Mitchell, a senior developer at Room Key now, came to me saying, "Colin, there’s this tool called Solr, and I think it will make searching our hotel database much easier." Again, I can only shake my head at what happens when you hire smart people, treat them like adults, and actually listen to what they come up with. Solr quickly went from a piecewise solution to our searching needs to something far more interesting. In our problem domain, stale reads of certain data are tolerable, and exploiting that was a lever we could pull. Eventually, we ended up baking an instance of Solr/Lucene directly into our individual application processes, making it possible to achieve true linear horizontal scalability for the application.

As of this writing, we’re on track to having 9 million uniques a month, from zero at the start of 2012. We did so with absolutely no fuss, no late nights, no hand wringing, and a laughably small amount of additional opex spend. But I digress.


Next.js (and probably its competitors) does this same thing (https://nextjs.org/docs/basic-features/data-fetching#getstat...), just much more elegantly and modularly, allowing you to work with any headless CMS while still maintaining the benefits of a static, "baked" dataset that gets deployed alongside static code. It just caches API responses at build time into local JSON.

The thing is, no matter how you bake the data, it has to be edited somewhere. That "somewhere" is rarely good as Markdown files or or SQLite for all but the most trivial implementations. In real-world mid-sized websites, content editors want something easier to work with. Having an actual CMS, even if it's just a decapitated Wordpress, makes their lives much easier. And you can still get the benefits of baked data.

With incremental static regeneration (https://www.smashingmagazine.com/2021/04/incremental-static-...), you also get the benefit of post-build on the fly revalidation and "rebaking" without requiring a full rebuild, courtesy of a server that checks for data updates in the background and rebuilds only the affected pages.


There's nothing to stop you implementing a WordPress-style CMS as part of this pattern, then baking its database on every deploy and shipping that as a copy.

It means users won't get instant deployment of their edits though, which depending on your use-case may be a non-starter.

The Mozilla.org site I reference in the article appears to use Django (potentially with the Django Admin CMS tool) to manage the data, but then bakes and distributes a SQLite file with the data that has been managed by Django.


> It means users won't get instant deployment of their edits though, which depending on your use-case may be a non-starter.

That's what the "incremental static regeneration" I mentioned does. Every time the data changes, Next.js catches it within your specified revalidation period, and then rebakes only the affected pages by updating their local caches. It's magical.

Better still, with this system, you never have to manage the database layer yourself. There is a CMS, yes, but if you go headless in the cloud, that's someone else's scaling problem. Any edits, whether at build time or subsequently, get baked into static HTML + JS that can be served from any CDN.


I think I see what you mean - so your production site is hosted entirely from a static CDN, but you have an intelligent build process running somewhere that can re-generate and then re-publish just a subset of the files when the underlying data changes?



Yes, exactly.

The downside is that most of that is black-box magic. It's not clear to me whether it's a secret npm server, some serverless function, or something else entirely doing that "intelligent build process". But by and large it does work, way better than I expected it to.

In detail:

1) On static pages, you can set a revalidation period (say 60s)

2) Those static pages are always served from the CDN cache, regardless of whether there is a set revalidation. Visitors always get baked HTML + cached JSON data, regardless.

3) Visitors 1, 2, 3, 4 visit within moments of each other, and each see the same cached page.

4) Visitor 5 happens after the 60s revalidation period. They STILL see the same cached pages as visitors 1-4, BUT this also triggers a staleness check against the data origin. (This is the part that's black magic; I'm not really sure how it's doing that. It "just works" if hosted on Vercel; on other providers it may be necessary to spin up a separate `next start` node server. It's entirely unclear to me).

5) Behind the scenes, Next.js via its magic sees new data and rebuilds the affected page. It takes a while, though (maybe a minute or so depending).

6) Visitor 6 visits at 65 seconds, not quite enough time for the new page to have been built. They still see the same cached version as visitors 1-5.

7) But visitor 7 visits a minute later, at 150 seconds, and by this time Next has updated the cache. Visitor 7+ see the newly baked page, with the updated data.

8) The cycle repeats every 60 seconds.

So in production, new visitors will see updated baked pages pretty soon after the data change is made in the CMS. Caveats: it's not "instant" per se (just soon), and it requires a sacrificial visitor (visitor 5 in our example) -- or a editor or bot pretending to be a visitor -- to trigger that staleness recheck by visiting the page.

If you need truly realtime, pushed updates straight from the CMS into your Jamstack, that requires even more workarounds on top of the Next.js ISR that I described above. Some CMS vendors have proprietary solutions to this, like DatoCMS's real-time updates API (https://www.datocms.com/docs/real-time-updates-api) but I don't think there is necessarily a solved, best-practices model to refer to. Basically some sort of pub/sub using workers and server-sent events. But that gets pretty complicated vs clientside polling of updates. A slightly different problem from updating baked data in near- (but not actual) real-time.


I really like that approach. Thanks for validating it publicly so I feel less like living an antipattern when hosting data next to code.


Applies only for read-only data?

Or for data that changes infrequently enough (and not in reaction to user input) that you are OK with changing it via a deploy.


The popular static site generator Hugo also supports static data in the form of a 'data' directory[1] where JSON files and other stuff can be stored.

[1] https://gohugo.io/templates/data-templates/#readout


Hugo doesn't execute any server-side code at run-time though does it? I believe the content in that data directory is made available to the templates at compile time, but the actual deployed asset is still just static files.


Correct, but there's nothing stopping your from having a lambda function regenerating S3 content from Hugo periodically/by request.


It's perfect for serverless, like lambdas?

For small datasets (<100MB), perhaps, but I disagree with the author that this scales well. I wouldn't want to be deploying a multi-gigabyte binary every time i make a change to the dataset...


It depends on your deployment setup? Or the frequency of dataset changes?

I check-in a one-line change and CI/CD kicks off a docker build and push. A multi gigabyte binary is pushed to AWS ECR every time. I hardly notice. One benefit is the consistency. Small change. Pushed. Large change. Pushed. Dataset. Pushed. New OS, new code, new lib, new data all the same.

Maybe it doesn’t scale to frequent updates? If you are streaming me data, I can’t update my image every second..


no, it's no good for big data sets because lambda warmup time and RAM consumption is proportional to binary size.

Just reference the data as a URL to static storage, problem solved.


> no, it's no good for big data sets because lambda warmup time and RAM consumption is proportional to binary size

What binary size? What RAM consumption? With a SQLite database or something similar you'll have a small and possibly unchanged program, one or more large replaced database files, and only the unavoidable reloading of actually used index pages contributing to memory use and "warmup" after an update.

Of course if instead of relying on some efficient DBMS that manages indexing and caching you have to discard and reload a large file in its entirety after an update in order to hold all its content in memory your database is part of the problem and not part of the solution.


Weirdly Cloud Run claims that cold start time isn't coupled to the size of the binary.

I find that larger database files on Cloud Run do need to be assigned a higher amount of RAM though, or they fail to start - but I'm not sure if that's a bug in how my code works. I've not yet spent much time tweaking SQLite parameters.


We switched from deployment:

> I wouldn't want to be deploying a multi-gigabyte binary every time i make a change to the dataset...

To runtime startup:

> lambda warmup time

And cost:

> RAM consumption is proportional to binary size


I have a few projects running on Cloud Run that deploy around a GB of SQLite data. Works pretty well! Not so great above 2GB though, at that point the model of deploying a copy of the data with each deploy stops working so well.

This pattern is mostly useful for content-oriented sites with human editors. It's pretty rare for any human-edited publication to generate more than a few hundred MBs of structured data!


> multi-gigabyte

I forgot how to count that low


I deploy AI to prod. Low data gravity in high performant envs (SSD relevant here) is a common component.

~4.1TB raid0 on a bunch of ssds xN distributed nodes we depoly, leveldb shards & client in the app.

Daily updates, blue/green needed


I use a modified version of this here [1]. I do an ETL into Sqlite, but then pre-generate a bunch of api responses into JSON files that are then baked into an S3 site [2]. Its really nice, but its not a silver bullet. There are certain interaction patterns this precludes, and wouldn't be good for highly dynamic data.

1 - https://github.com/J-Swift/cod-stats

2 - https://codstats-frontend.s3.amazonaws.com/index.html





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: