Hello creator of the project. This entire project was motivated by another HN post where someone was asking where we can get information about architectures and system designs across our industry. Seeing there wasn't anything available, I decided to start it; glad to see its resonated with the community.
I hope to create high quality posts from engineers that work on these systems and the challenges they are trying to solve. Really dig into the problems and the technologies and strategies they used to solve it.
What they have learned? Where they have failed?
I also plan to write up deep technical dives on technologies we all rely on and use everyday.
If there is any feedback to improve or you have something you want to write up with me please reach out.
Hey, I love this. Please keep doing it, we need more of this kind of high quality writing available on the web. When I was just getting started programming, I read the AOSA guides, but they were considerably harder to read than this.
There are so many things I love about this: the art style, the large font, the summary image, etc. Great work.
We worked hard to make very approachable, explain concepts people assume people are aware of.
This means a lot and hope to continue producing these and more.
You should also check out http://aosabook.org/en/index.html, they've collected information on the architecture of a wide range of open source programs.
Thanks for doing this! The only things I noticed is that in this image (https://architecturenotes.co/content/images/2022/05/Datasett...) the word "acquire" is misspelled and near the end GitHub is written as "Github" with incorrect casing, despite it being correct earlier on.
If you're asking about Datasette, it's named after the C64 tape drive I used to write my first ever "database" (a tiny program that stored and returned simple records when I was about 8 years old).
Simon thanks for taking the time to explain Datasette internals.
Not a Python developer myself, I was intrigued by the use of hooks to extend Datasette and ended up looking into pluggy and how Datasette plugins use hooks.
I noticed some Datasette hooks like 'extra_template_vars' override values set by other hooks, while 'extra_body_script' simply append stuff to a list.
How does Datasette avoid plugins clashing with each other? Say one plugin writes a header and another overrides it.
That's something I'm still figuring out. In particular, I want plugins that add extra content to pages to be able to collaborate with each other. I have an open issue for that here: https://github.com/simonw/datasette/issues/1191
I'm not too worried about extra template vars overrides though - that will only be a problem if two plugins accidentally use the same variable name. Maybe I should encourage plugins to use a namespace based on their name in the documentation though.
What can government data providers like data.cdc.gov do to support datasette better? Is it expected that these providers would make SQLite distros of data? Or do you think people will just suck stuff down in csv and convert it to datasette to make it easier to work with and reproduce the work of others?
My ultimate dream is that providers like that will use Datasette itself - or a similar system that has the same characteristics: make it really easy for people to slice and dice the data using querystring parameters or even SQL queries and get back out just the data they want as JSON, CSV and other formats.
I do think that SQLite is a really interesting format for publishing data, and I'd love to see more places publish raw SQLite files. It's much better at preserving things like column type information and relationships between tables than CSV is.
For sites that are running in CKAN using the datastore plugin on CSV/tabular resources, there is essentially a sql endpoint to make queries against that data. There are also automated conversions of that data to csv/json/xml. In my experience though, there's not a lot of activity around those apis -- people tend to just download the data.
(CKAN is an open source, traditional web app style (flask/jinja/postgresql) open data portal that powers a bunch of open data portals, including data.gov.ie)
Simon, I have a wish list item for Datasette, pagination of ad-hock queries. I know thats a really difficult thing to implement as it would require parsing and altering the sql query, but with large datasets it would be so useful!
Datasette avoids offset/limit pagination because it performs poorly on huge queries - and I don't want random visitors (and crawlers) hurting performance of a public instance by crawling through offset/limit of thousands of pages.
That's why table pages implement keyset pagination instead - so you can do https://congress-legislators.datasettes.com/legislators/legi... and get back records following the one with A000106, which is a fast query because the ID column has an index on it.
Supporting this with arbitrary queries is harder. One idea I had is to allow the user to specify which column and sort order should be used for keyset pagination - so you could construct a URL like this:
If a pagination column has been specified, Datasette would use the same trick it uses on regular table pages and add next links that way.
Would that work for you?
The other, probably easier option is a setting that enables offset/limit pagination of arbitrary SQL queries - turned off by default, but easy to turn on for users who are running Datasette on a private server. If that takes several seconds people can at least opt into it.
Either would work, but as you said the latter is easer to implement and would have done the job for what I was working on. If I had had more time I would have looking into trying to do it myself, but there is never enough time...
But AFAICT, it just doesn’t scale whatsoever. That SQLite db is both the dataset index and the dataset content combined, right? So you're limited by how big that SQLite db can realistically be. The docs say "share data of any shape or any size", but AFAICT it can't handle large datasets containing large unstructured data like images and video and multi-billion data point datasets are hard to store in a single machine/file.
Not really a criticism, but more wondering if there are scale optimizations in Datasette I'm not aware of since the docs do say any shape or size.
You're right, Datasette isn't the right tool for sharing billion point datasets (actually low-billions might be OK if each row is small enough).
I think of Datasette as a tool for working with "small data" - where I define small data as data that will fit on a USB stick, or on my phone.
My iPhone has a TB of storage these days, so small data can get you a very long way!
Using it for unstructured image and video would work fine using the pattern where those binary files live somewhere like S3 and the Datasette instance exposes URLs to them. I should find somewhere in the documentation to talk about that.
But yes, I should probably take "of any size" off the homepage, it does give a misleading impression.
Yeah - it’s probably unfair of me to say it doesn’t scale at all. But between large data and 2 extra orders of magnitudes of rows, the single SQLite file approach quickly breaks down, even if you don’t store the large content in-db.
> AFAICT it can't handle large datasets containing large unstructured data like images and video and multi-billion data point datasets are hard to store in a single machine/file
Images and videos can easily be yeeted in as binary blobs (same as with any other standard DB), and SQLite DBs scale into the hundreds of TB range as a single file. Are you comparing the single file strategy to something like a sharded cluster of DBs, or is your thought that a DB that stores objects as independent files is somehow superior?
Is it possible to add an Excel-like spreadsheet data entry mechanism for ad-hoc data analysis (rather than receiving datasets from elsewhere)? If not, how should I create datasets?
I have a hard time believing people feel comfortable using Python applications in production for anything other than prototyping. I have seen some shit over my career, across many kinds of interpreted languages, that will never let me approve of that.
Unless one simply doesn't care about runtime quality.
Believe it. I've spent almost my entire career running Python applications in production, as have many of my friends, and many large companies that I've worked for or worked with.
Given the number of terrible, buggy sites I've seen built using Java or .NET I personally have trouble believing companies run those in production, but evidently they do!
A slightly less snarky answer: the thing I care about isn't the language, it's the process and environment around the project.
If I'm going to put something in production, I want it to have:
- Comprehensive tests, protected by CI
- Thorough, up-to-date documentation
- Code that lives in version control, with good commit messages that help answer "why" questions about how it works
- Good development environments
- A robust deployment process
The language influences these in as much as different languages have different cultures and tooling around them, but conceptually they are pretty language agnostic.
I know how to do all of these things well in Python, which is why I tend to continue to spend my time in Python land.
Yeah, it'd be crazy if sites like Reddit, SurveyMonkey, Dropbox, Spotify, Instagram, Pinterest, Lyft, and Sentry were built on a silly little prototyping language like Python instead of a real programming language. Right?
A python web app with the wonderful Django ORM shows a lot more care about runtime quality than any Java or C# app that contains strings to query an SQL database, or worse, Mongo, which - while not the norm anymore, thank god - I've seen way too many of.
(And I haven't seen any ORM to match Django's, in any language. Java and C# have horrible popular ORMs).
I hope to create high quality posts from engineers that work on these systems and the challenges they are trying to solve. Really dig into the problems and the technologies and strategies they used to solve it.
What they have learned? Where they have failed?
I also plan to write up deep technical dives on technologies we all rely on and use everyday.
If there is any feedback to improve or you have something you want to write up with me please reach out.