One of my favourite things about Git scraping is that, thanks to CI providers like GitHub Actions which are free for open projects, it costs nothing to run.
TerminusDB (co founder here) was partially inspired by the git scraping approach to the revision control for data problem. We built a database that gives you all of the functionality of git, but in a database so you can query and with a commit graph etc. Has git semantics for clone, fork, rebase, merge and the other major functions. We store & share deltas and use succinct data structures that allow us to pass around in-memory DBs that you can then query in place. We're relatively new, but happy to see versioning data wherever it might land. Open source forever so really trying to be git for data. Technical white paper on our structure:
Very cool that it’s based off of RDF triples. I’m really curious about your product. What does performance look like? The TerminusDB website says it’s in memory? Can you have a database larger than your RAM?
RDF triples turn out to be a crucial part of the architecture as they make describing deltas really straightforward. It is just these triples were added and these triples were taken away. Performance is good - you get a degradation in query time as you build more appended layers, but you can squash these to a single plane to speed up. Often we have a query branch where the layers are optimized for query and another branch with all the commit history in place. We are working on something we call Delta roll ups at the moment - these are like squashes that keep the history. Hopefully you'll soon be able to automate the roll ups to keep query performance at a specified level (something like the vacuum cleaner in postgres).
It is in-memory, so you are limited to what's in RAM for querying, but it persists to disk, and we are betting that memory is going to get bigger and cheaper over the next while.
Yes - the server is in SWIPL and the distributed store is in Rust. Great combo we think.
We tried to take some of the best ideas from the semantic web and make them as practical as possible. Great to hear that people are getting knowledge graph jobs out there!
One comment on using git, isn't there a pull:push bottle neck that means it can't service any workload that has any write rate greater than (say) one write every 30 seconds. Write is subject to interpretation of course given commit != push (and i am unashamedly suggesting that a DB is better)
An excellent tool that can be used in conjunction with this approach is the word-level diff option for git diffs: `git diff --color-words`
Rather than processing entire lines (useful for code), it shows individual word changes, so it's great for tracking data as in the screenshots in the article and even better for text (output produced is similar to latexdiff for those familiar).
I highly recommend... I have a git alias setup for this, and use it almost every day.
Mostly reviews of text edits on the books. I'll make some changes for a while, then add them to git, and review the changes `git diff --color-words --cached` before I commit. Both MATH&PHYS and LA books are about to get a point-release (lots of typo fixes and improvements to certain explanations based on two years of reader feedback).
Even more useful is when collaborating with others, e.g., to see what changes the editor made. When I send diffs to her, I usually run `latexdiff` (takes >1h to generate on 500 pages), but when I check her changes I just use git.
Last but not least, I've been working on a French translation of the book and now I use the diffs to keep the two versions "in sync," meaning any improvements made to one edition are also made in the other. See https://minireference.com/blog/multilingual-authoring-for-th... for more info about this cool hack.
Yes! Great idea! I've been working on a similar, but different project recently. Using screenshots instead of git. I wanted to see the change in air quality due to the wildfires over time, so I'm making time-lapse videos: PurpleAir Air Quality 7200X Time-Lapse - Glass Fire - San Francisco Bay Area - 9/28-10/3 https://youtu.be/CFT-EIEmzfM
Thanks! Yes, it's puppeteer, plus imagemagick to add transparency using an overlay on an empty map, and ffmpeg to make the 60fps video. A lot of data, currently have over 200GB.
I actually suggested to GitHub back when the topics feature first came out that I thought the ability to submit topic recommendations to repos would be beneficial, since maybe some repo owners don't know about topics, or don't know which ones would help their discovery, or maybe (like a lot of PRs) would accept the suggestion but can't be bothered to do the legwork
Now that ".github" is a thing, putting the topics in there would allow a real PR workflow for suggestions
A course scheduler for my school, QuACS ( https://github.com/quacs/quacs ) was built entirely around git scraping. Around 7:30AM EST each morning, a GitHub Action is kicked off to scrape various things (class hour schedule, course catalog, faculty directory, etc) and produce a commit containing the current data. Then, the website is rebuilt (also using GitHub Actions) and hosted on GitHub pages. The only thing that's not static about the site is a call is made to our Student Information System (SIS) to retrieve current enrollment numbers in courses for seats.
It's helpful as the data is now open for everyone in a nice format, so some other projects that utilize course data have been taking advantage of it.
This is really cool, and coupled with a tool like https://github.com/augmentable-dev/askgit, could yield some interesting results. It could produce a table of specific values/text changing over time, measure frequency of changes, etc
That's a very nice hack of github actions. The strength of that is to have super reliable storage but the risk is to loose everything because you don't respect the Github terms of use.
The relevant section seems to be "Additionally, Actions should not be used for...any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used."
Don't get me wrong, I absolutely adore this usage of GH actions. But isn't this a TOS violation? I've thought about using GH actions as a generalized cron for online stuff and that feels like it walks the line (e.g. re-generating my static site via a Netlify webhook so it can update comments or whatnot). I feel okay about it because the static site is what's contained in the repo.
I've discussed this technique with GitHub employees about this technique in the past and no-one has raised any concerns about it. Maybe I wasn't talking to the right employees though.
This is also a relatively simple way to run a useful one-page informational site if you combine with Github Pages (or even just push to an S3 bucket/netlify/whatever). You can even have search for your one-page informational site with libraries like lunr[0].
A lot of websites that are data driven (or community contribution driven) would benefit from this kind of model. One thing I was thinking of building was a For example an open database of laptops, similar to some other sites out there. Once you've got the data all in one place you can also start doing some interesting analytical queries with CI and maybe add a second page with interesting trends.
I used to do this for job postings in target companies and a few other purposes in heady era of 2008. One of the hassles is that some pages do not neatly contain their useful data in json, so the tiny deploy script becomes a little more complicated to support the various steps:
1. Collection: logins, curl
2. Pretty print: jq / tidy
3. Selectors: Beautiful Soup / jq
4. Annotation: svn / git commit
The Github Actions angle is new and welcome though.
Oh, and I should mention, it's pretty cool to see how companies change job ads on a daily basis. Beyond just the spelling fixes, I've seen places advertise help wanted postings for like a day and remove them the next day on a regular basis. Or tack on a 'on call 24/7/365' statement. Or reuse a job posting but tack on the word 'Manager.'
If you're interested in Australia or New Zealand, we can chat. I have a scraper running on my spare laptop now, and am going to try porting it to Github Actions so it continues without me.
The source data is all HTML not JSON though, and I have to scrape pages, then parse job IDs, and then re-scrape the job listings themselves. Having it as a SQLite database is more helpful than the default search: e.g. all jobs that don't include the phrase "right to live and work in this location", all jobs that have email addresses, GROUP BY advertiser - features I wish but don't expect would ever be added to the source site.
In retrospect it mostly didn't work. What I really learned from this is that help-wanted pages are a formality. I figure places that hire people who read HN pay recruiters, and applying directly with the company less often 'skips the line' and more often finds a direct line to the trashcan, since h1b hiring laws require employers try to find local talent work first.
Nice, I didn't know that "Git scraping" was actually a term, but I used this technique to build an API which allows fetching HMRC exchange rates in JSON format.
The scrapers report errors to our slack channel and we even have workflows to deactivate/re-activate a single scraper, since it's common that one fails and we need to fix it (to stop spamming the slack channel with error messages).
I'm using cron on my cupboard server to update regularly rather than relying on Microsoft GitHub, and a standard post-update hook that emails me the diff when there are changes, so I can check out what new, funny, misspelled firstnames have been approved.
Git works really nice for this kind of use, as long as you make sure you store the data in a "stable" and "diffable" format.
I used to use git for backing up configs for network gear, most of which involved some sort of scraping because they didn't have APIs. It never really caught on with the rest of the network team, but I thought it was a great idea.
I’ve done a similar thing with APIs that return JSON. I write a small script which strictly queries the API and dumps the raw JSON into Postgres. Any structured data can be derived from the raw results (optionally with triggers) but what’s key is you will never reject a response from the API - anything goes. A nice of extension of this is having a distinct process using listen/notify on a derived table to drive notifications. I used this design to notify me of apartment listings changing and it was quite reliable.
I've used a similar technique to scrape GitHub profile statistics data and record it over time. It seems to work really well, particularly since the GitHub API doesn't record project view data for a period longer than 14 days.
So funny because I’ve proposed the same thing! Not necessarily git, just a merkle tree in general to diff changes to state over time. Super interesting read! The trade off being more cpu required to build a picture of state than storage requirements due to having to walk the tree.
A decade ago, I had a professor who would update their syllabus site and not reliably tell the class. Wget in spider mode, a cron job, a Bitbucket repo with email notifications on push, and a Google Group helped out a lot!
Wow, that's a cool idea! I wonder, how do you handle bad responses (e.g. HTTP 400/500) from the server? It might mess up the logs twice if it's committed...
I think if that happens `curl` returns an error status which kills the actions script before it gets to the commit - so you get an error message on the action but nothing is committed to the repo.
I just don't understand why docker hub doesn't have Atom or RSS feeds; it's the perfect situation for it: infrequent(?) changes that a large audience wants to be notified about
the point is that this isnt particularly useful except to just see a raw diff of the files. the example shows some interesting data points about fires changing periodically but there's basically nothing you can do with that information unless you put it in a real database.
Sure - that's what I did with my PG&E outages project (https://simonwillison.net/2019/Oct/10/pge-outages/). I wrote a Python script that iterated through the git commits and used them to create a SQLite database so I could run queries.
Essentially I was using the commit log as the point of truth for the data, and building a database as an ephemeral asset derived from that data.
What's different here is what you treat as the point of truth.
If the point of truth is the git repository and its history, then the SQLite database that you build from it is essentially a fancy caching layer - just like if you were to populate a memcached or redis instance or build an Elasticsearch index.
Not sure if databases really meet #3.
It may be hard for a database to beat the simplicity of Git when all you want is add file revisions and look at diffs.
I actually think git is MUCH better suited to this than a traditional database.
I've built track-changes-over-times things in relational databases before and it's a real pain to implement!
I've actually started using a GitHub repository to back up the PostgreSQL database that powers my blog, because it's a really cheap way of getting a revision history of that database without having to write any extra code: https://github.com/simonw/simonwillisonblog-backup/commits/m...
Didn't cha know, innovation in the tech industry is all about using things differently than intended and building new compositions that result in the same end goals as older less sophisticated compositions.
A database is probably better if the schema doesn't evolve and the present state of the data fits neatly into a relational schema, because then you are just extending it with versioning and the transformation to a version tracking database is fairly mechanical.
But if the data is already in a textual format where [or which can be canonicalized so that] each line is an atomic unit, Git + existing common diff tools get you a lot with very little work and are resilient to schema changes.
You could also reduce the data to EAV triples and do a change-tracking DB for that, wwhich would get you immunity to schema changes at the cost of losing all the value that a tracking DB has for known-schema data.
So, really, I'd say Git is the easy & general solution, though a DB might be worth the extra work if the schema was known to be fixed.
git is not suited to search through the different fields that have changed (it's content agnostic).
A database allows you to plot and search efficiently.
It's kinda neat but a pretty obvious TOS violation. I'm not sure github is really interested in people taking advantage of their free service to run their web scraper bots.
for now they are probably going to tolerate any usage even if you harm them in order to be the leader and kill the competition, then things will change. 101 monopoly
https://github.com/simonw/ca-fires-history for example scrapes every 20 minutes and runs on the free GitHub Actions plan.
I've run these on free Travis CI and Circle CI plans in the past as well.
If you want to run a private Git scraper you can pay money to do so, which seems very reasonable to me.