Hacker News new | past | comments | ask | show | jobs | submit login
Git scraping: track changes over time by scraping to a Git repository (simonwillison.net)
471 points by simonw 16 days ago | hide | past | favorite | 95 comments



One of my favourite things about Git scraping is that, thanks to CI providers like GitHub Actions which are free for open projects, it costs nothing to run.

https://github.com/simonw/ca-fires-history for example scrapes every 20 minutes and runs on the free GitHub Actions plan.

I've run these on free Travis CI and Circle CI plans in the past as well.

If you want to run a private Git scraper you can pay money to do so, which seems very reasonable to me.


Nifty! Do you plan to use those commit history in any way? Is it just for bookkeeping?


I'm hoping someone uses this data for something interesting - it may end up being me!

I ran the same kind of thing against PG&E outage data last year and used it to generate different visualizations: https://simonwillison.net/2019/Oct/10/pge-outages/


really appreciated your PGE stuff last year. I hope we won't need it, but if we do.. I hope you'll do it again this year. :)


TerminusDB (co founder here) was partially inspired by the git scraping approach to the revision control for data problem. We built a database that gives you all of the functionality of git, but in a database so you can query and with a commit graph etc. Has git semantics for clone, fork, rebase, merge and the other major functions. We store & share deltas and use succinct data structures that allow us to pass around in-memory DBs that you can then query in place. We're relatively new, but happy to see versioning data wherever it might land. Open source forever so really trying to be git for data. Technical white paper on our structure:

https://github.com/terminusdb/terminusdb-server/blob/dev/doc...


Very cool that it’s based off of RDF triples. I’m really curious about your product. What does performance look like? The TerminusDB website says it’s in memory? Can you have a database larger than your RAM?


RDF triples turn out to be a crucial part of the architecture as they make describing deltas really straightforward. It is just these triples were added and these triples were taken away. Performance is good - you get a degradation in query time as you build more appended layers, but you can squash these to a single plane to speed up. Often we have a query branch where the layers are optimized for query and another branch with all the commit history in place. We are working on something we call Delta roll ups at the moment - these are like squashes that keep the history. Hopefully you'll soon be able to automate the roll ups to keep query performance at a specified level (something like the vacuum cleaner in postgres). It is in-memory, so you are limited to what's in RAM for querying, but it persists to disk, and we are betting that memory is going to get bigger and cheaper over the next while.


+1 just wanted to say thanks for pointing out the use of RDF, I would have missed that.

I have been into the semantic web since year zero, and after years of doing deep learning work I just started a Knowledge Graph job two weeks ago.

Anyway thanks, I am going to dig into TerminusDB as soon as I get back from my morning hike.

EDIT: wow, TerminousDB is written in Swi-Prolog.


Yes - the server is in SWIPL and the distributed store is in Rust. Great combo we think.

We tried to take some of the best ideas from the semantic web and make them as practical as possible. Great to hear that people are getting knowledge graph jobs out there!


One comment on using git, isn't there a pull:push bottle neck that means it can't service any workload that has any write rate greater than (say) one write every 30 seconds. Write is subject to interpretation of course given commit != push (and i am unashamedly suggesting that a DB is better)


An excellent tool that can be used in conjunction with this approach is the word-level diff option for git diffs: `git diff --color-words`

Rather than processing entire lines (useful for code), it shows individual word changes, so it's great for tracking data as in the screenshots in the article and even better for text (output produced is similar to latexdiff for those familiar).

I highly recommend... I have a git alias setup for this, and use it almost every day.


That's a great tip, thanks. Looks like you can use that option with 'git show' too:

    git show --color-words


Oh cool. I just learned a new git verb today! (Previously I always used `git diff <commit-1> <commit>` but `git show <commit>` is much simpler.


> [I] use it almost every day.

What do you use it for? I'm wondering, what web pages / what information is so interesting, so one looks for changes almost every day :-)

Is it maybe for finding out if contents linked by Minireference text books, change in significant ways? Or move to different URLs?


Mostly reviews of text edits on the books. I'll make some changes for a while, then add them to git, and review the changes `git diff --color-words --cached` before I commit. Both MATH&PHYS and LA books are about to get a point-release (lots of typo fixes and improvements to certain explanations based on two years of reader feedback).

Even more useful is when collaborating with others, e.g., to see what changes the editor made. When I send diffs to her, I usually run `latexdiff` (takes >1h to generate on 500 pages), but when I check her changes I just use git.

Last but not least, I've been working on a French translation of the book and now I use the diffs to keep the two versions "in sync," meaning any improvements made to one edition are also made in the other. See https://minireference.com/blog/multilingual-authoring-for-th... for more info about this cool hack.


magit users can enable this in all diffs by setting the variable `magit-diff-refine-hunk'.


Thanks! I have written my last half dozen books using markdown + leanpub.com, keeping my book artifacts in GitHub. I am going to try this!


Yes! Great idea! I've been working on a similar, but different project recently. Using screenshots instead of git. I wanted to see the change in air quality due to the wildfires over time, so I'm making time-lapse videos: PurpleAir Air Quality 7200X Time-Lapse - Glass Fire - San Francisco Bay Area - 9/28-10/3 https://youtu.be/CFT-EIEmzfM


How did you capture all these screenshots? Great video!


You can use Puppeteer to drive the web scraping and use page.screenshot() to capture images. You can find a quick example in the docs below:

https://github.com/puppeteer/puppeteer/blob/v5.3.1/docs/api....


Thanks! Yes, it's puppeteer, plus imagemagick to add transparency using an overlay on an empty map, and ffmpeg to make the 60fps video. A lot of data, currently have over 200GB.


Purpleair also has a JSON api if you want the raw data behind those screenshots.

Scarifying! Thanks!


I wrote something like this for archiving public Spotify playlists: https://github.com/mackorone/spotify-playlist-archive

It’s nice to know I can easily find songs I liked based on when I listened to them.


That's such a great example of this pattern in action.

Would you be up for adding the git-scraping topic so it shows up on https://github.com/topics/git-scraping ?


I actually suggested to GitHub back when the topics feature first came out that I thought the ability to submit topic recommendations to repos would be beneficial, since maybe some repo owners don't know about topics, or don't know which ones would help their discovery, or maybe (like a lot of PRs) would accept the suggestion but can't be bothered to do the legwork

Now that ".github" is a thing, putting the topics in there would allow a real PR workflow for suggestions


I'd love to be able to do that.


Done!


A course scheduler for my school, QuACS ( https://github.com/quacs/quacs ) was built entirely around git scraping. Around 7:30AM EST each morning, a GitHub Action is kicked off to scrape various things (class hour schedule, course catalog, faculty directory, etc) and produce a commit containing the current data. Then, the website is rebuilt (also using GitHub Actions) and hosted on GitHub pages. The only thing that's not static about the site is a call is made to our Student Information System (SIS) to retrieve current enrollment numbers in courses for seats.

It's helpful as the data is now open for everyone in a nice format, so some other projects that utilize course data have been taking advantage of it.


That's really smart. I found the scraping action here: https://github.com/quacs/quacs-data/blob/master/.github/work...

Looks like it's even tracking covid data released by the school: https://github.com/quacs/quacs-data/commits/master/covid.jso...


This is really cool, and coupled with a tool like https://github.com/augmentable-dev/askgit, could yield some interesting results. It could produce a table of specific values/text changing over time, measure frequency of changes, etc


Wow, finding this tool has made my day - thank you for making it!

I did a quick skim through the code, and it appears that askgit reads the repo into a SQLite DB first, then queries that. Is that right?

Also, have you looked at Fossil for an alternative VCS for this work, or is that not a consideration here?


I wrote a similar git scraper for tracking bug bounty programs and their scopes. Here's the data, updated hourly:

https://github.com/arkadiyt/bounty-targets-data

And the code:

https://github.com/arkadiyt/bounty-targets


That's a fantastic example - and it looks like you've had that running for three years now! https://github.com/arkadiyt/bounty-targets-data


That's a very nice hack of github actions. The strength of that is to have super reliable storage but the risk is to loose everything because you don't respect the Github terms of use.

Here is the term of use of Actions if anyone interested : https://docs.github.com/en/free-pro-team@latest/github/site-...


The relevant section seems to be "Additionally, Actions should not be used for...any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used."

Don't get me wrong, I absolutely adore this usage of GH actions. But isn't this a TOS violation? I've thought about using GH actions as a generalized cron for online stuff and that feels like it walks the line (e.g. re-generating my static site via a Netlify webhook so it can update comments or whatnot). I feel okay about it because the static site is what's contained in the repo.


I've discussed this technique with GitHub employees about this technique in the past and no-one has raised any concerns about it. Maybe I wasn't talking to the right employees though.


No, that's great to hear! My concerns are assuaged.


> super reliable

i've been scraping daily for most of this year, and fwiw gh actions fail pretty regularly. here's my actions log: https://github.com/sw-yx/gh-action-data-scraping/actions?que...

so if you want reliable data scraping you'll actually have to build in some sort of retry capability so as not to lose data. just fyi for anyone here


I tend to run these against things that are unlikely to complain about it - most government data sources are OK in my experience.

If you're scraping against the terms of service of something you'll probably want to pay for a private repository.


I use this pattern to scrape the Redfin and Zillow estimates for the value of my house: https://github.com/williamsmj/real-estate-scrape/.


This is also a relatively simple way to run a useful one-page informational site if you combine with Github Pages (or even just push to an S3 bucket/netlify/whatever). You can even have search for your one-page informational site with libraries like lunr[0].

A lot of websites that are data driven (or community contribution driven) would benefit from this kind of model. One thing I was thinking of building was a For example an open database of laptops, similar to some other sites out there. Once you've got the data all in one place you can also start doing some interesting analytical queries with CI and maybe add a second page with interesting trends.

[0]: https://lunrjs.com/guides/getting_started.html


I used to do this for job postings in target companies and a few other purposes in heady era of 2008. One of the hassles is that some pages do not neatly contain their useful data in json, so the tiny deploy script becomes a little more complicated to support the various steps:

1. Collection: logins, curl

2. Pretty print: jq / tidy

3. Selectors: Beautiful Soup / jq

4. Annotation: svn / git commit

The Github Actions angle is new and welcome though.


Oh, and I should mention, it's pretty cool to see how companies change job ads on a daily basis. Beyond just the spelling fixes, I've seen places advertise help wanted postings for like a day and remove them the next day on a regular basis. Or tack on a 'on call 24/7/365' statement. Or reuse a job posting but tack on the word 'Manager.'


That's so interesting. Next time I'm job seeking I'll definitely look into doing this.


If you're interested in Australia or New Zealand, we can chat. I have a scraper running on my spare laptop now, and am going to try porting it to Github Actions so it continues without me.

The source data is all HTML not JSON though, and I have to scrape pages, then parse job IDs, and then re-scrape the job listings themselves. Having it as a SQLite database is more helpful than the default search: e.g. all jobs that don't include the phrase "right to live and work in this location", all jobs that have email addresses, GROUP BY advertiser - features I wish but don't expect would ever be added to the source site.


In retrospect it mostly didn't work. What I really learned from this is that help-wanted pages are a formality. I figure places that hire people who read HN pay recruiters, and applying directly with the company less often 'skips the line' and more often finds a direct line to the trashcan, since h1b hiring laws require employers try to find local talent work first.


Nice, I didn't know that "Git scraping" was actually a term, but I used this technique to build an API which allows fetching HMRC exchange rates in JSON format.

https://github.com/matchilling/hmrc-exchange-rates


We are using this approach to scrape the COVID-19 data in Switzerland.

https://github.com/openZH/covid_19

There is a common runner script and a matrix configuration to run this script for each Swiss canton (see https://github.com/openZH/covid_19/blob/master/.github/workf...).

The scrapers report errors to our slack channel and we even have workflows to deactivate/re-activate a single scraper, since it's common that one fails and we need to fix it (to stop spamming the slack channel with error messages).


I've been recording changes to the three lists of approved first names in Denmark in git for the past 8 years: https://koldfront.dk/git/godkendtefornavne/about/

I'm using cron on my cupboard server to update regularly rather than relying on Microsoft GitHub, and a standard post-update hook that emails me the diff when there are changes, so I can check out what new, funny, misspelled firstnames have been approved.

Git works really nice for this kind of use, as long as you make sure you store the data in a "stable" and "diffable" format.


I've needed something like this for trying to wrangle third party javascript libraries (which you load from their servers):

* tracking changes

* kicking off a selenium test run when their js changes

* including their js into your js bundle to reduce network calls (git scrape => cd pipeline)

Very cool!


I used to use git for backing up configs for network gear, most of which involved some sort of scraping because they didn't have APIs. It never really caught on with the rest of the network team, but I thought it was a great idea.


Sounds like Infrastructure as Code to me. Shame it didn't catch on.


I’ve done a similar thing with APIs that return JSON. I write a small script which strictly queries the API and dumps the raw JSON into Postgres. Any structured data can be derived from the raw results (optionally with triggers) but what’s key is you will never reject a response from the API - anything goes. A nice of extension of this is having a distinct process using listen/notify on a derived table to drive notifications. I used this design to notify me of apartment listings changing and it was quite reliable.


I've used a similar technique to scrape GitHub profile statistics data and record it over time. It seems to work really well, particularly since the GitHub API doesn't record project view data for a period longer than 14 days.

https://github.com/jstrieb/github-stats


I do this for FDA databases. Great to keep track of competitors and new product categories. Internal repo though.


So funny because I’ve proposed the same thing! Not necessarily git, just a merkle tree in general to diff changes to state over time. Super interesting read! The trade off being more cpu required to build a picture of state than storage requirements due to having to walk the tree.


A decade ago, I had a professor who would update their syllabus site and not reliably tell the class. Wget in spider mode, a cron job, a Bitbucket repo with email notifications on push, and a Google Group helped out a lot!


the problem with this is it's not easy to pull data out from git. eg: i want to visualize teh changes over time.


Torchbear makes this very simple using the Gut version control system. It's Speakeasy

https://github.com/foundpatternscellar/run-tests/blob/master...

https://github.com/speakeasy-engine/gut

Much better even to parse the data in Lightouch and save it in ContentDB


Wow, that's a cool idea! I wonder, how do you handle bad responses (e.g. HTTP 400/500) from the server? It might mess up the logs twice if it's committed...


I think if that happens `curl` returns an error status which kills the actions script before it gets to the commit - so you get an error message on the action but nothing is committed to the repo.


Ah, that makes sense, thanks!


A nice complement to this would be tooling that treats file revisions as time series data, e.g. plot the fire contained percentage over time.

Any existing tools out there?


Not that I've seen. I've written code to do that using the GitPython library in the past, but I had to write it custom for each project - e.g. https://simonwillison.net/2019/Oct/10/pge-outages/


Somewhat related Twitter thread from a few days ago: https://twitter.com/Programazing/status/1313479031069835266


Nice. Being thinking on doing the same for an ETF composite I’ve been following. I’d love to see how it changes over time


Haha, I've written a scraper for docker tags for Elixir images this way. I'm even gonna open source it this weekend...


I just don't understand why docker hub doesn't have Atom or RSS feeds; it's the perfect situation for it: infrequent(?) changes that a large audience wants to be notified about


Git is eating the database world.


for the good


So using git instead of a database when a database would be more suited?


Git is a database. It's designed for storing the changes made to files over time.

But hey, if there's a different kind of database that you think would be better here, I'd be fascinated to see it. Just make sure that it can

1. handle major structural changes to the files being tracked (e.g. JSON schema revisions)

2. store and retrieve these changes as efficiently as git

3. do the above with as little work required by the user

Good luck!


the point is that this isnt particularly useful except to just see a raw diff of the files. the example shows some interesting data points about fires changing periodically but there's basically nothing you can do with that information unless you put it in a real database.


Sure - that's what I did with my PG&E outages project (https://simonwillison.net/2019/Oct/10/pge-outages/). I wrote a Python script that iterated through the git commits and used them to create a SQLite database so I could run queries.

Essentially I was using the commit log as the point of truth for the data, and building a database as an ephemeral asset derived from that data.


so full cycle, back to a database :)


What's different here is what you treat as the point of truth.

If the point of truth is the git repository and its history, then the SQLite database that you build from it is essentially a fancy caching layer - just like if you were to populate a memcached or redis instance or build an Elasticsearch index.


A decentralised revision control database revolution is brewing!


Is that sarcasm? Because again you're describing a database.


Not sure if databases really meet #3. It may be hard for a database to beat the simplicity of Git when all you want is add file revisions and look at diffs.


I actually think git is MUCH better suited to this than a traditional database.

I've built track-changes-over-times things in relational databases before and it's a real pain to implement!

I've actually started using a GitHub repository to back up the PostgreSQL database that powers my blog, because it's a really cheap way of getting a revision history of that database without having to write any extra code: https://github.com/simonw/simonwillisonblog-backup/commits/m...


I think you want temporal tables for historical dimensions. Postgres doesnt have good support as its a bolt on. MariaDB has good support for it now.


I'm not sure I understand the difficulty, your columns are the fields in your object, the rows are the different points in time


Didn't cha know, innovation in the tech industry is all about using things differently than intended and building new compositions that result in the same end goals as older less sophisticated compositions.


A database is probably better if the schema doesn't evolve and the present state of the data fits neatly into a relational schema, because then you are just extending it with versioning and the transformation to a version tracking database is fairly mechanical.

But if the data is already in a textual format where [or which can be canonicalized so that] each line is an atomic unit, Git + existing common diff tools get you a lot with very little work and are resilient to schema changes.

You could also reduce the data to EAV triples and do a change-tracking DB for that, wwhich would get you immunity to schema changes at the cost of losing all the value that a tracking DB has for known-schema data.

So, really, I'd say Git is the easy & general solution, though a DB might be worth the extra work if the schema was known to be fixed.


I’ve put a git-scraped history of 12 years of the FAA registration database into an EAVT database (Datomic) and it worked reasonably well.


Can you expand on why a database would be more well suited?

It seems to me that if the goal is to track changes over time, git is very well suited to that.


git is not suited to search through the different fields that have changed (it's content agnostic). A database allows you to plot and search efficiently.


Well you could say folders are like "tables" and files are like "fields" within those tables...

The main difference between git and a traditional DB is querying and relationships between fields


This is a reliably solid way of finding passwords and access keys


It's kinda neat but a pretty obvious TOS violation. I'm not sure github is really interested in people taking advantage of their free service to run their web scraper bots.


for what its worth, when i did it (https://www.swyx.io/github-scraping/) and tweeted about it (https://twitter.com/swyx/status/1219739210434994177), github employees replied supportively. i really doubt they care unless you are running massive scale.


for now they are probably going to tolerate any usage even if you harm them in order to be the leader and kill the competition, then things will change. 101 monopoly


If by github employees you mean the devs, I expect them to like it.

It's the managers and lawyers that you need to worry about.


Q: What part of this is a TOS violation?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: