https://github.com/simonw/ca-fires-history for example scrapes every 20 minutes and runs on the free GitHub Actions plan.
I've run these on free Travis CI and Circle CI plans in the past as well.
If you want to run a private Git scraper you can pay money to do so, which seems very reasonable to me.
I ran the same kind of thing against PG&E outage data last year and used it to generate different visualizations: https://simonwillison.net/2019/Oct/10/pge-outages/
I have been into the semantic web since year zero, and after years of doing deep learning work I just started a Knowledge Graph job two weeks ago.
Anyway thanks, I am going to dig into TerminusDB as soon as I get back from my morning hike.
EDIT: wow, TerminousDB is written in Swi-Prolog.
We tried to take some of the best ideas from the semantic web and make them as practical as possible. Great to hear that people are getting knowledge graph jobs out there!
Rather than processing entire lines (useful for code), it shows individual word changes, so it's great for tracking data as in the screenshots in the article and even better for text (output produced is similar to latexdiff for those familiar).
I highly recommend... I have a git alias setup for this, and use it almost every day.
git show --color-words
What do you use it for? I'm wondering, what web pages / what information is so interesting, so one looks for changes almost every day :-)
Is it maybe for finding out if contents linked by Minireference text books, change in significant ways? Or move to different URLs?
Even more useful is when collaborating with others, e.g., to see what changes the editor made. When I send diffs to her, I usually run `latexdiff` (takes >1h to generate on 500 pages), but when I check her changes I just use git.
Last but not least, I've been working on a French translation of the book and now I use the diffs to keep the two versions "in sync," meaning any improvements made to one edition are also made in the other. See https://minireference.com/blog/multilingual-authoring-for-th... for more info about this cool hack.
It’s nice to know I can easily find songs I liked based on when I listened to them.
Would you be up for adding the git-scraping topic so it shows up on https://github.com/topics/git-scraping ?
Now that ".github" is a thing, putting the topics in there would allow a real PR workflow for suggestions
It's helpful as the data is now open for everyone in a nice format, so some other projects that utilize course data have been taking advantage of it.
Looks like it's even tracking covid data released by the school: https://github.com/quacs/quacs-data/commits/master/covid.jso...
I did a quick skim through the code, and it appears that askgit reads the repo into a SQLite DB first, then queries that. Is that right?
Also, have you looked at Fossil for an alternative VCS for this work, or is that not a consideration here?
And the code:
Here is the term of use of Actions if anyone interested : https://docs.github.com/en/free-pro-team@latest/github/site-...
Don't get me wrong, I absolutely adore this usage of GH actions. But isn't this a TOS violation? I've thought about using GH actions as a generalized cron for online stuff and that feels like it walks the line (e.g. re-generating my static site via a Netlify webhook so it can update comments or whatnot). I feel okay about it because the static site is what's contained in the repo.
i've been scraping daily for most of this year, and fwiw gh actions fail pretty regularly. here's my actions log: https://github.com/sw-yx/gh-action-data-scraping/actions?que...
so if you want reliable data scraping you'll actually have to build in some sort of retry capability so as not to lose data. just fyi for anyone here
If you're scraping against the terms of service of something you'll probably want to pay for a private repository.
A lot of websites that are data driven (or community contribution driven) would benefit from this kind of model. One thing I was thinking of building was a For example an open database of laptops, similar to some other sites out there. Once you've got the data all in one place you can also start doing some interesting analytical queries with CI and maybe add a second page with interesting trends.
1. Collection: logins, curl
2. Pretty print: jq / tidy
3. Selectors: Beautiful Soup / jq
4. Annotation: svn / git commit
The Github Actions angle is new and welcome though.
The source data is all HTML not JSON though, and I have to scrape pages, then parse job IDs, and then re-scrape the job listings themselves. Having it as a SQLite database is more helpful than the default search: e.g. all jobs that don't include the phrase "right to live and work in this location", all jobs that have email addresses, GROUP BY advertiser - features I wish but don't expect would ever be added to the source site.
There is a common runner script and a matrix configuration to run this script for each Swiss canton (see https://github.com/openZH/covid_19/blob/master/.github/workf...).
The scrapers report errors to our slack channel and we even have workflows to deactivate/re-activate a single scraper, since it's common that one fails and we need to fix it (to stop spamming the slack channel with error messages).
I'm using cron on my cupboard server to update regularly rather than relying on Microsoft GitHub, and a standard post-update hook that emails me the diff when there are changes, so I can check out what new, funny, misspelled firstnames have been approved.
Git works really nice for this kind of use, as long as you make sure you store the data in a "stable" and "diffable" format.
* tracking changes
* kicking off a selenium test run when their js changes
* including their js into your js bundle to reduce network calls (git scrape => cd pipeline)
Much better even to parse the data in Lightouch and save it in ContentDB
Any existing tools out there?
But hey, if there's a different kind of database that you think would be better here, I'd be fascinated to see it. Just make sure that it can
1. handle major structural changes to the files being tracked (e.g. JSON schema revisions)
2. store and retrieve these changes as efficiently as git
3. do the above with as little work required by the user
Essentially I was using the commit log as the point of truth for the data, and building a database as an ephemeral asset derived from that data.
If the point of truth is the git repository and its history, then the SQLite database that you build from it is essentially a fancy caching layer - just like if you were to populate a memcached or redis instance or build an Elasticsearch index.
I've built track-changes-over-times things in relational databases before and it's a real pain to implement!
I've actually started using a GitHub repository to back up the PostgreSQL database that powers my blog, because it's a really cheap way of getting a revision history of that database without having to write any extra code: https://github.com/simonw/simonwillisonblog-backup/commits/m...
But if the data is already in a textual format where [or which can be canonicalized so that] each line is an atomic unit, Git + existing common diff tools get you a lot with very little work and are resilient to schema changes.
You could also reduce the data to EAV triples and do a change-tracking DB for that, wwhich would get you immunity to schema changes at the cost of losing all the value that a tracking DB has for known-schema data.
So, really, I'd say Git is the easy & general solution, though a DB might be worth the extra work if the schema was known to be fixed.
It seems to me that if the goal is to track changes over time, git is very well suited to that.
The main difference between git and a traditional DB is querying and relationships between fields
It's the managers and lawyers that you need to worry about.