
How to Scrape 20M Records from GitHub - cglee
http://www.opensourcewatch.io/story/
======
minimaxir
Yikes, HTML scraping? That's bad/against TOS, especially when GitHub has a
robust API.

~~~
michaelrm
:) We explored using the API unfortunately at the volume of data being pulled
down, the API is not feasible due to the number of authorized API keys needed.

~~~
minimaxir
And you don't see this as unethical?

~~~
cglee
This is a huge topic of debate, but the consensus from "experts" seems to be
that if it's public data that's not behind some auth, then it's fair. By the
way, this is what web crawlers, like Google's crawler, does.

See: [https://www.quora.com/Is-website-scraping-legal-and-
ethical](https://www.quora.com/Is-website-scraping-legal-and-ethical)

But I don't necessarily want to go down this rabbit hole here, as it detracts
from the interesting technical issues outlined in the article. If you haven't
read the entire article, I recommend that you do, because they had to overcome
many interesting technical challenges while working on this.

~~~
minimaxir
A _little_ web scraping is fine, if and only if the website lacks a canonical
way to do so (i.e. an API). Ignoring the API and scraping hard enough that you
have to write an article _assessing scalability of the scraping_ is bad.

~~~
michaelrm
A bit more context may be helpful. While scraping scalability was a concern
for us (for latency reasons), it was only significant to retrieve historical
data as a one-time job, after which we throttled back. A latter segment talks
about a priority queue object (SRRPQQ) which we use in lieu of scaling
further.

