
Web Scraping to Create Open Data - stummjr
https://blog.scrapinghub.com/2016/03/30/web-scraping-to-create-open-data/
======
jnotarstefano
I had to use a similar approach when creating a cluster analysis of the
amendments in the Italian Senate [0].

The Italian Senate offers a SPARQL endpoint [1], which unfortunately doesn't
offer access to the texts of the amendments. So I had to roll my own and
create a small spider for them using Scrapy [2].

[0]:
[https://github.com/jacquerie/senato.py/blob/master/analysis....](https://github.com/jacquerie/senato.py/blob/master/analysis.ipynb)

[1]: [http://dati.senato.it/23](http://dati.senato.it/23)

[2]:
[https://github.com/jacquerie/senato.py/blob/master/senato/sp...](https://github.com/jacquerie/senato.py/blob/master/senato/spiders/senato_spider.py)

------
harperlee
So what is the legality of this? Apart from the risk of having someone pull
the plug on the way one takes the information out, when is something without a
proper license able to be used?

~~~
dsp1234
In the US, there is no copyright protection for "facts" on their own. However,
a compilation/database of facts can have copyright protections based on a 3
part test[0].

    
    
        1. the collection and assembly of pre-existing material, facts, or  data;
        2. the selection, coordination, or arrangement of those materials; and
        3. the creation, by virtue of the particular selection, coordination, or arrangement of an original work of authorship. 
    

But specifically there is no protection for the underlying facts themselves,
and there is no "sweat of the brow" doctrine. So scraping the data, and
rearranging the underlying facts into your own arrangement/organization is
almost always not copyright infringement. However, if that data is categorized
in some non-trivial way, and you keep that organization, then that is likely
to be copyright infringement.

However, if what you're scraping are not "facts", but some creative works,
such as blog posts, product descriptions, etc, then it is likely to be
copyright infringement.

Then on top of that, even if there is copyright infringement, other defenses
such as a license to use the data, or fair use may apply.

[0] -
[http://www.pddoc.com/copyright/compilation.htm](http://www.pddoc.com/copyright/compilation.htm)

~~~
toomuchtodo
> So scraping the data, and rearranging the underlying facts into your own
> arrangement/organization is almost always not copyright infringement.

I'm not so sure. It would definitely be illegal in the US for me to cherry
pick data out of Google Maps and add it to OpenStreetMap (and OSM has policies
addressing _exactly this_ ).

~~~
iolothebard
Yet companies like LexisNexis get most their data they resell this way.

~~~
toomuchtodo
Are they scraping copyrighted data? Or public records? Big difference.

~~~
iolothebard
Facts aren't copyrightable.

They scrape everything in the world they can get their hands on.

~~~
toomuchtodo
Collections of facts are:
[https://www.unc.edu/courses/2006spring/law/357c/001/projects...](https://www.unc.edu/courses/2006spring/law/357c/001/projects/dougf/node5.html)

------
minimaxir
I'm not fond of the implication at the end that scraping is justifiable
because old websites are dinosaurs without APIs, and those websites are
_jerks_ for not doing so, and therefore scraping is the _moral_ thing to do.

I've scraped my share of BuzzFeed data and Foursquare data to make data
visualizations (with the latter explicitly saying "don't scrape" in their
Terms). But if either one told me to stop and take down my results, I would
not contest, since data is what drives the Internet ecosystem.

(For the record, neither service did; in fact, both tried to recruit me as a
result of the visualizations. The difference is that I am not using the data
to create a direct competitor that could cause them to lose business.)

~~~
kh_hk
Disclaimer, I wrote the article.

> I'm not fond of the implication at the end that scraping is justifiable
> because old websites are dinosaurs without APIs, and those websites are
> jerks for not doing so, and therefore scraping is the moral thing to do.

It was not my intention to give that implication. The main implication behind
CityBikes is that public services should already provide this information
since, well, it _is_ a public service. On the same line, a private company
providing a public service should already do so. See motives [1].

> I've scraped my share of BuzzFeed data and Foursquare data to make data
> visualizations (with the latter explicitly saying "don't scrape" in their
> Terms). But if either one told me to stop and take down my results, I would
> not contest, since data is what drives the Internet ecosystem.

Same as CityBikes is doing. If we receive a cease and desist, we remove their
service from our API. As for Foursquare, I do not see Foursquare as a public
service. Your taxdollars at work, and all that.

I tried to keep the article balanced but maybe it wasn't clear. There are many
transportation companies willing and happy to be scraped, or looking forward
to provide their information for people to reuse [2].

[1]: [https://blog.scrapinghub.com/2016/03/30/web-scraping-to-
crea...](https://blog.scrapinghub.com/2016/03/30/web-scraping-to-create-open-
data/#benefits)

[2]: [http://nabsa.net/current-members/](http://nabsa.net/current-members/)

~~~
seanp2k2
Why does your blog intentionally crash browsers that it thinks are Safari?

~~~
mryan
You appear to have replied to the wrong comment. ScrapingHub is not the site
that attempts to crash Safari - that is weboob.

------
rakoo
"Web scraping to create Open Data" is the exact reason why weboob
([http://weboob.org/](http://weboob.org/)) was created and still thrives
today. CityBikes already seems to be doing a big part of the job, and in
Python nonetheless, so it should be easy to integrate its data and use it with
Boobsize
([http://weboob.org/applications/boobsize.html](http://weboob.org/applications/boobsize.html))

~~~
maxaf
That naming scheme definitely needs a long, hard rethink.

~~~
rakoo
It's funny, everytime Weboob is presented somewhere, and everytime there is a
post about the latest version of Weboob, the first comment is a variation of
"it's sexist/boobs are unprofessional/grow up", and very _very_ little time is
spent talking about the actual thing, what it does and why its only goal is to
become irrelevant. Sad thing.

Here's what they have to say about it, and why there's very little chance they
will change anything:

[http://laurent.bachelier.name/2013/12/weboob-the-asshole-
det...](http://laurent.bachelier.name/2013/12/weboob-the-asshole-detector/)

(This comment is not directed at you directly)

~~~
WaxProlix
> It's funny, everytime Weboob is presented somewhere, and everytime there is
> a post about the latest version of Weboob, the first comment is a variation
> of "it's sexist/boobs are unprofessional/grow up", and very very little time
> is spent talking about the actual thing

Seems like a bad name then, no?

~~~
JamilD
There is a legitimate argument to be made if the naming was unintentional or
if people were making a big deal out of nothing.

But with Application names like 'Handjoob', 'Boobsize', and 'Flatboob', I
think the naming is unavoidably distracting and inappropriate.

------
PlzSnow
Can anyone tell me which cloud provider they are using? I want to make sure
that scrapinghub are on the list. I block the IP addresses of all the major
cloud providers to prevent parasites such as this.

------
l1n
Heh. I do this with my Student Government data [1].

[1]
[https://umbc.lin.anticlack.com/finance/](https://umbc.lin.anticlack.com/finance/)

~~~
yeukhon
You need to fix the certificate before showing that to the public.

