Hacker News new | past | comments | ask | show | jobs | submit login

IANAL but since the pages are published under Creative Commons Attribution-ShareAlike, if someone wishes to collect the text on the basis of the HTML version then there's not much you can do about it.

Wikimedia no doubt have caching, CDNs and all that jazz in place so the likely impact on infrastructure is probably de-minimis in the grand scheme of things (the thousands or millions of humans who visit the site every second).




>IANAL but since the pages are published under Creative Commons Attribution-ShareAlike, if someone wishes to collect the text on the basis of the HTML version then there's not much you can do about it.

They said please don't, not don't do it or they'll sue you.

But content license and site terms of use are different things.

From their terms of use you aren’t allowed to

> [Disrupt] the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;

Wikipedia is also well within their rights to implement scraping countermeasures.


Yes, but they aren't going to care for just 2400 pages.

As a general rule, make your scraper non-paralell, and put a user-agent that has contact details in the event of an issue, and you're probably all good.

After all wikipedia is meant to be used. Don't be unduly disruptive, don't scrape 20 million pages, but scraping a couple thousand is totally acceptable.

Source: used to work for wikimedia, albeit not in the sre dept. My opinions are of course totally my own.


I don’t think the op was talking specifically to the content author, but to all the people who read the article and get the idea to scrape Wikipedia.


Honestly i'd rather people err on the side of scrapping wikipedia too much than live in fear of being disruptive and not do cool things as a result. Wikipedia is meant to be used to spread knowledge. That includes data mining projects such as the one in this blog.

(Before anyone takes this out of context - no im not saying its ok to be intentionally disruptive, or do things without exercising any care at all. Also always set a unique descriptive user-agent with an email address if you're doing anything automated on wikipedia).


Having been on the other side of this, I’d rather we encourage people to make use of formats/interfaces designed for machines and use the right tool for the job instead of scraping everything.

It’s incredibly easy for careless scrapers to disrupt a site and cost real money without having a clue what they’re doing.

I want people to think twice and consider what they are doing before they scrape a site.


> [Disrupt] the services by placing an undue burden on a Project website or the networks or servers connected with a Project website;

Two things:

  1) The English wikipedia *alone* gets 250 million page views per day !  So you would have to be doing an awful lot to cause "undue burden".

  2) The Wikipedia robots.txt page openly implies that crawling (and therefore scraping) is acceptable *as long as* you do it in a rate-limited fashion, e.g.:

  >Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please.

  > There are a lot of pages on this site, and there are some misbehaved spiders out there that go _way_ too fast.

  >Sorry, wget in its recursive mode is a frequent problem. Please read the man page and use it properly; there is a --wait option you can use to set the delay between hits, for instance.


1. You'd be surprised what kind of traffic scrapers can generate. I've seen scraping companies employing botnets to get around rate limiting that could easily cost enough in extra server fees to cause an "undue burden".

At a previous company we had the exact problem that we published all of our content as machine readable xml, but we had scrapers costing us money by insisting on using our search interface to access our content.

2. No one is going to jail for scraping a few thousand or even a few million pages, but just because low speed web crawlers are allowed to index the site, doesn't mean scraping for every possible use is permitted.


"Who's gonna stop me" is kind of a crappy attitude to take with a cooperative project like Wikipedia.

I mean, sure, you can do a lot of things you shouldn't with freely available services. There's even an economics term that describes this: the Tragedy of the Commons.

Individual fish poachers' hauls are also, individually, de-minimis.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: