Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you feel about web scraping?
141 points by jackschultz on June 6, 2017 | hide | past | web | favorite | 163 comments
Most sites seem to have ToS saying that scraping is not allowed, but in lots of cases that shouldn't be the case. If you see that, does it change your opinion?

Does the purpose of your scraping make a difference, if one use is just a project but another would be selling the data?

Are comments on sites like this public data or private?

What about sports statistics that sometimes are "private" by the league, rather than just open for people to use and write interesting articles about it?

Overall thoughts?

I actually created a site (I edited and deleted the mention of the name here because apparently people don't believe that I don't care about people looking at the site.) that scrapes comments and posts from Reddit that link to Amazon products and shows that information. I'm thinking of adding Amazon links from HN and other sites, but just not sure about how people would feel about scraping from sites like this.




I built a simple CRUD app for a previous (small) employer. Nothing special technology-wise, but a good concept, sound business model, and backed up with a couple of full-time staff creating content for it. Line one of the T&Cs was "no scraping". Business model was based on sales to individual users but we were prepared to do analysis in aggregate if asked.

A scraper company, funded by magic money (Knight Foundation grants) and $1m of VC, convinced a (UK) Government department to pay them to scrape our site for some analysis the department wanted. They'd never contacted us, never asked for permission, never asked if we could supply the data. Our company was bumping along at this point and having to lay people off. Income from a nice lucrative Government contract would have kept a couple more people in work.

The scraper company's FAQ was, in my view, full-on unethical:

> "we check the robots.txt file. If the site permits robots in general to scrape their site (NOT just GoogleBot!), then we will do so. We will make no effort to look for other terms and conditions as well."

You will ostentatiously "make no effort to look" for T&Cs in case they prohibit the significant contract you're about to sign with the Government? Whoa.

So how I feel about web scraping is simple: "don't be evil". If you're diverting income or traffic from the original site, don't do it. If you're genuinely adding value, go for it, but be open, be prepared to work with the original site, and be prepared to accede to their wishes.


Put the Terms and Conditions (the part relevant to scraping) in the /robots.txt as well.


Yes. Did that after this episode.


Were you seriously expecting bots to read your T&C? Or anyone, for that matter? Did you mention that it was okay for Google to scrape your site?


We're not talking generic "bots".

We're talking a custom scraper written for this site and this site only.

Yes, I am expecting the people who spend hours inspecting the source of my site, and then writing a custom scraper for it, to spend 30 seconds reading the T&Cs first.


Not sure why you'd expect that. If my webbrowser can download your source code, my software will as well.

If you want people to read it put your content behind a sign up with a checkbox.


It is _already_ behind a sign-up with a checkbox. They scraped their way past that too.


How? (Seriously, how does one do this?)


Simply log in first, then perform the scrape programmatically.

Seen here: https://kazuar.github.io/scraping-tutorial/


oh... I thought they were able to circumvent logging in and could scrape directly... hat makes much more sense now, thank you...


Ah, that changes things somewhat.


you could rate limit the site and when a limit is hit replace paragraphs with lorem ipsum.


Did your service offer a paid API? Scraping happens because of a lack of better options. Surely you can understand why the scrapers didn't want to contact you beforehand.


If you want my data on a paid API basis, then ask me about it. I need to know how big the demand is for third party users before I even prioritize building a paid API, having the god damn courtesy to ask for something would give me an idea.

If you're using my data to hijack my traffic, without asking, you could have all the right justifications in the world but you're still a prick. Who knows, maybe your orphanage building app will move me tears once I hear about it and I'll give you free access.


In what world is "engage a third-party scraping company" a better option than "drop a quick email to the site operator"?


Because with scraping you are in a legal grey area. But if you contact the site directly and they say "no", then there is no excuse to scrape.


Well, yeah, like if you ask someone to sell you their dog and they say "no". Doesn't justify stealing the dog.


Your analogy doesn't hold up. Your example is clearly theft, and is a criminal matter. Violation of a sites terms is a civil one, and again is a legal grey area. Scraping the site doesn't delete the content from a server...but there is only one copy of the dog.

EDIT - I can't reply to your comment below, but FWIW I agree that scraping sites in this manner is unethical. I am merely describing the logic that most scrapers go through for self justification and legal protection.


The analogy was "doing something anyway because you might be told 'no'", not "my server behaves like a canine quadruped". Copyright infringement is also potentially criminal in the UK, it's not as simple as you suggest.

But whatever. It just saddens me that the internet is a constant "don't be a dick" battle with companies like the scraper guys.

(edit - understood :) )


Yes, because they did not want to pay. This is not some complex issue.

Incidentally, I'll be shopping later. May I give you $5 for you to drive to London and take me there? I was going to steal your car, but then I figured I'd be a good citizen and demand you provide it to me at your own, prohibitive, loss.


How would the ToS apply to me if I never read them? If I write a script that downloads a page then I just downloaded it, never viewed it.

If you put something on a http endpoint then I can download it. The terms of use of the material just regulates what I can do with it (republish it) but I can't possibly be forbidden to download it.

Anyone with a web server is of course free to just block my attempt to download their page, via a user-agent filer or any other method.

Edit: IANAL so I have zero idea how law is actually applied around the world. My view is just that "how on earth would it be possible to enforce a contract I never saw?".

My view of course also says that I'm free to download the data at

http://secret.somecompany.com:8081/secrets/data.bin

Because they exposed it at a public http endpoint. I'm not so sure everyone (including somecompany) would agree with that. I think it's still perfectly reasonable that I'm allowed to do that - and that the restrictions that may apply to material I reach that way is only related to how I use it. Them posting it publicly on the internet (even if it's not linked) is the same as them having it on a billboard. I very much doubt that's how law works though.


My view of course also says that I'm free to download the data at http://secret.somecompany.com:8081/secrets/data.bin Because they exposed it at a public http endpoint.

That's a poor example, because they most likely did not intend for that information to be publicly available, and just did a poor job securing it.

Real-world analogy: if you happen to find someone's house with an unlocked front door, that does not give you permission to go looking around inside. It doesn't matter that you don't steal anything, and it doesn't matter that nobody knows you were there. You're still not allowed in and by going in you're breaking the law.

If your example was for content that's retrieved as the default web page for a public DNS entry, or a page that's reachable by following links from the website's default page (including urls in meta, link, and script tags), then I think you're on solid standing as far as being allowed to retrieve the content. At that point, copyright laws and ToS govern what you're allowed to do with the content, with the safe assumption being that you're allowed to view it, but not republish it (except for fair-use excerpts.)


> if you happen to find someone's house with an unlocked front door, that does not give you permission to go looking around inside.

But in this case you are asking the server for permission to come in. Unlike a lock, a server has situational awareness to decide if it says yes or no. We even call it a server, à la servant. If the maid acting as an agent for the homeowner at this hypothetical house invited you in, against the wishes of the homeowner, are you still breaking the law?


Like I said, someone who does a poor job securing their content, but who are not exposing it directly via publicly-exposed links, are not granting public access.

Your example is more akin to the maid wandering off leaving the front door open. That's not an invite to come in, that's a poor choice of maid.


I respectfully disagree. We don't use terms like 'request' and 'server' because the mechanism is akin to a door wide open.

We can agree that a maid that lets the wrong people in is a poor choice of made, but the maid is the one responding to the request. A door/lock does not deal with requests at all.


In this scenario though the maid is not choosing to wonder off, the homeowner is telling the maid to leave the front door open.

To extend this analogy a little bit, lets say that the homeowner told the maid that they can "give the old tv, the toaster or any other item in the garage to anyone who asks." A stranger comes and asks the maid for the golf clubs in the garage that the homeowner forgot about.

In this case, the homeowner did not explicitly expose the golf clubs to the public, but gave permission to the maid to give them out. Did the homeowner not grant public access to the golf clubs?


I agree - I suppose it matters mostly how I find it. If I could be shown to have been poking around (walking up to every door to find an unlocked one) then certainly I have some dubious intent.

The question is of course in what situations I should have known I stumbled across something they didn't intend for me to see.

Googles scraper certainly walks up to every door, and then publishes what it finds. If you accidentally publish something and then google index it, and someone reaches it via google, it's hard to say it wasn't public.


No, that's not how Google's scraper works. They only follow publicly-exposed links, and they check your robots.txt file to see if they should ignore any links they might have come across that you intended to be private.

If you accidentally put a link on your public website to content you didn't want public, and Google indexes it, then someone else fetches it, yeah, that's your fault for exposing it. It's also your responsibility for telling Google to remove it from their index. But if someone is poking around to see if urls are valid (eg: the /.ssh/id_rsa requests my website gets) that's breaking and entering, even though I'd be negligent for leaving my keys on the webserver like that.


Actually, I don't think they state that is the only way they find content. I've had plenty of sites with obscure domain names for development and they always seem to find their way on to Google some how.

I imagine in addition to public URLs they also use URLs entered into Chrome (which isn't that public) and since they host DNS they probably crawl domain names requested through them too.

Legally I'm not sure what they stand on to do that but they're big enough now so I guess they don't worry too much about that nowadays.


I know - but let's assume that since I and google could reach the document it was inadvertently put on a public server. Since it was a mistake it won't be filtered in robots.txt either.

For example, in the weeks before an Apple event, people tend to visit presumed unannounced product pages (by guessing) etc.


Well, if you go to another country and you don't know the laws, you will still get arrested if you break the law.


Sure. That's the starting point in that case. The question is whether it can be seen as the starting assumption of someone making a http request that it's a "service" for which I'm assumed to find, read and follow the terms of?

I'm guessing that varies and is decided by legal precedent only once it becomes an issue.


Most countries apply intellectual property law to anything published on the internet. In other words, when it comes to something you pull down via http, you should assume that, if you attempt to republish it in any form, someone out there might decide that it's worth their time to use the court system to destroy you, so you should take the time to read the ToS.


Yes - Like I said I think it's perfectly find that the material I download from the internet has terms and laws associated with it. That is, what I do with what I download is subject to laws as soon as I e.g. try to publish it.

But my point is this: what service have I used, and what terms can I be assumed to have agreed to if I do this

    > wget $random_url > /dev/null
That's what web scraping is. What I do with the scraped material is something completely different.


In that case, you are bound to these terms of service: https://tools.ietf.org/html/rfc2616


IANAL, but it came up in a computing science/internet law course and in the Netherlands the current jurisprudence is that if you use (and continue to use) a service you are reasonably expected to read, understand, and adhere to the rules and conditions set by the service.

In this case it thus doesn't whether or not you even saw the contract. You are bound by the ToS because it can be expected of you to search for it during continued use of the service.


What happens if the TOS state that you are not allowed to use an ad-blocker while viewing their website? (Honest question)


There is another side of ToS in the Netherlands that I haven't mentioned: The ToS has to be reasonable. There is a blacklist which cannot appear in a ToS under any circumstance because it is unreasonable to ask from users. For example, it is illegal for a company to require users to agree that when they buy goods from their shop they cannot return it whatsoever. There is also a gray list: these are conditions that companies may ask users to agree to if and only if they can prove it is reasonable condition. Notice that this a reverse of the burden of proof: normally a user has to prove that some terms in the ToS are unreasonable which is hard. Now the company has to prove that their terms are reasonable, which is also hard but better for the user.

For example, it is illegal for companies to document the religious beliefs of a person BUT if this is a service for certain religious community (e.g. a forum for a church) then the term that they have to disclose their religious belief is reasonable (given that they don't share this information with anyone, of course).

As for the ads, I don't know. It might be unreasonable to have users agree to subject themselves to third-party ads in order to use the service. But again, I am not a laywer.


Disclaimer: Pure conjecture.

I'd say the TOS have to apply to the "service" you're consuming and not the website itself. Otherwise, what happens if you use just a text based browser? Or any other means not foreseen by the TOS?

An ad-blocker would be just the same, it's something that your browser does.


Intent matters. Ad blockers are not the same as text based browsers.


I think this differs from country to country then, because I'm pretty sure that in e.g. Germany a contract is not valid or enforcable without prior explicit agreement from both parties. In Germany most ToS and EULAs are void anyway, because they are not presented to the user or buyer before the purchase or start of the transaction. Also, if ToS contain a serious error, the whole ToS is invalid.

However, in other countries such as the US contract law is totally different, so this does not generalize at all.

Also, IANAL.


However if a ToS attempts to remove enshrined rights or includes terms that are particularly onerous then this can invalidate the section or sometimes all of the ToS.


There was a case in Denmark, where a large real-estate agency, was forced to pay damages, caused by keeping a large real-estate listings website, from scraping their listings.


If you publish the information, you might be breaking copyright rules. In some jurisdictions a collection of non-copyrighted items (like some catalog) might be protected by law.


If one wants to separate the question of scraping from the question of publishing something scraped (Which are entirely different things) then the thought experiment should be e.g. that you scrape a catalog and your page only displays "right now there are 10224 articles in the IKEA catalog" or "right now 1234 bikes are for sale on craigslist".

That material has been reduced to a number - which certainly can't be considered protected as a material.


This is why I've always had a bit of a trigger when I see signs that say, in a publicly accessible place, "No Photography!, as I, personally, am of the opinion that, if I have been able to freely and legally walk to a spot where I can see something with my eyes, I should be free to photograph or record it. It is My experience after-all, and it should not be infringed on me experiencing seeing something that is legal to see.


It's kind of shocking that you can't "see" any difference between a picture and looking with your eyes.

Also disappointing that you're (perhaps willfully) ignoring the many reasons why photography is banned/strongly discouraged in some public places.


Oh I can see it - I just don't agree with you or those that would sag that they should not be able to photograph such things...

That's all - I just don't agree with you


I think there is a contradiction between web scraping ToS and Internet neutrality. Allowing a site to be scraped (not different with crawling nowadays) only by Google but not by others violates that principle and concentrate de power within few companies.


Yeah wtf? Everyone let's Google crawl their site but it's naughty for anyone else to do it... such bullshit!


Probably because scraping by search engines is a transaction that benefits both sides, as search engines bring traffic to the content's source.

"it's naughty for anyone else to do it" if that everyone are companies that will be using or monetizing that content without providing any kind of attribution or compensation to the source/creator.


Except now that Google shows the info directly in the search results.


Google isn't the only search engine.


Allowing a site to only be scraped/crawled by professionals (i.e. Google) ensures that those engaging such activity do so in a respectable manner.

It would be ideal to define the terms for scraping (similar to how API calls work) as opposed to the acceptable authors of scrapers.


I am not sure you can define those terms in a formal way. You can crawl the data in some way but use it for many different purposes that are difficult to control.


Fancy extension to robots.txt?


They are professionals now, but at the beginning it was not at all.


I would imagine even in the beginning that they, as developers, practiced good procedures while scouring the internet.

Given the difference in the size of the internet now and then, the backlash faced for poor practices may not be comparable.. so perhaps not.


We had a client who contacted with an issue: site goes down every 15 minutes. After couple of hours of debugging we found the culprit. Some swedish equal rights server was "scraping" site every 15 minutes with massive ddos. We put capacha for the server ip only (let them scratch their heads now) and let it be. If you decide to scrape a site dont bring it down or assume its a big server. Keep it civil and without impact on performance and nobody minds.


Correct call on keeping it civil. I know for all the scraping I do I put my name and contact info in the header so if people try to look and find an issue they know how to get in contact. But I also make sure I'm not making requests that would ever affect their servers. Gotta stay nice.


On this note, does anyone know of a drip/gradual scrape process for not overloading a host's servers, such that the process could run over the span of a day/week? I'm not worried about the content changing rapidly.


Uhhh, have a brain and common sense when you write your scraper?

The simplest thing to do is NOT parallelize your scraper, and sleep for a few seconds between requests.


Why have you not reported it to them or/and police?


You don't need police in environments where you can defend yourself.


Ok, so for them it is no longer a problem, but what about other people? It is their ethical duty to report them to the police, otherwise they were an accessory in the crime (In some countries there are laws to enforce this)


If you make a shitty site that can't handle a few concurrent requests this is your problem. It's not a DDoS. Not even a DoS. Just trying to request some pages you make available, and you failing at doing so.

Or do you also plan on suing Google once they scrape your site?


We were contacted as outsiders to come in and have a look. We managed to stop site from crashing with configuring but basically the site was bombarded with too many dynamic requests and having no cache it took its toll. Server did not wait for responses, it just sent us bunch of requests all at once.


Any decent webserver can throttle requests per source ip.


Looks like somebody didn't read the parent.

> "scraping" site every 15 minutes with massive ddos

Google doesn't scrape every 15 minutes. It was implied that they scraped the site often enough to cause a DDoS.


Looks like someone has never ran a website with even a slight amount of traffic. A few requests every 15 minutes is nothing and your site should be able to handle it.

Additionally, it clearly wasn't a DDoS. Quoting from the parent:

> Some swedish equal rights server was "scraping" site every 15 minutes with massive ddos. We put capacha for the server ip only (let them scratch their heads now) and let it be

If there is only one server ip, it is by definition not distributed. Just a misuse of the term DDoS.


> Google doesn't scrape every 15 minutes.

Google has a variable timing crawler. Google News crawls in almost realtime, but all Google crawlers back off if the site responds slow.

Additionally, Google respects `/robots.txt` (which allows the site owner to define a crawl delay) and uses sitemaps for hints, so it doesn't have to re-crawl every single HTTP object "every 15 minutes".


Yes, agreed though you must also take into account the idea of the "straw that breaks the camels back".


We gave the info to the client. We were in no position to report it.


How is a single server a ddos?

Did you mean "We couldn't handle the traffic caused by their scraping which lead to a denial of service"?


You are correct. Dos was correct term


I consult a sh#tload of aggregators, all of them involving at least some kind of web-scraping, most of the time quite a lot of the data-sources were not asked directly for their approval.

The simple formula is: "Create more value than you take." - read it as "Create more value for those whom you take from than the value you actually take from them."

If you scrap sport results from a sports page and create a competing product, as#hole move, don't do it and you will sued anyway.

Scrap sport results from a sport page, aggregate them into nice charts, but for detailed inspection the user has to go to the source (which you link to and want the user to use), you are good to go (from a product point of view, legally i.a.n.a.l).

The internet is not a zero-sum-game. Building on top of each other, even on data on top of each other, will lead to a better ecosystem, to more value for every participant.


In a couple places it sounds like you're interested in scraping HN -- if that's the case there is an official API: https://github.com/HackerNews/API

My own take on it in general is that for personal/research use I'm not morally opposed to scraping, even when it's in violation of the ToS, with two conditions: that it doesn't place an unreasonable burden on the server, and that it doesn't invade people's privacy. The legal significance of the ToS is murky at best (disclaimer: I'm not a lawyer) but if the site asks you specifically to stop scraping them or puts up a technical barrier you should stop (morally and, in the US at least, legally: see craigslist v 3taps)


I'm pretty much on this wavelength. If I can automate something that I'd otherwise do manually - or that a less technical person would happily do manually, I'm fine with having python or perl and a cron job do it. What I try to avoid is scope creep when my automation makes it easier to magnify the scale. My dad used to collect his local synoptic chart from the weather bureau website - and it frustrated him that they only archived 1 week's worth so when he was away he'd need to ask someone else to download them for him or end up with gaps in his collection - I happily scripted that up for him. What I wouldn't do for him though, was make it grab _every_ chart for every region available - if he wasn't doing that by hand, I wasn't going to break the terms of service that much further and automate it for him... I'm happy to admit when I do things like they they're without doubt technically infringing on someone's legal rights, but ethically I consider this to be in the "of course I'll walks across an intersection against a red light in the middle of the night with no cars around" class of wrongdoing.


Ahh thanks, that'll be good to use and seems like HN wouldn't be overall annoyed by me including these comments on the site. And in general as well for other people to run projects with the data.


The full HN dataset is also on Google Big Query: https://cloud.google.com/bigquery/public-data/hacker-news (updated daily)


IANAL, but I'd suggest that it doesn't matter how people "feel", there is a legal element here. For example, you can scrape NASDAQ data from any number of sites, but those sites pay NASDAQ for the data feed. You do not have the right to use that data. You need to get permission from NASDAQ. (I'm just using them as an example.

You don't really get to decide how somebody else's data gets used.

Using your sports stastics example, this will become a grey area as writing becomse more automated, but at the moment, a writer gets a 'statistic' like a score which is made publicly available. There are no limits on using that statistic. But you didn't automate the process of spreading the stats, you, in theory found a fact and wrote about it.

This is different from just giving a feed of stats, or linking through a bunch of services.


Of course this goes without saying that the law depends on the jurisdiction where the scraping takes place.


I work in a company that does a lot of web scraping, mostly from companies who's ToS says no scraping.

Instead of violating the ToS we have business contracts with each company, that give us the permission to scrape. We use this as a way to take control of integrations, and put the ball in our court, as most of these companies have little to no technical expertise or resource. By doing this we can create an integration as quickly as we want, instead of waiting months or even never managing to get one if it were to be done through an API.

Scraping can be a powerful tool in this respect, make sure you have permission, but ToS saying no scraping doesn't necessarily mean you can't get special permission.


A customer asked us to integrate their website with our mobile application. They had no IT department to do the integration, and they didn't want to open their database to the web, so I came up with this solution;

ssh into the machine, select using a db commandline tool, return data back from stdout in csv format, parse data at our side. It's surprisingly fast and secure.


SSH tunneling supports forwarding remote hosts to local ports. Then you could use a proper database connector in your app. For example access MySQL through an SSH tunnel: https://stackoverflow.com/questions/18373366/mysql-connectio...


I forgot to mention that it was an SQLite database.


They could reverse SSH even so they don't have a port open.


If you can see it, you can scrap it. It is important to be nice and do not overload site's systems, but personally I don't see a difference if I view the site myself or some script is doing it for me for later consumption.

Technically, sites can do anything they want to make scraping more difficult. But from the moral (and, I hope, legal — at least in the future) standpoint, scraping should be your right.


>If you can see it, you can scrap it. It is important to be nice and do not overload site's systems.

It depends on what you do with the content, but generally I would agree.

The content doesn't become any less copyrighted just because you scraped it. Keep that in mind.

Be nice is important to. A site may not react well to you scrapping the content in 1000 parallel connections, so ramp up slowly, and back of if the site because unresponsive or slow. Google, Bing and others indexes sites all the time, and they play nice and almost never causes issues for site operators. Amateurs often causes issues and crash sites as do some wannabe search engines (I looking at you Eniro, answer your emails).

If you don't play nice expect to be blocked or get served 1GB junk files or images of beavers.

Things you should ALWAYS do when running a web scraper:

* Back off the target server becomes slow.

* Include a unique user agent.

* Include contact information or make easy to find your contact information, should problems arise. Sites may have an API or data feeds they would be willing to share instead.

* Respect robots.txt

* Publish your crawlers IP ranges, in case people REALLY want to block you.


There is clearly a difference between scraping and not scraping, or you wouldn't want to scrape a site.

Why _should_ scraping be your right, morally? Your convenience doesn't remove ownership of data.


If the data is already published, it is effectively made public.


* most sites prohibit scraping but beg for Google to scrape them

* many good products/websites are based on scraped content

* many good products/websites are not feasible because of scraping limitations and limited access to data, even publicly funded data (e.g. no real estate ads with noise overlay despite the EU mandating noise maps in all member states; member states have prohibitive access rules for these maps)

* "rogue" scraping causes problems for many websites; blocking it creates problems for legitimate users of various proxy/anonymization services like Tor. Captchas are not a long-term solution, good programming defeats them.

I'd welcome a simple technical solution for scraping that takes into account the interests of site owners, other publishers, the public. The sooner publishers get together and build one, the better for them.


google at this point is "phonebook" of the internet.

it is understandable that you want to be in phonebook but also that you dont want somebody random to enter, use your resources and copy everything because of some other interests.

if there would be some official list of all "phonebooks" and you would allow those to scan your site but block others i would do it.

scrapers from my side, spam my logs/analytics, fake real visits, try to spend my adword budget, some of them sign up with spam data and screw your conversions and testing and so on. if it was only search engine i would allow them but its not.

at this point i have the whole aws ip range blocked.


> it is understandable that you want to be in phonebook but also that you dont want somebody random to enter, use your resources and copy everything because of some other interests.

It's not that simple: Google has used scraped content in the past to build competing products to the sites it was from. Examples: reviews from shopping sites for Google Shopping, Yelp reviews for Google Places.

Google can do it because it basically owns your front door. Everyone else can't.

> at this point i have the whole aws ip range blocked.

From past experience: this will likely block some apps, possibly services like Alexa from accessing your site.


Could you please elaborate on how good programming defeats captchas?


Indexing content is different from scraping content.


True, indexing content is not the same. But you can't index anything you haven't already scraped.

Indexing is simply the second step in the the process, scraping is the first step, and (users) searching the index is the third step.


I laugh at people who try to exercise ownership of bits. It's foolish to place data out there in the open with the expectation that people are going to treat it nicely.

Data yearns to be free, stop fighting it!


Reddit addresses this specifically in their TOS and prohibits it. They mention a licensing program.

https://www.reddit.com/help/useragreement/

Doesn't really matter how we all feel about it. If you hit their radar, they'll go after you. Can be expensive, whether you're in the right or not.

You're more likely to hit their radar if you're trying to make money. I suspect you're using affiliate links, right?


Right, I've come across that page, but I include my name and contact in the header, and don't make more than a request every 15 seconds so it won't hurt their servers at all.

And I did have affiliation links initially where I included the non-affiliate links next to them if people didn't want the affiliation. But then I got rid of those cause people didn't like it. I don't care about making money on it, just think it's interesting.


I suspect they are more interested in their copyrighted content than load on their servers. Hence the licensing program.

If you aren't trying to monetize, you will be lower on their radar, but not immune.


Who's copyright?


Reddit's.

"By submitting user content to reddit, you grant us a royalty-free, perpetual, irrevocable, non-exclusive, unrestricted, worldwide license to reproduce, prepare derivative works, distribute copies, perform, or publicly display your user content in any medium and for any purpose, including commercial purposes, and to authorize others to do so."

The original comment submitters could grant you a license on their own, but that might be difficult to coordinate.


To be clear then, reddit has no apparent copyright (on individual comments anyway). They have a license to the copyrighted work, which they're allowed to re-license.

I view this as significant mainly because it changes what they could sue you for, and any fair use evaluation if copyright does enter into it at some point.


And they specifically mention it's a "non-exclusive" license.

I suspect Reddit might claim a Collection Copyright in the compilation of posts/comments - so even if you could acquire your own license to a post and each of the comments - if you tried publishing that as a book, Reddit could claim the organisation of those individually copyrighted items is owned by them...


You'll note Reddit do not ask for or claim any copyright there - just a license to use whatever you publish there in specific (though very broad) ways.


They do ask for a license that allows them to resell your content.


Yes, that's still not the same thing as owning the copyright. The difference is significant, because I am not infringing on Reddit's rights by copying and re-publishing a Reddit comment.

If anyone, I'm infringing on the copyright of the original author of the comment, while Reddit has no grounds for a copyright infringement lawsuit.


> Reddit addresses this specifically in their TOS and prohibits it.

Where? I don't see anything that prohibits scraping. There are provisions against rehosting content, but that isn't scraping. You can scrape content without rehosting if you are locally scraping in an app, for example.


That is what OP is saying he will do....rehost the content.

"I actually created a site...that scrapes...from Reddit...and shows that information."


Though I don't exactly rehost the content. I straight link back to the Amazon link and the Reddit comment where the link came from. I'm not even reposting the comment text which I feel makes a difference on that front, but possibly not in others opinions.


If you're only reposting links, there's no issue. Content is what they care about. Reddit considers the comments to be content.


You omitted a key part there:

"that scrapes comments and posts from Reddit that link to Amazon products and shows that information"

It's ambiguous, but my interpretation is that it's the Amazon content (or perhaps a summary thereof) that's being rehosted, not the reddit content.


I'll concede that I'm now thoroughly confused as to what OP is scraping and republishing. He's revised it twice now.


Here's the link to the site if this makes more sense: http://www.productmentions.com

I got some comments on this about people not liking me throwing the url in here but figure it'd explain things better.

And also, this post isn't just about whether or not what I'm doing here is legal and ok, overall thoughts on it as well.


Seems safe from copyright to me, it is indeed mostly links.

And seems to surface funny or interesting things based on popularity, like the 55 gallon drums of "personal lube". So, the aggregation has some purpose.

Good luck with it, seems an interesting idea.


I have nothing against scraping. Ideally, all information on the web is ready for consumption by machines and people alike, and any law or contract trying to address machines and people differently in this regard is going to be flawed and technically ambiguous. Potential traffic issues aside, this seems largely unproblematic to me.

It's what you actually do with the information that matters. For example, republishing or otherwise distributing information when you have no right to do so may be an ethical and legal issue.


I think this might be an interesting read for webscrapers,

https://en.wikipedia.org/wiki/Sui_generis_database_right


The last time I wrote a generalized web scraper (as opposed to something specific that we had permission for) I put a lot of effort into distributing the load across many websites so that no single site would feel any pain more than if you were just browsing as a normal human being.

We were--at the time, scraping for lead information to add to our marketing database, and this isn't the thing I'm exactly the most proud of in my career. But we all make mistakes. I wouldn't do that again.

At the same time, we rotated things so that we weren't killing the websites in the niche market that we were trying to scrape for leads.

The algorithm was that we would seed Google, Yahoo, and Bing with certain keyword searches that were relevant. Then we'd take the search results from the APIs and stuff them into an array. Then we would sort them proportionally. If we (like we did) most often get the most hits from google, followed by yahoo, and then by Bing, we'd stuff the results into an array and intersperse them.

So if we had 3x google results and 2x yahoo, and 1x bing, we wouldn't hit the google results first. We'd hit a google result 3x then a yahoo 2x, then a bing 1x and cycle.

It was a decent way of doing things.

We never broke anyone's stuff. Even if it should have been.


i dont understand why you wouldnt be proud of this.. what exactly did you do wrong?


I feel uncomfortable about what we were doing in general. I didn't feel bad about the implementation. We were using the results of the web scraper to feed what was basically spam emails trying to recruit people to sign up for our product.

In the grand scheme of things, I think that I did the most responsible thing I could have with the task I was assigned. But the task falls into a general category of things I don't approve of.

It certainly wasn't illegal, and it probably wasn't unethical, but it was definitely gross, and my internal standards tell me to avoid things that make me feel gross.

That said, I'm kind of pleased with the technical results. Up until that time in my career, I'd never encountered a sorting algorithm that handled things based on the proportion of similar items in an array. I'm sure that other people have done this and that there's no way it's novel at all. But it was a cool challenge to make a shady task perform in a way that didn't break other people's shit just so that we could maximize our own efficiency.


It sounds like what you're really asking is this ...

There are some websites whose primary business model is providing content in exchange for something: a subscription fee, or advertising eyeballs. They have a very strong financial interest in your not scraping their content and providing it to others on different terms.

There are other websites who make some content available and explicitly authorize people to use it: various datasets and RSS feeds and such.

And then there is a wide swath of websites that have adopted generic TOS that prohibit scraping, or they prohibit it because they haven't given it much thought and can't think of any particular reason off the top of their heads to permit it.

So what you really want to know is what sites in the third category would consider a sensible scraping policy, if they had to give it sufficient thought.

In other words, if they don't just default to a prohibition because it's already in a TOS template or because they haven't thought it through, what's the rationale for either blocking or not blocking scrapers?


The legal system does not "decides" on legality unless its forced too. Consider the case of Google Books project. Eventually the courts did rule that it constituted fair use. Or consider the situation involving Flickr and Pinterest. [1] or the one involving RapGenius and lyrics licensing. [2]

So to answer your questions:

    > Are comments on sites like this public data or private?

    > Does the purpose of your scraping make a difference, if one use is just a project but another would be selling the data?
There is no correct answer, at least unless you are willing to wait a decade and spend millions while the cases make their way through the byzantine legal system of districts, circuits, appeals and supremes. Unlike Science & Technology where there is a "correct" answer, you should approach legal system with different perspective using instincts and acceptable risk tolerance.

The history is littered with people who took a bet, and ended up succeeding or failing upwards.

Finally ignorance is actually preferable to knowledge. By writing this question or say having this conversation over an email you are simply creating a paper trail that can only harm you if you get sued tomorrow. [3]

[1] https://photo.stackexchange.com/questions/53304/what-is-the-...

[2] https://www.nytimes.com/2014/05/07/business/media/rap-genius...

[3] https://www.fastcompany.com/1588353/steal-it-and-other-inter...


If it's done in a way which causes no greater load on the servers than a human doing the same job, and especially if it's only for personal use, then I for one feel entirely comfortable about it. And I would give pretty short shrift to any robots.txt rules favouring Google alone, which are clearly morally unreasonable and in my non-professional opinion legally questionable too.


I scrap stuff for personal use (mostly to generate RSS feeds for things like newly added files on an FTP servers, or new search results for particular query in a classified ads service). In those use cases, I don't care what the TOS says. I'm just automating my browsing, I'm not exfiltrating data I shouldn't have access to, nor am I republishing it anywhere else.

But in general, I'm sympathetic to all scrapping efforts that provide benefit to people for free.


There are many legitimate reasons for web scraping and I think it’s fair that you set out your desires for scraping in your robots.txt file. You can block other bots while allowing Googlebot. Alternatively, you can also use a WAF that restricts bot access.

But I do have a problem with scraping other sites to detract traffic away from them. Perhaps if it’s for data analysis. Also, I would be very careful as some governments have very strict privacy laws (EU, for example) and you never know what you are scraping.

But one of the other problems is that web scrapers can be used for very nefarious purposes – see this article on web scraping attacks: https://www.incapsula.com/web-application-security/web-scrap...


If something is accessible from the Internet, it's public. Read it or scrape it, doesn't matter. API is only to reduce burden for scrapers and servers.

Though legally it might be punished, so you better don't reveal yourself.


If it can be legally punished, I don't think it falls under a normal definition of public, does it?

My bank account can be accessed from the internet, is it public? Or do you mean without logging in? In that case, are quora answers public?


If you don't provide an API or your API is either rediculously expensive to use or behind some arcane vetting process (coughInstagramcough), don't be surprised people if scrape instead of using it.

And if you scrape, just don't be a dick about it. Don't hammer the site with your buggy scraper doing 1k hits per second. And don't resell the data.


>>> What about sports statistics that sometimes are "private" by the league, rather than just open for people to use and write interesting articles about it?

Not happening. Sports data is too much valuable. High sale value.

Don't give away for free what you could charge a lot of money for.


There is no such thing as public data that is private. If it is in the web for public consummation, the site has no right to forbid working with the data and re-using it elsewhere. There just is no means for them.

That does not say that one has the right to publish the exact same content. Then it becomes a copyright issue. But remixing it into another site, like yours? You have every right.

Note: That might depend where you live. Legislation might differ, US' fair use is a different concept than germanys public data concept, for example.


Copyright doesn't cover factual content only the specific presentation of the content if it's distinctive/artistic enough.

In Europe we have IPR covering data though.


I love scraping and hate it too. I have limited experience with it, but some clients want you to web scrape competitors sites to steal their customer list etc. It is a matter of ethics. I tried C# and a few other languages and it didn't work because some Javascript on the site had anti-scraping script. I declined the offer, as I didn't want to get sued by their competitor they wanted me to web scrape.

Web scraping is going the way webbots work now. Have to scrape data to automate stuff.


If a site provides a sitemapindex in the robots.txt, would you assume they're allowing their site to be crawled on the pages they've set out within?


This doesn't address your question directly, but are you aware that Reddit provides an API? Why not use it instead of scraping?


"Web scraping" covers a pretty broad area from plain data collection to snagging content for republishing.

Can you be a bit more specific?


Sure, I'm just curious about people's opinions on whether or not it's necessary to follow site's rules on data scraping, whether it's always ok if you're not trying to make money on it, or if something like giving credit to where the data comes from makes a difference. Just things in general like that if that makes sense.


Our opinion does very little. You would have to go ever each site scraped and read their terms of service / agreements.


> You would have to go ever each site scraped and read their terms of service / agreements.

Someone putting up a bunch of words doesn't bind you to a contract with respect to scraping or anything else. (At most they can give up a right. They can put something in the public domain, for example, by giving up their copyright.)

To give some extreme examples: What if the site says you owe them a million dollars if you even look at the site? What if someone put a sign on their lawn that says anyone stepping on the grass can be legally shot? Are you OK with those just because they said so?


> Someone putting up a bunch of words doesn't bind you to a contract

Scraping is copying, which is an exclusive privilege of the copyright owner under copyright. In the absence of a license or a copyright exception, you have no legal right to exercise that privilege. Someone putting words on a page might give you a license for the use you intend.

You don't need to be bound to a contract, because this isn't about restricting a right you have in the absence of license.


> Scraping is copying, which is an exclusive privilege of the copyright owner under copyright.

Is it necessarily copying in a sense that is different from how a web browser acquires the resource to render it? For example, scraping data to derive some non-protected facts could be considered consumption in the same sense a human reading it via a client would, and doesn't require storing a copy any more permanently than a browser.

Intent is probably a lot more important, legally, than the fact that you are technically copying something simply by "accessing" it in a browser. If you are scraping with the intent to extract and copy the copyright protected content, sure, that's a good point, but that's by no means the only use of scraping.

Personally, and I understand that this isn't how it works legally, I think that the contract is implied by the protocol. I request copies of information by means of HTTP. If the distributor doesn't want me to have a copy, they can choose not to respond with a copy. I never make a copy in that case: the server is, hence I have not reproduced the work. It's not me taking a book to a photocopier, it's me calling the publisher and asking them to send me a copy of page 237.


> Is it necessarily copying in a sense that is different from how a web browser acquires the resource to render it?

It's different in purpose of not mechanism, and insofar as there is an implicit license to copy in a browser for display, it is another step to establish that such a license extends to other purposes even if the mechanism is the same.

Also not that the existence and scope of implied license can be affected by the presence and terms of an explicit license, which is one reason you might want to read the T&C prior to scraping.


So if I (for example) scrape a website just to count words, have I made a copy, in purpose? I have no interest in making a copy, rather I am analyzing the content to derive an unprotectable fact. The host made a copy of the information for me so that I could.

Also, the validity of a set of T&C that I didn't explicitly agree to and likely never read is dubious, inconsistently enforcable at best.


You seem to be saying that copying is never fair use. Copyright doesn't protect ideas or facts but material expression.


> You seem to be saying that copying is never fair use.

No, I'm not. Fair use is one (of several) copyright exceptions and, as such, is addressed in my post.


Rereading your post I see that I read it incorrectly the first time. Thanks for pointing that out.


Google does not ask every site owner permission to index it.


There is some quid pro quo with allowing Google to scrape though. You get traffic in return.

This is changing, though, with Google's slow but steady movement of organic search results down the fold.


robots.txt


How the data is used and the scrape frequency are also factors.

Google won't bombard your site with requests and at worst will cache data but not try to make use of it.



Your question is more suitable for the owners of those sites not us. After all it isnt us who would be suing you in court. What we say has little to no impact.


Oh I totally get that, which is why I make sure any of the scraping I do isn't to make money, just to write about interesting analysis. But I'm just also curious about people here's opinions on legality and whether they care about following the specific site's rules.


Where I live, if there is no exchange of money for goods then there is not a contract. If there is fair use of the material, then copyright doesn't apply.


It depends. What's the site? How are you scraping? Is it going to cause traffic issues on the server? Are you breaking terms of service or any other reasonable requests to not scrape? Did you even talk to the site owners? Are you scraping content behind a paywall or similar? All of these questions and others make up the answer to your question, so there's no generalisation to make really.


the same i feel about torrenting. if you do it for your needs im ok with it but if you do it with commercial interest one way or another im not ok with it.


fantastic. Web scraping has enabled me to access a ton of data I would otherwise be forced to access manually.


This is the reason the robots.txt was created, to tell web scrapers and people building them, what is off limits.

Of course there are people building services that scrape certain sites that appear to be off limits to you.

Those people scraping sites that are explicitly prohibited either:

a. are breaking the rules, potentially the law if it's explicitly prohibited in a ToS, and will eventually have to deal with getting banned, or sued. It's quite a gray area legally but here are some laws that could be used against you:

Violation of the Computer Fraud and Abuse Act (CFAA). Violation of California Penal Code. Violation of the Digital Millennium Copyright Act (DMCA). Breach of contract. Trespass. Misappropriation. Source: Linkedin v. Doe Defendants

b. have an agreement with the website owners allowing them to scrape certain portions of their site.

c. scraping data with no rules concerning it.

For example, Facebook. has a ToS for scraping: https://www.facebook.com/apps/site_scraping_tos_terms.php At the bottom there is a form for those that want to get permission to scrape the site. And their robots.txt is heavily used to control crawlers with User-Agents they know. http://facebook.com/robots.txt

It's rare you would run into legal issues, but possible. The question is whether it's morally okay for you to scrape any data you want.


To examine the ethics of web scraping, I think it's useful to strip away all the technical trappings and just look at it as an interaction between two people. We'll call them Chloe the Creator and Sam the Scraper.

Chloe has expended effort/energy/resources to gather and collate information which provides value in some way to Sam.

Sam has invested nothing, risked nothing, and expended no effort with regard to that information.

Chloe decides, in her own interest, to give the public access to the information.

Does Chloe have a moral right to try and impose conditions on that access?

Does Sam have a moral duty to make a good-faith effort to abide by those conditions?

My answer to both of these questions is yes.


Web scraping is my birthright, no ToS on this earth can take it away.


> I actually created a site I called blah blah blah (don't worry, this isn't an ad for it)

https://media.giphy.com/media/EouEzI5bBR8uk/giphy.gif


Why don't you believe OP? If OP wanted to advertise the site, wouldn't they just do a Show HN? Is this more effective to warrant being dishonest?


because it would have been a lot easier to omit the entire paragraph.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: