Hacker News new | past | comments | ask | show | jobs | submit login

I think many websites people are going to want to extract from are going to have anti-scraping/anti-robot traffic controls that are going to try to keep out a scraper like this. Amazon.com for instance. Probably google properties.

That they will in the future plan on respecting robots.txt suggests they don't mean to get places content owners don't want them. On the other hand, automatic IP rotation kind of suggests they do mean to (what other purpose is there for that?).

Either way, it might be a limitation on what you might dream of using it for.

My own experiments with scraping Amazon and Google have been stopped in the water by their anti-bot traffic controls. (Amazon recently improved theirs).




Having built scrapers working against some of these measures, you would be really surprised at how often they are accidental. Shared hosting providers often set them as defaults, at least as far as I see from the work I've done.

The real barrier I find is the current case law in the US, which seems to be the jurisdiction of choice for many web companies. It's currently a real possibility that you will be criminally in breach of the law and suffer the cost if you blatantly and knowingly continue after being notified of their ToS. Yes google and other big companies have nothing to fear, but it's pretty much a case of "how many people are dumb enough to pick a fight with mike Tyson?"

If you target your scraping to further your own business, and impinge on someone else's business model, your in water that is currently murky. It really needs to be settled but until another lawsuit rises to the Supreme Court in the US, we won't have that, so it's just a matter of being aware that while your not trying to be an evil criminal, you may still be viewed as such by someone you scrape.


Is being blocked the worst thing that can happen? Can't they sue you for scraping and using their stuff?


Scraping and selling their content without consent. I would assume that you can definitely sue for that.


I see a lot of sites selling Google Search results (like services tracking your SERP positions). Could Google sue them?


We've invested very heavily in building out a solid infrastructure for extracting data. We want to make sure that the product Just Works for our users, and that includes rotating IP addresses (you don't have to fiddle with your own, we have access to a pool of thousands).

Robots.txt is a tricky balancing act. It was first conceived in 1994, and was designed for crawlers that tried to suck up all the pages on the web. ParseHub, on the other hand, is very specifically targeted by a human. A human tells ParseHub exactly which pages and which pieces of data to extract. From that point of view, ParseHub is more like a "bulk web browser" than a robot.

Here are some examples that make this line blurry. If I tell ParseHub to log into a site, visit a single page, and extract one piece of information on it, does that violate robots.txt? If yes, then your browser has been violating robots.txt for years. The screenshots of your most visited websites are updated by periodically polling those sites (and ignoring robots.txt). My browser is currently showing a picture of my gmail inbox, which is blocked by robots.txt https://mail.google.com/robots.txt

More importantly, your computer and browser already do a lot of robot-like stuff to turn your mouse click into a request that's sent to the server. You don't have to write out the full request yourself. Is that then considered a robot? If not, then why is it considered a robot when ParseHub does the same (again, assuming a single request) thing?

Furthermore, some sites don't specify rate limits in robots.txt, but still actively block IP addresses when they cross some threshold.

It is far from a perfect standard, so it makes a lot of practical sense to have the ability to rotate IPs, even if it's not appropriate to use that ability all the time.

Our goal here is to be able to distinguish between the good type and bad type of scraping and give webmasters full transparency. Obviously this is a hard problem. If you have any feedback on any of this we'd love to hear it.

ps. we've tested our infrastructure on many Alexa top 100 sites and can say with moderate confidence that it will Just Work.

pps. if you're a webmaster, having ParseHub extract data from your site is probably far preferable to the alternative. People usually hack together their own scripts if their tools can't do the job. ParseHub does very aggressive caching of content and tries to figure out the traffic patterns of the host so that we can throttle based on the traffic the host is receiving. Hacked together scripts rarely go through the trouble of doing that.


"Our goal here is to be able to distinguish between the good type and bad type of scraping and give webmasters full transparency. Obviously this is a hard problem. If you have any feedback on any of this we'd love to hear it."

Yes as said before plus:

- Obey robots.txt to the full extend

- Name your access, i.e. label your bot

- Don't use shady tactics such as IP rotation

- Provide web site owners the option to fully block access of your bots (yes, communicate your full IP ranges)

Again - this is from a content owner who paid for his content.


Neither of us are lawyers (as far as I know), and I assume you have legal counsel for a business like this, and I wish you luck in your business and hope it doesn't come to anything legal.

Actually I hope even more it does come to something legal and you win, because I'd love to expand and make concrete fair use rights for scraping. I like scraping, scraping is both fun and very useful for the business domain I work in, and very frustrating when content providers don't allow it by either terms of service (which may or may not be legally enforceable if you haven't agreed to them, it's unclear, but scary enough with all the CFAA over-enforcement) or technological protections.

But I think you're being disingenous about the difference between a bot and an interactive web browser, I think it's pretty straightforward to most people and will be to the courts if it comes to that.

Interestingly, the latest enhanced Amazon anti-bot protections I ran into say "To discuss automated access to Amazon data please contact...", but don't explicitly try to say "you are forbidden from automated access."


It's fair to say that robots.txt is a balancing act in this case, given it's intended use. However, a website's terms of use are non-negotiable. Clauses banning any form of automated access or data gathering (especially for non-personal use) are fairly popular amongst sites with "deny everything" robots.txt files. There's a very real risk here for both you and your customers.

In the long run it'd be nice to see some sort of "fair access" to websites introduced into law, unfortunately we don't let live in that world.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: