

Web Scraping as a Service - aarondf
http://tubes.io

======
conroy
What's the legality behind web scraping? Can you sell access to data that
isn't yours?

~~~
malandrew
They can be hit with a tort of trespass to chattels. If you have a lot of
servers performing scraping of a service they can sue you for depriving them
of use of their service. I simply don't see how something like this could ever
be launched as a business. However it could exist in another form.

Last year I spent a bunch of time on a "bittorrent for webscraping" project we
ended up scrapping at work after we pivoted. The p2p approach with a tit-for-
tat strategy and a headless webkit approach to scraping is the only long term
way we could see where you can scrape legally or at least semi-legally. Our
intent was to simply open source the project and let the rest of the world run
with it. All the technology is there to democratize the proprietary data
industry as has happened with music and movies. It's only a matter of time
before a lot of that data is freed from silos like Twitter and Craigslist.

The strategy was the following:

(1) Tit-for-tat scraping where you intentionally avoid scraping websites
likely to be in your jurisdiction by subnet. My machines scrape sites in
Europe while your machines scrape sites in the US. You also avoid scraping
data from services whose data you want.

(2) Headless webkit with various methods of avoiding browser fingerprinting.
The key is to look as much like legitimate traffic as possible, with the goal
of making it as likely that attempts to block a client would affect both real
users and scrapers. This is the most powerful deterrent to blocking.

(3) 1-1 correspondence between the templates a site uses (handlebars, ejs,
server-side templates, etc) and a database of snippets that describe exactly
how to convert those templates to JSON data. Any service that provides those
template definitions as part of a client-side webapp would be trivial to
convert to scraping formula.

(4) All the "recipes" for sites and their templates would probably be handled
centrally at first the same way torrents were. When a recipe is broken because
a site has been updated, the system would know that recipe broke because it
periodically checks for good data against a known corpus of previously scraped
data. Once a broken recipe has been identified someone could go fix it
(conceivably this could be automated as well since you have a known corpus of
data and you could detect how an XPath selector may have changed)

(5) Distributed hash table pointing to caches of previously requested data so
that a swarm for a particular service doesn't end up creating a DDoS by
accident. Each datum should only be scraped once for a certain interval of
time after which the data expires and can be requested again.

AFAIK such a system leaves you open to at most a lawsuit for inducing a third
party to breach contract, assuming that third party is even in a jurisdiction
where they can be held liable. But such a lawsuit would be tenuous at best.
Furthermore, you'd have to prove that the third-party ever visited the site in
question through non-automated means because if they never visit the site in
question in person, then they can not be held to the terms of service of that
site and therefore can't be liable for breaching a contract they never agreed
to.

FWIW IANAL but I have talked to a bunch about this and the above is the
closest I could come up with for a service that scrapes sites without running
afoul of the law.

Anyways, such a system is in the realm of black hat tactics, but to be honest,
monopolization of data to extract exorbitant rents and abuse of third party
developers is much more despicable. The whole world would be better off if we
commoditize access to all sorts of metadata.

------
mikemcdonald
I really like the idea; definitely could have used this in several of my
projects over the years. It'd be cool if there were more use cases to give a
better picture of what all is possible with the scraping service. That's what
drew me into IFTTT when I first visited their website. The example of getting
the title tag from the website wasn't that exciting :) . And any chances of a
trial period?? I'd be more willing to pay after seeing a functioning
implementation in my website/app.

~~~
gee_totes
Basically anything is possible with the scraping service (provided the target
pages don't render themselves via javascript).

It will grab the DOM and you can parse through it and return a nice little
JSON object.

~~~
malandrew
A headless webkit browser like PhantomJS can scrape pages that render
themselves via javascript.

------
bellwether
i could have a need for this type of service as well, but it's not clear what
benefit it provides over PhantomJS. is it just the alternating of the scraping
IP?

a free trial would be helpful, but before i sign up, i need to know what i
gain from the service. it seems like i still have to write all the same logic,
but tubes.io just takes care of making the web request. if that's all, it's
not exactly something i'd be willing to pay for.

~~~
gee_totes
I'm a tubes.io customer (and a happy customer too) and my main use for it is
both the alternating IP, and that it allows me to scale up scraping operations
without having to deal with too much on my backend.

The alternating IP is actually a bit of a big deal, it's kind of a pain to run
your own proxies and have to manage the proxy switching yourself (so painful
that I would pay 10 bucks/month for someone else to do it for me)

~~~
bellwether
true, alternating IPs are useful, except when you already have agreements to
do the scraping with the target site, which we do.

so what would interest me is something that made scraping easier or more
robust. so far, it looks like the service still has some growing to do before
it gets there.

