
Show HN: CloudScrape – Cloud-based web scraping platform - Cloudscrape
http://cloudscrape.com/
======
kennethologist
This can become prohibitively expensive even for a small web scraping job that
you're required to do everyday. Say scraping a news site for articles.

The tooling and intuitivity is awesome but besides that at the heart doesn't
it do what any other headless javascript enabled browser does?

For example:
[https://phantomjscloud.com/site/pricing.html](https://phantomjscloud.com/site/pricing.html)

Is there a way to get a less feature rich version of this i.e. sans auto-
resolve CAPTCHA etc?

Great work none the less though.

~~~
henrik1409
Thank your for the kind words :) While you're right that we do run a headless
browser-ish thing what sets us apart from most of our competitors, other than
the point-and-click approach is that we autodetect everything that's going on
in the browser - which also means that in most cases you dont need to know
what's going on. What this effectively means is that you'll often spend no
time reverse-engineering and be able to scrape even wildly complex javascript-
heavy sites in minutes instead of hours - and have them be a lot more stable
than they would otherwise since there's no "Wait for 5 seconds" which will
only be enough 95% of the time.

We see a lot of our clients manage to do what they need done - even daily
scrapes - for as little as $29 / month since scraping a news site daily will
often take up no more than a few minutes.

~~~
x5n1
Keep your prices, your average developer is not a business person and advises
everyone to race to the bottom.

~~~
cookiecaper
Absolutely. People will always tell you they want everything for either
practically or actually free. What they _say_ they'll pay and what they'll
_actually_ pay are usually two very different things, assuming you provide
something that can't be easily replaced.

~~~
Richdow
Agreed! Brilliant tool!

------
zer0defex
Nice timing, as just this week I had my first exposure to the Cloudscrape
platform. Credit where credit is due, I found the platform refreshingly
feature rich even at lower tier (i.e. free) account levels compared to other
offerings from competitors.

Downsides were like so many other crawlers/spiders - too much in trying to
meet the needs of all. Aiming for the 95% is easy, 95 to 99% though, not so
much. Hence all the corpses littered along this path.

Recommendations:

Define solid, real-world examples of your platform in practice and really, no
bullshit, what makes you different from every other platform that's 97%
polished out there?

You didn't meet my needs because I needed much more in the way of parsing
network requests from pages, mostly in handling if/then scenarios. A
relatively simple scenario, to ME, and there-in lies the challenge of the
crawler market, it's the last 5% that's custom to each scenario that so often
delivers 98& of the value.

Suggestion: Identify and use the talent you obviously have to destroy some
high value markets. Hint: Prioritize targets that still champion a desktop app
at any point in the top 5 in terms of data collection.

Outside of that and far more niche (and subsequently far lower priority, allow
for parsing / handling handling network requests and responses via custom
app/API instead of just block all, some, or none)

~~~
henrik1409
Thank you for the recommandations! Would love to know more about your exact
use cases - we're aiming for the 100% :) If you'd reach out on
support@cloudscrape.com - we'd be happy to investigate if we can't make
CloudScrape work for you as well.

------
danielmiessler
Probably a super dumb question, but isn't this fairly unethical? The
"automatic IP rotation" feature isn't there for no reason.

~~~
angry-hacker
As much as unethical as using adblock or disabling javascript. If you don't
want your content to be scraped, don't put it online!

~~~
manigandham
That's not a good argument. Do you ever leave your stuff lying around? I guess
we can just take it then right?

Just because you have access to something doesn't give you permission to
access or access it in any manner possible.

~~~
yxdfasdjkljasdf
That is not how HTTP works; your analogy is not correct.

Nobody is taking anything. If you don't want someone to access your page, then
don't respond to their request.

~~~
dsjoerg
At a high enough frequency, scraping is indistinguishable from a DDoS attack.
Do you believe DDoS attacks are OK? How do you draw the line?

~~~
yxdfasdjkljasdf
There is a clear distinction in the two. You are presenting a straw-man
argument.

~~~
dsjoerg
You haven't quite laid out your argument so I have to guess what it is.

When you say "That is not how HTTP works" it suggests that your claim is that
anything that HTTP allows is ethically OK to do. However that is clearly a
ridiculous stance, since a DDoS attack is a stream of valid HTTP requests and
that's clearly not OK.

So I'm left wondering what your argument actually is for why unwelcome
scraping is ethically OK.

I find this an interesting question, because while I would love for protcols
to also define ethics, I feel that would be scope creep for the poor protocol
designers. There's a wide variety of conduct and ethics questions that a
protocol cannot address.

Where I myself draw the line is at protocol behavior intentionally designed to
obscure my intentions. For example, sending my requests from a wide variety of
IP addresses is behavior that is specifically designed to obscure where I'm
coming from; my only intent in doing so would be to circumvent the intent of
the serving machine from providing lots of content to a single requestor. At
that point I'm engaging in deceptive behavior; I've crossed an ethical line.

~~~
yxdfasdjkljasdf
_When you say "That is not how HTTP works" it suggests that your claim is that
anything that HTTP allows is ethically OK to do. However that is clearly a
ridiculous stance, since a DDoS attack is a stream of valid HTTP requests and
that's clearly not OK._

That wasn't a response made to your comment, and you are mixing two different
arguments there. You guess in not correct.

 _So I 'm left wondering what your argument actually is for why unwelcome
scraping is ethically OK._

I never even suggested such an argument.

The behavior you described in the last paragraph is only deceptive from the
eyes of an information and privacy surveillant state actor. Anonymity is not
unethical, it is a human right.

------
mojaam
Hmm, another cool web scraper is import.io

I use to think web scraping was limited to obscure proprietary companies like
connotate so I'm glad to see more of these tools becoming available for
everyone to utilize and hopefully create something cool.

~~~
MartijnHoutman
Intercom.io is a very cool tool indeed. We use it to extract unique product
links from product overview pages of webshops for adding them to Pricepin (our
own tool).

------
henrik1409
It's great the see discussions going on here - would like to tie a few
comments to the questions of ethical aspects of web scraping:

As some has pointed out scraping is not exactly a new thing and a lot of the
biggest sites out there are built on the basis of web scraping or crawling. We
provide a tool and expect you use that tool while abiding the law - and if not
we will of course shut your account down immediately. Breaking the law
includes violating copyrights and performing DDoS attacks (Although they will
be rather small attacks since even 50 concurrent agents is no big deal for
most websites).

We consider ourselves good netizens. We wish nothing more than to provide a
good, easily accessible and safe tool for extracting valuable information from
the internet, be it for a price comparison site in a market that lacks
transparency, business intelligence for your company to make informed and
wiser decisions, or a PhD project that requires access to millions of data
points available online in unstructured form.

Additionally if you feel we're providing services that has ill-intent - we are
not providing any services (Captcha and proxy rotation) that anyone with a bit
of programming skill can not easily use in their own software. The main
difference is that we are actively improving and focusing not only on making a
good experience for our users - but also on minimizing the impact on the sites
being scraped. This involves several things like automated throttling and
slow-site detection, request caching, and blocking requests to services such
as google analytics - to not interfere with site owners stats.

~~~
chinathrow
If you are a good netizen, could you plese provide your user agent so I can
block your bot on all sites I operate?

Thank you.

EDIT: found that in your FAQ:

"Since disclosing IP’s and user agents would allow anyone to identify all
traffic coming from our system – we naturally never do."

That is the opposite of being a good netizen and I hope I'll be able to sue
you once I find out your services are helping to scrape my content.

2nd EDIT: Found out that you reside in Denmark and therefore in the EU, that
makes it way easier then.

------
alexivanovs
Anyone interested in 'similar' tools/platforms:

[http://codecondo.com/web-scraping-tools-extracting-
data/](http://codecondo.com/web-scraping-tools-extracting-data/)

~~~
reinhardt
And an "awesome list" of web scraping open source tools:
[https://github.com/lorien/awesome-web-
scraping](https://github.com/lorien/awesome-web-scraping)

------
ck2
Bot detection tip:

Humans don't read pages at dozens per minute.

Not hard to code for that, especially if your static content like images are
served by alternate servers, so you can just focus on your content servers.

The problem I am dealing with lately is massive bot farms where you do not see
a repeating IP for an hour. I catch them eventually but it takes a toll.

But this is why you preemptively block all of aws and servers like them.

------
melling
I built a little Swift/iOS "search engine". It contains about 2200 urls. I
only search page titles and tags, that I've manually added. What would be my
best option for crawling the links and allowing the search to include the text
of each page?

[http://www.h4labs.com/dev/ios/swift.html](http://www.h4labs.com/dev/ios/swift.html)

This is a weekend project so I don't want to spend a lot of money on it.

------
idibidiart
Hit a certain threshold and you'll most likely get IP banned by the site
you're scraping or heavily throttled unless the site does not care to minimize
bot traffic (which can cost the site owner valuable bandwidth and server
resources)

You may also get sued...

If the site owner wants its data to be available for automated extraction
they'd provide an API and can price it to compensate for cost of serving all
those bots.

~~~
djm_
If everyone had such a negative attitude as that we never would have had
Google.

Countless sites depend on web scraping. You can scrape and be a good netizen,
the two are not mutually exclusive.

~~~
idibidiart
Google has a unique position in that any site that wants to be found has to
let Google bot index its content. Google does not build a derived product from
your site's content that ends up competing with your site.

If someone needs to build a search site for real estate etc why couldn't they
just scrape the Google search result, filter it (white list domains), extract
the actual link and present it? In that case you'l need a Google specific
scraper that can be based on open source scraping libraries.

Update: Google actually will IP ban you if it thinks you're a bot trying to
scrape for search results.

But they have an API:

[https://developers.google.com/web-
search/docs/?hl=en](https://developers.google.com/web-search/docs/?hl=en)

~~~
dchuk
First of all, Google does take your content and make a product out of it by
selling ad space on search result pages. Those pages would have no value for
advertisers if it weren't for the content producers Google scraped to fill
those pages up with.

Second, you linked to an api page that clearly says it was deprecated 5 years
ago, come on. No one should ever feel bad about scraping Google, considering
Google is the world's largest scraper themselves.

------
chdir
\- Auto-resolve CAPTCHA’s

\- Automatic IP rotation

That's just wrong. Is there a website where we can blacklist IP addresses of
such violators ?

~~~
yxdfasdjkljasdf
_That 's just wrong._

You haven't explained how is it wrong and why. None of those thing are "wrong"
by itself. It is the malicious use, of any tool, that is unethical.

 _Is there a website where we can blacklist IP addresses of such violators ?_

What exactly do you think is being violated here?

~~~
chdir
Websites use captchas & IP based limits to prevent abuse of their resources &
make it harder for copycats to mirror their data. There are often cases where
copycats outrank original content in search rankings. (see this example :
[https://news.ycombinator.com/item?id=10103545](https://news.ycombinator.com/item?id=10103545)
).

If I were a content owner/producer and I see automated scraping from IP
addresses owned by Cloudscrape that violate the ToS, I would sadly treat the
entire pool of IPs as violators (even though some might be genuine users who
are respecting the limits).

I'd like to know what's a legitimate use case of auto-resolving captchas and
IP rotation other than circumventing limits imposed by webmaster.

P.S. Why the throwaway ?

~~~
cookiecaper
There's already a tool to stop "copycats". It's called copyright (and for
inventions, patents). You can and should use that to enforce your rights to
your IP. It's not too hard to start issuing DMCA requests, and it's not even
that expensive to have a lawyer do it if you're making money. It doesn't
matter whether the illegal copy is obtained by a bot or a human.

While I agree that captchas and IP blocks _can_ be employed by target sites, I
don't agree that it should be illegal to circumvent them. I also don't agree
that it's necessarily unethical (though in some cases, it may be). If you have
public information posted on the public web, I don't think you have the right
to mandate that it only be accessed by certain tools. You should plan and
expect that it will be accessed by every tool capable of doing so.

If something is disrupting your business by "clogging the tubes" or whatever,
that's another thing, and they can be held liable for that. But it doesn't
matter that they clogged the tubes with one type of program or another; what
matters is that the tubes were clogged by their actions, and that's the part
that should be focused on in the subsequent legal proceedings. The specific
tool or tools used to clog the tubes is at most a tangential curiosity. We
don't want to make certain programs illegal.

Maybe we need a new amendment with "the right to bear code". We do not want to
get down a rabbit hole where certain programs are legal and certain programs
are not (at least not anymore than we already are with the DMCA et al). Down
with code control!

------
amelius
I'm wondering about use cases for this (?)

Also, do they respect robots.txt? And how are they going to avoid being
blocked by websites that don't want to be scraped? (I guess it would be easy
to determine the IP addresses of their scrape robots).

~~~
GPGPU
One of their features is "Automatic IP Address Rotation" so it may not be
easy....

~~~
toomuchtodo
Use their free tier to scrape your own honeypot site, log the traffic and use
for blocking.

Wouldn't be hard to file abuse@ reports to their hosting provider as well.
Just pipe their scrapper IPs into WHOIS.

------
cosmolev
Sergey from
[https://news.ycombinator.com/item?id=10403788](https://news.ycombinator.com/item?id=10403788)
would better have accepted $15

------
xxdesmus
I hope you guys have a good abuse team to handle the incoming reports.

------
lsh
Are there open source alternatives to services like this and Diffbot?
Boilerpipe and similar libraries are ok but targeted at article extraction.

~~~
pablohoffman
Portia from Scrapinghub is 100% open source:
[http://scrapinghub.com/portia/](http://scrapinghub.com/portia/)

------
ausjke
Signed up and then was forwarded to a support-request page, is there a how-to
guide to start experiencing it with the free account?

I then wanted to back out, of course there is no way to remove your account.

While I know what scrape/robot/etc does, I don't know how to start using this
one after spent some minutes reading through on the website.

------
Richdow
This is just what I was looking for! Have you looked at the website? You get
free webs craping hours every month and start out with 20 hours. That will
take me som time to consume.
[http://cloudscrape.com/](http://cloudscrape.com/)

------
BinaryBird
Looks good. I've used WebScraper.io (running in Chrome on AWS EC2 instances)
before for many projects. It's quite poerwful and free.
[http://webscraper.io](http://webscraper.io)

------
chinathrow
I need to add their IP range to my blocklist.

~~~
kuschku
Why? People will just scrape your data anyway.

What you put online is _public_. Treat the web the same as a public bulletin
board. Everything you post will be readable by everyone. Trying to prevent
this is pointless.

~~~
pingswept
It might be desirable to make data available to humans, but not bots, no?

~~~
atemerev
OK, how about hiring hordes of cheap human scrapers from Mechanical Turk?
Everything visible can and will be scraped if need arises, the only variable
is price.

------
Xspirits
I found it really really expensive regarding to what it's offering. May be I'm
missing the point here.

~~~
blowski
I used to work for PriceRunner, and I remember that while scraping a site and
extracting content from the results was easy, avoiding rate limits and
handling all the edge cases was hard. From the feature list it seems that
CloudScrape is running full headless browsers, so handles JavaScript, allows
screengrabs and that stuff is expensive.

------
ck2
Wait, is this a ycomb sponsored startup or why are they allowed to self-
promote on HN ?

------
SoulMan
I miss yahoo pipes

~~~
henrik1409
We've got something in store that might make you happy then. Please stay tuned
:) [https://twitter.com/cloudscrape](https://twitter.com/cloudscrape)

------
juanescobarcom
Diffbot offers a series of awesome automatic APIs for data extraction -- no
setting up manual rules, just provide a URL you want to extract data from and
they'll visually process and extract automatically. Also provide a crawler and
bulk extractor, and a free trial -- www.diffbot.com

~~~
Richdow
Have you tried Cloudscrape? From the looks of it they can do the same, and
then more?

