
Show HN: Scrape the web by injecting JavaScript into web pages - TheEspion
http://espion.io/
======
bitplanets
This is cheerio with node.js and extra bells. Example:
[http://projs.hackhat.com/the-making-of-the-hackernews-
crawle...](http://projs.hackhat.com/the-making-of-the-hackernews-crawler-
interesting-tips-to-know/)

~~~
dale-cooper
Sounds more like phantomjs/casperjs.

~~~
TheEspion
Espion is similar to PhantomJS in that they're both headless browsers that you
can inject JavaScript into.

Espion comes with a lot more. First there is the infrastructure: processing
power, storage, connectivity and IP addresses that you don't have to
provision, set up or manage. Second, Espion includes the features that
surround extracting data from a site such as job scheduling, data quality
monitoring, online debugging and problem resolution and data delivery.

PhantomJS is perfectly viable, but if you need the features I highlighted and
use it, you'll have to build a lot yourself to get the job done.

~~~
gildas
So, is the Espion headless browser based on an existing one? If this is the
case, which one?

~~~
asdf123456
it's just PhantomJS wrapped up behind a cloud

------
corford
Or for scraping anything non-trivial, do it yourself with Casperjs,
AWS/Linode/DO, RabbitMQ and a bit of monitoring/alerting from someone like
Datadog. It will be cheaper and a lot more flexible.

Edit: realised above sounds a bit harsh. Am sure Espion can fill a gap where
clients need to scrape a limited amount of non-volatile data and don't have
the time to setup and manage something on their own.

~~~
gondo
you will need to manage the pool of IPs what is sometimes the hardest part as
it has to be maintained. all the setup is usually one time job

~~~
corford
An easy solution to this is to have your scrapers on one network and a fleet
of squid proxies on another. As and when an IP gets banned you just cycle in
another squid instance on a new VM with a fresh IP (which can be from a
completely different netblock/geolocation if you want). You don't have to
touch the rest of the scraping apparatus.

~~~
mhluongo
I deal with this every day. Eventually we gave up on the proxy pool idea, and
started running the headless browsers in the pool as Selenium nodes. It's
definitely not easy- for example, we've also had to build infrastructure that
helps keep track of IPs and their history.

We've open-sourced part of it.
[https://github.com/cardforcoin/shale](https://github.com/cardforcoin/shale)

------
dalacv
Since this looks to be a product targeting businesses, I would say you might
want to dumb down your techie speak on the front page. Talk to more of the
benefits and value that you will bring to your customers.

~~~
TheEspion
Thanks, we'll definitely do that.

------
ivanhoe
We run in-house data extraction structure very similar to this (spiders,
phantomJS, ocr, anonymous proxies, etc), and it indeed takes some time to set
it up properly. Main problem I see with turning this operation into SaaS
product is that no matter how big IP pool you have, if you have significant
number of clients, those IPs will eventually all get blacklisted. Unlike the
small players who create small amount of traffic and can run below the radar
(and thus offer the same service cheaper).

------
gingerlime
It sounds a bit in a grey area, in-between legit use and something slightly
dodgy (mentioning IP blacklisting and overriding CAPTCHAS...)

I'm curious to know what kind of companies / websites would need this kind of
service. And wouldn't this put the provider (the website being scraped) and
consumer (the people using this service) at an arms race? Can one build a
solid business on this basis?

I'm genuinely curious about the use-case, not trying to criticize or form any
judgment.

~~~
TheEspion
CAPTCHAs and IP blacklistings are things you encounter routinely in perfectly
legitimate and legal web scraping projects. A typical example are businesses
that want to monitor their competitors' prices or the second hand market for
their products.

We have plans to actively prevent the use of our platform for illegitimate
purposes (fraud, spam, etc.).

~~~
aikah
> CAPTCHAs and IP blacklistings are things you encounter routinely in
> perfectly legitimate and legal web scraping projects

It's only legal if the site TOS says it is.You dont get to decide wether you
can scrap websites legally or not.People got sued for scraping,trust me. And
it's not even about fraud or spam.

~~~
TheEspion
It's more complex than that. Whether the site's TOS are enforceable is a
matter of jurisdiction and may depend on the intent of the web scraper.

Ryanair for example has lost several cases where they tried to forbid scraping
their website on the groups that data scraping promoted free competition and
served consumers in general. See the latest decision here:
[https://uk.finance.yahoo.com/news/ryanair-suffers-setback-
ge...](https://uk.finance.yahoo.com/news/ryanair-suffers-setback-german-
screen-141456801.html).

~~~
Jayd2014
Does the location of Espion in Mauritius have anything to do with this? Are
the servers located there?. Good design of the site and good work.

~~~
TheEspion
Actually no, and I don't think our location would protect us from legal
liability. Our servers are outside Mauritius, hosting locally would be very
expensive and induce latency.

------
logn
What if instead of scraping, customers use the JS injection features for
spamming?

How are you approaching the liability issues involved, given that potentially
anyone can sign up, create workloads of varying levels of evil, and without
you necessarily vetting your customers?

I'm genuinely curious, not ranting.

~~~
TheEspion
We approach this the same way e-commerce or gaming companies approach fraud:
we'll actively monitor the infrastructure and set up safeguards to prevent
spam and other illegitimate activities.

------
ankimal
Can you talk about performance? Having written both vanilla non-js scrapers
(which get you far enough) and js scrapers for those pesky web pages, I can
confirm that our js scrapes took 10-20x the time.

Also, I am assuming dynamically changing IP addresses when you may get
blacklisted is built in?

~~~
TheEspion
The entire anonymous IP infrastructure is indeed built-in. Yes, performance is
lower than non-JS-enabled scraping. To counter this we offer a variety of
options to turn on/off performance trade-offs. They can be set per job and per
page.

------
volker48
At first I thought it was going to be a framework like this
[https://medialab.github.io/artoo/](https://medialab.github.io/artoo/), but it
seems a little more nefarious than that.

------
ivanca
This is an interested idea, but what I would pay for is a GUI for scrapping,
for example: clicking an element, seeing what classes it got, and be able to
click each class to see if it matches the elements I need. And then generate
the code for those actions I did in the GUI, maybe even be able to do both
(modify the JS and still work with the GUI when required, yeah, is hard but
possible)

The other major feature I need is paths of execution, for example if there are
two possible pages after certain step (think if-else) I want those views
visualized as interconnected nodes.

~~~
juddernaught
If you are looking for a GUI for scraping, check out kimono labs. It's a GUI
tool for scraping, and requires no code to set up and can find all similar
elements from the one you click on. It supports pagination and other types of
scrapes too.

------
baldfat
All I can say is ...

1) Inject javascript into webpages 2) IP pool to hife your own IP address 3)
Extract text from images and solve CAPTCHAs

Do you pay these guys in bitcoin on tor?

------
AlwaysBCoding
Just wondering, can the scraper record video of the site it's on? I feel like
that's something I have a lot of use cases for. Even if it's not possible is
there some hack you could do by taking a snapshot every 1/30th of a second and
stitching it together into a 30fps video? Always been curious about this.

~~~
TheEspion
That's a feature we never thought of. I can't promise it will be available at
release time, but quite possibly later next year.

------
jalfresi
The packages don't state if the number of pages are per month, day or hour? We
currently scrape well over 5 million pages an hour for a lot less (although,
much like you we are geared up for such loads) but it would be interesting to
see the cost per number of pages per hour you charge for odd-jobs/one-offs.

~~~
TheEspion
The packages are for a set number of pages, with no timeframe. I don't have
prices yet for a load in the millions of pages per hour. What kind of system
do you feed data into?

~~~
jalfresi
A very large array of mysql databases. Basically, each month we fire up a new
fresh database and start streaming data into it. Were currently pulling around
700Gb a month. Our reporting tools/systems run queries across this array. Its
actually not that bad speed wise (reports of over 9000 keywords over a 1 week
period for top 100 positions on a per hour basis)

------
Axsuul
What's the benefit of injecting JavaScript into the page when it comes to
scraping?

~~~
TheEspion
You gain access to the page's own DOM and JavaScript, which lets you call the
site's functions to fetch data if it helps you. You can also code your scraper
with the same techniques you use when building a page: jQuery, CSS selectors,
etc - basically all the good interfaces that have been developed over the past
20 years are available to your scraping code.

------
xyby
Who is your target customer?

~~~
TheEspion
We target businesses that want to fulfil their web data extraction needs in-
house rather than hiring a third-party provider – which we originally are.
Espion was built for our own needs at first and we’re still in the customer
discovery phase. I expect we’ll find many use cases in the coming months.

~~~
xyby

        businesses that want to fulfil their web data extraction needs in-house
    

Well, since you offer a web scrape tool, that's obvious :) But who is the
typical customer who wants to do that?

~~~
ivanhoe
As someone who provides the scrapping services for years I can indeed confirm
there's a lot of totally legit businesses who need the ability to compile the
various online data. From press clipping/ soc. media trends monitoring, over
various data-mining & analysis tasks, to price comparison and building
dropship inventories. This days everyone talks about big data and data
analysis, but you first need to collect that data.

------
LunaSea
How is it different than Phantom.js or Zombie.js ?

------
codexon
Does this scrape website behind encapsula?

~~~
TheEspion
Yes!

~~~
codexon
That's interesting. You should advertise that.

For anyone else wondering, it is actually spelled Incapsula, I would edit my
original comment if I could anymore.

------
dmritard96
you can do this with selenium also btws

