Show HN: Scrape the web by injecting JavaScript into web pages

bitplanets · on Dec 23, 2014

This is cheerio with node.js and extra bells. Example: http://projs.hackhat.com/the-making-of-the-hackernews-crawle...

dale-cooper · on Dec 23, 2014

Sounds more like phantomjs/casperjs.

TheEspion · on Dec 23, 2014

Espion is similar to PhantomJS in that they're both headless browsers that you can inject JavaScript into.

Espion comes with a lot more. First there is the infrastructure: processing power, storage, connectivity and IP addresses that you don't have to provision, set up or manage. Second, Espion includes the features that surround extracting data from a site such as job scheduling, data quality monitoring, online debugging and problem resolution and data delivery.

PhantomJS is perfectly viable, but if you need the features I highlighted and use it, you'll have to build a lot yourself to get the job done.

gildas · on Dec 23, 2014

So, is the Espion headless browser based on an existing one? If this is the case, which one?

asdf123456 · on Dec 23, 2014

it's just PhantomJS wrapped up behind a cloud

corford · on Dec 23, 2014

Or for scraping anything non-trivial, do it yourself with Casperjs, AWS/Linode/DO, RabbitMQ and a bit of monitoring/alerting from someone like Datadog. It will be cheaper and a lot more flexible.

Edit: realised above sounds a bit harsh. Am sure Espion can fill a gap where clients need to scrape a limited amount of non-volatile data and don't have the time to setup and manage something on their own.

gondo · on Dec 23, 2014

you will need to manage the pool of IPs what is sometimes the hardest part as it has to be maintained. all the setup is usually one time job

corford · on Dec 23, 2014

An easy solution to this is to have your scrapers on one network and a fleet of squid proxies on another. As and when an IP gets banned you just cycle in another squid instance on a new VM with a fresh IP (which can be from a completely different netblock/geolocation if you want). You don't have to touch the rest of the scraping apparatus.

mhluongo · on Dec 23, 2014

I deal with this every day. Eventually we gave up on the proxy pool idea, and started running the headless browsers in the pool as Selenium nodes. It's definitely not easy- for example, we've also had to build infrastructure that helps keep track of IPs and their history.

We've open-sourced part of it. https://github.com/cardforcoin/shale

eli · on Dec 23, 2014

For some definition of easy, I guess.

dalacv · on Dec 23, 2014

Since this looks to be a product targeting businesses, I would say you might want to dumb down your techie speak on the front page. Talk to more of the benefits and value that you will bring to your customers.

TheEspion · on Dec 23, 2014

Thanks, we'll definitely do that.

ivanhoe · on Dec 23, 2014

We run in-house data extraction structure very similar to this (spiders, phantomJS, ocr, anonymous proxies, etc), and it indeed takes some time to set it up properly. Main problem I see with turning this operation into SaaS product is that no matter how big IP pool you have, if you have significant number of clients, those IPs will eventually all get blacklisted. Unlike the small players who create small amount of traffic and can run below the radar (and thus offer the same service cheaper).

gingerlime · on Dec 23, 2014

It sounds a bit in a grey area, in-between legit use and something slightly dodgy (mentioning IP blacklisting and overriding CAPTCHAS...)

I'm curious to know what kind of companies / websites would need this kind of service. And wouldn't this put the provider (the website being scraped) and consumer (the people using this service) at an arms race? Can one build a solid business on this basis?

I'm genuinely curious about the use-case, not trying to criticize or form any judgment.

TheEspion · on Dec 23, 2014

CAPTCHAs and IP blacklistings are things you encounter routinely in perfectly legitimate and legal web scraping projects. A typical example are businesses that want to monitor their competitors' prices or the second hand market for their products.

We have plans to actively prevent the use of our platform for illegitimate purposes (fraud, spam, etc.).

aikah · on Dec 23, 2014

> CAPTCHAs and IP blacklistings are things you encounter routinely in perfectly legitimate and legal web scraping projects

It's only legal if the site TOS says it is.You dont get to decide wether you can scrap websites legally or not.People got sued for scraping,trust me. And it's not even about fraud or spam.

TheEspion · on Dec 23, 2014

It's more complex than that. Whether the site's TOS are enforceable is a matter of jurisdiction and may depend on the intent of the web scraper.

Ryanair for example has lost several cases where they tried to forbid scraping their website on the groups that data scraping promoted free competition and served consumers in general. See the latest decision here: https://uk.finance.yahoo.com/news/ryanair-suffers-setback-ge....

sxv · on Dec 23, 2014

The legality may also depend on the type of data being collected. For example, it is likely safer to scrape Yelp to gather public facts like business locations and phone numbers versus if the data is "copyrightable" like customer reviews. Both, however, would violate Yelp's TOS. See: http://streetfightmag.com/2013/03/04/legal-battles-erupt-ove...

Jayd2014 · on Dec 23, 2014

Does the location of Espion in Mauritius have anything to do with this? Are the servers located there?. Good design of the site and good work.

TheEspion · on Dec 23, 2014

Actually no, and I don't think our location would protect us from legal liability. Our servers are outside Mauritius, hosting locally would be very expensive and induce latency.

psykovsky · on Dec 23, 2014

Will it support Java, Flash and HTML5 audio and video?

TheEspion · on Dec 23, 2014

Very interesting question. HTML5 audio and video are definitely a possibility, if there's a strong use case. If you have a specific idea, I'd be interested to hear it.

I don't expect Flash or Java support. Flash and Java apps that load their data from standard HTTP resources can be scraped regardless of support, the others need reverse engineering anyway and wouldn't fit with HTML scraping.

psykovsky · on Dec 23, 2014

I use phantomjs as a headless audio player by loading youtube/soundcloud playlists and letting it play, but there is no official flash support on phantomjs anymore. Sorry if you were expecting some novel idea, but it's just me trying to use a measuring tape to drive a screw.

logn · on Dec 23, 2014

What if instead of scraping, customers use the JS injection features for spamming?

How are you approaching the liability issues involved, given that potentially anyone can sign up, create workloads of varying levels of evil, and without you necessarily vetting your customers?

I'm genuinely curious, not ranting.

TheEspion · on Dec 23, 2014

We approach this the same way e-commerce or gaming companies approach fraud: we'll actively monitor the infrastructure and set up safeguards to prevent spam and other illegitimate activities.

ankimal · on Dec 23, 2014

Can you talk about performance? Having written both vanilla non-js scrapers (which get you far enough) and js scrapers for those pesky web pages, I can confirm that our js scrapes took 10-20x the time.

Also, I am assuming dynamically changing IP addresses when you may get blacklisted is built in?

TheEspion · on Dec 24, 2014

The entire anonymous IP infrastructure is indeed built-in. Yes, performance is lower than non-JS-enabled scraping. To counter this we offer a variety of options to turn on/off performance trade-offs. They can be set per job and per page.

volker48 · on Dec 23, 2014

At first I thought it was going to be a framework like this https://medialab.github.io/artoo/, but it seems a little more nefarious than that.

ivanca · on Dec 23, 2014

This is an interested idea, but what I would pay for is a GUI for scrapping, for example: clicking an element, seeing what classes it got, and be able to click each class to see if it matches the elements I need. And then generate the code for those actions I did in the GUI, maybe even be able to do both (modify the JS and still work with the GUI when required, yeah, is hard but possible)

The other major feature I need is paths of execution, for example if there are two possible pages after certain step (think if-else) I want those views visualized as interconnected nodes.

juddernaught · on Dec 23, 2014

If you are looking for a GUI for scraping, check out kimono labs. It's a GUI tool for scraping, and requires no code to set up and can find all similar elements from the one you click on. It supports pagination and other types of scrapes too.

aaronem · on Dec 23, 2014

Selenium could do that for you, and there's an IDE available as a Firefox plugin; branching might be tricky and you'd need to build some tooling around it in general, but it might give you a place to start.

baldfat · on Dec 23, 2014

All I can say is ...

1) Inject javascript into webpages 2) IP pool to hife your own IP address 3) Extract text from images and solve CAPTCHAs

Do you pay these guys in bitcoin on tor?

AlwaysBCoding · on Dec 23, 2014

Just wondering, can the scraper record video of the site it's on? I feel like that's something I have a lot of use cases for. Even if it's not possible is there some hack you could do by taking a snapshot every 1/30th of a second and stitching it together into a 30fps video? Always been curious about this.

TheEspion · on Dec 23, 2014

That's a feature we never thought of. I can't promise it will be available at release time, but quite possibly later next year.

jalfresi · on Dec 23, 2014

The packages don't state if the number of pages are per month, day or hour? We currently scrape well over 5 million pages an hour for a lot less (although, much like you we are geared up for such loads) but it would be interesting to see the cost per number of pages per hour you charge for odd-jobs/one-offs.

TheEspion · on Dec 23, 2014

The packages are for a set number of pages, with no timeframe. I don't have prices yet for a load in the millions of pages per hour. What kind of system do you feed data into?

jalfresi · on Dec 23, 2014

A very large array of mysql databases. Basically, each month we fire up a new fresh database and start streaming data into it. Were currently pulling around 700Gb a month. Our reporting tools/systems run queries across this array. Its actually not that bad speed wise (reports of over 9000 keywords over a 1 week period for top 100 positions on a per hour basis)

willu · on Dec 23, 2014

If you are able to crawl that volume as cheaply as you say, you should definitely be offering it as a service.

jalfresi · on Dec 23, 2014

Its a very single purpose system, so probably not amenable to general purpose crawling (though we do have a separate system that is based of the design of our core system that is a general purpose web crawler/indexer)

MoOmer · on Dec 23, 2014

Are you rendering the JavaScript on those pages? What's your in-domain delay?

jalfresi · on Dec 23, 2014

Yes, and injecting JS into the page for easy analysis/collection. The delay is dynamic, based on captcha rates, proxy load and historic captcha costs. It constantly checks the current running costs and throttles the number of requests over the hour.

bolasanibk · on Dec 23, 2014

Is there any place I can read more about the technical side of this. I would love to know how you achieve these rates.

Axsuul · on Dec 23, 2014

What's the benefit of injecting JavaScript into the page when it comes to scraping?

TheEspion · on Dec 23, 2014

You gain access to the page's own DOM and JavaScript, which lets you call the site's functions to fetch data if it helps you. You can also code your scraper with the same techniques you use when building a page: jQuery, CSS selectors, etc - basically all the good interfaces that have been developed over the past 20 years are available to your scraping code.

xyby · on Dec 23, 2014

Who is your target customer?

TheEspion · on Dec 23, 2014

We target businesses that want to fulfil their web data extraction needs in-house rather than hiring a third-party provider – which we originally are. Espion was built for our own needs at first and we’re still in the customer discovery phase. I expect we’ll find many use cases in the coming months.

softdev12 · on Dec 23, 2014

The general sentiment on this thread pretty much sums up the idea of "scraping as a service". To me, there is definitely a legitimate business need to be able scrape. Whether people realize it or not, companies have in-house teams that build custom scraping tools. The challenge for you is going to be able to siphon out the bad actors that may use your service to do things that you would not approve of.

xyby · on Dec 23, 2014

    businesses that want to fulfil their web data extraction needs in-house

Well, since you offer a web scrape tool, that's obvious :) But who is the typical customer who wants to do that?

ivanhoe · on Dec 23, 2014

As someone who provides the scrapping services for years I can indeed confirm there's a lot of totally legit businesses who need the ability to compile the various online data. From press clipping/ soc. media trends monitoring, over various data-mining & analysis tasks, to price comparison and building dropship inventories. This days everyone talks about big data and data analysis, but you first need to collect that data.

LunaSea · on Dec 23, 2014

How is it different than Phantom.js or Zombie.js ?

codexon · on Dec 23, 2014

Does this scrape website behind encapsula?

TheEspion · on Dec 24, 2014

codexon · on Dec 24, 2014

That's interesting. You should advertise that.

For anyone else wondering, it is actually spelled Incapsula, I would edit my original comment if I could anymore.

dmritard96 · on Dec 23, 2014

you can do this with selenium also btws