Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Scrape the web by injecting JavaScript into web pages (espion.io)
109 points by TheEspion on Dec 23, 2014 | hide | past | favorite | 54 comments



This is cheerio with node.js and extra bells. Example: http://projs.hackhat.com/the-making-of-the-hackernews-crawle...


Sounds more like phantomjs/casperjs.


Espion is similar to PhantomJS in that they're both headless browsers that you can inject JavaScript into.

Espion comes with a lot more. First there is the infrastructure: processing power, storage, connectivity and IP addresses that you don't have to provision, set up or manage. Second, Espion includes the features that surround extracting data from a site such as job scheduling, data quality monitoring, online debugging and problem resolution and data delivery.

PhantomJS is perfectly viable, but if you need the features I highlighted and use it, you'll have to build a lot yourself to get the job done.


So, is the Espion headless browser based on an existing one? If this is the case, which one?


it's just PhantomJS wrapped up behind a cloud


Or for scraping anything non-trivial, do it yourself with Casperjs, AWS/Linode/DO, RabbitMQ and a bit of monitoring/alerting from someone like Datadog. It will be cheaper and a lot more flexible.

Edit: realised above sounds a bit harsh. Am sure Espion can fill a gap where clients need to scrape a limited amount of non-volatile data and don't have the time to setup and manage something on their own.


you will need to manage the pool of IPs what is sometimes the hardest part as it has to be maintained. all the setup is usually one time job


An easy solution to this is to have your scrapers on one network and a fleet of squid proxies on another. As and when an IP gets banned you just cycle in another squid instance on a new VM with a fresh IP (which can be from a completely different netblock/geolocation if you want). You don't have to touch the rest of the scraping apparatus.


I deal with this every day. Eventually we gave up on the proxy pool idea, and started running the headless browsers in the pool as Selenium nodes. It's definitely not easy- for example, we've also had to build infrastructure that helps keep track of IPs and their history.

We've open-sourced part of it. https://github.com/cardforcoin/shale


For some definition of easy, I guess.


Since this looks to be a product targeting businesses, I would say you might want to dumb down your techie speak on the front page. Talk to more of the benefits and value that you will bring to your customers.


Thanks, we'll definitely do that.


We run in-house data extraction structure very similar to this (spiders, phantomJS, ocr, anonymous proxies, etc), and it indeed takes some time to set it up properly. Main problem I see with turning this operation into SaaS product is that no matter how big IP pool you have, if you have significant number of clients, those IPs will eventually all get blacklisted. Unlike the small players who create small amount of traffic and can run below the radar (and thus offer the same service cheaper).


It sounds a bit in a grey area, in-between legit use and something slightly dodgy (mentioning IP blacklisting and overriding CAPTCHAS...)

I'm curious to know what kind of companies / websites would need this kind of service. And wouldn't this put the provider (the website being scraped) and consumer (the people using this service) at an arms race? Can one build a solid business on this basis?

I'm genuinely curious about the use-case, not trying to criticize or form any judgment.


CAPTCHAs and IP blacklistings are things you encounter routinely in perfectly legitimate and legal web scraping projects. A typical example are businesses that want to monitor their competitors' prices or the second hand market for their products.

We have plans to actively prevent the use of our platform for illegitimate purposes (fraud, spam, etc.).


> CAPTCHAs and IP blacklistings are things you encounter routinely in perfectly legitimate and legal web scraping projects

It's only legal if the site TOS says it is.You dont get to decide wether you can scrap websites legally or not.People got sued for scraping,trust me. And it's not even about fraud or spam.


It's more complex than that. Whether the site's TOS are enforceable is a matter of jurisdiction and may depend on the intent of the web scraper.

Ryanair for example has lost several cases where they tried to forbid scraping their website on the groups that data scraping promoted free competition and served consumers in general. See the latest decision here: https://uk.finance.yahoo.com/news/ryanair-suffers-setback-ge....


The legality may also depend on the type of data being collected. For example, it is likely safer to scrape Yelp to gather public facts like business locations and phone numbers versus if the data is "copyrightable" like customer reviews. Both, however, would violate Yelp's TOS. See: http://streetfightmag.com/2013/03/04/legal-battles-erupt-ove...


Does the location of Espion in Mauritius have anything to do with this? Are the servers located there?. Good design of the site and good work.


Actually no, and I don't think our location would protect us from legal liability. Our servers are outside Mauritius, hosting locally would be very expensive and induce latency.


Will it support Java, Flash and HTML5 audio and video?


Very interesting question. HTML5 audio and video are definitely a possibility, if there's a strong use case. If you have a specific idea, I'd be interested to hear it.

I don't expect Flash or Java support. Flash and Java apps that load their data from standard HTTP resources can be scraped regardless of support, the others need reverse engineering anyway and wouldn't fit with HTML scraping.


I use phantomjs as a headless audio player by loading youtube/soundcloud playlists and letting it play, but there is no official flash support on phantomjs anymore. Sorry if you were expecting some novel idea, but it's just me trying to use a measuring tape to drive a screw.


What if instead of scraping, customers use the JS injection features for spamming?

How are you approaching the liability issues involved, given that potentially anyone can sign up, create workloads of varying levels of evil, and without you necessarily vetting your customers?

I'm genuinely curious, not ranting.


We approach this the same way e-commerce or gaming companies approach fraud: we'll actively monitor the infrastructure and set up safeguards to prevent spam and other illegitimate activities.


Can you talk about performance? Having written both vanilla non-js scrapers (which get you far enough) and js scrapers for those pesky web pages, I can confirm that our js scrapes took 10-20x the time.

Also, I am assuming dynamically changing IP addresses when you may get blacklisted is built in?


The entire anonymous IP infrastructure is indeed built-in. Yes, performance is lower than non-JS-enabled scraping. To counter this we offer a variety of options to turn on/off performance trade-offs. They can be set per job and per page.


At first I thought it was going to be a framework like this https://medialab.github.io/artoo/, but it seems a little more nefarious than that.


This is an interested idea, but what I would pay for is a GUI for scrapping, for example: clicking an element, seeing what classes it got, and be able to click each class to see if it matches the elements I need. And then generate the code for those actions I did in the GUI, maybe even be able to do both (modify the JS and still work with the GUI when required, yeah, is hard but possible)

The other major feature I need is paths of execution, for example if there are two possible pages after certain step (think if-else) I want those views visualized as interconnected nodes.


If you are looking for a GUI for scraping, check out kimono labs. It's a GUI tool for scraping, and requires no code to set up and can find all similar elements from the one you click on. It supports pagination and other types of scrapes too.


Selenium could do that for you, and there's an IDE available as a Firefox plugin; branching might be tricky and you'd need to build some tooling around it in general, but it might give you a place to start.


All I can say is ...

1) Inject javascript into webpages 2) IP pool to hife your own IP address 3) Extract text from images and solve CAPTCHAs

Do you pay these guys in bitcoin on tor?


Just wondering, can the scraper record video of the site it's on? I feel like that's something I have a lot of use cases for. Even if it's not possible is there some hack you could do by taking a snapshot every 1/30th of a second and stitching it together into a 30fps video? Always been curious about this.


That's a feature we never thought of. I can't promise it will be available at release time, but quite possibly later next year.


The packages don't state if the number of pages are per month, day or hour? We currently scrape well over 5 million pages an hour for a lot less (although, much like you we are geared up for such loads) but it would be interesting to see the cost per number of pages per hour you charge for odd-jobs/one-offs.


The packages are for a set number of pages, with no timeframe. I don't have prices yet for a load in the millions of pages per hour. What kind of system do you feed data into?


A very large array of mysql databases. Basically, each month we fire up a new fresh database and start streaming data into it. Were currently pulling around 700Gb a month. Our reporting tools/systems run queries across this array. Its actually not that bad speed wise (reports of over 9000 keywords over a 1 week period for top 100 positions on a per hour basis)


If you are able to crawl that volume as cheaply as you say, you should definitely be offering it as a service.


Its a very single purpose system, so probably not amenable to general purpose crawling (though we do have a separate system that is based of the design of our core system that is a general purpose web crawler/indexer)


Are you rendering the JavaScript on those pages? What's your in-domain delay?


Yes, and injecting JS into the page for easy analysis/collection. The delay is dynamic, based on captcha rates, proxy load and historic captcha costs. It constantly checks the current running costs and throttles the number of requests over the hour.


Is there any place I can read more about the technical side of this. I would love to know how you achieve these rates.


What's the benefit of injecting JavaScript into the page when it comes to scraping?


You gain access to the page's own DOM and JavaScript, which lets you call the site's functions to fetch data if it helps you. You can also code your scraper with the same techniques you use when building a page: jQuery, CSS selectors, etc - basically all the good interfaces that have been developed over the past 20 years are available to your scraping code.


Who is your target customer?


We target businesses that want to fulfil their web data extraction needs in-house rather than hiring a third-party provider – which we originally are. Espion was built for our own needs at first and we’re still in the customer discovery phase. I expect we’ll find many use cases in the coming months.


The general sentiment on this thread pretty much sums up the idea of "scraping as a service". To me, there is definitely a legitimate business need to be able scrape. Whether people realize it or not, companies have in-house teams that build custom scraping tools. The challenge for you is going to be able to siphon out the bad actors that may use your service to do things that you would not approve of.


    businesses that want to fulfil their web data extraction needs in-house
Well, since you offer a web scrape tool, that's obvious :) But who is the typical customer who wants to do that?


As someone who provides the scrapping services for years I can indeed confirm there's a lot of totally legit businesses who need the ability to compile the various online data. From press clipping/ soc. media trends monitoring, over various data-mining & analysis tasks, to price comparison and building dropship inventories. This days everyone talks about big data and data analysis, but you first need to collect that data.


How is it different than Phantom.js or Zombie.js ?


Does this scrape website behind encapsula?


Yes!


That's interesting. You should advertise that.

For anyone else wondering, it is actually spelled Incapsula, I would edit my original comment if I could anymore.


you can do this with selenium also btws




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: