Espion is similar to PhantomJS in that they're both headless browsers that you can inject JavaScript into.
Espion comes with a lot more. First there is the infrastructure: processing power, storage, connectivity and IP addresses that you don't have to provision, set up or manage. Second, Espion includes the features that surround extracting data from a site such as job scheduling, data quality monitoring, online debugging and problem resolution and data delivery.
PhantomJS is perfectly viable, but if you need the features I highlighted and use it, you'll have to build a lot yourself to get the job done.
Or for scraping anything non-trivial, do it yourself with Casperjs, AWS/Linode/DO, RabbitMQ and a bit of monitoring/alerting from someone like Datadog. It will be cheaper and a lot more flexible.
Edit: realised above sounds a bit harsh. Am sure Espion can fill a gap where clients need to scrape a limited amount of non-volatile data and don't have the time to setup and manage something on their own.
An easy solution to this is to have your scrapers on one network and a fleet of squid proxies on another. As and when an IP gets banned you just cycle in another squid instance on a new VM with a fresh IP (which can be from a completely different netblock/geolocation if you want). You don't have to touch the rest of the scraping apparatus.
I deal with this every day. Eventually we gave up on the proxy pool idea, and started running the headless browsers in the pool as Selenium nodes. It's definitely not easy- for example, we've also had to build infrastructure that helps keep track of IPs and their history.
Since this looks to be a product targeting businesses, I would say you might want to dumb down your techie speak on the front page. Talk to more of the benefits and value that you will bring to your customers.
We run in-house data extraction structure very similar to this (spiders, phantomJS, ocr, anonymous proxies, etc), and it indeed takes some time to set it up properly. Main problem I see with turning this operation into SaaS product is that no matter how big IP pool you have, if you have significant number of clients, those IPs will eventually all get blacklisted. Unlike the small players who create small amount of traffic and can run below the radar (and thus offer the same service cheaper).
It sounds a bit in a grey area, in-between legit use and something slightly dodgy (mentioning IP blacklisting and overriding CAPTCHAS...)
I'm curious to know what kind of companies / websites would need this kind of service. And wouldn't this put the provider (the website being scraped) and consumer (the people using this service) at an arms race? Can one build a solid business on this basis?
I'm genuinely curious about the use-case, not trying to criticize or form any judgment.
CAPTCHAs and IP blacklistings are things you encounter routinely in perfectly legitimate and legal web scraping projects. A typical example are businesses that want to monitor their competitors' prices or the second hand market for their products.
We have plans to actively prevent the use of our platform for illegitimate purposes (fraud, spam, etc.).
> CAPTCHAs and IP blacklistings are things you encounter routinely in perfectly legitimate and legal web scraping projects
It's only legal if the site TOS says it is.You dont get to decide wether you can scrap websites legally or not.People got sued for scraping,trust me. And it's not even about fraud or spam.
It's more complex than that. Whether the site's TOS are enforceable is a matter of jurisdiction and may depend on the intent of the web scraper.
Ryanair for example has lost several cases where they tried to forbid scraping their website on the groups that data scraping promoted free competition and served consumers in general. See the latest decision here: https://uk.finance.yahoo.com/news/ryanair-suffers-setback-ge....
The legality may also depend on the type of data being collected. For example, it is likely safer to scrape Yelp to gather public facts like business locations and phone numbers versus if the data is "copyrightable" like customer reviews. Both, however, would violate Yelp's TOS. See: http://streetfightmag.com/2013/03/04/legal-battles-erupt-ove...
Actually no, and I don't think our location would protect us from legal liability. Our servers are outside Mauritius, hosting locally would be very expensive and induce latency.
Very interesting question. HTML5 audio and video are definitely a possibility, if there's a strong use case. If you have a specific idea, I'd be interested to hear it.
I don't expect Flash or Java support. Flash and Java apps that load their data from standard HTTP resources can be scraped regardless of support, the others need reverse engineering anyway and wouldn't fit with HTML scraping.
I use phantomjs as a headless audio player by loading youtube/soundcloud playlists and letting it play, but there is no official flash support on phantomjs anymore.
Sorry if you were expecting some novel idea, but it's just me trying to use a measuring tape to drive a screw.
What if instead of scraping, customers use the JS injection features for spamming?
How are you approaching the liability issues involved, given that potentially anyone can sign up, create workloads of varying levels of evil, and without you necessarily vetting your customers?
We approach this the same way e-commerce or gaming companies approach fraud: we'll actively monitor the infrastructure and set up safeguards to prevent spam and other illegitimate activities.
Can you talk about performance? Having written both vanilla non-js scrapers (which get you far enough) and js scrapers for those pesky web pages, I can confirm that our js scrapes took 10-20x the time.
Also, I am assuming dynamically changing IP addresses when you may get blacklisted is built in?
The entire anonymous IP infrastructure is indeed built-in. Yes, performance is lower than non-JS-enabled scraping. To counter this we offer a variety of options to turn on/off performance trade-offs. They can be set per job and per page.
This is an interested idea, but what I would pay for is a GUI for scrapping, for example: clicking an element, seeing what classes it got, and be able to click each class to see if it matches the elements I need. And then generate the code for those actions I did in the GUI, maybe even be able to do both (modify the JS and still work with the GUI when required, yeah, is hard but possible)
The other major feature I need is paths of execution, for example if there are two possible pages after certain step (think if-else) I want those views visualized as interconnected nodes.
If you are looking for a GUI for scraping, check out kimono labs. It's a GUI tool for scraping, and requires no code to set up and can find all similar elements from the one you click on. It supports pagination and other types of scrapes too.
Selenium could do that for you, and there's an IDE available as a Firefox plugin; branching might be tricky and you'd need to build some tooling around it in general, but it might give you a place to start.
Just wondering, can the scraper record video of the site it's on? I feel like that's something I have a lot of use cases for. Even if it's not possible is there some hack you could do by taking a snapshot every 1/30th of a second and stitching it together into a 30fps video? Always been curious about this.
The packages don't state if the number of pages are per month, day or hour? We currently scrape well over 5 million pages an hour for a lot less (although, much like you we are geared up for such loads) but it would be interesting to see the cost per number of pages per hour you charge for odd-jobs/one-offs.
The packages are for a set number of pages, with no timeframe. I don't have prices yet for a load in the millions of pages per hour. What kind of system do you feed data into?
A very large array of mysql databases. Basically, each month we fire up a new fresh database and start streaming data into it. Were currently pulling around 700Gb a month. Our reporting tools/systems run queries across this array. Its actually not that bad speed wise (reports of over 9000 keywords over a 1 week period for top 100 positions on a per hour basis)
Its a very single purpose system, so probably not amenable to general purpose crawling (though we do have a separate system that is based of the design of our core system that is a general purpose web crawler/indexer)
Yes, and injecting JS into the page for easy analysis/collection. The delay is dynamic, based on captcha rates, proxy load and historic captcha costs. It constantly checks the current running costs and throttles the number of requests over the hour.
You gain access to the page's own DOM and JavaScript, which lets you call the site's functions to fetch data if it helps you. You can also code your scraper with the same techniques you use when building a page: jQuery, CSS selectors, etc - basically all the good interfaces that have been developed over the past 20 years are available to your scraping code.
We target businesses that want to fulfil their web data extraction needs in-house rather than hiring a third-party provider – which we originally are. Espion was built for our own needs at first and we’re still in the customer discovery phase. I expect we’ll find many use cases in the coming months.
The general sentiment on this thread pretty much sums up the idea of "scraping as a service". To me, there is definitely a legitimate business need to be able scrape. Whether people realize it or not, companies have in-house teams that build custom scraping tools. The challenge for you is going to be able to siphon out the bad actors that may use your service to do things that you would not approve of.
As someone who provides the scrapping services for years I can indeed confirm there's a lot of totally legit businesses who need the ability to compile the various online data. From press clipping/ soc. media trends monitoring, over various data-mining & analysis tasks, to price comparison and building dropship inventories. This days everyone talks about big data and data analysis, but you first need to collect that data.