I'm curious what others use to scrape modern (javascript based) web applications...

jmkni · on March 16, 2017

I have used Selenium for this with quite a bit of success, or as others have mentioned, just figure out where the API endpoints are with fiddler and pull the data directly from the source.

Sometimes this can be a PITA though, for example Tableau obfuscates the JSON they send back, so it's easier to use Selenium to wait ten seconds and then scrape the resulting HTML.

jakubbalada · on March 16, 2017

Disclaimer: I'm a co-founder of Apifier [1].

It's not an open source, but free up to 10k pages per month. And it can handle modern JS web applications (your code runs in a context of crawled page). You can for example scrape API key at first and then use internal AJAX calls.

There's also a community page [2] where you can find and use crawlers made by other users.

[1] https://www.apifier.com [2] https://www.apifier.com/community/crawlers

brilliantcode · on March 16, 2017

interesting. are you seeing any product/market fit for this?

jakubbalada · on March 16, 2017

We see a lot of users who needs data from the web or APIs for sites which doesn't have one. Just not all of them can code and we have to scale custom development.

brilliantcode · on March 16, 2017

Are these developers? Business people? I'm curious because we've been searching for a tool like this for a while but ultimately management thought it was a bad idea to rely on scraping, there's simply no replacement for a REST api.

jakubbalada · on March 17, 2017

Both - developers on a free plan using own RSS for sites without one and business people (mainly startups) building their products on top of Apifier.

Typical use is an aggregator that needs common API for all partners who are not able to provide it. So they have running API on Apifier in an hour. It might break once in a while - than you have to update your crawler (not that often if you use internal AJAX calls).

brilliantcode · on March 17, 2017

I see, so there's not much value beyond startups and bootstrappers.

I feel like it's a hard sell to enterprises. Scraping is viewed inferior to an API so it makes sense for enterprises to just pay the target website for access to the data.

jakubbalada · on March 18, 2017

It's also hard to get direct access to the data.

But you're right it's a hard sell to enterprises although we have some (e.g. real estate developer creating pricing maps)

heipei · on March 16, 2017

Yes, Chrome is the way to go in my opinion (or in general any browser with a proper DevTools API). Zero setup (start the browser, use the API), zero feature-lag, zero deviation from regular user behaviour, all the security features of the regular browser. The only downside is that it is not as easy to get started as some of the tooling aimed at CI and web-page testing, but once you've built a few tools you'll quickly get the hang of what needs to happen in which order.

I use Google Chrome on https://urlscan.io to get the most accurate representation of what a website "does" (HTTP requests, cookies, console messages, DOM tree, etc). For Chrome, this is probably the best library available: https://github.com/cyrus-and/chrome-remote-interface. Headless is working as well, but still has some issues.

sergiotapia · on March 16, 2017

I use Elixir and Hound because it has a nice clean API that's not difficult to mess around with. It's really straightforward.

https://github.com/HashNuke/hound

PeterisP · on March 16, 2017

I used http://phantomjs.org/ as a headless browser for scraping a JS-based site. It was a couple years ago, though, maybe now there's something better.

RandomBookmarks · on March 16, 2017

Not open-source but free: Kantu (https://kantu.io) uses OCR to support web scraping. You mark an anchor image/text with a green frame and mark the area of data that needs to be extracted with pink frames. The image inside the pink frames is then sent to https://ocr.space for processing and Kantu api returns the extracted text. This works very well as long as you do not need a lot of data. It is certainly not a "high-speed" solution for scraping terabytes of data.

brilliantcode · on March 16, 2017

Tried the OCR for scraping and gave up because it was too slow and inaccurate.

OCR works well for certain scenarios where UI is fixed like on desktop applications but it's still fragile very much like CSS and Xpath selectors.

In fact, often OCR performs far slower and less accurate than CSS/Xpath selectors.

It has it's niches but I think it's sub optimal for web automation/scraping.

brianwawok · on March 16, 2017

Splash https://github.com/scrapy-plugins/scrapy-splash

Runs a little headless browser.

foxylion · on March 16, 2017

Interesting, is there a variant which uses chrome? This would also eliminate most scraping protections.

brilliantcode · on March 16, 2017

Splash is not chromium I believe. Therefore it's buggy as hell and doesn't render websites that Chrome can as smoothly and easily.

janci · on March 16, 2017

Many times it's actually much easier to scrape an JS-based app. You just find the right API calls and you get nicely formatted data (JSON mostly).

selllikesybok · on March 17, 2017

Kimono was good for this, but were acquired and shut down last year (IIRC). Not sure why their exit didn't lead to someone else moving into the space?

elchief · on March 16, 2017

We use HTMLUnit. Works pretty well. Not super fast, but you want to scrape individual sites at a moderate rate anyway

skinnymuch · on March 17, 2017

Have you run into issues? I'd think HTMLUnit isn't robust enough and it's "browser" Ian limiting?

elchief · on March 21, 2017

It's got a couple of idiosyncrasies but works well in general. Barfs out too much log info in general. XPath 1 is limiting, but can use Saxon if need to

skinnymuch · on March 25, 2017

Holy crap! Xpath 1 is still being used for it? I actually have no clue what the differences are between xpath versions but I just assumed everyone is on xpath 2.

I guess my other question is - have you run into any situations where the JavaScript parsing or browser rendering wasn't good enough?

corford · on March 17, 2017

Casperjs with Slimerjs and/or Phantomjs work well