Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm curious what others use to scrape modern (javascript based) web applications.

The old web (html and links) work fine with tools like Scrapy, but for modern applications which rely on javascript this does no longer work.

For my last project I used a chrome plugin which controlled the browsers url locations and clicks. Results where transmitted to a backend server. New jobs (clicks, change urls) where retrieved from the server.

This worked fine but required some effort to implement. Is there an open source solution which is as helpful as Scrapy but solves the issues provided by modern javascript websites/applications?

With tools like Chrome headless this should now be possible, right?



I have used Selenium for this with quite a bit of success, or as others have mentioned, just figure out where the API endpoints are with fiddler and pull the data directly from the source.

Sometimes this can be a PITA though, for example Tableau obfuscates the JSON they send back, so it's easier to use Selenium to wait ten seconds and then scrape the resulting HTML.


Disclaimer: I'm a co-founder of Apifier [1].

It's not an open source, but free up to 10k pages per month. And it can handle modern JS web applications (your code runs in a context of crawled page). You can for example scrape API key at first and then use internal AJAX calls.

There's also a community page [2] where you can find and use crawlers made by other users.

[1] https://www.apifier.com [2] https://www.apifier.com/community/crawlers


interesting. are you seeing any product/market fit for this?


We see a lot of users who needs data from the web or APIs for sites which doesn't have one. Just not all of them can code and we have to scale custom development.


Are these developers? Business people? I'm curious because we've been searching for a tool like this for a while but ultimately management thought it was a bad idea to rely on scraping, there's simply no replacement for a REST api.


Both - developers on a free plan using own RSS for sites without one and business people (mainly startups) building their products on top of Apifier.

Typical use is an aggregator that needs common API for all partners who are not able to provide it. So they have running API on Apifier in an hour. It might break once in a while - than you have to update your crawler (not that often if you use internal AJAX calls).


I see, so there's not much value beyond startups and bootstrappers.

I feel like it's a hard sell to enterprises. Scraping is viewed inferior to an API so it makes sense for enterprises to just pay the target website for access to the data.


It's also hard to get direct access to the data.

But you're right it's a hard sell to enterprises although we have some (e.g. real estate developer creating pricing maps)


Yes, Chrome is the way to go in my opinion (or in general any browser with a proper DevTools API). Zero setup (start the browser, use the API), zero feature-lag, zero deviation from regular user behaviour, all the security features of the regular browser. The only downside is that it is not as easy to get started as some of the tooling aimed at CI and web-page testing, but once you've built a few tools you'll quickly get the hang of what needs to happen in which order.

I use Google Chrome on https://urlscan.io to get the most accurate representation of what a website "does" (HTTP requests, cookies, console messages, DOM tree, etc). For Chrome, this is probably the best library available: https://github.com/cyrus-and/chrome-remote-interface. Headless is working as well, but still has some issues.


I use Elixir and Hound because it has a nice clean API that's not difficult to mess around with. It's really straightforward.

https://github.com/HashNuke/hound


I used http://phantomjs.org/ as a headless browser for scraping a JS-based site. It was a couple years ago, though, maybe now there's something better.


Not open-source but free: Kantu (https://kantu.io) uses OCR to support web scraping. You mark an anchor image/text with a green frame and mark the area of data that needs to be extracted with pink frames. The image inside the pink frames is then sent to https://ocr.space for processing and Kantu api returns the extracted text. This works very well as long as you do not need a lot of data. It is certainly not a "high-speed" solution for scraping terabytes of data.


Tried the OCR for scraping and gave up because it was too slow and inaccurate.

OCR works well for certain scenarios where UI is fixed like on desktop applications but it's still fragile very much like CSS and Xpath selectors.

In fact, often OCR performs far slower and less accurate than CSS/Xpath selectors.

It has it's niches but I think it's sub optimal for web automation/scraping.


Splash https://github.com/scrapy-plugins/scrapy-splash

Runs a little headless browser.


Interesting, is there a variant which uses chrome? This would also eliminate most scraping protections.


Splash is not chromium I believe. Therefore it's buggy as hell and doesn't render websites that Chrome can as smoothly and easily.


Many times it's actually much easier to scrape an JS-based app. You just find the right API calls and you get nicely formatted data (JSON mostly).


Kimono was good for this, but were acquired and shut down last year (IIRC). Not sure why their exit didn't lead to someone else moving into the space?


We use HTMLUnit. Works pretty well. Not super fast, but you want to scrape individual sites at a moderate rate anyway


Have you run into issues? I'd think HTMLUnit isn't robust enough and it's "browser" Ian limiting?


It's got a couple of idiosyncrasies but works well in general. Barfs out too much log info in general. XPath 1 is limiting, but can use Saxon if need to


Holy crap! Xpath 1 is still being used for it? I actually have no clue what the differences are between xpath versions but I just assumed everyone is on xpath 2.

I guess my other question is - have you run into any situations where the JavaScript parsing or browser rendering wasn't good enough?


Casperjs with Slimerjs and/or Phantomjs work well




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: