Hacker News new | comments | show | ask | jobs | submit login
Show HN: Headless Chrome Crawler (github.com)
172 points by yujiosaka 7 months ago | hide | past | web | favorite | 33 comments



Pretty cool, but I recommend anyone wanting to do this kind of thing to check out the source Puppeteer library. You can do some really powerful stuff and make a custom crawler fairly easily.

https://github.com/GoogleChrome/puppeteer


Looks like this is actually built on top of puppeteer. See the "Note" under "Installation": https://github.com/yujiosaka/headless-chrome-crawler/blob/ma...


Puppeteer has some limitations. You can’t install extensions, for example.

I haven’t looked into it, but I imagine it has a pretty clear fingerprint as well. So it would be easier to block than stock chrome in headless mode.


Unless something has changed that I missed, you can install extensions (I complained when the default args messed this up [0]). For example, I built something that uses puppeteer and an extension to capture audio and video of a tab [1]. It's just headless mode that doesn't allow extensions [2] (which I now realize is probably what you meant).

0 - https://github.com/GoogleChrome/puppeteer/issues/850 1 - https://github.com/cretz/chrome-screen-rec-poc/tree/master/a... 2 - https://bugs.chromium.org/p/chromium/issues/detail?id=706008


Puppeteer seems needlessly difficult to use on a VPS. I'd prefer an easily dockerized version but there seems to be nothing robust and they make it VERY hard to connect to a docker instance just running Chrome for the websocket/9222 interface sadly.


I recently did this in a Docker.

Let me quickly add instructions here, first you need to install some dependancies, add the following to dockerfile:

  RUN apt-get install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget
Secondly, launch puppeteer with --no-sandbox option:

  browser = await puppeteer.launch({
      args: ['--no-sandbox'] /*, headless: false*/
    })
That should do it.


I've done this recently actually. Take a look at the yukinying/chrome-headless-browser[0] image. You'll need to run with the SYS_ADMIN capability and up the shm_size to 1024M (you can workaround the SYS_ADMIN cap with a seccomp file but I didn't have much luck with that). Other than that oddness it works pretty well (and with Puppeteer 1.0, with far fewer crashes).

[0]: https://github.com/yukinying/chrome-headless-browser-docker


Yeah I’d really rather that people made extensions to Pupeteer rather than a whole new library.


This has been possible for a long time with any browser using Selenium for example. It has APIs and client libraries for many languages.

Also using a real browser brings a lot of problems: high resource consumption, hangs, it is unclear when the page has finished loading etc. You have to supervise all browser processes. And if you use promises, there is high chance that you will miss error messages because promises hide them by default.


While as a developer I find this super interesting, as a system administrator this makes me cringe. We don't have a lot of resources for servers, and I end up spending a disproportionate amount of time banning IPs from bots running poorly configured tools like this, which aren't rate limited and crush websites.

I'm grateful that "Obey robots.txt" is listed as part of it's standard behavior. If only scrapers cared enough to use it as well.


I've found that mod_evasive[1] works particularly well in these situation and helped us a lot dealing with it (though I'm not a sysadmin and I'm sure there are better tools to deal with it). But for someone who is just a webmaster, I'd recommend using it for a quick dirty fix for such hassles.

[1] https://www.digitalocean.com/community/tutorials/how-to-prot...


Such crawler should not be difficult to ban by looking at stats - if there are many requests per IP per unit of time, or many requests from data center IPs, or many requests from Linux browsers, it is likely bots and you can ban them (you can ban whole data center to be sure).


There are a lot of folks reevaluating their crawling engines lately now that Chrome headless is maturing. To me there are some important considerations in terms of CPU/memory footprint that go into distributing a large headless crawling architecture.

The stuff we are not seeing open-sourced is the solutions companies are building around trimmed down specialized versions of the headless browsers like Chrome headless, Servo, Webkit. People are running distributed versions of these headless browsers using Apache Mesos, Kubernetes, and Kafka queues.


I was stuck last time I was using headless chrome when I needed to use a proxy with an username and a paasword. Headless chrome just doesn't support it. Any changes on that?



Thanks. Wish it was simpler. It seems overkill to have to an extra proxy in the middle with no auth to authenticate with the one with auth, just to make headless chrome working.


There's also page.authenticate, which has worked well for me. https://github.com/GoogleChrome/puppeteer/blob/master/docs/a...


You can also use page.authenticate() for that - see a note at the bottom of the article. Also see https://github.com/GoogleChrome/puppeteer/pull/1732


Actually, we just figured out how to do this. Details here: https://bugs.chromium.org/p/chromium/issues/detail?id=741872...


Awesome. I need to figure out a way to make it work with our Ruby code, but it shouldn't be that hard. Thanks.


I've been considering writing my own puppeteer docker image such that one could freeze the image at crawl time after a page has loaded. This would allow me to re-write the page-parsing logic after the page layout changes. Has anyone done this already or know of any other efforts to serialize the puppeteer page object to handle parsing bugs?


I'm thinking about adding a crawler to Bookmark Archiver, to augment the headless chrome screenshotting and PDFing that it already does.

Wget is also a pretty robust crawler, but people have requested a proxy that archives every site they visit in real-time more than a crawler.


Can't see from examples, how do I get back individual elements from the body?


Not a user of this tool, but https://github.com/yujiosaka/headless-chrome-crawler#event-n... points to https://github.com/GoogleChrome/puppeteer/blob/master/docs/a... where you can grab elements.

Basically, when new page event happens, you get the `page` object where you have access to it and can do queries.


`page.$(query)` or `page.$$(queryAll)` to be more specific

https://github.com/GoogleChrome/puppeteer/blob/v1.1.0/docs/a...


yes I've used puppeteer, but I couldn't see where it exposed the page object, but looking closer I saw

HCCrawler.launch({ // Function to be evaluated in browsers evaluatePage: (() => ({ title: $('title').text(), })), // Function to be called with evaluated results from browsers onSuccess: (result => { console.log(result); }), })

which doesn't look that good for my other worry, that you come to a dynamic JS built up page, load the DOM and evaluate in some milliseconds and then that DOM is changed after you've done your incorrect evaluation.



Nice job! Can this be scaled and distributed to multiple machines?


maybe with this? https://Browserless.io


also how does this handle pages that load with a small number of links and then uses JS to write in a bunch of DOM nodes and links?


I don't know about this project specifically, but typically with headless Chrome, you let it run the JS and then read the DOM.


most naive code I see is like the following

    const page = await crawler.browser.newPage();
    await page.goto(url);
    await page.waitForSelector("a[href]");
    const hrefs = await page.evaluate(
    () => Array.from(document.body.querySelectorAll('a[href]'), ({ href }) => href)
    );
and then you do something with hrefs.

However if you have a page that loads with 4 links defined, does its script and then ends up 100+ links you miss the 100+ links. I notice people often failing to account for this in their crawlers, so I wondered if this one was.


is it possible to scrape songs from spotify web app with this?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: