
Show HN: Headless Chrome Crawler - yujiosaka
https://github.com/yujiosaka/headless-chrome-crawler
======
ptasker
Pretty cool, but I recommend anyone wanting to do this kind of thing to check
out the source Puppeteer library. You can do some really powerful stuff and
make a custom crawler fairly easily.

[https://github.com/GoogleChrome/puppeteer](https://github.com/GoogleChrome/puppeteer)

~~~
chatmasta
Puppeteer has some limitations. You can’t install extensions, for example.

I haven’t looked into it, but I imagine it has a pretty clear fingerprint as
well. So it would be easier to block than stock chrome in headless mode.

~~~
kodablah
Unless something has changed that I missed, you can install extensions (I
complained when the default args messed this up [0]). For example, I built
something that uses puppeteer and an extension to capture audio and video of a
tab [1]. It's just headless mode that doesn't allow extensions [2] (which I
now realize is probably what you meant).

0 -
[https://github.com/GoogleChrome/puppeteer/issues/850](https://github.com/GoogleChrome/puppeteer/issues/850)
1 - [https://github.com/cretz/chrome-screen-rec-
poc/tree/master/a...](https://github.com/cretz/chrome-screen-rec-
poc/tree/master/attempt1) 2 -
[https://bugs.chromium.org/p/chromium/issues/detail?id=706008](https://bugs.chromium.org/p/chromium/issues/detail?id=706008)

------
codedokode
This has been possible for a long time with any browser using Selenium for
example. It has APIs and client libraries for many languages.

Also using a real browser brings a lot of problems: high resource consumption,
hangs, it is unclear when the page has finished loading etc. You have to
supervise all browser processes. And if you use promises, there is high chance
that you will miss error messages because promises hide them by default.

------
tesin
While as a developer I find this super interesting, as a system administrator
this makes me cringe. We don't have a lot of resources for servers, and I end
up spending a disproportionate amount of time banning IPs from bots running
poorly configured tools like this, which aren't rate limited and crush
websites.

I'm grateful that "Obey robots.txt" is listed as part of it's standard
behavior. If only scrapers cared enough to use it as well.

~~~
superasn
I've found that _mod_evasive_ [1] works particularly well in these situation
and helped us a lot dealing with it (though I'm not a sysadmin and I'm sure
there are better tools to deal with it). But for someone who is just a
webmaster, I'd recommend using it for a quick dirty fix for such hassles.

[1] [https://www.digitalocean.com/community/tutorials/how-to-
prot...](https://www.digitalocean.com/community/tutorials/how-to-protect-
against-dos-and-ddos-with-mod_evasive-for-apache-on-centos-7)

------
tegansnyder
There are a lot of folks reevaluating their crawling engines lately now that
Chrome headless is maturing. To me there are some important considerations in
terms of CPU/memory footprint that go into distributing a large headless
crawling architecture.

The stuff we are not seeing open-sourced is the solutions companies are
building around trimmed down specialized versions of the headless browsers
like Chrome headless, Servo, Webkit. People are running distributed versions
of these headless browsers using Apache Mesos, Kubernetes, and Kafka queues.

------
hartator
I was stuck last time I was using headless chrome when I needed to use a proxy
with an username and a paasword. Headless chrome just doesn't support it. Any
changes on that?

~~~
jancurn
There's a workaround - [https://blog.apify.com/how-to-make-headless-chrome-
and-puppe...](https://blog.apify.com/how-to-make-headless-chrome-and-
puppeteer-use-a-proxy-server-with-authentication-249a21a79212)

~~~
hartator
Thanks. Wish it was simpler. It seems overkill to have to an extra proxy in
the middle with no auth to authenticate with the one with auth, just to make
headless chrome working.

~~~
timstapl
There's also page.authenticate, which has worked well for me.
[https://github.com/GoogleChrome/puppeteer/blob/master/docs/a...](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pageauthenticatecredentials)

------
princehonest
I've been considering writing my own puppeteer docker image such that one
could freeze the image at crawl time after a page has loaded. This would allow
me to re-write the page-parsing logic after the page layout changes. Has
anyone done this already or know of any other efforts to serialize the
puppeteer page object to handle parsing bugs?

------
nikisweeting
I'm thinking about adding a crawler to Bookmark Archiver, to augment the
headless chrome screenshotting and PDFing that it already does.

Wget is also a pretty robust crawler, but people have requested a proxy that
archives every site they visit in real-time more than a crawler.

------
bryanrasmussen
Can't see from examples, how do I get back individual elements from the body?

~~~
diggan
Not a user of this tool, but [https://github.com/yujiosaka/headless-chrome-
crawler#event-n...](https://github.com/yujiosaka/headless-chrome-
crawler#event-newpage) points to
[https://github.com/GoogleChrome/puppeteer/blob/master/docs/a...](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#class-
page) where you can grab elements.

Basically, when new page event happens, you get the `page` object where you
have access to it and can do queries.

~~~
ej12n
`page.$(query)` or `page.$$(queryAll)` to be more specific

[https://github.com/GoogleChrome/puppeteer/blob/v1.1.0/docs/a...](https://github.com/GoogleChrome/puppeteer/blob/v1.1.0/docs/api.md#pageselector)

------
artur_makly
there's this too:
[https://github.com/brendonboshell/supercrawler](https://github.com/brendonboshell/supercrawler)

------
agotterer
Nice job! Can this be scaled and distributed to multiple machines?

~~~
artur_makly
maybe with this? [https://Browserless.io](https://Browserless.io)

------
bryanrasmussen
also how does this handle pages that load with a small number of links and
then uses JS to write in a bunch of DOM nodes and links?

~~~
trevyn
I don't know about this project specifically, but typically with headless
Chrome, you let it run the JS and then read the DOM.

~~~
bryanrasmussen
most naive code I see is like the following

    
    
        const page = await crawler.browser.newPage();
        await page.goto(url);
        await page.waitForSelector("a[href]");
        const hrefs = await page.evaluate(
        () => Array.from(document.body.querySelectorAll('a[href]'), ({ href }) => href)
        );
    

and then you do something with hrefs.

However if you have a page that loads with 4 links defined, does its script
and then ends up 100+ links you miss the 100+ links. I notice people often
failing to account for this in their crawlers, so I wondered if this one was.

------
londt8
is it possible to scrape songs from spotify web app with this?

