
Show HN: Getting started with Puppeteer and Chrome Headless for Web Scraping - emadehsan
https://github.com/emadehsan/thal
======
paps
Where I work we prefer jQuery to the native DOM API for scraping. It really
speeds up the process of extracting data.

For example with Puppeteer you can do page.injectFile("jquery-3.2.1.min.js").
I think that would simplify your evaluate() calls.

It would also be easy to speed up the whole process by doing a single
evaluate() call per page with all your scraping code in it.

BTW we just released an article with tips & tricks for Headless Chrome:
[https://blog.phantombuster.com/web-scraping-
in-2017-headless...](https://blog.phantombuster.com/web-scraping-
in-2017-headless-chrome-tips-tricks-4d6521d695e8) What do you think?

~~~
emadehsan
Good suggestion. I would update soon. Thank you

------
Giroflex
> Since the official announcement of Chrome Headless, many of the industry
> standard libraries for automated testing have been discontinued by their
> maintainers. The prominent of these are PhantomJS and Selenium IDE for
> Firefox.

Correct me if I'm wrong, but if I'm notm mistaken Selenium IDE has been
discontinued due to lack of mantainers, and that has little if any relation to
Chrome Headless.

The IDE is just a more effective way of programming test behavior; the
Selenium webdriver is still up and working with straight code (as is the case
of this tutorial).

~~~
jaxn
Related or not, it seems like a valid point.

We switched to chrome headless after a post from thoughtbot made me question
Capybara-WebKit's future.

~~~
escap
Selenium IDE was discontinued due to the change of extension (from XPI to
WebExtension) in Firefox. Nothing to do with Chrome Headless.

see
[https://seleniumhq.wordpress.com/2017/08/09/firefox-55-and-s...](https://seleniumhq.wordpress.com/2017/08/09/firefox-55-and-
selenium-ide/) and associated HN discussion
[https://news.ycombinator.com/item?id=15061605](https://news.ycombinator.com/item?id=15061605)

~~~
tw21
Plus, there are already new IDEs showing up, for example
[https://chrome.google.com/webstore/detail/kantu-browser-
auto...](https://chrome.google.com/webstore/detail/kantu-browser-
automation/gcbalfbdmfieckjlnblleoemohcganoc)

------
ankit84
Great tutorial! Also, you look like a Full stack. How's the reception for
Hospital Run software you worked on?
([https://github.com/HospitalRun/hospitalrun-
frontend](https://github.com/HospitalRun/hospitalrun-frontend))

""Somewhat similar is the case with Internet that we traversed today in quest
of data.""

~~~
emadehsan
HospitalRun team is great and very welcoming. You can join there Slack channel
here: [https://hospitalrun.slack.com/](https://hospitalrun.slack.com/) . The
project is expected to be undertaken by JS Foundation in near future. And
Yeah, I am a Full stack developer.

------
twsted
Two things:

1\. Please do not test a web app with Chrome only, we don't want to go back to
a world with a single browser

2\. > So, until puppeteer supports this, we will rely on jsdom, a package
available via npm

JSDOM is not just a package on npm, it's an engineering piece of art

~~~
hugh7
I have given up on Firefox. Every bug I submit takes it own sweet time to get
fixed.

~~~
rodorgas
How about the bufixes you submit, are they faster?

------
veb
Ooooh I read this fantastic introduction the other day and wrote this wee HN
demo using Cheerio.
[https://gist.github.com/veb/c1beab69b5eb1b07123e5eaf55b80320](https://gist.github.com/veb/c1beab69b5eb1b07123e5eaf55b80320)

------
testcross
Do you know if it is possible to render a page without serving it from a web
server? For example, I have the html of one page of my domain generated by a
test. I would like to use puppeteer to render it. But I don't want to setup a
http server for this. I would like to give a string with the html + a url to
page.goto and let it render the page like it comes from the real server.

I guess I can cheat by intercepting the request and respond with the html I
already have. But I wonder if there is already something existing.

~~~
houli
You should be able to use a data URI containing the HTML string

~~~
egeozcan
It's not allowed anymore:
[https://groups.google.com/a/chromium.org/forum/m/#!topic/bli...](https://groups.google.com/a/chromium.org/forum/m/#!topic/blink-
dev/GbVcuwg_QjM)

~~~
ConfucianNardin
That's not correct.

Initial assumption when reading the thread was that navigating to a data URI
would be handled like entry of a data URI into the omnibox and still be
allowed.

A small test case confirms that assumption - it works.

------
garou
I am writing almost the same thing but for PDF [1]. But I am having trouble
with scaling.

I got able to make it run inside a docker.

In this exact moment the example at the repo is just returning a blank PDF but
the problem is at the API Gateway.

[1]
[https://github.com/tecnospeed/pastor](https://github.com/tecnospeed/pastor)

------
MrBlue
Puppeteer is definitely cool but on a recent project I had to revert back to
using NightmareJS as I needed to download files.

~~~
andrewguenther
This is currently being worked on in Headless Chrome. There's been tons of
development on the project and they're super open to feature requests.

------
gmac
A simple option for web scraping is just to use the developer console in a
real web browser.

I have a repo outlining the basics here: [https://github.com/jawj/web-
scraping-for-researchers](https://github.com/jawj/web-scraping-for-
researchers)

------
jasan_s
Tried Puppetter, Its pretty awesome. I'm a newbie in terms of scraping but
thus far its been a pleasant experience with this tool. Anyone used artoo.js
with puppeteer successfully?

------
testcross
Is it possible to launch multiple times const browser = await
puppeteer.launch(); in a same nodejs process? I haven't find any information
about that

~~~
aslushnikov
It is possible. Beware though that each `puppeteer.launch()` will spawn a
chromium process.

------
naveedahmada036
I can write mini script to scrape emails and github, what's up about this
hype?

------
dchuk
Correction: it's "scraping"

~~~
emadehsan
Corrected! Thanks :)

------
desireco42
I tried it out when it was released, it works well and it is decently fast.

~~~
testcross
What is faster than puppeteer? All the alternatives using electron look
slower.

------
kasbah
Seems like most of the parsing is done by JSDOM in this tutorial.

~~~
emadehsan
Updated to use `page.evaluate`

