
Observations running 2M headless sessions - mrskitch
https://docs.browserless.io/blog/2018/06/04/puppeteer-best-practices.html
======
fake-name
I have a python project to manage headless chromium, which exposes basically
the entire dev-tools UI:

[https://github.com/fake-name/ChromeController](https://github.com/fake-
name/ChromeController)

It also supports some nice bits - there's a tab pool interface, it does
execution lifetime management, and there's actual parameter checking for
arguments generated by parsing the `protocol.json` file the browser supplies.

I use it as a mixin for some high(ish) volume web archiving I do as a hobby.
It does a nice job poking through things like buttflare and other bullshit
WAFs.

~~~
armitron
How do you archive? Screenshots?

I haven’t found a way to do mhtml with chrome headless which would be so
convenient..

~~~
fake-name
Archiving backend: [https://github.com/fake-
name/ReadableWebProxy](https://github.com/fake-name/ReadableWebProxy)

Stores the scraped content in a postgres database, with external blob storage
for binary content.

Additionally, historical records of the scraped content are kept, so I have
something that acts kind of like the internet archive too.

~~~
kchr
Thanks for your effort, this is an awesome project.

------
tnolet
Excellent article and tips! I learned a lot from Joel's knowledge on Puppeteer
and its intricacies.

I'm running a somewhat different business of which Puppeteer is a pretty big
part. All points made in the post are valid, I could only add the following
after having run ~400k Puppeteer sessions.

\- Race conditions happen. This issue [0] is causing roughly 3% of all
Puppeteer runs to fail in my case. I had to bake in a retry mechanism.

\- Memory matters, CPU not so much. Just looking at my Librato and AWS stats,
the CPU is mostly idle when running multiple concurrent sessions.

\- One way to establish compartmentalisation is to actually run each scraping
session in a separate, one-off Docker container. You pass in the code via a
disk mount or such. No hassle with too many tabs, or shared context between
runs. Each container is destroyed after running.

[0]:
[https://github.com/GoogleChrome/puppeteer/issues/1325](https://github.com/GoogleChrome/puppeteer/issues/1325)

~~~
CapacitorSet
I'm amazed that one must go as far as creating different containers -
duplicating the browser, dependencies and whatnot - to achieve tab isolation
and browser resiliency in Chrome, when Firefox does this with as little as
container tabs.

~~~
tnolet
Well, in my specific case it is also a security measure. In the context of a
Saas you want very strict separation between user sessions. This was my main
concern.

------
evadne
Belated note: my work colleague has recently released an Elixir based solution
to supervise Chrome, see:

[https://github.com/holsee/chroxy](https://github.com/holsee/chroxy)

[https://slides.com/holsee/deck-5](https://slides.com/holsee/deck-5)

~~~
mrskitch
Very nice! It's amazing to me how many developers just run headless wo any
sort of service-layer. You're gonna have a bad time :(

------
mrskitch
Hey folks! I'm Joel, entrepreveloper behind this browserless.io service, happy
to answer questions or chat!

~~~
dschnurr
Hey Joel–awesome service and post. Your tip about parallelizing via browser
instances instead of tabs resonated with me as I've had a lot of headaches
with one bad request taking down a full instance. That said, it does seem like
the constant launch/close of puppeteer browser instances does have some sort
of memory leak – Here's what happened when I switched from parallelizing via
pages to browsers: [https://imgur.com/a/rCz3oow](https://imgur.com/a/rCz3oow)

Have you seen any similar behavior in your system? I initially thought I was
forgetting to close the browser instance but triple checked my code and it
does seem like I'm calling close() everywhere.

~~~
mtrunkat
We have found 2 related issues:

\- Sometimes Promise returned by page.close() never resolves so it's good to
call Promise.race() on that together with a Promise that resolves after some
timeout period (30s?)

\- Sometimes Chrome process doesn't get killed so we are also manually killing
remaining Chrome process after browser.close()

~~~
dschnurr
Good tips. How are you killing remaining zombie Chrome processes? Just polling
the processlist and running "kill" commands?

------
nkozyra
Some of the puppeteer notes are effectively in the docs and quick starts. I've
found puppeteer to be an absolute joy to use after years of phantom, custom
scrapers, common scraper libraries, etc.

At a certain scale things obviously get harder but making your "sessions"
ephemeral handles a lot of the resource issues.

The cat and mouse with people looking to block puppeteer access may heat up a
bit too. But for the most part puppeteer makes doing this stuff 100x easier.

~~~
mrskitch
Right - probably repeating some stuff already out there. The main issue I’ve
been seeing on puppeteers slack channel is folks who are totally new, and
don’t read the docs, so trying to help all devs effectively

------
afinemonkey
Very interesting to see this. I've been running a similar infrastructure
rendering sometimes up to 500k pages a day, although often without images. I'm
also running on Digital Ocean, but using nightmare.js
([https://github.com/segmentio/nightmare](https://github.com/segmentio/nightmare)),
which runs on top of Electron, which in turns runs on top of Chromium.

The CPU and RAM patterns I see are different, with fixed CPU usage at near max
and memory oscillating between 65% and 80%. I believe this is due to the
different usage pattern, I basically _always_ have at least 20/30 jobs running
concurrently on each machine, and they're usually fairly long (up to 10
minutes or so).

Contrary to what you mention, I've never had an issue with pages crashing and
bringing the whole browser down. Maybe it has happened, but it's definitely
negligible compared to the benefits I get by running say 5 pages in parallel.
For some tasks I've also had some luck overriding the cross-content policy and
using dirty ol' iframes to render multiple webpages in the same session.

I've considered migrating to puppeteer, so it's encouraging to see large scale
project sharing their experience with it.

~~~
graystevens
Same experience here - we run a docker instance of Chrome using tabs for
multiple pages rather than multiple browsers, and they regularly run for days
without issues. Of course their RAM usage gradually expands but it is easy
enough to systematically stop & start the container, thanks to some built in
error and fault handling to retry any requests which failed.

I can’t say I’ve seen one tab bring down the entire browser, but I’m sure
thats feasible, but thanks to docker and the fault handling above, it’d
restart the instance and be up within seconds.

~~~
mrskitch
This is what I've initially noticed as well (app will run for several days and
eventually the account will cancel due to unavailability ... you'll get
paged).

The screenshot I showed in that post is an instance that's been running for
_months_ under high load. I can't stress enough that using tabs/pages will
always result in frequent restarts which can be really tricky if there's other
sessions that need to gracefully finish.

------
mtrunkat
Problem with fixed concurrency is that required memory per Chrome process
varies a lot.

We ([https://www.apify.com](https://www.apify.com)) are solving this by
autoscaling number of parallel Puppeteer instances based on memory and CPU.
Our open source SDK ([http://github.com/apifytech/apify-
js](http://github.com/apifytech/apify-js)) implements this using class
PuppeteerCrawler ([https://www.apify.com/docs/sdk/apify-runtime-
js/latest#Puppe...](https://www.apify.com/docs/sdk/apify-runtime-
js/latest#PuppeteerCrawler)) which internally uses AutoscaledPool that
provides autoscaling:

\- [https://github.com/apifytech/apify-
js/blob/master/src/autosc...](https://github.com/apifytech/apify-
js/blob/master/src/autoscaled_pool.js)

\- [https://www.apify.com/docs/sdk/apify-runtime-
js/latest#Autos...](https://www.apify.com/docs/sdk/apify-runtime-
js/latest#AutoscaledPool)

Sadly this feature is currently limited to our platform because it's mainly
build for running in Docker containers. And as Docker container don't know
about it's CPU consumption vs limits it requires a notifications about
reaching its CPU limit from underlying platform. We have solved this using
Websocket events. But we are currently working on extending this to work
anywhere/locally.

~~~
mrskitch
I've definitely noticed this as well and have been working on a "auto-scale"
feature as well (to tone-down concurrency when under load). It's not an easy
task, but you can see my first pass here:
[https://github.com/joelgriffith/browserless/blob/master/src/...](https://github.com/joelgriffith/browserless/blob/master/src/Chrome.ts#L556)

------
squeaky-clean
Great article! I've got several apps that maintain long running Selenium jobs
(requires a real JS engine for the pages it hits) and ruthlessly killing
containers after a each job and with a timer is the only solution I found.
Zombie processes are a b __ __!

My team is in the planning phase of rewriting one, and we're going to use
Puppeteer this time. Definitely sharing this with them!

~~~
aisofteng
Would Tini help?

[https://github.com/krallin/tini/blob/master/README.md](https://github.com/krallin/tini/blob/master/README.md)

~~~
squeaky-clean
I may have misused the term zombie process. Seems that's when an entry for a
terminated process doesn't clear it's entry in the process table.

I'm my experience Selenium will appear to be functioning normally from the
outside, but the driver completely locks and doesn't proceed at seemingly
random times. I couldn't think of a sane way to detect this, killing
containers seemed easier and has worked reliably so far. I'm definitely open
to hearing suggestions though.

That aside though Tini looks like a good starting point for containers like
mine anyways. So thanks, it does seem helpful!

~~~
TeMPOraL
It runs out of entropy? Not sure if this applies to containers, but had
something like this happen in a virtual machine we used for builds (Jenkins).
I made the mistake of instantiating a cryptographically secure PRNG during
_build_ time; the initialization needed entropy to seed the CSPRNG with, and
this would lock up the whole virtual machine in the middle of the build. It
seems that headless VMs have no possible source of entropy, unless explicitly
configured to use host's source. Maybe something similar happens with
containers?

------
unbearded
One thing that I noticed is that if the website is behind Distil Networks,
they will block on the first request and make cumbersome, if not impossible,
to automate some task.

I get that is important to protect the information of their clients (which
seems to be content aggregators), but there are legitimate use cases for
allowing at least some scrapping to happen - one being when the information is
about the person/company who wants to read it in an automated form so it can
be used for further processing.

------
testcross
At Ahrefs we are running between 130 and 170 millions sessions a day. The
project started before the release of puppeteer. I can tell you the release of
puppeteer saved me a lot of time. It's way more easier to keep a correct
browser state using it than what was previously available. It also interacts
well with lighthouse. I wouldn't call chrome headless stable or bug free (it
doesn't even handle https correctly). But it's a good thing to have it
available.

~~~
graystevens
What issues do you have with HTTPS/TLS? I can’t say I’ve come across any
issues so far - is it a specific usecase to Ahrefs?

~~~
testcross
[https://github.com/GoogleChrome/puppeteer/issues/1159](https://github.com/GoogleChrome/puppeteer/issues/1159)

I don't think we have a fancy usage of puppeteer. But still we get between 2%
and 5% of errors depending on the version used. It's a small ratio, but at
this scale it's a lot of pages.

------
RyanShook
I’d like to know, what are people on HN using headless browsers for?

~~~
mrskitch
You’d actually be surprised by the applications, I definitely was. Some I’ve
seen:

\- Screenshots

\- PDF generation

\- Functional tests

\- Regression tests

\- Scraping SPAs

\- Crazy stuff like printing user-generated content for personalization
products..

~~~
leetbulb
Among some of these, I've also built an internal microservice for analyzing
URL redirect chains. Input a URL and it will provide me with all of the
redirects it goes through and information for each one including method (html
meta, 301, etc), timing, etc.

~~~
kwerk
I’ve done some hacky work redirect analysis but just the final url. Thought of
open sourcing?

------
dom96
I recently played around with a headless Firefox instance to test a web app on
Travis and so far it has been working incredibly well. Here is a sample
.travis.yml file for those interested: [https://github.com/dom96/geckodriver-
travis/blob/master/.tra...](https://github.com/dom96/geckodriver-
travis/blob/master/.travis.yml)

------
maurice2k
I'm running ~600k to ~800k chrome sessions a day, each in a new browser
process (no tabs/pages) because of cookie interferences.

The magic is hidden in chrome's command line parameters which can
significantly speed up the browser start.

I'm not using headless because I need an extension but most command line
switches also apply to headless I guess...

A prepared profile/user directory also helps speeding up things.

~~~
steve19
Can you share the command line switches that help?

------
adrianolek
I think there is no need to manually `ADD dumb-init` when Docker provides an
init (tini) process via `--init` flag.

[0]: [https://docs.docker.com/engine/reference/run/#specify-an-
ini...](https://docs.docker.com/engine/reference/run/#specify-an-init-process)

~~~
mrskitch
The reason I baked dumb-init into the image was in order to offer better dev
UX (since I'm assuming some/lots/most folks out there will just `docker run`
without much context on what's going on). I actually like specifying with the
init switch better, but it opens up new devs to a world of hurt

------
defied
At TestingBot [1] we run about 50k sessions / day in real Chrome browsers
(each session is a new Virtual Machine with Chrome freshly launched).

Compared to the other browsers we offer, Chrome is by far the most stable.

[1] [https://testingbot.com](https://testingbot.com)

~~~
k5jhn
Hi, could you elaborate as to why TestingBot runs one browser per VM, as
opposed to running multiple browsers on a beefier VM?

~~~
defied
Because of security, we want to make sure each test can not be affected by a
previous test.

If you'd run multiple tests on 1 VM, then one customer could change a property
on the VM or alter a state that might cause the next test on the VM to fail.

By providing a pristine VM per test, we guarantee that each test cannot be
affected by any previous test.

~~~
k5jhn
Thanks for the info!

------
voiper1
Do people generally agree with `3. page.evaluate is your friend` over using
puppeteer's methods?

~~~
brod
the biggest annoyance with `page.evaluate` is transpiling javascript which
adds helpers outside of the `page.evaluate` closure that in turn are not
serialized and sent to the browser.

~~~
mrskitch
That's a fantastic point: I'll add that as a gotcha

