Hacker News new | past | comments | ask | show | jobs | submit login
Observations running 2M headless sessions (browserless.io)
315 points by mrskitch on June 4, 2018 | hide | past | web | favorite | 84 comments

I have a python project to manage headless chromium, which exposes basically the entire dev-tools UI:


It also supports some nice bits - there's a tab pool interface, it does execution lifetime management, and there's actual parameter checking for arguments generated by parsing the `protocol.json` file the browser supplies.

I use it as a mixin for some high(ish) volume web archiving I do as a hobby. It does a nice job poking through things like buttflare and other bullshit WAFs.

How do you archive? Screenshots?

I haven’t found a way to do mhtml with chrome headless which would be so convenient..

Archiving backend: https://github.com/fake-name/ReadableWebProxy

Stores the scraped content in a postgres database, with external blob storage for binary content.

Additionally, historical records of the scraped content are kept, so I have something that acts kind of like the internet archive too.

Thanks for your effort, this is an awesome project.

How do you archive? Screenshots? SCOTT BELL SCTB IS AN IDIOT MORON I haven’t found a way to do mhtml with chrome headless which would be so convenient..

Excellent article and tips! I learned a lot from Joel's knowledge on Puppeteer and its intricacies.

I'm running a somewhat different business of which Puppeteer is a pretty big part. All points made in the post are valid, I could only add the following after having run ~400k Puppeteer sessions.

- Race conditions happen. This issue [0] is causing roughly 3% of all Puppeteer runs to fail in my case. I had to bake in a retry mechanism.

- Memory matters, CPU not so much. Just looking at my Librato and AWS stats, the CPU is mostly idle when running multiple concurrent sessions.

- One way to establish compartmentalisation is to actually run each scraping session in a separate, one-off Docker container. You pass in the code via a disk mount or such. No hassle with too many tabs, or shared context between runs. Each container is destroyed after running.

[0]: https://github.com/GoogleChrome/puppeteer/issues/1325

Hey Tim! Nice to see you here! I agree with your points overall, especially the disposable containers as they’re super hard to keep up. The CPU and Memory notes are accurate as well — except for canvas intense sites (which makes sense).

Thanks for posting your thoughts!

I'm amazed that one must go as far as creating different containers - duplicating the browser, dependencies and whatnot - to achieve tab isolation and browser resiliency in Chrome, when Firefox does this with as little as container tabs.

Well, in my specific case it is also a security measure. In the context of a Saas you want very strict separation between user sessions. This was my main concern.

Belated note: my work colleague has recently released an Elixir based solution to supervise Chrome, see:



Very nice! It's amazing to me how many developers just run headless wo any sort of service-layer. You're gonna have a bad time :(

Hey folks! I'm Joel, entrepreveloper behind this browserless.io service, happy to answer questions or chat!

Hi. What is this exactly? Skimming through the site didnt give me a clear detailed explanation.

Is it a programming api to chrome? As in:

host:~$ ./browserless-app example.com # will fetch an interact using js/DOM with the web @ url

Is this supposed to be used as a proxy between some random browser, a remote chrome browser and the http response? As in:

<IE/edge/safary script> page-load: ws.connect(ws://mybless-service/); this.html = ws.recv()</script>

Or is it more oriented to have a remote service which you can use to send commands to process and control browsing sessions? login to foobar.org; update this data. fetch something; return data/status too peer.

I'm really lost. A sugestion. I cant check 4 layers of subproject dependencies in orther to have a clear picture of what the thing does. An how it does it. I dont know if this is a node server running standalone (can I autodeploy this service in my server? Is this a saas?), or something you use in the client js. Or something you can access from the client side that internally connects to a websocket. Or even a mix of all this. A clear arquitectural design diagram helps IMO in this things:

- Layer A runs on yours/ours/both linux/$X server using Y framework

- Layer B Is a websocket service (standalone/saas) for clients to consume.

- Layer C is a browser client js API wrapper for the service.

- This is useful for this <full use case and run session of the example>

Dont get me wrong, this is not a hate comment. Is just a personal suggestion to help people not deeply invested in JS/node ecosystem. Maybe its me just being thick this morning and not getting it.

Thanks, it looks very cool anyway.

Yeah, definitely a product that you don't realize you need...until you find the issue.

As of now it's a pool of headless Chrome instances that you can control remotely and run tasks through. Screenshots, PDFs, scraping... whatever you need. The reason that folks use this is because it's _really scary_ to run Chrome alongside your app's infra, so separating it out (much like a DB) is a good idea.

In the near-future we'll offer functions and other REST endpoints for doing common and user-defined tasks. This make it even more contained so your app can keep running happily without babysitting Chrome.

Great question!

Hi Joel,

I had fun collaborating with you last year on a puppeteer docker thread in GitHub.

I’m super happy browserless is picking up. A bit jealous of your success but still rooting for you!

Containerized Puppeteer is amazing. Hope Google acquires browserless.io some day.

Thanks very much for the clarification.

Indeed it looks very useful.

Thanks for the article I ran into an issue a few minutes ago and phantomjs isn’t going to cut it. I am more of a hobbyist so this was a nice set of pointers and a good intro— id never heard of puppeteer. I was going to use nightmareJS or selenium but reckon ill try this out. Also, the screenshot tip is nice, i was driving myself crazy trying to figure out where the bug in my code was— there wasn’t one but the page was just not rendering. Screenshot is a nice sanity check

Hey Joel–awesome service and post. Your tip about parallelizing via browser instances instead of tabs resonated with me as I've had a lot of headaches with one bad request taking down a full instance. That said, it does seem like the constant launch/close of puppeteer browser instances does have some sort of memory leak – Here's what happened when I switched from parallelizing via pages to browsers: https://imgur.com/a/rCz3oow

Have you seen any similar behavior in your system? I initially thought I was forgetting to close the browser instance but triple checked my code and it does seem like I'm calling close() everywhere.

We have found 2 related issues:

- Sometimes Promise returned by page.close() never resolves so it's good to call Promise.race() on that together with a Promise that resolves after some timeout period (30s?)

- Sometimes Chrome process doesn't get killed so we are also manually killing remaining Chrome process after browser.close()

Good tips. How are you killing remaining zombie Chrome processes? Just polling the processlist and running "kill" commands?

Are you using something like dumb-init or equivalent? That's a definite requirement as _not_ having it results in zombie processes hanging around (which do retain some memory footprint). Read more about that here: https://github.com/Yelp/dumb-init#why-you-need-an-init-syste...

I'm using dumb-init for the main node process in my docker container (node in-turn spawns chrome instances via puppeteer as requests come in). Is there a way to use dumb-init for each individual chrome instance as well? Or are you suggesting we make chrome the main container process and start new containers for each session?

Hey. What is the error rate you face? Like, how often it is not possible to render a page because the navigation fails, chrome crashes, a strange exception is raised by puppeteer, ...

Great write-up! Hands-on, code included, concise. Thanks for posting.

head(less)s up: the links in the footer of your docs are busted. it looks like they all point to an `/en/` route but the actual URLs include no such tier.

Ouch, thanks for the heads up!

Small feedback about the website: It should be easier to leave the docs subdomain and look at the main page. Took me way to long to get to your pricing page.

Appreciate the feedback! I'm undergoing a massive redesign of the site that should be much friendlier. What you're seeing right now is the work of one person under tight deadlines :)

Great article. Do you have any prevention of overloading a single domain if the number of concurrent requests is increased?

Entrepreveloper, I like that.

Some of the puppeteer notes are effectively in the docs and quick starts. I've found puppeteer to be an absolute joy to use after years of phantom, custom scrapers, common scraper libraries, etc.

At a certain scale things obviously get harder but making your "sessions" ephemeral handles a lot of the resource issues.

The cat and mouse with people looking to block puppeteer access may heat up a bit too. But for the most part puppeteer makes doing this stuff 100x easier.

Right - probably repeating some stuff already out there. The main issue I’ve been seeing on puppeteers slack channel is folks who are totally new, and don’t read the docs, so trying to help all devs effectively

Very interesting to see this. I've been running a similar infrastructure rendering sometimes up to 500k pages a day, although often without images. I'm also running on Digital Ocean, but using nightmare.js (https://github.com/segmentio/nightmare), which runs on top of Electron, which in turns runs on top of Chromium.

The CPU and RAM patterns I see are different, with fixed CPU usage at near max and memory oscillating between 65% and 80%. I believe this is due to the different usage pattern, I basically always have at least 20/30 jobs running concurrently on each machine, and they're usually fairly long (up to 10 minutes or so).

Contrary to what you mention, I've never had an issue with pages crashing and bringing the whole browser down. Maybe it has happened, but it's definitely negligible compared to the benefits I get by running say 5 pages in parallel. For some tasks I've also had some luck overriding the cross-content policy and using dirty ol' iframes to render multiple webpages in the same session.

I've considered migrating to puppeteer, so it's encouraging to see large scale project sharing their experience with it.

Same experience here - we run a docker instance of Chrome using tabs for multiple pages rather than multiple browsers, and they regularly run for days without issues. Of course their RAM usage gradually expands but it is easy enough to systematically stop & start the container, thanks to some built in error and fault handling to retry any requests which failed.

I can’t say I’ve seen one tab bring down the entire browser, but I’m sure thats feasible, but thanks to docker and the fault handling above, it’d restart the instance and be up within seconds.

This is what I've initially noticed as well (app will run for several days and eventually the account will cancel due to unavailability ... you'll get paged).

The screenshot I showed in that post is an instance that's been running for _months_ under high load. I can't stress enough that using tabs/pages will always result in frequent restarts which can be really tricky if there's other sessions that need to gracefully finish.

Problem with fixed concurrency is that required memory per Chrome process varies a lot.

We (https://www.apify.com) are solving this by autoscaling number of parallel Puppeteer instances based on memory and CPU. Our open source SDK (http://github.com/apifytech/apify-js) implements this using class PuppeteerCrawler (https://www.apify.com/docs/sdk/apify-runtime-js/latest#Puppe...) which internally uses AutoscaledPool that provides autoscaling:

- https://github.com/apifytech/apify-js/blob/master/src/autosc...

- https://www.apify.com/docs/sdk/apify-runtime-js/latest#Autos...

Sadly this feature is currently limited to our platform because it's mainly build for running in Docker containers. And as Docker container don't know about it's CPU consumption vs limits it requires a notifications about reaching its CPU limit from underlying platform. We have solved this using Websocket events. But we are currently working on extending this to work anywhere/locally.

I've definitely noticed this as well and have been working on a "auto-scale" feature as well (to tone-down concurrency when under load). It's not an easy task, but you can see my first pass here: https://github.com/joelgriffith/browserless/blob/master/src/...

Great article! I've got several apps that maintain long running Selenium jobs (requires a real JS engine for the pages it hits) and ruthlessly killing containers after a each job and with a timer is the only solution I found. Zombie processes are a b!

My team is in the planning phase of rewriting one, and we're going to use Puppeteer this time. Definitely sharing this with them!

I may have misused the term zombie process. Seems that's when an entry for a terminated process doesn't clear it's entry in the process table.

I'm my experience Selenium will appear to be functioning normally from the outside, but the driver completely locks and doesn't proceed at seemingly random times. I couldn't think of a sane way to detect this, killing containers seemed easier and has worked reliably so far. I'm definitely open to hearing suggestions though.

That aside though Tini looks like a good starting point for containers like mine anyways. So thanks, it does seem helpful!

It runs out of entropy? Not sure if this applies to containers, but had something like this happen in a virtual machine we used for builds (Jenkins). I made the mistake of instantiating a cryptographically secure PRNG during build time; the initialization needed entropy to seed the CSPRNG with, and this would lock up the whole virtual machine in the middle of the build. It seems that headless VMs have no possible source of entropy, unless explicitly configured to use host's source. Maybe something similar happens with containers?

...the driver completely locks and doesn't proceed at seemingly random times...

I think the terminology for that is "wedged".

One thing that I noticed is that if the website is behind Distil Networks, they will block on the first request and make cumbersome, if not impossible, to automate some task.

I get that is important to protect the information of their clients (which seems to be content aggregators), but there are legitimate use cases for allowing at least some scrapping to happen - one being when the information is about the person/company who wants to read it in an automated form so it can be used for further processing.

At Ahrefs we are running between 130 and 170 millions sessions a day. The project started before the release of puppeteer. I can tell you the release of puppeteer saved me a lot of time. It's way more easier to keep a correct browser state using it than what was previously available. It also interacts well with lighthouse. I wouldn't call chrome headless stable or bug free (it doesn't even handle https correctly). But it's a good thing to have it available.

What issues do you have with HTTPS/TLS? I can’t say I’ve come across any issues so far - is it a specific usecase to Ahrefs?


I don't think we have a fancy usage of puppeteer. But still we get between 2% and 5% of errors depending on the version used. It's a small ratio, but at this scale it's a lot of pages.

What kind of hardware is Ahrefs running these headless browser sessions on?

Awesome stats, I've actually run across Ahrefs a few times during my development and purview of other products. Really: well done on that, I hold Ahrefs in high esteem

I’d like to know, what are people on HN using headless browsers for?

You’d actually be surprised by the applications, I definitely was. Some I’ve seen:

- Screenshots

- PDF generation

- Functional tests

- Regression tests

- Scraping SPAs

- Crazy stuff like printing user-generated content for personalization products..

Among some of these, I've also built an internal microservice for analyzing URL redirect chains. Input a URL and it will provide me with all of the redirects it goes through and information for each one including method (html meta, 301, etc), timing, etc.

I’ve done some hacky work redirect analysis but just the final url. Thought of open sourcing?

Awesome! What did you use to go through the redirects? I’ve had issues with some browser automation only giving the final url.

I'm using Chrome Headless to analyze user-submitted pages on https://urlscan.io

I will save a screenshot, network request info, request content (for JS/CSS/HTML), cookies, log messages, JavaScript global variables etc. Annotate and correlate it all and you have a nice summary of what a certain website "does".

I am using them to run node-electron hosted WebRTC proxies, to enable UDP-like networking for a real time browser game.

interacting with an app which doesn't provides a public API and has too much frontend code to be worth reverse engineering, I just hack into the app's webpack bundled code and use it

I'm using splash [1] for scraping websites for personal automation purposes.

[1] https://splash.readthedocs.io/en/stable/

I scrape lots of data from services while logged in. Scraping is done with curl, but every so often cookies expire, so I use selenium to login, save cookies, then go back to curl.

Isn't curl fit to also do the login part and save the cookies?

Tons of stuff get loaded during login, hard to go through each one and code the proper urls and POST params etc, quicker to write login code.

PDF generation for us. Had been relying on phantomjs for years, but it has been a bear. Attempted to write something with PDFKit to avoid the overhead of hosting headless chrome, but the effort was too great for our small team. Using chrome we are able to build high fidelity reports with the same web components we use in the application's UI.

I'm using Puppeteer for website / transaction monitoring at https://checklyhq.com. People script their own checks and we run them for you on a cron like schedule and alert when things go off the rails.

Personally? Dealing with cloudflare and other bullshit WAFs that break the internet.

Recording of webrtc video conferencing sessions , pdf generation, spa test suites

https://www.prerender.cloud/ - server-side rendering API for single-page apps

Headless browsers are the de-facto standard for scrapping websites which have some protection layer. Ad-fraud is also a thing.

Automating logging into and data capture for financial/banking sites, building a new finance aggregation service.

Rendering high res vectors (SVG)

Built an DOM 2 png image api (hivis.io)

Personally, for testing.

I recently played around with a headless Firefox instance to test a web app on Travis and so far it has been working incredibly well. Here is a sample .travis.yml file for those interested: https://github.com/dom96/geckodriver-travis/blob/master/.tra...

I'm running ~600k to ~800k chrome sessions a day, each in a new browser process (no tabs/pages) because of cookie interferences.

The magic is hidden in chrome's command line parameters which can significantly speed up the browser start.

I'm not using headless because I need an extension but most command line switches also apply to headless I guess...

A prepared profile/user directory also helps speeding up things.

Can you share the command line switches that help?

I would also be very interesting in tips regarding this.

please do share!

I think there is no need to manually `ADD dumb-init` when Docker provides an init (tini) process via `--init` flag.

[0]: https://docs.docker.com/engine/reference/run/#specify-an-ini...

The reason I baked dumb-init into the image was in order to offer better dev UX (since I'm assuming some/lots/most folks out there will just `docker run` without much context on what's going on). I actually like specifying with the init switch better, but it opens up new devs to a world of hurt

At TestingBot [1] we run about 50k sessions / day in real Chrome browsers (each session is a new Virtual Machine with Chrome freshly launched).

Compared to the other browsers we offer, Chrome is by far the most stable.

[1] https://testingbot.com

Hi, could you elaborate as to why TestingBot runs one browser per VM, as opposed to running multiple browsers on a beefier VM?

Because of security, we want to make sure each test can not be affected by a previous test.

If you'd run multiple tests on 1 VM, then one customer could change a property on the VM or alter a state that might cause the next test on the VM to fail.

By providing a pristine VM per test, we guarantee that each test cannot be affected by any previous test.

Thanks for the info!

Do people generally agree with `3. page.evaluate is your friend` over using puppeteer's methods?

the biggest annoyance with `page.evaluate` is transpiling javascript which adds helpers outside of the `page.evaluate` closure that in turn are not serialized and sent to the browser.

That's a fantastic point: I'll add that as a gotcha

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact