It also supports some nice bits - there's a tab pool interface, it does execution lifetime management, and there's actual parameter checking for arguments generated by parsing the `protocol.json` file the browser supplies.
I use it as a mixin for some high(ish) volume web archiving I do as a hobby. It does a nice job poking through things like buttflare and other bullshit WAFs.
I haven’t found a way to do mhtml with chrome headless which would be so convenient..
Stores the scraped content in a postgres database, with external blob storage for binary content.
Additionally, historical records of the scraped content are kept, so I have something that acts kind of like the internet archive too.
I'm running a somewhat different business of which Puppeteer is a pretty big part. All points made in the post are valid, I could only add the following after having run ~400k Puppeteer sessions.
- Race conditions happen. This issue  is causing roughly 3% of all Puppeteer runs to fail in my case. I had to bake in a retry mechanism.
- Memory matters, CPU not so much. Just looking at my Librato and AWS stats, the CPU is mostly idle when running multiple concurrent sessions.
- One way to establish compartmentalisation is to actually run each scraping session in a separate, one-off Docker container. You pass in the code via a disk mount or such. No hassle with too many tabs, or shared context between runs. Each container is destroyed after running.
Thanks for posting your thoughts!
Is it a programming api to chrome? As in:
host:~$ ./browserless-app example.com # will fetch an interact using js/DOM with the web @ url
Is this supposed to be used as a proxy between some random browser, a remote chrome browser and the http response? As in:
<IE/edge/safary script> page-load: ws.connect(ws://mybless-service/); this.html = ws.recv()</script>
Or is it more oriented to have a remote service which you can use to send commands to process and control browsing sessions? login to foobar.org; update this data. fetch something; return data/status too peer.
I'm really lost. A sugestion. I cant check 4 layers of subproject dependencies in orther to have a clear picture of what the thing does. An how it does it. I dont know if this is a node server running standalone (can I autodeploy this service in my server? Is this a saas?), or something you use in the client js. Or something you can access from the client side that internally connects to a websocket. Or even a mix of all this. A clear arquitectural design diagram helps IMO in this things:
- Layer A runs on yours/ours/both linux/$X server using Y framework
- Layer B Is a websocket service (standalone/saas) for clients to consume.
- Layer C is a browser client js API wrapper for the service.
- This is useful for this <full use case and run session of the example>
Dont get me wrong, this is not a hate comment. Is just a personal suggestion to help people not deeply invested in JS/node ecosystem. Maybe its me just being thick this morning and not getting it.
Thanks, it looks very cool anyway.
As of now it's a pool of headless Chrome instances that you can control remotely and run tasks through. Screenshots, PDFs, scraping... whatever you need. The reason that folks use this is because it's _really scary_ to run Chrome alongside your app's infra, so separating it out (much like a DB) is a good idea.
In the near-future we'll offer functions and other REST endpoints for doing common and user-defined tasks. This make it even more contained so your app can keep running happily without babysitting Chrome.
I had fun collaborating with you last year on a puppeteer docker thread in GitHub.
I’m super happy browserless is picking up. A bit jealous of your success but still rooting for you!
Containerized Puppeteer is amazing. Hope Google acquires browserless.io some day.
Indeed it looks very useful.
Have you seen any similar behavior in your system? I initially thought I was forgetting to close the browser instance but triple checked my code and it does seem like I'm calling close() everywhere.
- Sometimes Promise returned by page.close() never resolves so it's good to call Promise.race() on that together with a Promise that resolves after some timeout period (30s?)
- Sometimes Chrome process doesn't get killed so we are also manually killing remaining Chrome process after browser.close()
At a certain scale things obviously get harder but making your "sessions" ephemeral handles a lot of the resource issues.
The cat and mouse with people looking to block puppeteer access may heat up a bit too. But for the most part puppeteer makes doing this stuff 100x easier.
We (https://www.apify.com) are solving this by autoscaling number of parallel Puppeteer instances based on memory and CPU. Our open source SDK (http://github.com/apifytech/apify-js) implements this using class PuppeteerCrawler (https://www.apify.com/docs/sdk/apify-runtime-js/latest#Puppe...) which internally uses AutoscaledPool that provides autoscaling:
Sadly this feature is currently limited to our platform because it's mainly build for running in Docker containers. And as Docker container don't know about it's CPU consumption vs limits it requires a notifications about reaching its CPU limit from underlying platform. We have solved this using Websocket events. But we are currently working on extending this to work anywhere/locally.
The CPU and RAM patterns I see are different, with fixed CPU usage at near max and memory oscillating between 65% and 80%. I believe this is due to the different usage pattern, I basically always have at least 20/30 jobs running concurrently on each machine, and they're usually fairly long (up to 10 minutes or so).
Contrary to what you mention, I've never had an issue with pages crashing and bringing the whole browser down. Maybe it has happened, but it's definitely negligible compared to the benefits I get by running say 5 pages in parallel. For some tasks I've also had some luck overriding the cross-content policy and using dirty ol' iframes to render multiple webpages in the same session.
I've considered migrating to puppeteer, so it's encouraging to see large scale project sharing their experience with it.
I can’t say I’ve seen one tab bring down the entire browser, but I’m sure thats feasible, but thanks to docker and the fault handling above, it’d restart the instance and be up within seconds.
The screenshot I showed in that post is an instance that's been running for _months_ under high load. I can't stress enough that using tabs/pages will always result in frequent restarts which can be really tricky if there's other sessions that need to gracefully finish.
My team is in the planning phase of rewriting one, and we're going to use Puppeteer this time. Definitely sharing this with them!
I'm my experience Selenium will appear to be functioning normally from the outside, but the driver completely locks and doesn't proceed at seemingly random times. I couldn't think of a sane way to detect this, killing containers seemed easier and has worked reliably so far. I'm definitely open to hearing suggestions though.
That aside though Tini looks like a good starting point for containers like mine anyways. So thanks, it does seem helpful!
I think the terminology for that is "wedged".
I get that is important to protect the information of their clients (which seems to be content aggregators), but there are legitimate use cases for allowing at least some scrapping to happen - one being when the information is about the person/company who wants to read it in an automated form so it can be used for further processing.
I don't think we have a fancy usage of puppeteer. But still we get between 2% and 5% of errors depending on the version used. It's a small ratio, but at this scale it's a lot of pages.
- PDF generation
- Functional tests
- Regression tests
- Scraping SPAs
- Crazy stuff like printing user-generated content for personalization products..
The magic is hidden in chrome's command line parameters which can significantly speed up the browser start.
I'm not using headless because I need an extension but most command line switches also apply to headless I guess...
A prepared profile/user directory also helps speeding up things.
Compared to the other browsers we offer, Chrome is by far the most stable.
If you'd run multiple tests on 1 VM, then one customer could change a property on the VM or alter a state that might cause the next test on the VM to fail.
By providing a pristine VM per test, we guarantee that each test cannot be affected by any previous test.