
Theheadless.dev – open source Puppeteer and Playwright knowledge base - tnolet
https://theheadless.dev/
======
lihaoyi
I used puppeteer in my book’s
([https://www.handsonscala.com/](https://www.handsonscala.com/)) build
pipeline to convert HTML sources into PDFs, for online distribution and
printing. This let me write and style my book using common web technlogies
(e.g. bootstrap CSS) rather than needing to fiddle with specialized tooling
like Pandoc or LaTeX, and it ended up looking pretty good.

Works flawlessly, and has exactly the configuration knobs you would expect and
want. Took a bit of plumbing to call into a Node.js script from my Scala build
logic, but all in all ended up being like 20 lines of plumbing which was
straightforward to write and understand. A welcome change after struggling
with bugs in wkhtmltopdf!

~~~
rossta
This is a really interesting use case! Screenshots of your book look great.

I'd just started looking into the existing tools like those you mentioned for
e-book authoring and it's quickly gotten overwhelming. From what you've
described, I'm interested in looking into puppeteer as an alternative
workflow.

------
jadell
One day I will right an extensive post (or set of them) about using Puppeteer
to bypass sites' anti-bot measures. It's a fascinating (and annoying) cat-and-
mouse game. But at the end of the day, almost all bot detection measures rely
on using Javascript to report back metrics about the browser, but those
measures are running in an environment where the bot completely controls what
Javascript reports back.

One of my favorite tricks I've seen employed are detection measures that look
to see if common detection bypass tricks have been implemented (like checking
the toString output of commonly overridden native functions.)

[https://theheadless.dev/posts/challenging-flows/#bot-
detecti...](https://theheadless.dev/posts/challenging-flows/#bot-detection)

~~~
rashkov
I wonder if google captcha will always be able to defeat puppeteer? Seems odd
for google to publish a set of abuse-able APIs, and not be able to detect
their use.

~~~
folkhack
There are farms of people who literally sit around all day and solve CAPTCHAs
- there's no surefire way to address this problem and it usually ends up in an
orchestration of reputation-score tooling (including making a user fill out a
CAPTCHA) to fingerprint a bot.

If you're good at spoofing all of that fingerprinting you'll blow straight
past them - it's all client-side in-which you have control all the way down to
the bits and bytes.

~~~
vdfs
You can just use Google text to speech to solve reCaptcha

~~~
judge2020
This is a their answer:
[https://support.google.com/websearch/answer/86640?hl=en](https://support.google.com/websearch/answer/86640?hl=en)

------
chromedev
I've used and love Puppeteer, but it also makes me realize with enough money
and the right skills, these tools are exactly what would be used to mass
manipulate social media. You could literally create thousands or millions of
accounts and add enough entropy to make it undetectable.

I've created a couple scripts to delete accounts, and even signing in can
include randomization between scrolls and clicks to make each change entirely
unique and to mimic real user interactions. Sort of scary to think about what
is possible with this, especially given a large pool of residential IP
addresses.

~~~
tnolet
OP here: funny and true story. We had a misconfiguration in our rate limiting.
One smarty pants used it to blast his twitch channel with "viewers", e.g. a
1000 concurrent Puppeteer sessions.

It was probably great for his engagement / viewership numbers till we shut him
down.

~~~
sovietmudkipz
I have to ask... What was the reason you shut this person down? Was it simply
that they were violating your rate limit, or something more?

E.g. if they were paying for the 1000 concurrent puppeteer sessions, would
everything be in the clear with your SaaS?

Presumably, your service doesn't care what the users use it for. Sure, though,
it's a violation of twitch's terms of service to fake viewers on the platform.
I may be naive -- could twitch sue a service that is used to fake viewers?

~~~
tnolet
They were violating our terms of service, rate limiting, responsible behaviour
etc. etc. So we care quite a bit what Checkly is used for. We have a page for
it:

[https://www.checklyhq.com/docs/browser-checks/responsible-
us...](https://www.checklyhq.com/docs/browser-checks/responsible-use/)

Luckily, most users are completely ok. But fraud and abuse will always be a
thing.

~~~
Dylanlacey
That you were asked this question makes me a little sad.

Beside risks like Twitch (or worse, all of Amazon) blocking your traffic
wholesale, letting someone use your product to abuse another is just a crappy
thing to do; Especially if you let it because "well we're making money".

------
gitgud
Anybody know of a framework which can record user sessions and output
puppeteer commands? _(basically mouse clicks - > code)_

It would make writing end-2-end integration tests much easier...

~~~
hlenke
Hey, here is a basic & open source recorder for Puppeteer:
[https://github.com/checkly/puppeteer-
recorder](https://github.com/checkly/puppeteer-recorder) from the same team
(disclaimer I'm co-founder of Checkly)

------
umaar
I'm working on a video course for all things browser automation, whether it's
testing, scraping, auditing, deploying etc.

I'm keeping the codebase open on GitHub: [https://github.com/umaar/learn-
browser-testing/](https://github.com/umaar/learn-browser-testing/) so anyone
who wants to follow along can do so for free.

I've almost finished some cool content such as:

\- An Amazon price checker which sends you a text message when the price
decreases

\- An Playwright script which gets to Wikipedia Philosophy
([https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosoph...](https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy))

\- Having an automation script constantly running on a cheap Raspberry PI

~~~
moooo99
A good use case would be indexing real estate portals for interested buyers. I
could imagine it as a relatively attractive side business, targeting real
estate agencies and buyers/renters alike. But I haven't had the time to
actually build it

~~~
mxuribe
From my having worked ~8 years in residential real estate (and overseeing dev.
and maintenance of an in-house api)...I've learned that real estate data on
all those real estate sites (from zillow, to any other broker site, etc.) is
very much not clean, not always correct, nor updated at a realistic interval.
From a data perspective, it is extremely messy to work with. Plus real estate
brokers (and related data brokers) have little incentive to clean up the data,
let alone make updates more real-time. This might not be a big issue for
personal apps...but just be aware of the not-so-little gaps around real estate
data. Now, if someone did this as an exercise to learn about apis, scraping,
web browser automation, then by all means enjoy; plenty to learn for sure! But
don't expect the data to be as usuable. (Oh, and my claims of data messiness
are not limited to automata, as anyone searching for homes can tell you about
their disappointments when asking about a home only to hear from the agent:
"Sorry, that home was sold, we/someone forgot to take the listing down...").

~~~
randomdude402
This is unfortunately the truth of the situation. I spent way too much time
writing scrapers for Zillow and Redfin to guide home purchasing decisions,
only to find out half the stuff listed for sale had been under contract for
weeks, places listed with garage that had none, homes with pictures of pools
and no pool in the data fields... just tons of errors in the data that no
amount of vigilance could clean.

~~~
Dylanlacey
The mad thing is, I would happily pay for the real estate sites to provide me
with this kind of data because buying a house is so expensive and such a
pain...

But it would STILL suck, because the input is so dodgy.

------
nikisweeting
I've been a long-time Puppeteer user, it's been game-changing in so many ways.
We even built a whole platform with it to automate our company's social media
presence, among countless small bash scripts, archiving utilities, QA
workflows, etc.

Playwright interests me even more though. We've been getting a lot of requests
for ArchiveBox.io to support other browsers as the rendering engine for web
archives, and it's always seemed daunting to try and reimplement multi-browser
support ourselves for puppeteer-style workflows, but Playwright seems to
completely take care of that!

------
vulpesx2
I've never really worked too heavily with headless, so what would be some
examples of 'real-life' applications using this API? Looking for some
inspiration to maybe build a side project around this :)

~~~
tnolet
Obviously tooting my own horn here, but these are some products:

\- [https://checklyhq.com](https://checklyhq.com) \- synthetic monitoring

\- [https://microlink.io](https://microlink.io) \- automation

\- [https://www.browserless.io](https://www.browserless.io) \- testing

\- [https://www.scrapingbee.com/](https://www.scrapingbee.com/) \- scraping

All are business leveraging these types of frameworks

~~~
mrskitch
Thanks for the mention (browserless.io)! Really enjoying this site, let me
know if you'd like a guest post and I'd happily add something. We've gathered
a few weird tricks and tips as well. Definitely a lot that can be shared!

------
tnolet
Maybe fun to add. The guides and articles are all on GitHub as is the source
code for the site - it’s Vuepress based so it might help folks who want to
make their own knowledge base using that framework.

------
thekyle
How do these differ from something like Selenium?

~~~
mrskitch
Hey, I work on browserless.io, which supports both puppeteer and selenium.

Selenium uses a chatty HTTP interface, whereas puppeteer/playwright use
WebSockets or pipes to communicate. Under-the-hood, however, Selenium is
simply using chrome's devtools protocol to communicate with it. The way
selenium does this is by another binary, generally a `driver`, that has the
protocol "baked" into it and has the HTTP selenium API as its input interface.

This is all a long way of saying that puppeteer/playwright have a lot less
moving parts, and are generally more approachable. Selenium _does_ have a lot
more history behind it, better support across languages and frameworks, and is
more stable but it's also much larger and "clunkier" feeling. It's also a lot
harder to scale with load-balancers since, again, it's all over HTTP so you'll
need some way to load-balance with sticky sessions.

Practically speaking they all do the same thing at some layer. Both are high-
level APIs around the devtools protocol, it's just what higher-level interface
you prefer and what your language/runtime is.

~~~
Dylanlacey
^^ All of this. I work for Sauce Labs; We've been pretty focused around
Selenium but we're building out support for Puppeteer, Playwrite, Cypress et
al.

The newer automation tools benefit from being newer; They can take advantage
of hardened, well designed interfaces (like the Dev Tool protocol). Selenium's
been around for a bit longer, and was built when browsers didn't make it easy
to control them. That's influenced the semantics of Selenium quite a lot, as
well as explaining the extra moving parts (Drivers exist to map the Selenium
Wire Protocol (or W3C protocol) to whatever they're driving because Selenium
wasn't built with a specific browser in mind).

I feel like, at this point in time, the real difference is how much
abstraction you want from the browser. Selenium is a set of knives, Puppeteer
is a Die Cutter. You'll put in more work with Selenium, but maybe you need
something do happen a REALLY specific way. Or, you might just need shapes cut,
and Puppeteer will be more reliable and faster.

------
defied
I work for [https://headlesstesting.com](https://headlesstesting.com) where we
provide a grid of browsers, which people can use in combination with Puppeteer
and Playwright. One of the reasons people use this, instead of Selenium, is
because of the increase in speed.

------
fareesh
How does one reliably use puppeteer and know when a page has loaded?

I've tried using the various networkidle events, wait for some DOM element,
and I find myself just using 5 seconds or something like that as the most
reliable solution.

Is there a foolproof way of doing this? I feel like it should be way easier
and less hacky.

~~~
luckylion
You can't really know whether a site has fully loaded and is just sending
analytics ajax to the server or is querying some random service and will
trigger a navigation based on the response, whether there's a setTimeout
running somewhere that'll trigger a navigation etc. I've found a timeout to be
the safest bet. If onload has been fired and it hasn't jumped in the last X
seconds, it's probably going to stay at that URL.

Humans don't know either, they're just better at guessing based on visual
cues.

~~~
ethbro
It's something of a UX question too.

One _could_ design a page that visually loaded, then jumped to a redirect
after 10 seconds on the page. But who would?

The primary approach is always event-based, because most pages do that sanely.

If not... the best approach I've found is looking for sentinel elements.

Essentially, something that only matches once the website is de facto loaded
(regardless of events). Sometimes it's a "search results found" bit of text,
sometimes a first element. But more or less, "How do I (a human) know when the
page is ready?"

~~~
luckylion
> But who would?

Affiliate networks (the shadier they are, the more likely they will), because
they are weird and load third party tracking beacons in transitional pages and
want to make extra sure that the beacons (who also can redirect multiple
times) have been loaded. To add to the fun, they're also adding random new
tracking domains (to avoid being blocked, I assume), so you can't even say
whether you expect some domain to be transitional or final to increase your
confidence in what you measure.

You're right though, looking for elements is a pretty good way if you know the
page you're checking. If you're going in blind, you can still look for things
they probably have (e.g. <nav>, <header>, <section> etc), but I haven't found
any that are reliably on a "real" page and reliably not on a redirect page.

~~~
ethbro
That's a use case I haven't encountered, nor considered!

Most of my work is making known transitions (e.g. page1 to page2) work
reliably, so I have the benefit of knowing the landing page structure.

If you're crawling pathological, _client-side_ redirect chains, maybe do
pattern-matching scans on loaded code for the full set of redirect methods?
There's only so many, and includes / doesn't-include seems a fair way to
bucket pages.

~~~
luckylion
Yeah, we had been doing that initially and found that there are lots of
imaginative ways to use e.g. refresh meta-tags that browsers do accept but we
did not (e.g. somebody might say _content= "0.0;
url=[https://example.com/"*](https://example.com/"*)) and more and more
networks and agencies switching to JS-based redirects lead to a headless
browser being easier in the end, despite dealing with these specific issues.

A simple _self.location.href = ...* is still doable (-ish, because I've seen
conditional changes that were essentially if(false)... to disable a redirect,
which we obviously didn't consider when pattern matching), but once they
include e.g. jQuery (and some do on a simple redirect page) it got far too
complicated.

------
ffpip
You might wanna switch to another analytics provider (temporarily) when
posting on HN. Because 60%+ users probably block it anyway, so you will lose
the insights, etc

~~~
janOsch
I am asking out of curiosity: is it possible & legal to set up a simple proxy
server to redirect analytics data to the provider? So that the analytics
traffic goes to the same domain - and is not blocked?

~~~
spanhandler
Google probably don't want that happening because they use the data to inform
decisions for all kinds of other things (search, ad sales, filling in gaps in
their creepy profiles of people, maybe even metrics-driven machine-learning-
for-web-design, who knows) and are more worried about web site owners feeding
them mountains of fake data than about missing stats on some % of ad-blocking
users.

~~~
ffpip
Use uBlock Origin. It blocks even those sneaky DNS changers

------
wprapido
Awesome! Would love to see something like that for Selenium

------
polskibus
What about Cypress though? How does it compare to Playwright ? Is it going to
die now that MS has rolled out their own RPA?

~~~
tnolet
Cypress is an E2E testing solution with batteries included. Playwright is just
the browser automation part.

------
skywal_l
Chrome only...

~~~
rozenmd
"Playwright provides a set of APIs to automate Chromium, Firefox and WebKit
browsers."

~~~
skywal_l
Indeed I missed that. So playwright is definitely more interesting.

What is surprising is the linked site is quoting the FAQ of playwright that
has been removed a couple of months ago:
[https://github.com/microsoft/playwright/pull/1930/files](https://github.com/microsoft/playwright/pull/1930/files)

That FAQ seems to be giving a lot of perspective on the subject. Wondered why
it was removed.

