
Creating an open-source solution to the headaches of headless browsers - mrskitch
https://sourcesort.com/interview/joel-griffith-browserless
======
mrskitch
(Author here): lots of comments about why would someone pay for this. I think
the answer is simple if you step out of your developer shoes.

There's a lot of complexity in managing and even building the thing from the
start -- and then you have to support it. If you're working in a large org
then there's a chance that you can just DIY it, however for small-medium
businesses this isn't practical and it's a waste of time (their most precious
resource).

I like to think of it as a managed database. Sure, you can freely download
postgres in a container and you're up and going, however there's a lot more
costs to it than just that. Having a fully-managed database saves you time and
other non-tangibles that it can be worth the cost. Just depends on your
circumstances.

~~~
JustARandomGuy
Since you're reading this, a feature request: I would love it if you could put
up a REST endpoint for extracting all the images (example code here [1]) on a
web page, and more endpoints for extracting all the links, script addresses,
etc.

I was trying to do that on Browserless but couldn't get the final file
download to work (I adapted Stack Overflow code linked below to put all the
web pages images into a ZIP file and download that) - presently I'm running
this on a Google Cloud Function, which is working but I'd rather outsource it
to you, especially since the function chokes on large web pages (possibly it
needs more RAM than the 2GB limit currently available in GCF?).

[1]
[https://stackoverflow.com/a/52542490](https://stackoverflow.com/a/52542490)

~~~
mrskitch
Just to follow up, you can use our /scrape API to do this:

curl -X POST \
[https://chrome.browserless.io/scrape](https://chrome.browserless.io/scrape) \
-H 'Content-Type: application/json' \ -d '{ "url":
"[https://reddit.com/"](https://reddit.com/"), "elements": [{ "selector":
"img" }] }'

This will get all the <img> tags on a page and return their attributes (which
includes their sources). If you wanted to do scripts as well, just add a new
object to the elements array with the "selector" of "script".

------
pearjuice
What's with the snarky comments? The guy build a successful business with
happy customers. He didn't make it in 2% of his time, he built it over the
years with experience from other projects and solved an issue people are
willing to pay money for. Maintenance now might be on low time but doesn't
mean it's easy to solve a problem and consistently ship the solution. I
honestly think most of you have a misunderstanding about serviced software and
the demand for it. The fact that you can do it with a Docker script and some
time doesn't mean it works properly for all use cases or everyone can and
should manage that infrastructure themselves. Some people happily pay for it
so they don't have to worry whether their instance is still running and can
call support instead of firing a terminal and looking through obscure
Stackoverflow answers. He even open sourced his entire business. Instead of
throwing apples, try to learn something or at least appreciate the effort.

~~~
mrskitch
I felt similar, thanks for echoing the sentiment.

------
iamEAP
I recently learned (by trial and error) some of the headaches associated with
running headless browsers at scale that Joel mentions here; wish I'd heard of
this service earlier. I ended up finding other solutions to fill in the gaps:
Puppeteer Cluster is one I'd recommend
([https://github.com/thomasdondorf/puppeteer-
cluster](https://github.com/thomasdondorf/puppeteer-cluster))

I especially like the "host it yourself" commercial license model, here; while
automating browser _actions_ over a network works well enough, _detailed
scraping_ over a network can quickly become inefficient (as many requests for
elements or element attributes may incur individual round-trips). In some
cases, colocating your browser instance with your scraping logic becomes a
necessity.

~~~
mrskitch
We hear about puppeteer-cluster _a lot_, and we hear the same thing from folks
(that's it's great). browserless.io essentially does "clustering" at an
infrastructure level, whereas puppeteer-cluster does it at the application
level.

Both essentially solve the same problem, just in different ways.

------
Thorrez
I'm confused about the license[1]. It seems to not be actually open source.
The Open Source Definition says[2]:

> The license must not restrict anyone from making use of the program in a
> specific field of endeavor. For example, it may not restrict the program
> from being used in a business, or from being used for genetic research.

But this seems to be doing that exact restriction.

Additionally the license seems like it contains a loophole:

>If you are creating an open source application under a license compatible
with the GNU GPL license v3, you may use browserless under the terms of the
GPLv3.

If I make an open source application, I can use browserless under the terms of
the GPLv3. That means I can redistribute browserless under the GPLv3. That
means people can take the browserless code I redistribute and use that for
commercial products (as long as they don't distribute a non-GPLv3 binary form
of the commercial products containing browserless, because that would break
the GPLv3).

[1]
[https://github.com/browserless/chrome#licensing](https://github.com/browserless/chrome#licensing)

[2] [https://opensource.org/osd#fields-of-
endeavor](https://opensource.org/osd#fields-of-endeavor)

~~~
Reelin
Just checked the GitHub license page
([https://github.com/browserless/chrome/blob/master/LICENSE.md](https://github.com/browserless/chrome/blob/master/LICENSE.md)).

> This work is dual-licensed under GPL-3.0 OR the browserless commercial
> license. You can choose between one of them if you use this work.

So it's clearly GPLv3 (no loophole required), which AFAIK does allow closed
source proprietary use within a company so long as the program isn't
redistributed externally (perhaps the developer didn't understand that?). It
seems that the licensing section in the readme should have it's wording
adjusted somewhat.

In fact, I think you're even in the clear to run a proprietary cloud service
using GPLv3 code which is why the AGPL (among others) exists. Some recent
drama ([https://techcrunch.com/2019/05/30/lack-of-leadership-in-
open...](https://techcrunch.com/2019/05/30/lack-of-leadership-in-open-source-
results-in-source-available-licenses/)) for reference.

(Oddly, the header underneath that states "GPL-3.0-or-later" which is a bit
inconsistent.)

------
jrockway
I wrote something like this a couple months ago and thought about selling it.
I ultimately decided that the price where you make it cost-prohibitive to mine
cryptocurrency is too high for someone that just wants to render PDFs without
dealing with the burstiness of running several copies of Chrome on their
production infrastructure. I was also concerned about the underlying browser
changing how things render when upgraded; I didn't want to run an outdated
browser, but I also didn't want to tell users "hey we updated Chrome, better
check the output of your batch job and make an emergency fix to your HTML".

How are you dealing with these issues?

~~~
RussianCow
How often do browsers break backwards compatibility? I don't think I can
recall a single time I've had working code break due to a browser upgrade,
with the exception of non-standard features (which is maybe what you're
referring to, but then it's a known risk and you should already be aware of
it).

Edit: And if a client really does need version X of Chrome, you could give
them the ability to pay extra to pin the version indefinitely.

------
hoten
I work on Lighthouse (user of the Chrome DevTools protocol) and work with the
DevTools team (obviously the main user of the protocol) - and when I first saw
browserless I was blown away. So cool! Good job with your success.

What was the hardest part re: working with the protocol?

~~~
mrskitch
The protocol is pretty easy, I think coordinating the necessary “enable” calls
is a bit cumbersome. Also the legacy JSON protocol is harder to support, but I
understand why.

Hardest part is debugging crash issues and why they happened. You either just
get a generic “Page crashed!” error (which I think is puppeteers handler
message), or “browser disconnected!”. That and chromes logs are just crazy
noisy, which I haven’t gotten a lot out of.

Those are probably the biggest, thanks for asking!

------
sneak
> _There 's always been this thought that I've had, that advertising and
> "paid" attention is really in no one's best interest. You're likely to get
> users who really aren't going to get any value out of your software, so your
> churn increases, and you've also just paid for that user that's churned.
> These things get harder to tease out since it's almost impossible to ask
> "show me all the churned users this period acquired from advertising
> channels." Maybe that's possible, but you'd have to do a lot wrangling
> together to get it all working._

Correct me if I’m wrong, but isn’t precisely that kind of analytics simply
_table stakes_ for any modern crm/marketing/customer intelligence suite in
2019? It seems like that is absolutely a solved problem.

~~~
mrskitch
Yeah, it's a pretty contrived example on my part. My sentiment here is that
there's so many inputs to modeling behavior, and trying to find signal in the
noise, that at this scale your time is likely spent better elsewhere. Unless
you have the revenue stream to do it and do it well, then the effort can be a
time sink

~~~
sneak
How many variables do you need to track, though? $X spend, Y signups at $Z
each, AA% churn, $BB retained MRR for an estimated $CC CLV based on an
$X/Y*(1-AA%) CAC - why does it need to be much more complicated than that when
you don’t have millions of users?

(Seriously though, I’m asking, not poking fun. You probably know more about
this stuff than I do, having actually done it. It seems really simple to
figure, to me. What am I missing?)

~~~
matthiaswh
It's more a matter of being messy rather than complicated. You're trying to
track users who often visit your site multiple times before purchasing, coming
from different sources, on different devices, and somehow tie that attribution
to the purchase. None of the data inputs are consistent nor tied together.
Analytics tells you differently than your ad platform, your payment processor
tells you yet another thing, and your email marketing tool, and your CRM. (And
if you want the serious tools for data monitoring and reporting have moved to
focusing on enterprises...) You have to somehow factor in refunds, free
trials, prorated billing, early cancellations. You have multiple ad campaigns
running ad variations. Don't forget a/b testing.

Finally you see some data point that hints something might be working, but you
know you have to account for all the other factors involved. Did I make any
website edits that day? Did the ad network change their algorithm slightly?
Was there a holiday affecting traffic? When did I insert that new ad again?
Wait, I know I changed my ad bid at some point... Did I get an influx of
traffic from another source? Was it just a fluke?

If you want good, real data, it's messy. And far from a solved problem.

~~~
mrskitch
Wow, you said this so much better than I could, thanks for chiming in

------
jessaustin
Does this handle something like Distil [0]? Or is that type of scraping not
the focus of this product?

[0] [https://www.distilnetworks.com/block-bot-
detection/](https://www.distilnetworks.com/block-bot-detection/)

~~~
stickfigure
No, it won't. However, Distil is not hard to work around if you automate a
real browser in headful mode.

~~~
jessaustin
Could you point me to some reasonably straightforward ways to do that? Thanks!

~~~
stickfigure
I don't think there's any HOWTO posted online; I just worked it out by trial
and error.

Use a real version of Chrome (not Chromium) and headful mode. Mask the
navigator.webdriver property. Pace your requests and take care to use "good"
IP addresses.

Keep in mind that as soon as Distil sees something obviously automated (like a
headless browser) the source IP address is "burned" for some number of days.

~~~
jessaustin
Thanks!

------
bobblywobbles
Looks interesting, however I can't view your webpage - I am getting this
error: "The character encoding of the plain text document was not declared.
The document will render with garbled text in some browser configurations if
the document contains characters from outside the US-ASCII range. The
character encoding of the file needs to be declared in the transfer protocol
or file needs to use a byte order mark as an encoding signature."

~~~
mrskitch
Interesting, I’ll take a look and see if our markup is encoded improperly.

------
encoderer
Congrats on the continued success. It can stun people when they are reminded
that the skills they sell to others can be applied for their own prosperity
and I think you see that here.

If you’re a developer with a day job there has never been a better time to get
started building and selling your own software.

It’s not glamorous but it is rewarding.

~~~
mrskitch
I spent ~4 hours on my 10th wedding anniversary debugging a production issue.
It's not a fun thing to talk about, and doesn't get a lot of attention, but
the truth of the matter is that when things are bad _they are bad_. I can see
now why folks say that this isn't for everyone.

------
jotto
For a similar headless Chrome project launched around the same time, but with
a price-per-api-request model, see
[https://www.prerender.cloud/](https://www.prerender.cloud/) (PDFs,
screenshots, pre-rendering). MRR is about the same.

~~~
trpc
Rendora is true FOSS, free and self-hosted with very lightweight usage of
resources.

[https://github.com/rendora/rendora](https://github.com/rendora/rendora)

~~~
kresten
Last commit 12 months ago.

[https://rendora.co/](https://rendora.co/) Seems to be gone.

------
rozenmd
Probably worth mentioning that's $24k MRR, not how much it costs...

~~~
NiekvdMaas
Still impressive for a 1-man show who says he spends just 1% if him time on
it.

------
r_singh
I'm using chrome-aws-lambda on Lambda and it works like a dream. Luckily for
my use case I don't need images, fonts, etc.

There's also GCF for those on Google Cloud. I have used Browserless' trial and
felt like the 2+ GB instances were kind of expensive because they require
reservation unlike Lambda where you get 400,000 GB-seconds and 1M request per
month for free.

------
perl4ever
"How cool would it be if you could just fire up your browser, do the work you
want it to, and press a button and now it just magically does that someplace
for you without ever having to write code?"

Like, recording macros has been a thing forever, but how are you going to
magically generalize them, without _better-than-human_ AGI?

------
mrskitch
Thanks again for the questions. Please do email me if I happen to overlook
anything: joel at browserless dot io

------
ausjke
how is this different from apify ([https://apify.com/](https://apify.com/)),
apify seems can do what browserless does and is also open-source, means you
can self-host it freely.

------
msmithstubbs
Solo founder: I built a tool and it solves problems for people to the tune of
$24k/month

Almost everyone: That's great! Well done.

HN commenters: Pffft.

~~~
dang
That's not even close to a fair summary of this thread.

------
MuffinFlavored
What $288k/yr headache is there around `docker pull buildkite/puppeteer`?

[https://hub.docker.com/r/buildkite/puppeteer](https://hub.docker.com/r/buildkite/puppeteer)

~~~
xwdv
Yea, I’m thinking making about a company named after some kind of low hanging
fruit and just building out a bunch of these trivial little use cases that
keep popping up and put them all behind a subscription model.

You’ll probably see me with an article in 6 months about how I’m making
$150k/mo for about 2% of my time.

~~~
pearjuice
I have set a calendar item for 6 months from now. Looking forward to your blog
post.

~~~
xwdv
Feeling the pressure now.

------
pmarreck
Scripting headless browsers for testing is an antipattern and should literally
be avoided as much as humanly possible

~~~
RussianCow
Care to elaborate?

~~~
pmarreck
I have never had a good experience incorporating a headless browser test into
a test suite. In literally every case it added so much complexity, suite run
time and uncertainty that I realized it was better to just do the test via
unit tests (which test the logic directly) and/or integration tests (which
test the HTTP output of the controllers) if it was at all possible to rework
the logic to operate in that fashion.

1) Increased load times and test run times due to browser complexity and
memory consumption

2) Impossible to run concurrently without additional instances, each of which
takes up massive memory

3) Tests are slow and often nondeterministic (literally THE WORST property a
test suite can have), with many cases where things like "sleep()" delays are
put in to circumvent some opaque browser latency issue, which is just gross

4) Even after suffering all of the above, you're still only testing ONE engine
(say, WebKit) instead of all of the popular ones (Blink (Chrome), Gecko
(Firefox), EdgeHTML (eh, Blink, I guess, now?), etc)

I did not enumerate all the disadvantages, but these should be enough to
support my position. The number of browser driver driven tests in your test
suite should be as close to "zero" as possible.

Does this discourage the use of SPA's? A lot, yes. But when necessary, I
manage to do a separate frontend JS test suite via jsdom which does not
require firing up a headless browser, and my build process runs both the
frontend and backend test suites and only deploys if they both pass.

~~~
RussianCow
Thanks! That's interesting, because at work we're struggling with browser-
specific regressions and were looking at headless browser testing to help
solve that. I agree with all the drawbacks you listed (except #4, since some
headless browser solutions let you use multiple browsers), but unit tests
don't do anything to help with differences in browsers. Do you just do manual
QA and hope for the best? Or is this not as big an issue for you as it is for
our company? (We still have to support IE 11, so that's where the majority of
browser-specific issues manifest.)

~~~
pmarreck
The “surface“ of my sites is usually small enough to just visual-check
manually. I get that it might be necessary in some cases- note that I did say
“as close to zero as possible” and not simply “zero”. My criticisms are that
it is just such a Rube Goldberg-esque approach to testing something that its
use should be minimized. It would be great if all browsers had to pass some
standardized spec before being considered viable, that is the source of the
cross-browser nondeterminism IMHO.

