There's a lot of complexity in managing and even building the thing from the start -- and then you have to support it. If you're working in a large org then there's a chance that you can just DIY it, however for small-medium businesses this isn't practical and it's a waste of time (their most precious resource).
I like to think of it as a managed database. Sure, you can freely download postgres in a container and you're up and going, however there's a lot more costs to it than just that. Having a fully-managed database saves you time and other non-tangibles that it can be worth the cost. Just depends on your circumstances.
These managed solutions can provide value for large companies, especially ones that fall into a DIY trap.
Solve problems with computers.
The link is showing up as https://sourcesort.com/interview/browserless.io instead of https://browserless.io
I was trying to do that on Browserless but couldn't get the final file download to work (I adapted Stack Overflow code linked below to put all the web pages images into a ZIP file and download that) - presently I'm running this on a Google Cloud Function, which is working but I'd rather outsource it to you, especially since the function chokes on large web pages (possibly it needs more RAM than the 2GB limit currently available in GCF?).
curl -X POST \
-H 'Content-Type: application/json' \
This will get all the <img> tags on a page and return their attributes (which includes their sources). If you wanted to do scripts as well, just add a new object to the elements array with the "selector" of "script".
For instance, if you're already programming and working 8+ hours a day in your home, the last thing you want to do is more of it. One of the biggest things you'll want to do is get out more often since you're home a lot, and staying home to work on a side-project doesn't sound awfully appealing.
The web app we are testing right now (SaaS for architects) makes heavy use of canvas elements and can't be tested headless.
Just as a bit of feedback, on the linked site you write:
>Browserless is simply a tiny web-server that “productionalizes” all the stuff about headless browsers and their automation capabilities.
I find that both hard to parse (productionalizes? wat) and saying nothing. How about "a tiny web-server that helps you use headless browsers without the hassle of setting them up on your environment"
The only downside to using a lot of IaaS type stuff like this is almost every month something goes down - Github or Azure or Appveyor etc, and this is another thing that could go down, even for a few minutes, and basically the CI is stuffed, or pingdom goes heywire. But this is a more general problem with cloudifying all the things, not with this particular service.
When thinking about the viability of a business, developers commonly make the mistake of assuming that nobody would pay for something that's possible for them to code up themselves.
Sure, many of us write code all day and love it. But most people have other responsibilities, value their time highly, and (correctly) prefer to pay reasonable amounts of money so they can spend their time doing more important or profitable things.
I especially like the "host it yourself" commercial license model, here; while automating browser _actions_ over a network works well enough, _detailed scraping_ over a network can quickly become inefficient (as many requests for elements or element attributes may incur individual round-trips). In some cases, colocating your browser instance with your scraping logic becomes a necessity.
Both essentially solve the same problem, just in different ways.
> The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.
But this seems to be doing that exact restriction.
Additionally the license seems like it contains a loophole:
>If you are creating an open source application under a license compatible with the GNU GPL license v3, you may use browserless under the terms of the GPLv3.
If I make an open source application, I can use browserless under the terms of the GPLv3. That means I can redistribute browserless under the GPLv3. That means people can take the browserless code I redistribute and use that for commercial products (as long as they don't distribute a non-GPLv3 binary form of the commercial products containing browserless, because that would break the GPLv3).
> This work is dual-licensed under GPL-3.0 OR the browserless commercial license. You can choose between one of them if you use this work.
So it's clearly GPLv3 (no loophole required), which AFAIK does allow closed source proprietary use within a company so long as the program isn't redistributed externally (perhaps the developer didn't understand that?). It seems that the licensing section in the readme should have it's wording adjusted somewhat.
In fact, I think you're even in the clear to run a proprietary cloud service using GPLv3 code which is why the AGPL (among others) exists. Some recent drama (https://techcrunch.com/2019/05/30/lack-of-leadership-in-open...) for reference.
(Oddly, the header underneath that states "GPL-3.0-or-later" which is a bit inconsistent.)
How are you dealing with these issues?
Edit: And if a client really does need version X of Chrome, you could give them the ability to pay extra to pin the version indefinitely.
What was the hardest part re: working with the protocol?
Hardest part is debugging crash issues and why they happened. You either just get a generic “Page crashed!” error (which I think is puppeteers handler message), or “browser disconnected!”. That and chromes logs are just crazy noisy, which I haven’t gotten a lot out of.
Those are probably the biggest, thanks for asking!
Correct me if I’m wrong, but isn’t precisely that kind of analytics simply table stakes for any modern crm/marketing/customer intelligence suite in 2019? It seems like that is absolutely a solved problem.
(Seriously though, I’m asking, not poking fun. You probably know more about this stuff than I do, having actually done it. It seems really simple to figure, to me. What am I missing?)
Finally you see some data point that hints something might be working, but you know you have to account for all the other factors involved. Did I make any website edits that day? Did the ad network change their algorithm slightly? Was there a holiday affecting traffic? When did I insert that new ad again? Wait, I know I changed my ad bid at some point... Did I get an influx of traffic from another source? Was it just a fluke?
If you want good, real data, it's messy. And far from a solved problem.
For instance: we had a week last year where we had a flood of cancellations at once and there was nothing I could attribute it to. Looking back it was just a coincidence, however it consumed a lot of my time (writing emails and looking at analytics) to figure out why instead of just pushing ahead.
I was likely over-reacting, but I have noticed there's a lot of people spending a lot of time doing analytics and doing research on trends in stead of just executing. And, especially early on, you should just be executing and not thinking too much about trends.
Use a real version of Chrome (not Chromium) and headful mode. Mask the navigator.webdriver property. Pace your requests and take care to use "good" IP addresses.
Keep in mind that as soon as Distil sees something obviously automated (like a headless browser) the source IP address is "burned" for some number of days.
If you’re a developer with a day job there has never been a better time to get started building and selling your own software.
It’s not glamorous but it is rewarding.
https://rendora.co/ Seems to be gone.
There's also GCF for those on Google Cloud. I have used Browserless' trial and felt like the 2+ GB instances were kind of expensive because they require reservation unlike Lambda where you get 400,000 GB-seconds and 1M request per month for free.
Like, recording macros has been a thing forever, but how are you going to magically generalize them, without better-than-human AGI?
Almost everyone: That's great! Well done.
HN commenters: Pffft.
You are somewhat new so maybe you don't know that it's not really the way people communicate around here. Please check out the guidelines in the footer of the website.
Maybe examine your feelings, is it jealousy or envy or something?
- We support selenium with the same version of Chrome as puppeteer's. Everything is versioned together.
- Queueing/Concurrency and notifications. Arguably done in other efforts, but works.
- Numerous other APIs built on top of puppeteer to do core cases. No need to write your own integration.
- Monitoring and more.
People do pay for this in order to not think about the burden of supporting it. Managed databases are a thing, even though they are freely available to download and run.
Now, I won’t write another rant on the subject of Eternal September, and how our field is increasingly populated by charlatans who can’t do simple things on their own. But yes: that’s exactly how I feel.
Did you divide (time that took you to complete this trivial task + time to document it, version it, explain it to coworkers, maintain it next time chrome decides to break something) * (your $/h cost) / 0.00008 before doing that?
(This can be significantly larger for the purchased solution then an internal one as well)
And fwiw I do run my own puppeteer cluster because it's economically advantageous at our scale.
You’ll probably see me with an article in 6 months about how I’m making $150k/mo for about 2% of my time.
1) Increased load times and test run times due to browser complexity and memory consumption
2) Impossible to run concurrently without additional instances, each of which takes up massive memory
3) Tests are slow and often nondeterministic (literally THE WORST property a test suite can have), with many cases where things like "sleep()" delays are put in to circumvent some opaque browser latency issue, which is just gross
4) Even after suffering all of the above, you're still only testing ONE engine (say, WebKit) instead of all of the popular ones (Blink (Chrome), Gecko (Firefox), EdgeHTML (eh, Blink, I guess, now?), etc)
I did not enumerate all the disadvantages, but these should be enough to support my position. The number of browser driver driven tests in your test suite should be as close to "zero" as possible.
Does this discourage the use of SPA's? A lot, yes. But when necessary, I manage to do a separate frontend JS test suite via jsdom which does not require firing up a headless browser, and my build process runs both the frontend and backend test suites and only deploys if they both pass.