Hacker News new | past | comments | ask | show | jobs | submit login
Headless Chrome support in Cloud Functions and App Engine (cloud.google.com)
333 points by idoco on Aug 19, 2018 | hide | past | favorite | 79 comments



I wrote a collection of Dockerfiles for images running Python 2.7 or Python 3.6 + Selenium with either Chrome or Firefox and using Xvfb for the X display (necessary for running Selenium headlessly).

https://github.com/seanpianka/docker-python-xvfb-selenium-ch...

Using this, in conjunction with AWS Step Functions, Lambda, and ECS, it became merely cents a month to run a headless scraper task in the cloud.


Can you elaborate a bit? Sounds interesting. I had never heard of AWS Step Functions before.

What does your workflow look like?


Not OP, but to elaborate on AWS Step Functionss:

In short - this gives you the ability to pass the output of one lambda function to the input of another lambda function.

An example of one that I've written to regularly create a new copy of our Production RDS database in Ireland as a Staging RDS database in Oregon.

1. Cloudwatch Event starts Step Function on the 15th

2. Copy last Production snapshot from Ireland to Oregon

3. Restore this snapshot as a new RDS instance (It will fail until the snapshot is available and retry with exponential backoff - this is a step function feature)

4. In parallel:

  - Add tags to the instance (Once it's available)

  - Delete the snapshot copy (When finished restoring)

  - Modify the new instance with security groups and subnets

     - In parallel:

        - Run a SQL query to anonymize all of PII columns for GDPR compliance as data has now left the EU.

        - Call out to the Cloudflare API to update our DNS entry with the new RDS endpoint.

           - Delete the old Staging database instance


Do you run into problems with Lambda's 5-minute maximum execution time for those kinds of operations? I'd like to do something similar to this for both RDS and DynamoDB, but the execution time will often surpass 5 minutes, meaning I'd have to run a Step Functions worker on EC2 or ECS. That opens up a whole bunch of complexity with managing the worker code and its deployment, which I'd rather avoid if possible.


With the current implementation; no problems hitting the limit. As mentioned in my below comment, our query for anonymization would be the heaviest - but it's designed to be quick as we don't care about unique values for most data.

If we did though - Fargate is a great solution for it, but you wouldn't be able to feed data back into the next step without some additional complexity - Maybe have the next step pull an SQS queue, or an S3 file, or look for a database entry, etc. as it's next bit of data that it needs - and just fail until it finds it, and once the Fargate (Or whatever) has done it's job and placed it in your method of choice, then it could continue.


> Run a SQL query to anonymize all of PII columns for GDPR compliance as data has now left the EU.

Do you ensure the values you replace with 'make sense' in the context of the application? i.e are names turned into fake names?

If so, I would love to hear more about you handle the complexities of this. If not, it's still a wonderful pipeline that I'm putting my ideas box, thanks for sharing.


Nope! I probably could with the Faker library; but we don't care about that - and to do so would be a much heavier query. My query looks like this, so it runs extremely fast and isn't an issue on Lambda:

  `UPDATE Users set FirstName = 'FAKEFIRSTNAME', LastName = 'FAKELASTNAME', StreetAddress = '123 FAKE ST.', Zip = '10001', PrimaryEmail = Cast(NewId() as varchar(36)) + '@x.com', Phone = '555-555-5555')`


I tried to use Headless Chrome on Cloud Functions for a project I'm working on, but even on the fastest instances loading pages was sometimes really slow (pages timing out after waiting for 60s).

It seems sometimes JS execution was taking a long time, so I guess that was preventing requests from being made. In a single CPU cloud function you have network requests, JavaScript execution, rendering, and the Node process controlling the browser all competing for resources.

That being said, it was super simple to get started!


Do simple hello-world HTML pages render ok? I found that rendering was slow-ish but totally acceptable for reasonably heavy HTML pages, so long as we weren't flooding page re-renders (eg by using React without any optimisation of render calls)


Yes those were fine.


How many requests per minute were you running? I was planning on using Headless Chrome on Cloud Functions to take in some arbitrary HTML and output it to a PDF. In my case I can’t see >2 rpm happening anytime soon.


Have you seen https://docs.browserless.io/docs/pdf.html? It’s in a docker image too if that’s all you need.


Just saw your other comment! The /function call looks interesting. Would I be able to use puppeteer’s request intercept API with it?

A bit more on my use case: currently using pdf.js for rendering HTML reports as a PDF. It’s been a pain. I’d like to essentially take the HTML of some (say, just a <table>) or all (full HTML doc) of the current page and send it off. The main gotcha I can think of is relative URLs but intercept request would resolve that.


Yes: anything available on the `page` object is available in the /function route, so request interception is available!


looks like Google claims another victim....I don't see the point of using browserless


Among other things with puppeteer we do screenshot generation using GKE on Google Cloud @ https://screenshots.cloud/ scaling up and down running instances depending on demand. We keep browser instances running constantly as the startup time is significant. I will be interested to see what the startup time is for puppeteer on this, will definitely be giving it a try.


Nice reference immediately above "used by us" :)

One completely unrelated thing. On Chrome 68.0.3440.84, I noticed the large icons (particularly the Kubernetes one) looked "weird", with jagged edges that didn't make any sense. Some poking with the devtools revealed that 'backface-visibility: hidden' seems to be disabling antialiasing.

Suggest opening the following in new tabs so you can flip back and forth between them:

- As is right now: https://i.imgur.com/nYzsukI.jpg

- Nicer-looking: https://i.imgur.com/GNlvx7Z.jpg

I noticed disabling this has an effect on the animation at the top (the edges of the moving webpage slides don't have constantly-moving jaggies).

There may well be a valid reason you have this enabled, perhaps for added performance. Or perhaps React added it in for you? :P


Thanks for spending time to let me know because I don't think I would have noticed it otherwise! I can't see it on retina but I can on my non retina display. We'll have to make the CSS rule more specific. As to why it's there I believe Firefox 57 or around that version had an issue with the sliding animation on the top of the page causing images to tear or not render at all when they scrolled in. This bug must have been solved recently because disabling backface-visibility on the image doesn't cause the same tearing.


I was very curious what was causing the non-antialiasing, it was fun.

And you can repro :) cool. Makes a lot of sense you can't see it on retina.

Interesting FF bug you hit. CSS3 GPU-accelerated animations are incredibly complex... heh, adding the rule fixed Firefox ~57, but now Chrome 68 is glitching out because the rule is there. I wonder if Google realizes yet. Ponders complexity of creating minimal testcase, versus waiting for someone else to notice :P


Nice to see this concept get into the big cloud platforms. We built something similar couple of years ago, primarily to get a sandbox for some compute jobs we were running on Heroku:

https://github.com/flowhub/jsjob


Does this mean that HC is preinstalled in the runtime?

Becauee as far as I know you can already run HC with other FaaS solutions, but having this out of the box would be really nice.


Running Headless Chrome / Chromium is a bit of a hassle on AWS Lambda and other FAAS providers. Chrome requires some specific bindings/binaries to work. I think the Chrome guys and girls convinced their Cloud coworkers to provides these in the underlying Linux machines that run the Cloud Functions.


exactly this. The base operating system comes with the system libraries necessary to support headless chrome out of the box.

(disclaimer: I work for Google Cloud)


Any chance of also supporting Firefox?


What's to really stop other providers from doing the same thing though? :)


Nothing. It's about developer convenience, not technical possibility.


Project priorities, I guess.


It actually is easier than people think, after chrome added the ability to disable shm (which isn't available on lambda) you can statically compile it into a single binary and run it with chromedriver & capybara easily.

Unfortunately Google Cloud Functions don't natively support golang yet so for my business this is a non-starter.


I just found this

https://github.com/adieuadieu/serverless-chrome/blob/master/...

Don’t know if it works. Going to try this week.


Seconded. It works great. I've been using this in a lambda web collection project for the past year. Initial load times for headless chrome can be a bit slow, but iterative uses after it has launched are pretty fast.


It works and quite well, it's been pretty much set-and-forget for me for quite some months now.


If Google cloud ain’t your jam then checkout browserless (https://browserless.io/). It can be considerably cheaper under certain situations, and we’ve been up and running for almost a year. Happy to answer questions if anyone has any.

EDIT: We’ve got stuff on GH: https://github.com/joelgriffith/browserless, and startup is under 100ms most of the time. Fonts and other things “just work” as well, plus there’s a slew of REST APIs for common stuff as well. Selenium webdriver support landing soon!


...and if browserless ain't your jam, checkout https://www.prerender.cloud/

cheap, because we optimized for just 3 things:

pre-rendering, screenshots, or PDFs for $0.000365 per API request

   curl https://service.prerender.cloud/screenshot/https://google.com/ > out.jpg

   curl https://service.prerender.cloud/pdf/https://google.com/ > out.pdf

   curl https://service.prerender.cloud/https://google.com/ > out.html


Pricing is in tiers, not per request as you seem to imply?


Correct, only the final tier is variable rate; apologies for the confusion.

Under 20,000 monthly requests = $9 flat rate ($0.00045)

Under 100,000 monthly requests = $40 flat rate ($0.0004)

>= 100,000 monthly requests = variable rate @ $0.00036/req


Yeah, that's completely different than what you were trying to imply. Still a cool service, though.


note: prerender.cloud just keeps a pool of instances running, so there is no startup latency, compared to Google Cloud functions


You’ve got a minor typo on your marketing site: http://s.rnbk.org/Photo-2018-08-20-05-41-P8447NLVuohO3iFOLhK... (They)


Ouch, thanks for the heads up!


I wish Google would support Ruby!

If you do too checkout the petition over at https://www.serverless-ruby.org


Something somewhat related: I think more Cloud providers need to start doing Docker-functions. It’s always going to be a waiting game for runtimes and upgrades. Docker is portable and can run just about anything, so why not support that?

https://zeit.co/blog/serverless-docker is an example of what I mean


See the "Serverless Containers" section on https://www.google.com/amp/s/gweb-cloudblog-publish.appspot....

It's still in alpha, so Google is not advertising it wide. It was announced about a week before Zeit's Serverless docker at GCP Next.

Disclaimer: I work at Google and I've had some involvement in the tech behind this.


That's an interesting domain: 'http://gweb-cloudblog-publish.appspot.com/'

(it just redirects to cloud.google.com)


Google Cloud has Docker based Cloud Functions ("Serverless Containers") in private alpha at the moment. More info at https://cloud.google.com/blog/products/gcp/cloud-functions-s..., and https://youtu.be/Y1sRy0Q2qig?t=33m54s.


Thanks @mrskitch, @hugelgupf, and @dantiberian the Docker based Cloud Functions look like the best long term solution to solving the long term support of multiple languages.


What does this have to do with headless Chrome?


I think it would be pretty cool to use Cloud Functions as a Selenium Grid, sort of like Zalenium (https://github.com/zalando/zalenium) does with Kubernetes. If you could parallelize end-to-end tests enough, you could get massive burstable capacity to run parallel tests.


Not perfect but HtmlUnit anyone? I used it for scraping in the past with mixed experiences.


HtmlUnit API is really cool. It's fine for most use cases. But obviously the Javascript support is not perfect. [Shameless plug] I wrote a book about web scraping where I talk about HtmlUnit & headless chrome with Java: https://www.javawebscrapinghandbook.com


I've been screwing around with running Headless Chrome & Puppeteer on Lambda/Serverless/FAAS solutions. It's all a bit of a mixed bag. You CAN run Headless Chrome on AWS Lambda, but the cost involved is pretty crazy as you need ~1500Mb in RAM to comfortably run any code with Chrome.

Google Cloud of course has "inside knowledge" and I would love to switch to them for my SaaS https://checklyhq.com, were it not that Google Cloud Functions is just offered in four (!) regions...


Great to see you here Tim, love that chrome extension! We should chat a bit more sometime. I’d love to back checkly’s infrastructure.


that's it...im moving to GCP

sorry but Rekognition rekt it for any type of computer vision on AWS.

Great infrastructure...after all I do have an AWS Solution Architect Associate certification....which means jack shit

Great move by GCP, I'm also very pleased with Firebase and it's integration with cloud functions....

BUT my biggest reservation still in 2018 when it comes to serverless is the cold start up time...

I built a token based API on AWS Lambda and registering, signing up took forever when the app was not at peak. that was 2014 tho.


We use GCP heavily. It's really great.

Latency between Google Cloud Functions and other GCP products improved significantly in the past week, as did start up and execution time.

However we only use GCFs for background tasks and not for any web API/micro services, etc. Still too much latency for that use case.


The Cold Start of AWS lambda is better now. You can get under 1000ms. However, Google Function needs to improve the Cold start.


Ok, I was just looking into this 1 week ago and was gonna spin up a VM to do headless. Now I get to keep my firebase project in cloud functions only. Much cleaner architecture.


My company provides a similar service, with both Chrome and FireFox headless support for Automated Testing/Screenshots: https://testingbot.com/support/getting-started/headless.html

We run each test in a new VM, running on our own private cloud (dedicated servers).

Note: we use the Selenium protocol for this, not yet Puppeteer.


I'm currently using a headless Chrome for my latest project www.blockedby.com (still in alpha stage, looking for feedback)

I've been looking at a non-local solution. I'm using Python and this article hints that Puppeteer is not the only way to invoke this. But I don't see any documentation on the DevTools protocol.

Anyone knows if it's supported ? Or any providers that do ?


You should just need to get a handle on the host and port to use Devtools protocol to talk to the launcher headless chrome instance. If you use node.js to start it, you can get the port like this: https://github.com/GoogleChrome/puppeteer/blob/master/docs/a...

Then you can use PyChromeDevtools to connect to that host/port:

https://github.com/marty90/PyChromeDevTools/blob/master/READ...


I'm not sure if this is what you are looking for, but I have been using headless Chrome through Selenium (using nightwatch.js) for some time now and it works great. You will probably need to rewrite your tests though, but they will support FF too. YMMV.


I’ve tested HC on GCF and GAE standard with the IO launch. Sadly they’re 3-10x slower than App Engine Flex (same vm size on App Engine Flex vs Standard). Even a screenshot of google.com takes 6+ seconds on GCF / GAE Standard vs 2 seconds for Flex. I hope they fix this as spinning to zero is important for me but the latency is too high right now.


I don’t know what this means for browserless.io but I hope he still retains a strong niche and has a desirable product.


Tried HC just yesterday on cloud functions. For some reason, it runs extremely slow. Did some comparisons to AWS lambda with similar memory / cpu sizes and basic “load page and screenshot” jobs would take 2x - 3x more time on google cloud functions.

I’ll dig deeper soon but this is a bad start.


Is it still within say 5 seconds?

Do you get the impression there’s some config tuning to do or something you can’t control?


Can someone please explain what the advantage of running a snapshot service via GCF would be vs. AppEngine Standard (w/ node)?


Lower costs. With AppEngine you pay for a whole instance, regardless of how much utilisation it gets. With cloud functions you only pay for the time the function is being executed. If your code isn't being executed often, GCF are much cheaper.


Oh I see... AppEngine is also "pay for what you use" but it seems it's rounded to 15 mins rather than CloudFunctions which is 100ms. Thanks!


I wonder how feature/pricing compete with Browserless?


Our small instance ($30/month) is roughly similar to their $44.38 instance on App Engine. If you were to run a full 10 concurrent sessions constantly, which a small browserless instance can max-out at, this would cost roughly $427 a month in Functions. So depends on your use-case


Thanks makes sense it would be based on use case. Not sure who is petty enough to downvote my comment.


For reference, how did you get $427? Thanks in advance


Running 10 concurrent Google Functions at 1GB/1CPU. Assumes 30 days in a month


I am getting a different price from GCP calculator. Almost $2000. In your example, 10functions por Second, so each function is taking 1000ms to finish. 2592000s in a month * 10invocation = 25920000 invocations month, running with a 1GB function and 500kb of network bandwidth (out). https://cloud.google.com/products/calculator/#id=555e1af7-c8...


Ah, I wasn’t calculating their other fees like invocations and networking costs. Crazy it’s almost an order-of-magnitude


How Google is behaving recently... I can feel it becoming Oracle. I want to stay 10 miles far from it. Learned my lesson with Google maps.


I'm not buying it, Google Cloud just moved to node 8 earlier this year as the post says, but now it's node 10. It's just not good tech, it's unnecessary lock-in. Docker on anything is better, this is similar: https://zeit.co/blog/serverless-docker


It’s wonderful tech that has saved us thousands a month. The lock in is very little as it’s running a JS function. We’ve written our functions so that the export to the GC function merely passes in arguments to another function that is required. We could move to AWS, Zeit, or anywhere else with little friction.


Please do, then, it's better for everyone in the node ecosystem if people aren't running node 4.x in 2018.


Would you be happier if people switched to some other language for their cloud functions? Node is not that important.


It's nice to see serverless platforms adding support for headless Chrome. But there's still one problem with AWS Lambda / Cloud Functions / Zeit Now - the run time is limited to a few minutes. If you want to run any longer job, e.g. a web crawler, you need to either spin up the instances yourself or use platform like Apify, which allows running arbitrary-long jobs, provides pre-built Docker images for headless Chrome or XVFB, and provides SDK to simplify state persistence, access to proxies etc.

For example, a simple actor to convert HTML to PDF looks like this:

https://www.apify.com/jancurn/url-to-pdf

More info:

https://www.apify.com/docs/actor

https://www.apify.com/docs/sdk/apify-runtime-js/latest

https://www.apify.com/library?type=acts

Disclaimer: I'm a co-founder of Apify




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: