I wrote a collection of Dockerfiles for images running Python 2.7 or Python 3.6 + Selenium with either Chrome or Firefox and using Xvfb for the X display (necessary for running Selenium headlessly).
In short - this gives you the ability to pass the output of one lambda function to the input of another lambda function.
An example of one that I've written to regularly create a new copy of our Production RDS database in Ireland as a Staging RDS database in Oregon.
1. Cloudwatch Event starts Step Function on the 15th
2. Copy last Production snapshot from Ireland to Oregon
3. Restore this snapshot as a new RDS instance (It will fail until the snapshot is available and retry with exponential backoff - this is a step function feature)
4. In parallel:
- Add tags to the instance (Once it's available)
- Delete the snapshot copy (When finished restoring)
- Modify the new instance with security groups and subnets
- In parallel:
- Run a SQL query to anonymize all of PII columns for GDPR compliance as data has now left the EU.
- Call out to the Cloudflare API to update our DNS entry with the new RDS endpoint.
- Delete the old Staging database instance
Do you run into problems with Lambda's 5-minute maximum execution time for those kinds of operations? I'd like to do something similar to this for both RDS and DynamoDB, but the execution time will often surpass 5 minutes, meaning I'd have to run a Step Functions worker on EC2 or ECS. That opens up a whole bunch of complexity with managing the worker code and its deployment, which I'd rather avoid if possible.
With the current implementation; no problems hitting the limit. As mentioned in my below comment, our query for anonymization would be the heaviest - but it's designed to be quick as we don't care about unique values for most data.
If we did though - Fargate is a great solution for it, but you wouldn't be able to feed data back into the next step without some additional complexity - Maybe have the next step pull an SQS queue, or an S3 file, or look for a database entry, etc. as it's next bit of data that it needs - and just fail until it finds it, and once the Fargate (Or whatever) has done it's job and placed it in your method of choice, then it could continue.
> Run a SQL query to anonymize all of PII columns for GDPR compliance as data has now left the EU.
Do you ensure the values you replace with 'make sense' in the context of the application? i.e are names turned into fake names?
If so, I would love to hear more about you handle the complexities of this. If not, it's still a wonderful pipeline that I'm putting my ideas box, thanks for sharing.
Nope! I probably could with the Faker library; but we don't care about that - and to do so would be a much heavier query. My query looks like this, so it runs extremely fast and isn't an issue on Lambda:
`UPDATE Users set FirstName = 'FAKEFIRSTNAME', LastName = 'FAKELASTNAME', StreetAddress = '123 FAKE ST.', Zip = '10001', PrimaryEmail = Cast(NewId() as varchar(36)) + '@x.com', Phone = '555-555-5555')`
I tried to use Headless Chrome on Cloud Functions for a project I'm working on, but even on the fastest instances loading pages was sometimes really slow (pages timing out after waiting for 60s).
It seems sometimes JS execution was taking a long time, so I guess that was preventing requests from being made. In a single CPU cloud function you have network requests, JavaScript execution, rendering, and the Node process controlling the browser all competing for resources.
That being said, it was super simple to get started!
Do simple hello-world HTML pages render ok? I found that rendering was slow-ish but totally acceptable for reasonably heavy HTML pages, so long as we weren't flooding page re-renders (eg by using React without any optimisation of render calls)
How many requests per minute were you running? I was planning on using Headless Chrome on Cloud Functions to take in some arbitrary HTML and output it to a PDF. In my case I can’t see >2 rpm happening anytime soon.
Just saw your other comment! The /function call looks interesting. Would I be able to use puppeteer’s request intercept API with it?
A bit more on my use case: currently using pdf.js for rendering HTML reports as a PDF. It’s been a pain. I’d like to essentially take the HTML of some (say, just a <table>) or all (full HTML doc) of the current page and send it off. The main gotcha I can think of is relative URLs but intercept request would resolve that.
Among other things with puppeteer we do screenshot generation
using GKE on Google Cloud @ https://screenshots.cloud/ scaling up and down running instances depending on demand. We keep browser instances running constantly as the startup time is significant. I will be interested to see what the startup time is for puppeteer on this, will definitely be giving it a try.
One completely unrelated thing. On Chrome 68.0.3440.84, I noticed the large icons (particularly the Kubernetes one) looked "weird", with jagged edges that didn't make any sense. Some poking with the devtools revealed that 'backface-visibility: hidden' seems to be disabling antialiasing.
Suggest opening the following in new tabs so you can flip back and forth between them:
Thanks for spending time to let me know because I don't think I would have noticed it otherwise! I can't see it on retina but I can on my non retina display. We'll have to make the CSS rule more specific. As to why it's there I believe Firefox 57 or around that version had an issue with the sliding animation on the top of the page causing images to tear or not render at all when they scrolled in. This bug must have been solved recently because disabling backface-visibility on the image doesn't cause the same tearing.
I was very curious what was causing the non-antialiasing, it was fun.
And you can repro :) cool. Makes a lot of sense you can't see it on retina.
Interesting FF bug you hit. CSS3 GPU-accelerated animations are incredibly complex... heh, adding the rule fixed Firefox ~57, but now Chrome 68 is glitching out because the rule is there. I wonder if Google realizes yet. Ponders complexity of creating minimal testcase, versus waiting for someone else to notice :P
Nice to see this concept get into the big cloud platforms. We built something similar couple of years ago, primarily to get a sandbox for some compute jobs we were running on Heroku:
Running Headless Chrome / Chromium is a bit of a hassle on AWS Lambda and other FAAS providers. Chrome requires some specific bindings/binaries to work. I think the Chrome guys and girls convinced their Cloud coworkers to provides these in the underlying Linux machines that run the Cloud Functions.
It actually is easier than people think, after chrome added the ability to disable shm (which isn't available on lambda) you can statically compile it into a single binary and run it with chromedriver & capybara easily.
Unfortunately Google Cloud Functions don't natively support golang yet so for my business this is a non-starter.
Seconded. It works great. I've been using this in a lambda web collection project for the past year. Initial load times for headless chrome can be a bit slow, but iterative uses after it has launched are pretty fast.
If Google cloud ain’t your jam then checkout browserless (https://browserless.io/). It can be considerably cheaper under certain situations, and we’ve been up and running for almost a year. Happy to answer questions if anyone has any.
EDIT: We’ve got stuff on GH: https://github.com/joelgriffith/browserless, and startup is under 100ms most of the time. Fonts and other things “just work” as well, plus there’s a slew of REST APIs for common stuff as well. Selenium webdriver support landing soon!
Something somewhat related: I think more Cloud providers need to start doing Docker-functions. It’s always going to be a waiting game for runtimes and upgrades. Docker is portable and can run just about anything, so why not support that?
Thanks @mrskitch, @hugelgupf, and @dantiberian the Docker based Cloud Functions look like the best long term solution to solving the long term support of multiple languages.
I think it would be pretty cool to use Cloud Functions as a Selenium Grid, sort of like Zalenium (https://github.com/zalando/zalenium) does with Kubernetes. If you could parallelize end-to-end tests enough, you could get massive burstable capacity to run parallel tests.
HtmlUnit API is really cool. It's fine for most use cases. But obviously the Javascript support is not perfect. [Shameless plug] I wrote a book about web scraping where I talk about HtmlUnit & headless chrome with Java: https://www.javawebscrapinghandbook.com
I've been screwing around with running Headless Chrome & Puppeteer on Lambda/Serverless/FAAS solutions. It's all a bit of a mixed bag. You CAN run Headless Chrome on AWS Lambda, but the cost involved is pretty crazy as you need ~1500Mb in RAM to comfortably run any code with Chrome.
Google Cloud of course has "inside knowledge" and I would love to switch to them for my SaaS https://checklyhq.com, were it not that Google Cloud Functions is just offered in four (!) regions...
Ok, I was just looking into this 1 week ago and was gonna spin up a VM to do headless. Now I get to keep my firebase project in cloud functions only. Much cleaner architecture.
I'm currently using a headless Chrome for my latest project www.blockedby.com (still in alpha stage, looking for feedback)
I've been looking at a non-local solution. I'm using Python and this article hints that Puppeteer is not the only way to invoke this. But I don't see any documentation on the DevTools protocol.
Anyone knows if it's supported ? Or any providers that do ?
You should just need to get a handle on the host and port to use Devtools protocol to talk to the launcher headless chrome instance. If you use node.js to start it, you can get the port like this: https://github.com/GoogleChrome/puppeteer/blob/master/docs/a...
Then you can use PyChromeDevtools to connect to that host/port:
I'm not sure if this is what you are looking for, but I have been using headless Chrome through Selenium (using nightwatch.js) for some time now and it works great. You will probably need to rewrite your tests though, but they will support FF too. YMMV.
I’ve tested HC on GCF and GAE standard with the IO launch. Sadly they’re 3-10x slower than App Engine Flex (same vm size on App Engine Flex vs Standard). Even a screenshot of google.com takes 6+ seconds on GCF / GAE Standard vs 2 seconds for Flex. I hope they fix this as spinning to zero is important for me but the latency is too high right now.
Tried HC just yesterday on cloud functions. For some reason, it runs extremely slow. Did some comparisons to AWS lambda with similar memory / cpu sizes and basic “load page and screenshot” jobs would take 2x - 3x more time on google cloud functions.
Lower costs. With AppEngine you pay for a whole instance, regardless of how much utilisation it gets. With cloud functions you only pay for the time the function is being executed. If your code isn't being executed often, GCF are much cheaper.
Our small instance ($30/month) is roughly similar to their $44.38 instance on App Engine. If you were to run a full 10 concurrent sessions constantly, which a small browserless instance can max-out at, this would cost roughly $427 a month in Functions. So depends on your use-case
I am getting a different price from GCP calculator. Almost $2000.
In your example, 10functions por Second, so each function is taking 1000ms to finish.
2592000s in a month * 10invocation = 25920000 invocations month, running with a 1GB function and 500kb of network bandwidth (out).
https://cloud.google.com/products/calculator/#id=555e1af7-c8...
I'm not buying it, Google Cloud just moved to node 8 earlier this year as the post says, but now it's node 10. It's just not good tech, it's unnecessary lock-in. Docker on anything is better, this is similar: https://zeit.co/blog/serverless-docker
It’s wonderful tech that has saved us thousands a month. The lock in is very little as it’s running a JS function. We’ve written our functions so that the export to the GC function merely passes in arguments to another function that is required. We could move to AWS, Zeit, or anywhere else with little friction.
It's nice to see serverless platforms adding support for headless Chrome. But there's still one problem with AWS Lambda / Cloud Functions / Zeit Now - the run time is limited to a few minutes. If you want to run any longer job, e.g. a web crawler, you need to either spin up the instances yourself or use platform like Apify, which allows running arbitrary-long jobs, provides pre-built Docker images for headless Chrome or XVFB, and provides SDK to simplify state persistence, access to proxies etc.
For example, a simple actor to convert HTML to PDF looks like this:
https://github.com/seanpianka/docker-python-xvfb-selenium-ch...
Using this, in conjunction with AWS Step Functions, Lambda, and ECS, it became merely cents a month to run a headless scraper task in the cloud.