Earlier this month I wrote an ETL extractor using Capybara & headless browser (to work-around a lack of API - PS: only do that as a last resort!).
One thing I learned and wanted to warn about here is that Chrome headless currently doesn't support file downloads [1].
PhantomJS won't work either (unless you use a custom build) [2].
I also tried with capybara-webkit, but no luck either [3].
The only driver which ultimately allowed me to download files at this point was Selenium + Firefox (with some tweaking on profile options browser.download.dir, browser.download.folderList = 2, browser.helperApps.neverAsk.saveToDisk set to the MIME type of files to be downloaded).
Seconded, I did not try all of the other ways you had to go through, fortunately I was already using Selenium with chromedriver in my testing and I found this great set of examples for testing file downloads:
It looks like the second linked post here derives from the first, but both are very similar. My tests don't actually download the files anymore (they are PDFs, and we opted to open them in a new tab instead, which is unfortunately harder to confirm with a test)
Add this to the list of useful wrappers for things I thought would be difficult to test as a regression spec, but ultimately weren't hard at all, like email delivery with email-spec/email-spec [3]
>One thing I learned and wanted to warn about here is that Chrome headless currently doesn't support file downloads [1].
Not true. At least with C# You can use
var client = new WebClient();
client.Headers[HttpRequestHeader.Cookie] = cookieString(driver);
client.DownloadFile(reportURL, savePath + fileName + ".xlsx");
Along with:
string cookieString(IWebDriver driver)
{
var cookies = driver.Manage().Cookies.AllCookies;
return string.Join("; ", cookies.Select(c => string.Format("{0}={1}", c.Name, c.Value)));
}
Edit: Just wanted to emphasize that this ensures all the steps you did to login won't be lost since you're using the logged in cookie.
The idea is to use Selenium to get the download url and then use a regular download method using the selenium cookie.
Yes - I could have done that on other cases, but not in the specific case I tracked, where I don't know the report URL at all, nor I can construct it (it's generated by some enterprise app / javascript code). I think the bug report I mentioned relates to that.
(otherwise you could use pretty much whatever you want to download the file, indeed).
Chrome headless migration has been my bane for the last two weeks. It's almost ready, but not yet.
The short version of what's wrong: chrome headless does not support extensions. The chromedriver used by selenium works via starting chrome with an automation extension that gives it certain needed functionality. So things like setting the window size, taking screenshots, and a few other important features just cause chromedriver to crash or hang.
There's bugs filed and some clever googlers are trying to fix it all- but it's not ready for our purposes yet. I don't want to rewrite dozens of tests that work with phantomjs to get around chrome's bugs.
Do you have a link or something about how to make that work? About two weeks ago, I was struggling with screenshots and pushed it back to my to-do list.
Until the fix is released for Rails(the issue is resolved, but I don't think it's published to rubygems) you'll also need the monkeypatch I added later on in the issue report(or some other code to that effect).
I hear a lot of folks mentioning the pain working with headless chrome. This was the inspiration for Navalia, which is inspired by the API from nightmare. It also handles the case where you want to run simultaneous scripts by using new tabs (and even multiple chrome instances as well). It's still early in life, and I'm about to do another breaking release, but here is the repo https://github.com/joelgriffith/navalia
Kind of off topic, but. Is running automation as complicated as this? I recently wanted to log in to a website, click some page and download a .csv file. I saw that chrome can be run headless, nice.
So I opened it headless and there's this REPL, nice I can do JS directly into the website. Now how do I automate this.
This leads me into all sorts of things that doesn't seem related at all (I is, but still). Selenium stuff, automation setups, drivers, language bindings, chrome-api-stuff and no end in sight.
All I want is something like "chrome -headless -js script-flow.file http://URL"
Am I to overwhelmed or is there no simple way without buttloads of 3rd party tools and setups required?
Yeah, it's actually way simpler than that since chrome (and I think Firefox now too) is exposing a API for driving the browser, what you're after is the remote debugging protocol: https://chromedevtools.github.io/devtools-protocol/
But, you have bunch of things that abstract that for you, so you don't have to implement things hitting that API manually. Anyways, here is a handy document from Mozilla on the protocol too: http://searchfox.org/mozilla-central/source/devtools/docs/ba...
Chrome headless is a bit on the edge. Selenium + Firefox/Chrome is pretty mature, and Selenium even publishes Docker images that remove a lot of setup complexity. Pick your favorite language, grab the requisite WebDriver gem/module/etc and point it at the container.
Additionally, for many use cases, the many browser automation SaaS's out there are a good solution.
I'm working on a high-level API to solve a lot of what you're describing. It's still in its infancy, but soon will be runtime agnostic: https://github.com/joelgriffith/navalia. File an issue for what's missing!
I was hoping it would let me automate a few tasks with a "user" flow (i.e. enter details, click, click). With e.g. curl or python I didn't even get through the login screens because they seem to require some special dealing with cookies, request/response cookies, smfd, itc, id_ado, _ip_xat and so on.
Basically auth on websites seems to require a whole bunch of stuff and I just started looking at simpler forms of automation instead. Still not sure which one I'll continue looking into.
(If anyone sees this, I am trying to log into iTunes Connect and Fabric and download metrics)
> What does headless Chrome provide that a web scraper in any given language can't?
Everything a web scraper could do, without reinventing all the infrastructure for handling web content (including JS/DOM interaction) from scratch. You could obviously do it all yourself in your language of choice,but why not focus on the application specific parts?
Taking screenshots comes to mind. It's still extremely complicated to this day to get it right. Even with the likes of PhantomJS, Chrome Headless, etc.
Just this morning my team migrated from phantomJS to chrome headless for our Karma plus Jasmine tests. Did a bit of researching before the move and we found tests ran faster plus chrome used less memory than phantomJS. ~ 2000 specs compiling html and stuff
In fact no we didn't do any kind of benchmark. It's an assumption made of feedback from other users. As we knew we wanted to switch to headless Chrome anyway, we didn't investigate more into this part.
I tried switching from poltergeist to Chrome headless last week using a very similar setup, but it was much, much slower. I assumed it was Selenium's fault, but didn't dig any deeper as it seemed to be expected from the guide I was using[1] ("I can anecdotally report that Capybara-Webkit seems significantly faster").
When specs are failing you actually want to see what's happening in the browser. Xvfb + normal Chrome + ffmpeg screen capture. This can be run in Docker and thus on Macs as well.
One thing I learned and wanted to warn about here is that Chrome headless currently doesn't support file downloads [1].
PhantomJS won't work either (unless you use a custom build) [2].
I also tried with capybara-webkit, but no luck either [3].
The only driver which ultimately allowed me to download files at this point was Selenium + Firefox (with some tweaking on profile options browser.download.dir, browser.download.folderList = 2, browser.helperApps.neverAsk.saveToDisk set to the MIME type of files to be downloaded).
[1] https://bugs.chromium.org/p/chromium/issues/detail?id=696481
[2] https://github.com/ariya/phantomjs/issues/10052
[3] https://github.com/thoughtbot/capybara-webkit/issues/691