
Python Headless Web Browser Scraping on Amazon Linux - steven5158
http://fruchterco.com/post/53164489086/python-headless-web-browser-scraping-on-amazon-linux
======
fauigerzigerk
PhantomJS is brilliant, but Selenium is a questionable choice for this task.
For some reason, the creators of Selenium have decided that passing HTTP
status codes back through the API is and always will be outside the scope of
their project. So if you request a page and it returns 404 you have no way to
find out (other than using crude heuristics). This makes Selenium completely
unusable for anything I would have used it for.

Fortunately you can do it by using phantomjs directly instead of going through
the Selenium WebDriver API. Maybe one day the phantomjs WebDriver API
implementation (ghostdriver) will extend the API to pass HTTP status
information back to the caller. Until then, this API is unusable (at least for
me).

~~~
nirvdrum
Well, I think the matter is a bit more complicated than that. When dealing
with a full browser, you fetch a lot of resources. The status code for the
first page fetch may be easily obtained, but your API gets very wonky as soon
as you want to get status codes for all linked resources. Even if you managed
that, any Ajax requests would complicate things, especially if they have
deferred loading. And then you have WebSockets.

There are tools, such as BrowserMob Proxy, far better suited for monitoring
HTTP traffic. And they'll get you all the headers. You can even capture to HAR
so you measure performance.

~~~
fauigerzigerk
Difficult edge cases are never a good reason not to support the 99.9% case.

Also, phantomjs has access to all the information you want and the WebDriver
API already has a capabilities negotiation facility.

[Edit] Don't forget that the original URL is the only one supplied by the
client of the API. It may be incorrect for very different reasons than all the
other resources included by the page itself. That's why it is justified to
treat it as a special case.

~~~
nirvdrum
These aren't edge cases. They're asked about constantly. Most people are using
Selenium because they care about everything on the page. Otherwise, your
stdlib HTTP client would be sufficient.

That aside, if PhantomJS already has the info, you can always fetch it with
executeScript.

If you do feel that strongly about the status code part though, I'd urge you
to comment on the public draft of the W3C spec:
[http://www.w3.org/TR/webdriver/](http://www.w3.org/TR/webdriver/)

~~~
cstejerean
From the point of view of simulating actual users, the fact that some random
third-party resource on the page failed to load is not particularly relevant.
That happens all the time as I browse around the web, and I never have to care
about it as long as the site continues to function. So it very much is an edge
case compared to the page itself failing to load.

~~~
nirvdrum
A JavaScript file failing to load will bork most pages. A CSS file failing to
load or a key image will cause most people to quit. And an Ajax request
failing in a single-page app will render it useless.

But, my point of view is from actual Selenium users. This is framed by
providing support on the IRC channel, on the mailing lists, triaging the issue
tracker, and by interacting with people at SeleniumConf and the local Boston
meetup. It's not some fringe use case and I'm not arguing the point for the
sake of arguing it. The original supposition that it's an edge case is not
accurate. And sure, the web breaks. That's why people using Selenium would
like a way to catch that. And that's a big part of why the BrowserMob Proxy
project exists.

~~~
encoderer
"A JavaScript file failing to load will bork most pages. A CSS file failing to
load or a key image will cause most people to quit."

Wha?

Sure, if, say, "app.js" fails to load, you have a problem.

But an analytics script?

A 3rd party ad script (which is what the GP gave as an example)?

These things can and do fail all the time.

------
slaxo
For anyone using PhantomJS I'd recommend checking out CasperJS
([http://casperjs.org/](http://casperjs.org/)) . It adds some nice features to
PhantomJS and takes out a lot of the pain points

------
diminoten
I find it preferable to determine the requests that jQuery is making and
perform them myself to extract the necessary data, rather than load up a whole
browser just to do the same thing.

Selenium is _terrible_ , performance wise, and requires a _significant_
investment in environment in order to work reliably. I try to avoid it except
when I absolutely cannot.

~~~
ArbitraryCrow
I wound up doing this myself, after spending an undue amount of time
struggling with a morass of insanely written Javascript. Fiddler proved
indispensable for observing the actual interaction with the web server.

------
brechin
If you're writing Python and need to do something like this, you could try
using Phantompy, a Python port of PhantomJS:
[https://github.com/niwibe/phantompy](https://github.com/niwibe/phantompy)

It's still "in an early stage of development" but it's on my list of libraries
to keep an eye on for when I have time to tackle the JS-heavy sites of the
world.

------
spikels
For scraping phantomjs or casperjs is the best way to go but you will have to
use some JavaScript [1]. Both give you access to everything a WebKit browser
user does with either a Node-style callback syntax (phantomjs) or a
procedural/promises-style syntax (casperjs). Easy to setup, simple to use and
fast enough for scraping but only WebKit (for now).

For testing on browsers other than WebKit (or vendor specific WebKit edge
cases) use Selenium. Harder to setup, more complex, probably faster (still
slow for testing) but not limited to WebKit.

[1] Sorry folks but some JavaScript is required to programmatically
interacting with the web - also need some HTML and CSS.

------
xfour
One more thing, has anyone used BeautifulSoup for forever? Is the project
still active? I mean the website is cute and all, but I find pyquery ( Also
based on lxml) so much easier with parsing the scraped data.

~~~
ianhawes
Something to consider is that the trend the past year has been to use headless
browsers over BeautifulSoup, cURL, etc.. because headless browsers are harder
to detect by anti-scraping systems and can interpret JavaScript.

~~~
takluyver
That's what the OP is about ;-). But BeautifulSoup isn't a way to retrieve a
web page, it's a way to parse HTML. You can get the page with a headless
browser, and then transfer the DOM into a BeautifulSoup tree to do your
scraping.

------
616c
I recently tried to get back into Selenium for a work-related project and,
despite its frustrations, it is one my favorite open source gems I found in
the last several years. When showing it uninitiated web devs their heads
almost exploded from joy and amazement. Your setup with Selenium intrigued me
since the pain point for me has become how difficult it is to maneuver some
browsers with Selenium IDE to throw together ideas, if that is even encouraged
anymore.

------
phaer
You are installing some devel-packages, but i don't see anything compiling?
Does the selenium installation build native extensions? Then the commands
should probably the other way round. Or is phantomjs compiling something on
the first run?

Minor nitpick: I don't think it is a good idea to copy a binary directly to
/usr/bin, without a package manager. You could just put it into /opt and
symlink to /usr/(local/)bin.

~~~
kawsper
The file that he is fetching ( phantomjs-1.9.1-linux-x86_64.tar.bz2 ) is the
executables for his platform, with some examples on usage and a readme.

~~~
cinquemb
That doesn't seem like a very safe thing to do... dont they have sc for
PhantomJS one can checksum and run ./configure > make > sudo make install?

~~~
Wilya
PhantomJS is pretty big. IIRC, building it takes quite some time. I think they
bundle webkit and the necessary parts of Qt, and you'd have to be out of your
mind to build _that_ from source if you can avoid it.

Using official distribution packages would be a better idea, but their
freshness can vary, especially on RHEL.

~~~
cinquemb
Fair enough if you want to get straight to doing what you were planning to do.

------
j-kidd
Off topic: it is perfectly fine to install things like PyQt / PySide on a
headless server. I suppose the problem is because the distro doesn't provide
these packages?

Also, PhantomJS works fine in this case because the binary in the tarball is
statically compiled. You can find a whole lot of qt stuffs inside PhantomJS
source repository. There ain't no such thing as "truly headless".

------
techaddict009
Wow was searching something similar. Actually was trying to build a app which
scraps data from movie ticket booking sites and provides data via SMS to user
that whether tickets are still available or not. Because everyone doesn't have
access to internet in India yet.

@Steven5158 thanks for the share.

If anyone here wants help in building SMS apps do contact me.

------
keypusher
We do quite a bit of web scraping / parsing on headless servers with Selenium.
What we did was just install some X packages and run VNC server on the
headless clients with Firefox. Cool thing about that is you can then go watch
the scripts executing if you connect to the VNC session and take a screenshot
on failure, etc.

~~~
Shakahs
Brilliant! I've been using Xvfb for headless operation, didn't even consider
using VNC.

------
JimmaDaRustla
I am under the assumption the python-requests would have the same issue - it
does not render the page, it only retrieves the original page response.

Very, very good to know when diving into scraping.

