
RoboBrowser: Your friendly neighborhood web scraper - pmoriarty
https://github.com/jmcarp/robobrowser
======
aexaey
I'm surprised nobody has mentioned WWW::Mechanize - classic perl library [1]
or python port of it [2], which is much closer to RoboBrowser than
selenium/phantomjs/horseman.

[1] [http://search.cpan.org/~ether/WWW-
Mechanize-1.75/lib/WWW/Mec...](http://search.cpan.org/~ether/WWW-
Mechanize-1.75/lib/WWW/Mechanize.pm)

[2]
[https://pypi.python.org/pypi/mechanize/](https://pypi.python.org/pypi/mechanize/)

~~~
Dolores12
Mechanize is outdated and python 2 only. We tried it and switched to
RoboBrowser.

~~~
jimmaswell
In some significant ways, python 2 is the better language than 3, though.

~~~
tmerr
In what ways?

~~~
popey456963
Python 3 fan myself, but there are two things I've heard from colleagues:

1\. More modules for Python2 than Python3. A lot of projects are forced to go
down to Python2 to allow them to use the modules they want to use. 2\. It's
what they're familiar with. Generally the older developers like using Python2
because they know all what they're getting, no matter what strings are
attached. 3\. Better syntax. Apparently some people enjoy the Python2 syntax
more than the Python3 one. Not using brackets for print statements seem to be
the biggest plus, even though in my opinion it looks more Pythonic.

------
est
I hope scrapers could be in a form of Chrome extensions, it would record my
webpage actions as macros, then execute the macros on a remote headless server
without downtime with periodic revisits. No need to program or config
anything.

~~~
popey456963
SeleniumIDE[0] provides a nice and simple way of doing this, it's just a very
simple Firefox addon that lets you record and playback mouse movements,
typing, etc. You can then improve your macro through Selenium WebDriver.

[0] [http://www.seleniumhq.org/](http://www.seleniumhq.org/)

~~~
enibundo
One problem with selenium last time I used it, was that it is very slow. Maybe
this python library fixes this (i.e. no browser will show).

~~~
popey456963
Alas, this is both a downside and an upside of Selenium. It's rather slow
because it does need to spin up a Firefox instance, but it is very user-
friendly and easy to learn because you can see exactly where you are at just
by looking at the web browser.

You can run headless Selenium, and speed it up by using a static Firefox
instance, but even then it'll be maybe 2-3x slower than some of the others.

The only reason this is (in my opinion), better than other solutions is
because you can see the physical webpage it's loading, and for the sheer ease
of use that this has. You don't even really need any coding experience to get
a simple test running.

~~~
seanp2k2
You can record tests in Selenium or something like Capybara and replay them
using something like PhantomJS, which is a headless browser-like JS execution
environment that does things like generate a would-be DOM:
[https://github.com/jnicklas/capybara](https://github.com/jnicklas/capybara)

You can also use Selenium tests with things like
[https://www.browserstack.com/automate](https://www.browserstack.com/automate)
, where TL;DR they run your selenium test on dozens of browser + platform
combinations and send you the results, like screenshots and any javascript
errors. If you're familiar with CI stuff, you can see how powerful this has
the potential to be. It's non-trivial but very possible to run your own
cluster of selenium nodes as well; check out the official Selenium Grid:
[http://www.seleniumhq.org/projects/grid/](http://www.seleniumhq.org/projects/grid/)

~~~
enibundo
what is CI? ...CLI?

~~~
shrikant
Continuous Integration.

------
markbnj
Does this run javascript on the page? I've done quite a bit of scraping with
scrapy, and have had to use phantomjs in many cases because static html
doesn't get what you're after.

~~~
tekacs
At a glance, no - it uses Requests to fetch pages and BeautifulSoup to parse
them, the latter of which only parses the HTML into a document object.

So static HTML parsing only.

------
toasterlovin
I've had a lot of success scraping websites with Capybara [1]. It's intended
for writing acceptance tests of web apps, but it works remarkably well for
scraping websites. It's written in Ruby, but the DSL it provides for
interacting with web pages should be pretty understandable to anybody who's
programmed before. It also supports multiple browsers, which means you can
tradeoff along these axes:

\- Headless vs. Not \- JS support vs. Not

I put a repo together with a sample script [2] for scraping leads off of a
website which I will not name, but whose name rhymes with 'help'. It uses the
PhantomJS browser for headless JS support. It also includes a Vagrantfile so
you can avoid installing all the dependencies on your local machine.

[1]:
[https://github.com/jnicklas/capybara](https://github.com/jnicklas/capybara)

[2]: [https://github.com/toasterlovin/scraping-
yalp](https://github.com/toasterlovin/scraping-yalp)

------
facepalm
I love PhantomJS or SlimerJS for scraping. Everything else includes extra
hassles for cookie management, JavaScript emulation, faking user agents and
whatnot. Best to simply use a headless browser. Selenium seems overly
complicated, too.

------
Benfromparis
Interesting for unprotected websites but it's easy to detect and to block: no
valid js, no valid meta header, no valid cookie, suspect behavior...

Selenium is a much "elaborated" solution, but still, can be detected most of
the time.

Disclosure: I'm DataDome co-founder. If you want to detect bad bots and
scrapers on your website, don't hesitate to try out for free and to share your
feedback with us [https://datadome.co](https://datadome.co)

~~~
dchuk
I realize you have reasons not to answer this question, but out of curiosity,
what sorts of thing can tip off the fact that a site is getting scraped by a
real browser and selenium?

~~~
Benfromparis
Of course I cannot go much into details, but we are using behavior detection
and Javascript tracking (mouse, scroll, screen...).

------
mathheaven
After a glimpse, I should say that if the page needs javascript then use
selenium else you use this. So this is like selenium without javascript. Am I
right?

------
r1k
Does it support sites which require a JS enabled browser?

~~~
aexaey
It doesn't. To scrape (or fake-API) js-only websites you have to either:

\- drive a browser (firefox/chrome) via already mentioned here
selenium/webdriver (potentially hiding the actual browser window into a
virtual X by wrapping the whole thing with xvfb-run),

\- or use one of the webkit-based toolkits: phantomjs [1] or headless horseman
[2].

There is also an interesting project that combines the two, i.e. it drives a
Firefox (or, more precisely, slightly outdated version of Gecko) to emulate a
phantomjs-compatible API. [3]

phantomjs/slimerjs are pretty popular and even have tools that run on top of
them, such as casperjs [4], that geared more to automated website testing, but
can be quite good at scraping or fake-APIing too.

[1] [http://phantomjs.org/](http://phantomjs.org/)

[2] [https://github.com/johntitus/node-
horseman](https://github.com/johntitus/node-horseman)

[3] [https://slimerjs.org/](https://slimerjs.org/)

[4] [http://casperjs.org/](http://casperjs.org/)

~~~
brynedwards
I recently wrote a browser-driven scraper using Nightmare[1], which uses
Electron under the hood. Another option for those who prefer python is
dryscrape[2], although I haven't tried it.

[1]
[https://github.com/segmentio/nightmare](https://github.com/segmentio/nightmare)

[2]
[http://dryscrape.readthedocs.io/en/latest/](http://dryscrape.readthedocs.io/en/latest/)

~~~
nikolay
Dryscrape is really cool! Thanks for sharing!

------
pkmishra
What benefit does it provide in comparison to Scrapy?

~~~
alexroan
From what I can tell only recently starting to uzse Scrapy is that alot more
"magic", shall we say, happens in the background so long procedures which
could be a few hundred lines using bs4/requests/mechanize/etc can be minimized
into a lot less. Looking at Robobrowser, it seems like it will reduce some of
the coding effort but not to the extent that Scrapy does.

------
pcr0
Hmm, I can see why I'd want to use this library over piecing together requests
and BS4 myself for every project. I love how simple the examples look.

I have a project I'm working on that will involve scraping many different
websites on a daily basis. My only scraping experience so far is using
cheerio[0] to scrape a single page with a 1,000 row HTML table. Should I start
with something BS-based like this or should I jump straight into Scrapy? Or
are there any other alternatives I should try?

[0]:
[https://github.com/cheeriojs/cheerio](https://github.com/cheeriojs/cheerio)

~~~
jakubbalada
If you want to scrape many websites on a daily basis, have a look at
[https://www.apifier.com](https://www.apifier.com) as an alternative.

Disclaimer: I'm a cofounder there

------
gkst
I've used robobrowser for a project, where I needed to log in to a website and
subsequently access pages as a logged in user. It worked well and I like the
API. For "simple" scrapers that require authentication or some form of user
interaction this is a good tool. If I need to scrape many pages from a site as
fast as possible, I'd probably go for Scrapy though.

------
thomasahle
I'd like to write a small scraper for a website that uses NTLM authentication,
the headers it sends are:

    
    
        HTTP/1.1 401 Unauthorized
        Server: Microsoft-IIS/8.5
        WWW-Authenticate: NTLM
        WWW-Authenticate: Negotiate
        ...
    

Does RoboBrowser support these kinds of protocols? I tried to get it to work
with Scrapy, but it seemed non-trivial...

~~~
kej
It's been years since I've used it, but I think cntlm can do this. Point your
Scrapy code at the cntlm instance, and it should handle all of the NTLM
headers for you.

------
IanDrake
Just curious... what is everyone using scrapers for?

I've done a lot of work scraping various sites and I can tell you this: basing
any product on your ability to aggregate data via scraping will not work in
the long run.

Eventually you will be asked not to scrape and then you'll get sued if you
don't stop.

Case law is not in your favor here. See Craigslist Vs. 3Taps.

------
zo1
I had a quick look into the repository and unfortunately, it doesn't support
WebSockets. Does anyone know of a browser automation library/framework that
does support WebSockets?

------
bitfox
What are the differences (advantages) from Selenium WebDriver and why I should
use it?

------
taesu
If this doesn't run js, then what's it's edge vs. requests lib?

------
PhasmaFelis
Could someone explain what this is for, maybe with a couple of examples? This
is getting to be a problem on HN.

~~~
tlrobinson
Really? It's right there on the main Github page, a 3 sentence description and
6 code examples.

~~~
PhasmaFelis
I know, I read it. It's for "browsing the web without a standalone web
browser," and I'm sure that if that was something I had needed, I would have
said "Oh! How lovely!" But, since I didn't have that need already, I'm not
clear _why_ someone would want that. And I'd like to know! So could you give
me a couple of practical use cases? "User stories," if you're into that?

~~~
simula67
Here are a couple of use cases :

* Lets say you are Google and you want to test if the site is working correctly every day. You could code up a Python script that opens up www.google.com, searches for "facebook" and makes sure that the first result points to www.facebook.com. This script can be configured to run everyday and if someone accidentally pushes an update to the site that causes www.facebook.com to not show up as the top result, the script automatically reverts the site back to its original state. This means users continue to get best search results even if an engineer made a mistake with the ranking algorithm.

* Lets say you are Ebay and you want to make sure that the prices for products on your site is competitive with those at Amazon. You can code up a Python script which searches for some products that customers regularly buy, like an iPhone, and extract the lowest offered price at Amazon. It can then compare them with the lowest offered price of an iPhone on Ebay. If the lowest offered price on Ebay is much larger than that at Amazon, you can offer a discount. This convinces the customer that they are getting competitive offers from Ebay and stops them from writing off Ebay when they want to shop online.

