
Upton: A Web Scraping Framework - t1c1
http://www.propublica.org/nerds/item/upton-a-web-scraping-framework
======
ricardobeat
Looks like a very nice integrated solution for data scientists and
researchers. It's interesting to see the different shapes tools like this
take, depending on their target users. I've been happy with node.js +
cheerio[1] and it's simplistic jquery-like API:

    
    
        request 'http://website.com/list_of_stories.html', (err, body) ->
            $ = cheerio.load(body)
            callback $('#comments li a.commenter-name').map(cheerio::text)
    

Plus if you need to handle javascript/ajax, just replace that with
jsdom/chimera with minor changes.

[1] [http://npmjs.org/package/cheerio](http://npmjs.org/package/cheerio)

~~~
chuckd1356
I found using cheerio with request and phantom.js makes anything possible.

[https://github.com/mikeal/request](https://github.com/mikeal/request)
[https://github.com/sgentle/phantomjs-
node](https://github.com/sgentle/phantomjs-node)

------
danso
Cool library, Jeremy...another Ruby scraping framework you might want to check
out and improve upon is Artsy's Spidey:

[https://github.com/joeyAghion/spidey](https://github.com/joeyAghion/spidey)

Has a similar approach but also leaves storage (and caching) up to the end-
user.

~~~
jeremybmerrill
Thanks, Dan. I'll check that out, especially for good ideas on how to solve
things I haven't solved yet. Spidering and scraping seem to be very related,
but not quite the same -- and I admittedly know nothing about spidering.

------
miket
Instead of (or in addition to) manually creating scrapers, you can use Diffbot
to automatically extract this type of information from news articles using
computer vision:
[http://diffbot.com/products/automatic/article](http://diffbot.com/products/automatic/article).
It also allows you to create rules with a WYSIWYG editor:
[http://diffbot.com/products/custom/](http://diffbot.com/products/custom/)

~~~
Ecio78
I was starting a project by using Scrapy: I was essentially going to query
everyday a page (a search page accessed by a POST call), getting the results
(a list of links) and then get every single result page (download an XML
file).

The project was paused but I'm thinking about restarting it, and I was
thinking if something like diffbot or import.io could be useful for me.. any
experience doing these kind of stuff?

------
jliechti1
Does Upton have any way of dealing with JavaScript or ajax calls? For a lot of
the scraping I do in Python, this is crucial for me. I use Selenium's
Webdriver (along with beautiful soup or lxml) now for that - definitely open
to other options.

~~~
a8da6b0c91d
There's phantomjs for that. Though I've actually had the most success with
perl's WWW::Mechanize::Firefox and a headless X server.

~~~
kanzure
You don't even need phantomjs, really. You can use python + webkitgtk+ through
the gobject bindings. The problem with phantomjs is that it's using an old
version of webkit from an old version of qtwebkit from qt 4.8, whenever that
was released. By comparison, webkitgtk+ can be compiled from upstream webkit
whenever you please.

If you insist on controlling webkit through javascript, you can use gnome-
seedjs. But this is problematic/annoying because there's no commonjs
implementation yet.. in phantomjs you can require() node modules in the
outside context. Not so much in gnome-seedjs..

Also, X means it's not actually headless, even if you're using xserver-xorg-
video-dummy or xvfb. For this reason, phantomjs got rid of the X requirement a
number of versions ago.

~~~
dangayle
Do you have any other resources about using webkitgtk with python for this
sort of purpose?

~~~
kanzure
No, I haven't found a definitive tutorial or reference (even for webkit). In
general, look around for "import gi.repository.webkit" and you will find
relevant things. I am not very sure how the other phantomjs developers are
learning webkit things.. probably just reading code.

------
terhechte
For the past couple of years, I've always done any web scraping with trusty
python + beautiful soup or elementtree. I've recently started doing it with
Clojure + Enlive (mostly as an excuse to use clojure for less academic
excercises) but I really like it.

From that perspective, Upton looks pretty cool, especially the debug mode.

~~~
dangoldin
Have you taken a look at Scrapy ([http://scrapy.org/](http://scrapy.org/))? My
evolution has been from Perl to Python and I recently did a project with
Scrapy that left me pretty happy.

~~~
look_lookatme
Scrapy is really awesome. It's a soup to nuts queuing, fetching and extraction
workflow tool and if I had to start a larger than trivial project (like an rss
reader or shopping aggregation site) I would base my spider toolchain on it.

~~~
dangoldin
Yea - I was definitely impressed by it and I just got started. It felt as if I
was able to get rid of all the boilerplate and just focus on getting the next
page that needed to be crawled and the information to extract.

------
bergie
I've been using [http://diffbot.com/](http://diffbot.com/) for this sort of
stuff, together with [http://oembed.com/](http://oembed.com/)

~~~
shazzdeeds
Been loving Diffbot for my startup. Only downside is you very rapidly approach
the freemium tier for any production loads. It works fantastically well
though.

------
hobonumber1
I use YQL to do web scraping. It lets me do something like this:

`select * from data.html.cssselect where url="www.yahoo.com" and css="#news
a"`

Could you elaborate on the benefits of using Upton instead of this?

~~~
jeremybmerrill
AFAICT, YQL can only handle scraping individual pages that way.

Upton can scrape a whole set of pages. If you have a page that lists the pages
you're interested in; suppose you're interested in HN commenters on front page
posts, you could specify the front page URL and a selector for links to
comment pages, and Upton would _automatically_ scrape those pages and return
them to you.

Upton could even write the commenter names to a CSV for you with just a
filename and a CSS selector/XPath expression.

It's not stuff you couldn't do with YQL or Python/BeautifulSoup. But it's
stuff that I didn't want to have to write over and over each time I wrote a
new scraper.

~~~
hobonumber1
Makes sense! Thanks for clarifying that.

------
inovica
This looks really good. We're mainly Python focused and have been working on a
tool to try to 'train' a crawler to extract specific elements of a page. As
this is a crawling thread I hope you don't mind me asking some advice on where
to take it from here :)

Here's how it currently works:

1) It has a queue of domains that I have pre-processed. For the initial
purposes I've restricted it to pages that I think are ecommerce based on $
signs, add to cart/basket type links etc

2) There is a visual tool that I then use to select certain parts of the page
- eg price, product, image etc. I save these out as xpaths

3) Once I have done one URL I send a crawler to that domain and extract other
pages that fit the profile of an ecommerce page and try to use the same
mapping as number 2 above to extract the data

I have done a small video to show it in action:

[http://www.screencast.com/t/riB3iiVMiSk](http://www.screencast.com/t/riB3iiVMiSk)

I'm not sure if I'm doing this the right way. If a site/page changes structure
then I may have to re-map the data. I was hoping that someone would have some
pointers for me in terms of any other ways to do this. Also with Javascript-
heavy sites I've had some problems

If anyone has any knowledge of screen scraping, where it can be done more
automatically, I'd really appreciate a steer!

~~~
regularfry
I've done almost exactly this in the past. There's a hell of a lot of fiddling
in keeping the xpaths both stable and general enough to be useful.

One approach I found absolutely vital was to have a rewriting, caching proxy
between the crawler and the upstream site. This proxy allowed me to rewrite
the page content into something much simpler for the crawler to get to grips
with (RSS or Atom, say). I used Celerity
([http://celerity.rubyforge.org/](http://celerity.rubyforge.org/)) with a
hacked-on Mechanize API to do the rewriting, which let me handle JS-heavy
pages _almost_ as easily as static HTML ones. My original inspiration for this
was _why's Mousehole (the source for which is here:
[https://github.com/evaryont/mousehole](https://github.com/evaryont/mousehole),
I've got no idea if it runs on recent Rubies).

The proxy also gives you somewhere to raise an alert if, all of a sudden, your
scraping fails because of an upstream change.

One tool I always intended to make some use of, but never got round to, was
Ariel: [http://ariel.rubyforge.org/](http://ariel.rubyforge.org/). It looks
like it ought to be able to totally remove the need to manually extract
xpaths.

~~~
inovica
Thanks for this. I'll check these out

------
MrBlue
Recently I've been using CasperJS for my scraping needs.
[http://casperjs.org/](http://casperjs.org/)

------
valtron
Web::Scraper is a really good one for perl.
([http://search.cpan.org/~miyagawa/Web-
Scraper-0.37/lib/Web/Sc...](http://search.cpan.org/~miyagawa/Web-
Scraper-0.37/lib/Web/Scraper.pm)).

~~~
draegtun
Typically with Perl there is more than one module for this :)

\- pQuery |
[https://metacpan.org/module/pQuery](https://metacpan.org/module/pQuery)

\- Mojo::UserAgent |
[https://metacpan.org/module/Mojo%3a%3aUserAgent](https://metacpan.org/module/Mojo%3a%3aUserAgent)

\- Scrappy |
[https://metacpan.org/release/Scrappy](https://metacpan.org/release/Scrappy)

\- Web::Query |
[https://metacpan.org/module/Web%3a%3aQuery](https://metacpan.org/module/Web%3a%3aQuery)

\- Web::Magic |
[https://metacpan.org/module/Web%3a%3aMagic](https://metacpan.org/module/Web%3a%3aMagic)

Above are specifically for scraping but one shouldn't forget WWW::Mechanize &
LWP.

My preference over last few years is with pQuery. However Web::Query is
Tokuhiro's pQuery _improvement_ and Mojo::UserAgent looks very nifty.

------
Legion
What separates this from Nokogiri? (Don't take that as critical, more working
code out in the world is better. Just wondering, as I use Nokogiri heavily for
our company chat-bot, and couldn't tell the answer at a quick glance.)

~~~
htp
Reposting a comment by the author from the article:

> Upton depends on Nokogiri, which is basically the BeautifulSoup port for
> Ruby.

> If you just used vanilla Nokogiri, you'd be responsible for writing code to
> fetch, save (maybe), debug and sew together all the pieces of your web
> scraper. Upton does a lot of that work for you, so you can skip the
> boilerplate.

------
joshfraser
Recently I've been using Fake Browser (fakeapp.com) for web scraping. While
it's inefficient for large jobs, it's awesome for hacking together quick
scripts. With Fake you can write your scraper in JavaScript and it run in an
actual browser so it's very visual process. It's great for instances where you
want to get past complicated authentication systems without writing code. Just
sign in manually and start your script.

------
mindcrime
Sounds pretty cool. Also, for those operating in "java land" there are Commons
HttpClient[1] and Apache Tika[2] which, together, are a pretty potent
combination for scraping web data.

[1]: [http://hc.apache.org/httpcomponents-client-
ga/](http://hc.apache.org/httpcomponents-client-ga/)

[2]: [http://tika.apache.org/](http://tika.apache.org/)

------
dome82
What about import.io? Anyone has experience using it?

Link: [http://import.io/](http://import.io/)

~~~
fourstar
Is this somewhat familiar to embedly.com? That's what I currently use, and it
seems to work fine, but am curious if that's the best thing to be using (I'm
typically just grabbing the image thumbnail, but it'd be nice to grab some
text if it exists too.

------
nwienert
Always used anemone with great results
[http://anemone.rubyforge.org/](http://anemone.rubyforge.org/)

Seems like it's not being actively developed, but again, never had a problem.

Edit: realize this is focused on single page scraping with data extraction.
Could use them nicely together in fact.

~~~
jeremybmerrill
I had never seen this before. I will check it out, at least for inspiration.

I'd say "scraping" is a little more focused on extracting data from specific
pages as opposed to ALL pages as in "spidering", but the two are certainly
cousins if not siblings. Anemone would probably be good at the same sorts of
tasks Upton is designed for (i.e. scraping data contained on multiple pages).

------
geekymartian
Nice lib, you may take some ideas from pismo:
[https://github.com/peterc/pismo](https://github.com/peterc/pismo) , more
metadata oriented but returns a nokogiri doc as well.

~~~
jeremybmerrill
Thanks, will check it out.

------
kingkool68
PHP SimpleDOM Class makes scraping just like working with jQuery on the
backend.
[http://simplehtmldom.sourceforge.net/](http://simplehtmldom.sourceforge.net/)

~~~
paulhauggis
I used this for many years, but the memory footprint is terrible.

I would recommend querypath. Very small footprint and takes a fraction of the
cpu time.

------
gavingmiller
For anyone interested in a Node.js crawler, my company recently released
roach:
[https://github.com/PetroFeed/roach](https://github.com/PetroFeed/roach)

------
Shorel
What I need is something that can scrape .NET sites with lots of weird and
signed AJAX stuff just to populate a select control.

I'm using node.js with several libraries and nothing has worked so far.

~~~
gee_totes
You might want to look at one of the headless web browsers, like phantom or
casper,

[http://phantomjs.org/](http://phantomjs.org/)
[http://casperjs.org/](http://casperjs.org/)

~~~
Shorel
Already tried that, and already failed.

Some JS in the page makes webkit die.

I think a headless Firefox is my only hope.

------
praveenhm
Is web scraping legal?

~~~
adamnemecek
If anything, it might be against TOS but ASAIK, breaking TOS is not illegal.

~~~
lsiebert
Who was the guy who got jail time for breaking a TOS? Think it was iPhone
related.

~~~
adamnemecek
Are you thinking of weev?
[http://en.wikipedia.org/wiki/Weev#AT.26T_data_breach](http://en.wikipedia.org/wiki/Weev#AT.26T_data_breach)

