
Webscraping with Rvest - hanginghyena
http://www.programmingr.com/content/webscraping-rvest-easy-mba-can/
======
minimaxir
Rvest works fine with tabular data. If, however, you are working with data
outside of Wikipedia, you will find that website data is _very rarely_
available in a <table> and is instead part of a hierarchical tree, which is a
pain to process/clean in R.

In such cases, working with Python/BeautifulSoup4 and importing the clean and
normalized data into R will save frustration over time, even offsetting the
overhead of using two languages.

~~~
haddr
I will work with any data, as soon as it is easily retrieved with some css
selector. Otherwise you would have problems using any web scraping tool.

~~~
sixtypoundhound
JSON is pretty easy to unpack, if you can figure out the call back that gets
the data.

~~~
minimaxir
The primary use case for web scraping tools like Rvest is for data that _doesn
't_ have a JSON endpoint and everything is rendered serverside, or is a static
web page.

------
baldfat
The reason why so many people were mixing Python code with R was specifically
for these sort of task. Web scraping in R has really caused me to not touch
another tool outside of R for a few years now and it is great.

Well done Hadley Wickham being inspired by libraries like Beautiful Soup and
bringing a great tool to R.

------
haddr
It really looks as easy as it can get. The good part of R is that many R
packages are designed in a similar way (highly specialized methods, doing a
good job). Combining that with %>% makes you really efficient.

------
jbmorgado
This seems really an intuitive way of getting the tables. What would be the
most similar library in python for those cases where R isn't available in the
system (with the permissions in some labs machines, unfortunately it takes
weeks-forever to get R installed)?

------
josep2
I've written a few blog posts where I used Rvest to get data and R's great
visualization tools to visualize it. R has a ton of issues as a platform and
language but this is a fantastic package and it has a great ecosystem for
small data (the majority of data).

------
gabrielcsapo
all of these web scraping frameworks, doesn't it tell you that the web needs
more wide spread semantic markup?

<jobs-list> <job> <employer> YCombinator </employer> <position> ...
</position> </job> </jobs-list>

Something like that?

I know what everyone will say, it is so terse and convoluted, but maybe
something like

<ul semantic-markup="jobs"> <li semantic-markup="job"> <p semantic-
markup="job-employer"> YCombinator </p> <p semantic-markup="job-position"> ...
</p> </li> </ul>

Seems like a lot of work though...maybe I take that back.

~~~
nathancahill
If websites wanted to make their data accessible, they would create APIs. Data
on websites is inaccessible for a reason.

------
ankimal
I've had great success with Ruby/Mechanize for regular html scraping and
phantomjs for dynamic page scraping.

------
data_spy
Rvest is for webscrapping newbs. A more seasoned R person would still use
PhantomJS and RSelenium as it actually collects all the page's information but
Rvest only collections a portion of it. Try it on washingtonpost.com and you
will see.

~~~
baldfat
> Rvest is for webscrapping newbs

down voted for calling people newbs. Also it always depends on what tool works
best.

