

Rvest: Easy web scraping with R - hadley
http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

======
12423gsd
The RStudio guys have really made R a pleasure to use. Thank you guys!

The core language is still a confusing mess (I'm still never sure when to use
a matrix, a dataframe, a list..), but if you use their tools you can ignore it
for the most part.

In under 10 lines you can massage data and generate fantastic graphics.

A little off topic: but does anyone know what their business model is? Are
they going to run out of money and burnout in a year or two?

~~~
smachlis
Here's how I think of it, which has been working for me:

matrix - If you have data that would make sense to be in a spreadsheet-type
format and all your data are numbers.

dataframe - If you have data that would make sense to be in a spreadsheet-type
format and some columns are numbers but other columns are something else
(character strings, dates, TRUE/FALSE); but each column is only one thing.
That is, you have one column that's all dates, another column that's all
numbers, yet another column that's all character strings, etc.

list - if you need to mix data types within a certain entity (vector or column
of data).

~~~
jowiar
To piggyback on what hadley said a bit, I find thinking of a data frame as a
"collection of records", and a matrix as "two dimensional data" to be a bit
better.

One useful heuristic worth asking is "Does it make sense to sort this data by
something". In that case, you have a data frame. Whereas if you want to
perform matrix math on something (inverting it, multiplying it by another
matrix, reducing it, etc.), you have a matrix. Things that I use a matrix for
can generally also be expressed as a data frame with columns rowId, colId, and
value. If it doesn't make sense in that format, a matrix is generally not the
appropriate structure.

~~~
hadley
That's a great explanation! Data frame for data analysis; matrix for math.

------
earino
The community and ecosystem around R is rapidly changing and adapting. R has a
long and storied history as a niche language for statistics and analysis. Much
like those disciplines have entered the mainstream of modern technology
enabled businesses, so follows the R ecosystem. Previously laborious tasks are
being revamped with new elegant APIs such as Rvest does for scraping (and
dplyr for manipulation, and lubridate for date manipulation, etc...)

Performance is also historically an R bugaboo, but with changes to R's copy on
write semantics and other optimizations in the base language, current
benchmarks show it behaving on par with Python and other dynamic languages (if
not even slightly better with tools such as dplyr and data.table.)

The maggritr package's implementation of a "pipe semantic" (often considered
the only truly successful implementation of a 'component architecture') and
the adoption of the model for tools such as Rvset are really allowing for the
functional, vectorized nature of R to shine through. These are really darned
exciting times to be a part of this community!

~~~
elliott34
Do you have any links to articles on the current benchmarks that you mention?
Not being snide just curious to read more.

~~~
hadley
I'd start here:
[https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-G...](https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping)

Performance isn't currently a huge focus for dplyr. In my opinion dplyr is
fast enough that the bottleneck becomes mostly cognitive - you spend more time
thinking about what you want to do than actually doing it.

------
danso
I guess I really do skip over who the submitter is when checking out HN
links...if this submission had been titled, "Rvest: Easy web scraping R
library by Hadley Wickham", I would've immediately been non-skeptical.

It looks like rvest intends to be the equivalent of Mechanize, with stateful
navigation in the works. Is there an R equivalent to just Beautiful Soup or
Nokogiri?

~~~
hadley
What are you looking for? rvest should support all the navigation tools from
beautiful soup/nokogiri (unless I've missed something), but currently doesn't
have any support for modifying the document (in which case I think your only
option is the XML package).

~~~
danso
No you didn't miss anything...I meant if there were standalone parsers for
R...Mechanize uses Nokogiri/Soup as a dependency.

~~~
hadley
Not that I'm aware of - rvest uses the R package XML which uses the C library
libxml.

------
minimaxir
Will this work with https websites?

One of the reasons I learned Python for data scraping was that R in general
does not play nice with https (RCurl requires a certificate and even then it's
pretty fussy)

~~~
hadley
Yes. It uses httr which wraps RCurl in such a way that everything should just
work.

------
hudibras
Hadley Wickham writes R packages faster than I can read the documentation on
them.

A true 10xer.

