As a very early reader of some of Wes's first chapters, I'm eagerly awaiting the final product. It's some fine work, and Wes is not one to cut corners in his exposition or skim over important things to know.
I hope this book will advance recognition of a great set of standard tools that will make working with imperfect data as painless as possible, in a tech stack that is very amenable to 'productionization'.
Here's a direct link:
I'm buying the hard copy, because I love books, but they don't allow me to get in on the early release? They need to offer an "all editions" package.
But about a month ago I realized, "You know, what I really loved about my studies back in academic years was numbers. That's what I want to do with code; I want to crunch numbers"
This realization made me switch over from learning Ruby to learning Python. This book is going to be perfect for my interests.
Wes, thank you for your efforts in putting this book together.
edit: I take it back about Haskell's list comprehension syntax. At the end of the day they're both great. Haskell's syntax just feels like reading set theory notation.
While Python's functional programming constructs leave a lot to be desired, they are usually good enough for numerical work. Instead of using reduce(...) to sum an array and having it compile down properly, you can just use arr.sum(), which is implemented in C by numpy anyway.
I love Haskell (and want it to beat Python), but at this point, Python is the clear winner for numerical work. Libraries matter more than language in this case (as matlab demonstrates).
Found this old comment relevant: http://lambda-the-ultimate.org/node/2720#comment-40694
But I will be really happy to be proven wrong. I really think lots of things would be easier in Haskell.
Scratch that, the current ongoing dev and research work that is making its way onto the haskell hackage is both astonishing and often superlative code. The libs released in the past month and the ones I know will be made available over the coming months... It's just great work!
(I'm biased owing To using some of these tools for fun and profit :-). )
That aside, Pandas is a really nice piece of engineering that really works. It's a good role model for other libs, even when restricted to just its data frame part
There's a few other projects that aren't quite public yet going on that should make it possible to do some of the standard things you might want like a nice staticly typed fast data frame (which is going to look at pandas as an initial role model) and some other parts of the data analysis flow.
I haven't programmed in any substantial quantity or quality since 2004, so I'll start with a more approachable language (Python), but I'll definitely keep an open mind.
I don't know MATLAB, R, etc. well enough to say whether they can use the cloud as easily as Python, but it sure was easy with Python.
R has great libraries but I would prefer to use Python.
Matlab is behind R and Python for both data cleaning and analysis. This is especially true if you have string variables and factor variables. Matlab's great if you are doing matrix operations of clean data (and you don't need to do anything fancy in how you report them). But, I don't find it worth using for real data analysis.
Python and R are both great, though they feel pretty similar to me. Python obviously has the big advantage if you want to do general purpose computing too. Python has been faster than R in most cases I've compared them. For my work, development speed is more important than execution speed... so this hasn't been a huge factor for me.
R has a couple of big advantages. First, the libraries. This is a big deal for me. I saw that there is a python equivalent to ggplot2 in the works. This will definitely strengthen the case for python, but the availability of libraries in R is awesome.
Second, the community and help resources in R are amazing. I rarely run into a problems in R that haven't already been addressed on stackexchange.com.
Perhaps I should be more proactive about asking python questions when I run into them, but I usually just work it out myself (which is more time consuming than looking the answer up online.)
Lastly, I'm not an expert on big data. But, spending relatively little time with both R's bigmemory and Python's PyTables, it seems easier to get up to speed on big data with R at the moment.
Though I haven't met them, my sense is that Wes, Travis Oliphant and the other relevant python developers are putting in a heroic effort to get Python up to speed. I have every expectation that Python will be my choice of the future.
In general I agree with you; Matlab's age and origins show through in some warty ways, and one of them is string processing. Whenever I have to process anything that's not simple CSV or Excel, I use Python. (For XML, there's Perl Xpath command line tool, which has come in pretty handy for simple XML extraction.)
That said, however, the Statistics Toolbox has classes dataset, nominal, and ordinal that take a huge amount of the pain out of working with string data. Dataset lets you mix column types and refer to them by name, and lets you name rows if you like. I think it's similar to a dataframe in R. Nominal and ordinal are efficient representations for string columns. They are a workaround for Matlab's lack of a runtime string pool, but also are fast and small.
Edit: re-reading your last paragraph, I guess you're saying that Python can reach parity with R's libraries, at which point it's elegance and speed will win out. R's decades of lineage do seem to hobble it in terms of style and syntax, after all.
As an example, I like the ability to use expressions on the left side of an assignment (e.g. names(df) = "stuff"). But, it sounds like you are right that the python developers are getting to learn from R's mistakes and avoid getting locked into to legacy ideas.
As far as libraries... R has a lot. So I don't expect python to totally catch up soon. But, I only use 10 or 15 R libraries, and those are really popular libraries. So, unless you do an incredible range of stuff, python probably doesn't hvae to completely catch up.
One major advantage for R is the package management system (CRAN). The uniformity of the interface... the ability to search for stuff in it... that's been really useful for me. Not sure if anything like that is in the works for python.
Lastly, there are a lot of little helper functions that I've written for myself in python that are part of base R. The first example that comes to mind is head() to view the top few lines of a data structure. It seems strange that python would be missing these little helper functions, but I never found it.
dot(A, dot(B, inverse(A)))
Numpy has a "matrix" type, so you can write:
In : A = numpy.matrix('1 2; 3 4; 5 6')
In : B = numpy.matrix('1 2 3; 4 5 6')
In : C = numpy.matrix('1; 2; 3')
In : A * B * C
In : (A * B) * C
In : A * (B * C)
Amusing coincidence: this article appeared on Slashdot this Wednesday:
Comparing R, Octave, and Python for Data Analysis: