Hacker News new | comments | show | ask | jobs | submit login
Python for Data Analysis (new O'Reilly book from creator of Pandas) (oreilly.com)
186 points by plessthanpt05 on May 24, 2012 | hide | past | web | favorite | 57 comments



Note, this is the "Early Release" version, code for "not done yet". I'll be putting the wraps on the full draft with luck by end of June.


Congrats Wes on this Early Release milestone!

As a very early reader of some of Wes's first chapters, I'm eagerly awaiting the final product. It's some fine work, and Wes is not one to cut corners in his exposition or skim over important things to know.

I hope this book will advance recognition of a great set of standard tools that will make working with imperfect data as painless as possible, in a tech stack that is very amenable to 'productionization'.

Kudos, sir.


Consider browsing the criticism of Data Analysis with Open Source Tools(O'Reilly) for suggestions. Many people, including myself, expected a book that would have been more in line with what your book is probably trying to accomplish.


I was also very disapointed with DAwOST; it's a book about data analysis techniques with a bit of code mixed in. This is a book about _programming tools for working with data_, pretty much the exact opposite.


Hey Wes, is there a table of contents available yet?


I'll see if I can get one posted; should be soon.


It seems that if you follow the link to Safari Books Online, you get a table of content.

Here's a direct link: http://my.safaribooksonline.com/9781449323592?portal=oreilly...


It seems far from final: the outline for Python basics is way more detailed than anything about Pandas.


I was going to ask, but the early release does come with (free!) upgrades towards the final product: http://shop.oreilly.com/category/early-release.do


Stupid publishers.

I'm buying the hard copy, because I love books, but they don't allow me to get in on the early release? They need to offer an "all editions" package.


As I get my feet wet again in programming, I've gone through phases in dabbling with web, app, and game programming.

But about a month ago I realized, "You know, what I really loved about my studies back in academic years was numbers. That's what I want to do with code; I want to crunch numbers"

This realization made me switch over from learning Ruby to learning Python. This book is going to be perfect for my interests.

Wes, thank you for your efforts in putting this book together.


If you really want to crunch numbers, I highly recommend Haskell. It falls short of Python in terms of libraries and by some accounts, readability, but it is substantially faster and excels at number crunching. I find it easier to express number crunching in functional rather than object-oriented style programming. Python has lambdas, but usually theres a more pythonic way to express something in Python. Although I must say, I find Haskell's list comprehension syntax slightly more elegant.

http://tryhaskell.org/

edit: I take it back about Haskell's list comprehension syntax. At the end of the day they're both great. Haskell's syntax just feels like reading set theory notation.


I'm not sure I believe you that Haskell is faster than Python. In general, most of your python operations will be raw matrix operations handled by blas/lapack/etc.

While Python's functional programming constructs leave a lot to be desired, they are usually good enough for numerical work. Instead of using reduce(...) to sum an array and having it compile down properly, you can just use arr.sum(), which is implemented in C by numpy anyway.

I love Haskell (and want it to beat Python), but at this point, Python is the clear winner for numerical work. Libraries matter more than language in this case (as matlab demonstrates).


I think you're right that Python will be faster than Haskell when it's calling blas/lapack for matrix operations.

Found this old comment relevant: http://lambda-the-ultimate.org/node/2720#comment-40694


I never programmed in haskell, but I did in clojure and I must say that functional style and number crunching fit really nice. Said that, I am not sure you can beat things like numpy or pandas when it comes to complex/big datasets. I looked around in the haskell world and it seems that to do the same as, say, in pandas you need to write a lot of things yourself and you can't rely on the same superb documentation.

But I will be really happy to be proven wrong. I really think lots of things would be easier in Haskell.


And there's also a great and growing set of tools for numerical and data related computation.

Scratch that, the current ongoing dev and research work that is making its way onto the haskell hackage is both astonishing and often superlative code. The libs released in the past month and the ones I know will be made available over the coming months... It's just great work! (I'm biased owing To using some of these tools for fun and profit :-). )

That aside, Pandas is a really nice piece of engineering that really works. It's a good role model for other libs, even when restricted to just its data frame part


Just curious, could you link to some of these libraries? Genuinely curious here (I really want Haskell to win).


Newest ones are: meta-par (a work stealing scheduler as a library that can shuttle work btwn CPU,gpu, and distributing across other machines.) Pending one you can find online: distributed haskell (the production version is on github, and there's a proof of concept package on hackage called remote) Things that have been around a bit that have ongoin work: repa, dph, accelerate, and to some extent hmatrix.

There's a few other projects that aren't quite public yet going on that should make it possible to do some of the standard things you might want like a nice staticly typed fast data frame (which is going to look at pandas as an initial role model) and some other parts of the data analysis flow.


and on the DataFrame front, look to https://github.com/cartazio/HasCol in a few days - week for a prealpha/alpha look at what shall be the kernel of the haskell story for doing a nice data frame. :)


Most number crunching is definitely functional; however, most programming (even in science) is about all the window dressing before you get to the actual heart of the number crunching. For those tasks, an imperative syntax is usually more approachable for those without formal CS backgrounds (especially if they know a little C, FORTRAN, or MATLAB). This is why Python is so popular with the scientific computing crowd.


Thanks for the advice :)

I haven't programmed in any substantial quantity or quality since 2004, so I'll start with a more approachable language (Python), but I'll definitely keep an open mind.


Do keep an open mind, but bear in mind that Haskell is a different style of programming all together and there are libraries to speed number crunching in Python (NumPy).


If you want to crunch numbers, you should keep an eye on the Julia Language (http://julialang.org/). It's "a high-level, high-performance dynamic programming language for technical computing". Seems to math friendly with a nice syntax.


Python is great for doing weird transformations of data that ideally you shouldn't have to do, or normalizing legacy things into modern standards. I like developing in several languages, but I often pop open Python to really "look" at something. I sadly haven't had much of a reason (yet!) to use Pandas too much, but anything we can do to get the word out about these kinds of techniques will hopefully increase procedural literacy in the world, and help ease the flow information even.


Don't forget to brush up on the basics (with Python!):

http://greenteapress.com/thinkstats/


And don't forget to say that this guy publishes a free pdf version in the website above. I'm reading it now in my Kindle.


I'm thrilled that Pandas is named Pandas and not Pyndas.


Pandas is an extremely solid, well-tested and super-powerful library. Congrats to Wes on the release of this book.


wesm, do you cover using cloud-based computing resources for tackling computationally intensive data analysis problems? I use Python and PiCloud to turn what is an hours-long data analysis problem on my Mac into a minutes-long job on Amazon's hardware.

I don't know MATLAB, R, etc. well enough to say whether they can use the cloud as easily as Python, but it sure was easy with Python.


The focus is primarily on data manipulation / preparation / cleaning / processing tools, data algorithms, visualization, and practical analysis case studies. The platform where you use them is a separate matter-- I will have some material on distributed computing tools, however.


Sounds good. I can't tell you how thrilled I am to have a book like this coming out. I'm very interested in using it as a text for an instrumental analysis class I'm putting together. Thanks so much!


For R, MATLAB, or Python in a cloud Map/Reduce framework, try out Opani. http://opani.com/help/wiki


Didn't know about that one - thanks for the tip!


Ayone have experience as to how Python compare to R, Matlab (Octave) and other tools for data analysis?

R has great libraries but I would prefer to use Python.


I've done a lot of work using both R and Python, and a little bit using Matlab (and SAS and Stata).

Matlab is behind R and Python for both data cleaning and analysis. This is especially true if you have string variables and factor variables. Matlab's great if you are doing matrix operations of clean data (and you don't need to do anything fancy in how you report them). But, I don't find it worth using for real data analysis.

Python and R are both great, though they feel pretty similar to me. Python obviously has the big advantage if you want to do general purpose computing too. Python has been faster than R in most cases I've compared them. For my work, development speed is more important than execution speed... so this hasn't been a huge factor for me.

R has a couple of big advantages. First, the libraries. This is a big deal for me. I saw that there is a python equivalent to ggplot2 in the works. This will definitely strengthen the case for python, but the availability of libraries in R is awesome.

Second, the community and help resources in R are amazing. I rarely run into a problems in R that haven't already been addressed on stackexchange.com.

Perhaps I should be more proactive about asking python questions when I run into them, but I usually just work it out myself (which is more time consuming than looking the answer up online.)

Lastly, I'm not an expert on big data. But, spending relatively little time with both R's bigmemory and Python's PyTables, it seems easier to get up to speed on big data with R at the moment.

Though I haven't met them, my sense is that Wes, Travis Oliphant and the other relevant python developers are putting in a heroic effort to get Python up to speed. I have every expectation that Python will be my choice of the future.


Note: This reply is mostly helpful if you work with legacy Matlab code, or have colleagues who primarily know Matlab, and you have to work with string data.

In general I agree with you; Matlab's age and origins show through in some warty ways, and one of them is string processing. Whenever I have to process anything that's not simple CSV or Excel, I use Python. (For XML, there's Perl Xpath command line tool, which has come in pretty handy for simple XML extraction.)

That said, however, the Statistics Toolbox has classes dataset, nominal, and ordinal that take a huge amount of the pain out of working with string data. Dataset lets you mix column types and refer to them by name, and lets you name rows if you like. I think it's similar to a dataframe in R. Nominal and ordinal are efficient representations for string columns. They are a workaround for Matlab's lack of a runtime string pool, but also are fast and small.


This is a really detailed comparison and seems balanced. Thanks for that. Do you not use both in conjunction (R for the stats libraries and Python for the more general computing)? Having done some background reading (but without yet getting down into the weeds with them), that was my take on the relationship.

Edit: re-reading your last paragraph, I guess you're saying that Python can reach parity with R's libraries, at which point it's elegance and speed will win out. R's decades of lineage do seem to hobble it in terms of style and syntax, after all.


Even though I hear complaints about R's syntax, I don't know exactly what people dislike about it. In fact, I kind of like R's syntax.

As an example, I like the ability to use expressions on the left side of an assignment (e.g. names(df) = "stuff"). But, it sounds like you are right that the python developers are getting to learn from R's mistakes and avoid getting locked into to legacy ideas.

As far as libraries... R has a lot. So I don't expect python to totally catch up soon. But, I only use 10 or 15 R libraries, and those are really popular libraries. So, unless you do an incredible range of stuff, python probably doesn't hvae to completely catch up.

One major advantage for R is the package management system (CRAN). The uniformity of the interface... the ability to search for stuff in it... that's been really useful for me. Not sure if anything like that is in the works for python.

Lastly, there are a lot of little helper functions that I've written for myself in python that are part of base R. The first example that comes to mind is head() to view the top few lines of a data structure. It seems strange that python would be missing these little helper functions, but I never found it.


I have been programming python too long to make an objective comparison with R. I have had to use R libraries at times, and I've found rpy to be a workable bridge from R to python for this purpose. Depending on how it works under the hood, it might not be appropriate for big data, though. Also, I had to custom modify some R libraries to work with my data, so it has been useful to know a bit of both, although I mostly picked up the R as I did the mods.


I would add that an advantage of Matlab for number bashing is a much more native handling of linear algebra. Let's say you have two matrices A and B, in matlab I could write:

  A*B*A'
Whereas in python it would be (approximately):

  dot(A, dot(B, inverse(A)))
so matlab can evaluate everything in the right order (right to left) whereas for numpy I end up writing a little recursive function to dot a list of arguments together from right to left, which feels a bit cludgey and more of an impediment to getting your ideas down in code. Especially when your equations get very big as they often do with stats!


If you intend "matrix multiplication", there is no "right order": "matrix multiplication" is associative (http://en.wikipedia.org/wiki/Matrix_multiplication).

Numpy has a "matrix" type, so you can write:

        In [7]: A = numpy.matrix('1 2; 3 4; 5 6')                                                         
        In [8]: B = numpy.matrix('1 2 3; 4 5 6')                                                          
        In [9]: C = numpy.matrix('1; 2; 3')                                                               

        In [10]: A * B * C
        Out[10]: 
        matrix([[ 78],                                                                                    
                [170],                                                                                    
                [262]])                                                                                   
        
        In [11]: (A * B) * C                                                                              
        Out[11]:                                                                                          
        matrix([[ 78],
                [170],                                                                                    
                [262]])                                                                                   

        In [12]: A * (B * C)                                                                              
        Out[12]: 
        matrix([[ 78],                                                                                    
                [170],
                [262]])


It isn't native, but numexpr is a nice in between:

http://code.google.com/p/numexpr/


"A[n]yone have experience as to how Python compare to R, Matlab (Octave)"

Amusing coincidence: this article appeared on Slashdot this Wednesday:

Comparing R, Octave, and Python for Data Analysis:

http://developers.slashdot.org/story/12/05/23/1956219/compar...


You might want to look into RPy for using R libraries with python.


I've very briefly experimented with rpy2. It got the job done, but I thought it was tricky enough that I'd want a good reason to combine R and python. Otherwise, I'd try and do the whole project in one or the other (And I haven't used rpy2 since I first tried it out.)


I recommend reading this blog,for those interested in crunching numbers on python http://slendrmeans.wordpress.com/


Looking at the table of contents, this seems like a great reference book for several technologies. I think I am sold, especially with the free updates.


does this include numpy, or is it just pandas?


NumPy, IPython, matplotlib, pandas, a little bit of SciPy, scikit-learn, statsmodels, and plenty other fun stuff


Sold! Any suggested prerequisites?


Some working general purpose Python knowledge would be helpful but not required


Can we get the table of contents?


If you're in the LinkedIn "Python Data" group, I posted a comment with a list of the ToC (hope you don't mind, Wes!)

http://www.linkedin.com/groupItem?view=&gid=4388870&...


I'll blog/tweet/G+ a ToC pretty soon, doing a bit of reorganization before the next Early Release update


Where's the table of contents?


can we use it for trade analysis?


Without a doubt (how I got my start in Python)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: