
Python for Data Analysis (new O'Reilly book from creator of Pandas) - plessthanpt05
http://shop.oreilly.com/product/0636920023784.do
======
wesm
Note, this is the "Early Release" version, code for "not done yet". I'll be
putting the wraps on the full draft with luck by end of June.

~~~
TheCowboy
Consider browsing the criticism of Data Analysis with Open Source
Tools(O'Reilly) for suggestions. Many people, including myself, expected a
book that would have been more in line with what your book is probably trying
to accomplish.

~~~
wesm
I was also very disapointed with DAwOST; it's a book about data analysis
techniques with a bit of code mixed in. This is a book about _programming
tools for working with data_, pretty much the exact opposite.

------
hkmurakami
As I get my feet wet again in programming, I've gone through phases in
dabbling with web, app, and game programming.

But about a month ago I realized, "You know, what I really loved about my
studies back in academic years was _numbers_. That's what I want to do with
code; _I want to crunch numbers_ "

This realization made me switch over from learning Ruby to learning Python.
This book is going to be _perfect_ for my interests.

Wes, thank you for your efforts in putting this book together.

~~~
Cieplak
If you really want to crunch numbers, I highly recommend Haskell. It falls
short of Python in terms of libraries and by some accounts, readability, but
it is substantially faster and excels at number crunching. I find it easier to
express number crunching in functional rather than object-oriented style
programming. Python has lambdas, but usually theres a more pythonic way to
express something in Python. Although I must say, I find Haskell's list
comprehension syntax slightly more elegant.

<http://tryhaskell.org/>

edit: I take it back about Haskell's list comprehension syntax. At the end of
the day they're both great. Haskell's syntax just feels like reading set
theory notation.

~~~
yummyfajitas
I'm not sure I believe you that Haskell is faster than Python. In general,
most of your python operations will be raw matrix operations handled by
blas/lapack/etc.

While Python's functional programming constructs leave a lot to be desired,
they are usually good enough for numerical work. Instead of using reduce(...)
to sum an array and having it compile down properly, you can just use
arr.sum(), which is implemented in C by numpy anyway.

I love Haskell (and want it to beat Python), but at this point, Python is the
clear winner for numerical work. Libraries matter more than language in this
case (as matlab demonstrates).

~~~
Cieplak
I think you're right that Python will be faster than Haskell when it's calling
blas/lapack for matrix operations.

Found this old comment relevant: <http://lambda-the-
ultimate.org/node/2720#comment-40694>

------
th0ma5
Python is great for doing weird transformations of data that ideally you
shouldn't have to do, or normalizing legacy things into modern standards. I
like developing in several languages, but I often pop open Python to really
"look" at something. I sadly haven't had much of a reason (yet!) to use Pandas
too much, but anything we can do to get the word out about these kinds of
techniques will hopefully increase procedural literacy in the world, and help
ease the flow information even.

------
look_lookatme
Don't forget to brush up on the basics (with Python!):

<http://greenteapress.com/thinkstats/>

~~~
neves
And don't forget to say that this guy publishes a free pdf version in the
website above. I'm reading it now in my Kindle.

------
joshu
I'm thrilled that Pandas is named Pandas and not Pyndas.

------
monatron
Pandas is an extremely solid, well-tested and super-powerful library. Congrats
to Wes on the release of this book.

------
wgrover
wesm, do you cover using cloud-based computing resources for tackling
computationally intensive data analysis problems? I use Python and PiCloud to
turn what is an hours-long data analysis problem on my Mac into a minutes-long
job on Amazon's hardware.

I don't know MATLAB, R, etc. well enough to say whether they can use the cloud
as easily as Python, but it sure was easy with Python.

~~~
wesm
The focus is primarily on data manipulation / preparation / cleaning /
processing tools, data algorithms, visualization, and practical analysis case
studies. The platform where you use them is a separate matter-- I will have
some material on distributed computing tools, however.

~~~
wgrover
Sounds good. I can't tell you how thrilled I am to have a book like this
coming out. I'm very interested in using it as a text for an instrumental
analysis class I'm putting together. Thanks so much!

------
apechai
Ayone have experience as to how Python compare to R, Matlab (Octave) and other
tools for data analysis?

R has great libraries but I would prefer to use Python.

~~~
dbecker
I've done a lot of work using both R and Python, and a little bit using Matlab
(and SAS and Stata).

Matlab is behind R and Python for both data cleaning and analysis. This is
especially true if you have string variables and factor variables. Matlab's
great if you are doing matrix operations of clean data (and you don't need to
do anything fancy in how you report them). But, I don't find it worth using
for real data analysis.

Python and R are both great, though they feel pretty similar to me. Python
obviously has the big advantage if you want to do general purpose computing
too. Python has been faster than R in most cases I've compared them. For my
work, development speed is more important than execution speed... so this
hasn't been a huge factor for me.

R has a couple of big advantages. First, the libraries. This is a big deal for
me. I saw that there is a python equivalent to ggplot2 in the works. This will
definitely strengthen the case for python, but the availability of libraries
in R is awesome.

Second, the community and help resources in R are amazing. I rarely run into a
problems in R that haven't already been addressed on stackexchange.com.

Perhaps I should be more proactive about asking python questions when I run
into them, but I usually just work it out myself (which is more time consuming
than looking the answer up online.)

Lastly, I'm not an expert on big data. But, spending relatively little time
with both R's bigmemory and Python's PyTables, it seems easier to get up to
speed on big data with R at the moment.

Though I haven't met them, my sense is that Wes, Travis Oliphant and the other
relevant python developers are putting in a heroic effort to get Python up to
speed. I have every expectation that Python will be my choice of the future.

~~~
clebio
This is a really detailed comparison and seems balanced. Thanks for that. Do
you not use both in conjunction (R for the stats libraries and Python for the
more general computing)? Having done some background reading (but without yet
getting down into the weeds with them), that was my take on the relationship.

Edit: re-reading your last paragraph, I guess you're saying that Python can
reach parity with R's libraries, at which point it's elegance and speed will
win out. R's decades of lineage do seem to hobble it in terms of style and
syntax, after all.

~~~
dbecker
Even though I hear complaints about R's syntax, I don't know exactly what
people dislike about it. In fact, I kind of like R's syntax.

As an example, I like the ability to use expressions on the left side of an
assignment (e.g. names(df) = "stuff"). But, it sounds like you are right that
the python developers are getting to learn from R's mistakes and avoid getting
locked into to legacy ideas.

As far as libraries... R has a lot. So I don't expect python to totally catch
up soon. But, I only use 10 or 15 R libraries, and those are really popular
libraries. So, unless you do an incredible range of stuff, python probably
doesn't hvae to completely catch up.

One major advantage for R is the package management system (CRAN). The
uniformity of the interface... the ability to search for stuff in it... that's
been really useful for me. Not sure if anything like that is in the works for
python.

Lastly, there are a lot of little helper functions that I've written for
myself in python that are part of base R. The first example that comes to mind
is head() to view the top few lines of a data structure. It seems strange that
python would be missing these little helper functions, but I never found it.

------
salimmadjd
I recommend reading this blog,for those interested in crunching numbers on
python <http://slendrmeans.wordpress.com/>

------
ruffyen
Looking at the table of contents, this seems like a great reference book for
several technologies. I think I am sold, especially with the free updates.

------
brianobush
does this include numpy, or is it just pandas?

~~~
wesm
NumPy, IPython, matplotlib, pandas, a little bit of SciPy, scikit-learn,
statsmodels, and plenty other fun stuff

~~~
neovive
Sold! Any suggested prerequisites?

~~~
wesm
Some working general purpose Python knowledge would be helpful but not
required

------
npcomplete
Can we get the table of contents?

~~~
pwang
If you're in the LinkedIn "Python Data" group, I posted a comment with a list
of the ToC (hope you don't mind, Wes!)

[http://www.linkedin.com/groupItem?view=&gid=4388870&...](http://www.linkedin.com/groupItem?view=&gid=4388870&type=member&item=115808571&commentID=81103690&qid=043286c4-a34b-4614-8cde-95b8a79b8087&trk=group_most_popular-0-b-cmn&goback=%2Egmp_4388870#commentID_81103690)

------
neves
Where's the table of contents?

------
sodelate
can we use it for trade analysis?

~~~
wesm
Without a doubt (how I got my start in Python)

