
Practical Data Science in Python - doh
http://radimrehurek.com/data_science_python/
======
gallamine
It might be useful to those not familiar with it, but this blogpost was
written using IPython Notebooks - you can code, plot and then render to HTML
all in the browser. Most of my data science work is done using this format. If
Python isn't your language of choice, there are lots of plugins for Python
Notebook to let you effectively do in-browser REPL with plotting and
documentation:
[http://ipython.org/notebook.html](http://ipython.org/notebook.html)

It's changed the way I work (and blog)

~~~
spot
If you like that you might consider Beaker Notebooks which have some more
advanced UI elements and better polyglot support:
[http://BeakerNotebook.com](http://BeakerNotebook.com)

You can combine multiple languages in the same notebook, and they can
communicate.

~~~
devilsdounut
Interesting. Any advantages of this over IPython? I see fancy rendering of
DataFrames, but there are libraries that let you do this in IPython.

~~~
spot
Thanks.

Beaker has autotranslation for communicating among languages, and you can have
multiple languages in one notebook.

Four months ago someone asked for compare and contrast:
[https://news.ycombinator.com/item?id=8366245](https://news.ycombinator.com/item?id=8366245)

Since then we have fixed a ton of bugs and fleshed out the concept. The
reflection API and JS scriptability are about to be released, along with a
bunch of UI polish, performance, and all kinds of fixes.

There are only 2-3 of us working for 2 years so far. The project is still
young and developing, and in fact we are hiring! Especially we need a
Javascript Engineer, Angular experience a plus:
[http://www.twosigma.com/careers/position/935.html](http://www.twosigma.com/careers/position/935.html)

~~~
IanCal
This looks really interesting, thanks. The communication is great, often I
want to munge the data / collect things in python, then graph them in R.

------
neverminder
There's this thought constantly bugging me - Python is popular among data
scientists, but it also happens to be quite a slow language (roughly speaking)
in comparison to the likes of Java or Go for instance. Hypothetically
speaking, would it not be more beneficial to use something like Rust instead?

~~~
IanOzsvald
Don't forget that the higher level functionality (e.g. the scikit-learn
routines Radim uses) are typically wrappers for underlying C/Fortran routines
and they're the real bottleneck. The relatively few lines of VM'd Python are
'slow' compared to e.g. C but aren't the bottleneck.

The win with Python (and other dynamic languages) is that you can experiment
quickly with ideas when you're formulating a solution, that's a big part of
exploratory data science.

If you're curious about high-speed work in Python - Radim did a blog series on
how he sped up word2vec to be faster than Google's original C code:
[http://radimrehurek.com/2013/09/deep-learning-with-
word2vec-...](http://radimrehurek.com/2013/09/deep-learning-with-word2vec-and-
gensim/)

I'll also note [self promo!] that I wrote on book on High Performance Python,
if that's your cup of tea (and Radim wrote a section in it):
[http://shop.oreilly.com/product/0636920028963.do](http://shop.oreilly.com/product/0636920028963.do)

~~~
danieldk
_The win with Python (and other dynamic languages) is that you can experiment
quickly with ideas when you 're formulating a solution, that's a big part of
exploratory data science._

And in my experience, very hard to reproduce after a couple of years. With
enough discipline, it's obviously possible to make well-structured Python
programs that will last. But in practice that rarely happens with scientific
software written in Python. Usually, there are many external dependencies,
it's fragile (no static type checking), and platform-dependent (usually OS X
or Linux). To add to the mess, most scientists like to hardcode paths to the
input data, etc.

Although I am not a fan of Java, I usually don't encounter the same problems
with older scientific Java software. If it's Mavenized you are usually ready
to go after a 'mvn compile', otherwise, you just dump the project structure in
an IDE and it usually works.

(The plague with scientific software in Java is that it is often not thread-
safe.)

Also, I think the quick experimentation is not limited to Python and
statically typed languages with a REPL can also provide that (Haskell, OCaml,
Scala). And since Go was mentioned: since compilation time in Go is usually
near-zero, it's the same.

~~~
gh02t
> And in my experience, very hard to reproduce after a couple of years.

Well, let's be honest with ourselves... this isn't limited to Python.
Scientific code that isn't a mess is almost nonexistent. For a lot of
scientists, writing code is totally secondary and many simply aren't skilled
programmers (nor should we necessarily expect them to be).

It is however deeper than that. As a graduate student, I was involved in a
government initiative to write a high quality large scale code package. This
was (still is, the program just got extended) a well funded and well organized
effort with hundreds of people, including literally dozens of people who can
legitimately claim to be the best in the world at their specialties. This
included some genuinely amazing computer scientists and software engineers who
enforced well planned coding practices.

And yet, the code is still far from ideal. A big part of this is its scale -
millions of lines of very technical numerics code and libraries all working
together. Most of what I consider to be the toughest work was on integrating
various disparate pieces and unifying them under one common input structure.

Point being, even with effectively unlimited resources using rigorous
development standards and statically typed languages (primarily c++11) there
are still tons of issues. A lot of it is because of incorporation of older
codes, which is inescapable in any non-trivial scientific code.

------
onderkalaci
For the ones who are seeking for data science in Python, that is great. Thanks
for sharing!

------
afarrell
If you're coming from web development and used to using virtualenv, anaconda
has environment management too. Run $(conda install conda-env). You can still
pip install things into conda environments too. you'll probably want to
$(conda install binstar) and search for various packages with that don't come
in stock anaconda. For example, you can $(conda install --javascript node)

~~~
afarrell
That should be $(conda install --channel javascript nodejs).

------
mslate
I gave a strikingly/humorously similar talk at a meetup in Boston ~1.5 years
ago:

[http://nbviewer.ipython.org/github/mmautner/email_classifier...](http://nbviewer.ipython.org/github/mmautner/email_classifier/blob/master/gmail_importance.ipynb)

