

Data Science Toolbox - rkda
http://datasciencetoolbox.org/

======
jmount
I don't get it. DST installs 3 tools: R, Python, and DST itself. R and its
packages are VERY easy to install, Python and its packages are pretty easy to
install, and you don't need DST until you choose to install DST. It's not like
they are putting a large hard to configure stack (like
Hadoop/Mahout/Pig/Hive/Spark ...) for you.

~~~
jeroenjanssens
Author here. Although the software contained in the base is indeed relatively
easy to install, you'd be surprised how often people struggle installing
Python and its packages on Mac OS X or Microsoft Windows!

The initial reason I created DST was that I'm currently writing a book called
Data Science at the Command Line. Installing the command-line tools discussed
in this book is unfortunately not straightforward. I wanted to offer potential
readers a Vagrant box (with a shell provisioner) that installs these command-
line tools. Similar to what Matthew Russell has done for his book Mining the
Social Web [1]. This blog post [2] provides some more explanation and compares
four different solutions.

The software packages that you mention sure are exciting! If there is an
author or teacher that wants to teach these to her readers or students, then
we should create a bundle (i.e., Ansible playbook) for them!

[1] [https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-
Ed...](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition) [2]
[http://jeroenjanssens.com/2013/12/07/lean-mean-data-
science-...](http://jeroenjanssens.com/2013/12/07/lean-mean-data-science-
machine.html)

~~~
paulgb
It's a pain to install BLAS and the numpy family on a box you don't have root
on so I definitely see value in the VM approach. Does being in a VM degrade
the performance to the point where this is mainly a teaching tool or is it
comparable to a native install?

------
Terr_
I half-expected a "toolbox" more like Orange.

([http://orange.biolab.si/](http://orange.biolab.si/))

------
jayshahtx
Would love to see a few basic libraries install ahead of time (NumPy and
SciPy, namely) and hear about performance implications of running virtually.

Cool tool though - I've done data science work for a few companies now and the
most frustrating thing is always getting set up on their stack.

------
paulgb
Appears to be unrelated from the Data Science Toolkit virtual machine image
which frontpaged yesterday

[https://news.ycombinator.com/item?id=7835097](https://news.ycombinator.com/item?id=7835097)

~~~
jeroenjanssens
That's a nice coincidence! (I can imagine that all these names can be
confusing.) The Data Science Toolkit focuses on lots of interesting APIs. This
blog post [1] provides a brief comparison.

[1] [http://jeroenjanssens.com/2013/12/07/lean-mean-data-
science-...](http://jeroenjanssens.com/2013/12/07/lean-mean-data-science-
machine.html)

------
manueslapera
I appreciate author's effort. But for any aspiring data scientists this
website is much much more useful:

[http://www.datasciencetoolkit.org/](http://www.datasciencetoolkit.org/)

------
spot
Looks cool. When it reaches 1.0 you might want to include Beaker:
[http://BeakerNotebook.com](http://BeakerNotebook.com)

------
hatred
@ author, I had a trivial query , How is it different from Anaconda ?

~~~
jeroenjanssens
Anaconda is an easy-to-install distribution that contains many many Python
packages. It is not a virtual machine.

The Data Science Toolbox also contains Python and the most popular packages
for doing data science. Besides that, it also contains R and, again, some
great packages.

The main difference, though, is that the DST aims to be a platform where
authors and teachers can create custom software and data bundles for their
readers and students. (I have been thinking about creating a bundle that would
install Anaconda on the DST.)

If you only need to use Python, then Anaconda may very well satisfy your
needs. Good luck!

