

Tools to get started in machine learning - nwenzel
http://k2company.com/blog/2012/09/06/toolbox-for-learning-machine-learning-and-data-science/

======
rm999
While I think Python is great, I think the author discounts R too casually. I
share his frustrations about R, but what's nice is R (well, S) was designed to
be a statistical computing tool, and it does anything related to that quite
well. Especially for data analysis, which is where I spend a lot of my time
(perhaps most) in the whole model building process, R is amazing. Also, R is
very usable out of the box for math and data visualization, whereas Python
requires many libraries. It's good to know R.

I think R and Python can be used in conjunction very effectively. Analyze your
data in R, prototype your algorithms in either, build products in Python.

~~~
dbecker
With rpy2 you can even embed R code in Python.

------
dbecker
>I found the syntax baffling, the documentation copious, but written for
mathematicians instead of hackers

I'm always surprised how much people hate the syntax of R. I primarily work in
Python, but I use R once or twice a week... and the syntax seems very clean to
me. Can someone give mee an example of what you dislike with R's syntax?

I'm even more surprised to hear complaints about the documentation in R. The
help files in R are much more complete and well organized than docstrings in
the python libraries we use. Even the web usually lacks anything as useful as
what I get from the vignettes function in my R interpreter.

~~~
zmmmmm
Personally it's not _so_ much syntax as the confusing data model that gets me
in R. So many different but very similar data types - lists, data frames,
matrices, tables, vectors - all very similar but slightly different syntax,
very frequently converted silently from one to the other when you call
functions but resulting in strange quirks that are extremely hard to debug at
the other end. The combination of loose data typing and this plethora of
similar data types makes it a nightmare to work with at times. On the other
hand when you grok it and it works for you ... it's amazing.

~~~
dbecker
The complaints about R still seem counterintuitive to me. Python has the same
data types listed above, and many more.

A list in R is a list in Python. A data frame in R is a date frame in the
pandas library. A matrix in R is a matrix in numpy. A vector in R is a
1-dimensional ndarray in numpy.

But Python adds dictionaries, tuples, iterators, sets, and a bunch of other
data types that aren't used in R.

R's lists and vectors are relatively similar... but you could say the same
thing about numpy's matrix and ndarray. You could probably say the same thing
about python's sets, tuples and lists.

To be honest, I'd have said the strength of python is that it has many more
data types than R... rather than fewer data types.

------
gallamine
If you're using a Mac, _please_ don't install all of this individually.
Instead, install the Scipy Superpack:
<http://fonnesbeck.github.com/ScipySuperpack/>

~~~
gammarator
The superpack is a great one-click option. For staying up to date, I find a
package manager (like macports or homebrew) more effective for managing my
python packages.

------
EvanKelly
Does anyone have any experience with Octave and how it compares to the Python
setup that OP suggests?

What are the benefits/pitfalls?

~~~
apl
Octave sits at an awkward half-way point between MATLAB and
Python/R/Julia/etc. You get the shitty syntax of MATLAB at not quite MATLAB's
speed and miss out on support as well as various incompatible toolboxes. So
unless you have hard dependencies or lots of MATLAB code sitting around,
Octave isn't that attractive an option.

It's great for matrix/vector maths, though. If that's all you do -- go for it.
Everything over and above that is a royal pain in MATLAB and, conversely, in
Octave.

------
peteTorrione
A nice list of tools. We tried to use Python 3.0, but had the same problems as
the author...

And if you are lucky enough to have free-ish access to MATLAB, here's a free,
BSD, open-source, github repo'd machine learning toolbox to help you get
started:

<http://www.newfolderconsulting.com/prt/>

Full disclosure: I'm involved.

------
shreeshga
If you are going to use python for ML. Use the python package from Entthought
[<http://www.enthought.com/products/epd_free.php>]

It has most of the libraries, out of the box.

~~~
rrosen326
(Author) - Nice tip. Thanks - I'll update my list. (Wish I had known at the
start of my learning!)

------
Radim
Don't forget gensim (and its tutorials): <http://radimrehurek.com/gensim>

Plus it's the only one on that list that will scale beyond the "My Data Set"
size.

------
pfanner
>If you are a real data scientist or expert, skip this

I am a "real data scientist" and need some advanced data analysis technics and
machine learning. Can someone recommend an introduction for me?

~~~
ihodes
What's your background? (i.e. what do you already know?)

~~~
pfanner
Probably not much. I started working with bigger data some months ago and now
I notice that I need some real techniques and not just my "ok try this and
this". I'm coding in C (where I "create" the data (numerical integration of
stochastic differential equations)) and Python (plotting). I need
methods/algorithms/techniques to analyse the data "on the fly" because I can't
save it all (it's too much data).

~~~
tst
Just a small tip which may ease the search for methods. The general term for
"on the fly" learning is online learning [1]. The rest depends on your problem
but there are often online variants of offline methods, e.g. when you work
with Gaussian process regressions

[1]: <http://en.wikipedia.org/wiki/Online_machine_learning>

------
mlmilleratmit
Solid summary of some powerful tools, no wonder python is the new default for
academic data analysis.

~~~
FrojoS
Andrew Ng, in his machine learning class [1] urges people do use Matlab
instead of Python, because in his experience people develop faster with Matlab
than wit any other tool/language.

Personally, I am experienced with Matlab but not so much with Python, so I am
not able to judge. I definitely hate the fact that Matlab is proprietary and
partly closed source. Also, I think Python syntax is much more pretty while
Matlab is not even designed to be a real language. But alas, it comes with
very powerful functionality out of the box.

Note, that the OP mentioned in the introduction that he had no access to
Matlab over his employer or university and hence dismissed it.

[1] <https://www.coursera.org/ml> (one of the first videos)

~~~
otoburb
It's too bad the OP didn't find Octave, which is an opensource Matlab clone
that was also used in Andrew Ng's ML course.

~~~
0ren
I am not sure that Octave preforms as fast as MatLab; for example, I think
MatLab does a better job in parallelizing non-vectorized code.

