

Choosing a First Machine Learning Project: Start by Reading or by Doing? - danger
http://blog.smellthedata.com/2010/07/choosing-first-machine-learning-project.html

======
jacoblyles
I'm not sure what you are imagining the scope of your first project to be, but
I would recommend you begin by understanding and implementing some well-known
algorithms. Start with a Guassian Mixture model trained with the EM algorithm.
Then do linear and logistic regression, perceptron is pretty simple and a
simpler version of the widely used support vector machine.

The handwriting recognition database here is fantastic for testing a variety
of simple ML models:

<http://yann.lecun.com/exdb/mnist/>

In our machine learning class, we would use data from the KDD cup for our
projects. Why don't you create a submission for old KDD cups and see if your
model can do better than random? 1998 is good for logistic and linear
regression:

<http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html>

Using the 2007 dataset, you can try out some of the matrix factoring methods
that have worked well for the Netflix prize:

<http://www.cs.uic.edu/~liub/Netflix-KDD-Cup-2007.html>

These should all be relatively small tasks. Learning how to interpret your
results and iterate your models to make them better will take longer than
understanding the algorithms.

~~~
danger
There are so many possible directions that somebody might want to go, though.
If you're ultimately interested in reinforcement learning, you don't
immediately need to understand EM in order to "get" Q learning. Or to work on
discrete problems related to computer vision, having a basis in linear
programming and algorithms will get you further than logistic regression or
SVMs.

I'll agree that anybody doing machine learning should understand the basic
idea of maximum likelihood learning (e.g., logistic regression). But how far
do you go beyond that? This is really the heart of the issue, I think. Yes,
there are a ton of things that are really useful to know, and any machine
learning student should learn them at some point. If you're trying to do an
interesting project that teaches you something about research (and thus looks
good on a graduate school application), though, what is the best use of your
time?

~~~
dododo
You're right in that you need direction. Each subclass of machine learning has
some elementary algorithms worth learning:

For unsupervised learning or probabilistic latent models, the EM algorithm,
Metropolis-Hastings sampling, variational methods (variation EM, variational
Bayes, expectation propagation).

For supervised learning, linear regression, perceptrons, support vector
machines.

For reinforcement learning, Q-learning, E^3, Rmax.

------
endtime
Start by doing...pick a relatively easy algorithm, implement it, and fully
understand why it's doing what it's doing. If you start out by implementing
something with extreme math-fu it may just seem like magic.

My first ML project was to implement STAGGER (Schlimmerand Granger, 1986),
which is a very simple algorithm for handling concept drift. Then I trained it
on the domain {red, green, blue} X {square, circle, triangle} X {small,
medium, large}. I fed it 40 positive examples of small red square, then 40
positive examples of large green triangle, then 40 positive examples of medium
blue circle, and watched the learned concept change. I understood how and why
it worked, and that felt pretty good.

------
apurva
For what it's worth, I think it's important to start with interesting
problems. My first interaction with ML was an implementation of Naive Baye's
for classifying spam, borrowing much ideas from PG's A Plan for Spam from
scratch (ie no libraries). This is what got me really interested in the field,
much more than randomly picking up topics- there are just so many areas to
choose from. Another approach would be to read up on standard supervised
learning techniques and just observing how the parameters for these algos
behave on datasets. something like a Weka really comes in handy if you wish to
focus on analyzing behavior of such techniques first. Best of luck!

------
raintrees
Doing. If I just read, I don't get the other parts, like getting the debugging
aspect correct, finding out the requirements of the environment, etc. Plus, I
have less investment if I haven't typed the code in myself.

Edit: And continue with the formalized learning through reading (and
experimentation). Later get a handle on common conventions, as they usually
help accuracy/readability.

