
Machine Learning Exercises in Python, Part 1 - jdwittenauer
http://www.johnwittenauer.net/machine-learning-exercises-in-python-part-1/
======
jupiter90000
Often this sort of material seems to be a collection of methods and
understanding them, which is obviously important to being able to use them.
However, I usually feel like the example problems are much cleaner and simpler
than those I've encountered in business. I feel like there's this missing link
between learning the methods and doing something that actually adds
significant value for a business using machine learning. Perhaps it's just me
or my field though.

I found that usually lots of work involved just transforming or examining data
in relatively simple ways or using human expert decisions as to important
threshholds for outliers. For example I could run an outlier algorithm on data
and either the returned outliers were very obvious and could have been found
using a manual query by knowing the business context, or it returned alot of
false positive outliers that were useless for the business. Other times, we'd
have a predictive model that was good for 95% of cases but would make our
company look ridiculous on predictions for the other 5%, so couldn't use it in
production-- and the nature of the data was such that we couldn't use the
model for only certain value ranges.

Perhaps it was just the nature of our realm of business (telecom), and these
approaches are more useful for others (advertising, stock trading, etc). Any
experience with business fields where this stuff made a sizable impact for
something they productionized in business they can share?

~~~
jim-greer
Depending on the business needs, returning outliers can be useful even if
there are a bunch of false positives.

I'm not a machine learning guy, but when I was at Kongregate, we had a problem
with credit card fraud on our virtual goods platform. It wasn't serious
fraudsters, just dipshit teens with their parent's credit card.

I had labeled data: historical transactions, with chargebacks, which I fed
into Weka. I included all kinds of stuff we knew about the user. A simple
rule-based classifier could pick out risky transactions, with a lot of false
positives.

I made a simple tool for our customer service team to review these risky
transactions. They would decide whether to warn the user, temporarily block
them from buying or temp ban them, or permanently ban them.

This worked pretty well for us. The risk factors were new players, players
spending quickly, and users who were dicks - as measured by how often others
had muted them in chat, how often they swore in chat, etc.

As an aside, saying "fuck" or "shit" in chat wasn't very predictive of fraud -
often those terms aren't signs of an abusive user, since they might just be
saying "fuck, I suck at this game". What was predictive was users who said
"Gay", "Penis", or "Rape". People who use those terms on a game platform are
largely dickheads. So the score for abusiveness became known as the "Gay,
Penis, Rape Score" or "GPR" for short.

~~~
jupiter90000
Very cool, thanks! I didn't realize that in certain contexts, many false
positive outliers wouldn't necessarily be such a bad thing, especially when
they could be further refined with human interaction.

------
Animats
I took that course from the pre-Coursera Stanford videos, when someone from
Black Rock Capital taught the course at Hacker Dojo. Did the homework in
Octave, although it was intended to be done in Matlab.

It was painful. Those videos are just Ng at a physical chalkboard, with
marginally legible writing. All math, little motivation, and, in particular,
few graphics, although most of the concepts have a graphical representation.

~~~
CloudYeller
Spot on. I respect the depth of Ng's knowledge, but for 99% of people, knowing
how to implement a linear regression algorithm is completely useless. Hardly
anyone is trying to write a better ML algorithm; the rest of us just need to
import code that was written by PhD's. So it's far better to understand higher
level concepts like when you should use a certain ML method, what assumptions
go into it, and generally how the underlying algorithms work.

~~~
RockyMcNuts
Well, not if you want to be a data scientist, I think.

If you don't do a class where you build things from first principles, you'll
never know how to tweak code you imported.

The linear regression algorithm he teaches is a stepping stone to neural
networks, it's a neural network with no hidden layer and no nonlinearity.
True, you would probably never use that in the field but you have to start
with something simple.

After I took the Ng course and put a couple of algos built from examples in
the course into production, I said, "oh, let me use R or scikit-learn instead
of this hacky Octave." And off the shelf using default parameters, none of
them performed nearly as well. You need to understand the algorithm pretty
granularly to be able to then cross-validate and tune parameters.

The field is sufficiently new that for anything interesting, an off-the-shelf
import from scikit-learn is not going to be anywhere near state of the art,
you should have the ability to roll your own.

It would be interesting to re-implement Ng's examples and assignments in
TensorFlow.

~~~
shostack
For people getting started with ML do you think it is more important to learn
first principles and the "boring" math like this, or do you think it is
important to give the learner some quick wins and keep the excitement and
interest levels up?

~~~
RockyMcNuts
Do what feels good for you :)

Ng is a fine place to start, you get some pretty quick wins, doing MNIST from
first principles within a month or two. You just need to know or get
comfortable with matrix multiplication. It strikes a reasonable balance
between being rigorous and approachable for a committed student at an
undergrad level.

Principles of Statistical Learning is easier
[https://lagunita.stanford.edu/courses/HumanitiesandScience/S...](https://lagunita.stanford.edu/courses/HumanitiesandScience/StatLearning/Winter2015/about)

LAFF linear algebra is just starting
[http://www.ulaff.net/](http://www.ulaff.net/)

Hinton's Neural Networks is offered in the fall
[https://www.coursera.org/learn/neural-
networks](https://www.coursera.org/learn/neural-networks)

For my money, I wouldn't do something like Practical Machine Learning in R,
because I think you'll learn more R than machine learning. I wouldn't do the
Udacity TensorFlow course because I think it assumes a lot of stuff you would
learn in Ng's class ... I think Ng is a fine place to start.

------
fitzwatermellow
During the time of the original class, I don't think scikit-learn and spark
were quite as mature. But perhaps Octave still enjoys a certain prominance in
academic machine learning research. Matlab was also used for the recent EdX
SynthBio class. And it just feels a bit archaic now, doing science in a gui on
the desktop, instead of on a cloud server via cli ;)

------
ivan_ah
Related, the demos from Kevin P. Murphy's excellent ML book implemented in
Octave [1] and (partially) in Python[2].

[1]
[https://github.com/probml/pmtk3/tree/master/demos](https://github.com/probml/pmtk3/tree/master/demos)
[2]
[https://github.com/probml/pmtk3/tree/master/python/demos](https://github.com/probml/pmtk3/tree/master/python/demos)

------
mark_l_watson
Very nice. I took the class twice and think it is easiest to use Octave, but
for after taking the class these Python examples might help some people.

------
jjallen
Seems like to compensate for day to day weight/water fluctuations one would
need to track the _trailing_ activity and food data for a period of days prior
to the data analyzed. I'm thinking 3-5.

.2 lbs/kilos lost is mostly a rounding error. Our weight could fluctuate that
much on a daily basis from the amount of salt consumed.

~~~
Noseshine
I think you clicked on the wrong thread, you probably wanted to post here:

    
    
        > Machine Learning and Ketosis
    

[https://news.ycombinator.com/item?id=12279415](https://news.ycombinator.com/item?id=12279415)

------
NelsonMinar
Ng's machine learning class is excellent, but the main thing holding it back
is its use of Matlab/Octave for the exercises. A Python version (with auto-
grading of exercises) would be a huge improvement.

------
motyar
Can I find same in R?

~~~
Noseshine
Try

[https://www.edx.org/course/applied-machine-learning-
microsof...](https://www.edx.org/course/applied-machine-learning-microsoft-
dat203-3x)

[https://lagunita.stanford.edu/courses/HumanitiesSciences/Sta...](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about)

> This is not a math-heavy class, so we try and describe the methods without
> heavy reliance on formulas and complex mathematics. We focus on what we
> consider to be the important elements of modern data analysis. Computing is
> done in R. There are lectures devoted to R, giving tutorials from the ground
> up, and progressing with more detailed sessions that implement the
> techniques in each chapter.

------
denfromufa
What is the best learning resource for gaussian process (kriging) using
Python?

~~~
0xmohit
Have you seen Gaussian Processes for Machine Learning [0]?

The entire text is freely available online at the mentioned URL.

[0]
[http://www.gaussianprocess.org/gpml/](http://www.gaussianprocess.org/gpml/)

~~~
denfromufa
That is classical resource for GP, but using Matlab. I'm looking for something
practical using Python.

~~~
0xmohit
I haven't used it, but you may want to explore GPy [0] if you haven't already.

[0] [https://github.com/SheffieldML/GPy](https://github.com/SheffieldML/GPy)

~~~
denfromufa
I never understood what is the difference between GPy and GP in sklearn. I'm
using the latter, but still do not understand most of parameters that go into
this model.

------
earthpalm
Lets talk about about how much Michael I. Jordon taught Andrew Ng what he
knows about machine learning and AI.

~~~
freyr
He's the Michael Jordan of machine learning, after all.

~~~
zeristor
Aye

------
danjoc
[https://www.coursera.org/about/terms/honorcode](https://www.coursera.org/about/terms/honorcode)

I will not make solutions to homework, quizzes, exams, projects, and other
assignments available to anyone else (except to the extent an assignment
explicitly permits sharing solutions). This includes both solutions written by
me, as well as any solutions provided by the course staff or others.

~~~
jdwittenauer
None of the material in these posts could be used directly to complete
assignments for the class. I suppose someone could attempt to "back-port" some
of the Python code to Octave, but if you're going to that much trouble it's
probably easier to just solve it in Octave in the first place.

~~~
rawnlq
I took the course a while back and most of the assignments were just straight
copying pasting from the pdf or translating some math formula into octave and
never more than 10 lines of code. It's so spoon-fed already I don't see why
anyone would want to cheat by porting it from python.

~~~
0xmohit
Having taken the myself, I couldn't agree more.

The class is meant to _introduce_ one to machine learning. As such, the
problems are usually fairly simple and one wouldn't need to _cheat_ unless all
one is attempting to do it to solve those without looking at either the leture
videos or slides.

(Translating from Python to Octave might, on the other hand, require more
effort in comparison to implementing the solutions in Octave.)

