Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning Exercises in Python, Part 1 (johnwittenauer.net)
520 points by jdwittenauer on Aug 12, 2016 | hide | past | favorite | 62 comments



Often this sort of material seems to be a collection of methods and understanding them, which is obviously important to being able to use them. However, I usually feel like the example problems are much cleaner and simpler than those I've encountered in business. I feel like there's this missing link between learning the methods and doing something that actually adds significant value for a business using machine learning. Perhaps it's just me or my field though.

I found that usually lots of work involved just transforming or examining data in relatively simple ways or using human expert decisions as to important threshholds for outliers. For example I could run an outlier algorithm on data and either the returned outliers were very obvious and could have been found using a manual query by knowing the business context, or it returned alot of false positive outliers that were useless for the business. Other times, we'd have a predictive model that was good for 95% of cases but would make our company look ridiculous on predictions for the other 5%, so couldn't use it in production-- and the nature of the data was such that we couldn't use the model for only certain value ranges.

Perhaps it was just the nature of our realm of business (telecom), and these approaches are more useful for others (advertising, stock trading, etc). Any experience with business fields where this stuff made a sizable impact for something they productionized in business they can share?


Depending on the business needs, returning outliers can be useful even if there are a bunch of false positives.

I'm not a machine learning guy, but when I was at Kongregate, we had a problem with credit card fraud on our virtual goods platform. It wasn't serious fraudsters, just dipshit teens with their parent's credit card.

I had labeled data: historical transactions, with chargebacks, which I fed into Weka. I included all kinds of stuff we knew about the user. A simple rule-based classifier could pick out risky transactions, with a lot of false positives.

I made a simple tool for our customer service team to review these risky transactions. They would decide whether to warn the user, temporarily block them from buying or temp ban them, or permanently ban them.

This worked pretty well for us. The risk factors were new players, players spending quickly, and users who were dicks - as measured by how often others had muted them in chat, how often they swore in chat, etc.

As an aside, saying "fuck" or "shit" in chat wasn't very predictive of fraud - often those terms aren't signs of an abusive user, since they might just be saying "fuck, I suck at this game". What was predictive was users who said "Gay", "Penis", or "Rape". People who use those terms on a game platform are largely dickheads. So the score for abusiveness became known as the "Gay, Penis, Rape Score" or "GPR" for short.


Very cool, thanks! I didn't realize that in certain contexts, many false positive outliers wouldn't necessarily be such a bad thing, especially when they could be further refined with human interaction.


I've had similar experience in insurance. Our predictive algorithms have been used sparingly and guides our strategy but we don't fully trust the actual data. That's how we leverage our analysis.

For us, small increments does give us sizeable impact. And we don't aim for predicting 100% of the cases either. We take what we get and see how we can use it.

In business, we don't care about accuracy. We care about improvement.


Thanks for your comment, this is exactly the type of information I was interested in.


Chiming in to say that I have the same exact experience :) I work in security, and we use these methods to detect anomalies or classify malicious content or URLs. A silly false positive is embarrassing, even if it happens once. Humans always augment our methods, or we have to set expectations to the customer that we are trading off accuracy for speed. Fast customer support usually helps against false positives too.


Yes, augmenting machine intelligence with human intuition is great because machines yet haven't got human intuition which we can't program.


While I agree that data munging is very important and very difficult, I disagree that it should be part of every course teaching any kind of data manipulation.

I took a course called data mining at university and it largely consisted of munging data.

Biased by that one course, I would expect anything called "data mining" to contain a lot of practice and theory about cleaning data and a machine learning course to focus on what to do with the cleaned data.


These are just introductory courses, teaching the theory.

Teaching best practices for applying these methods to particular fields is probably beyond the expertise of any one person. Perhaps there's an opportunity for professors or practitioners of each field here?


I would argue that if you the physics behind the problem, then even semi-empirical models easily beat machine learning. I have seen this consistently on my data-sets.


I took that course from the pre-Coursera Stanford videos, when someone from Black Rock Capital taught the course at Hacker Dojo. Did the homework in Octave, although it was intended to be done in Matlab.

It was painful. Those videos are just Ng at a physical chalkboard, with marginally legible writing. All math, little motivation, and, in particular, few graphics, although most of the concepts have a graphical representation.


Spot on. I respect the depth of Ng's knowledge, but for 99% of people, knowing how to implement a linear regression algorithm is completely useless. Hardly anyone is trying to write a better ML algorithm; the rest of us just need to import code that was written by PhD's. So it's far better to understand higher level concepts like when you should use a certain ML method, what assumptions go into it, and generally how the underlying algorithms work.


Well, not if you want to be a data scientist, I think.

If you don't do a class where you build things from first principles, you'll never know how to tweak code you imported.

The linear regression algorithm he teaches is a stepping stone to neural networks, it's a neural network with no hidden layer and no nonlinearity. True, you would probably never use that in the field but you have to start with something simple.

After I took the Ng course and put a couple of algos built from examples in the course into production, I said, "oh, let me use R or scikit-learn instead of this hacky Octave." And off the shelf using default parameters, none of them performed nearly as well. You need to understand the algorithm pretty granularly to be able to then cross-validate and tune parameters.

The field is sufficiently new that for anything interesting, an off-the-shelf import from scikit-learn is not going to be anywhere near state of the art, you should have the ability to roll your own.

It would be interesting to re-implement Ng's examples and assignments in TensorFlow.


For people getting started with ML do you think it is more important to learn first principles and the "boring" math like this, or do you think it is important to give the learner some quick wins and keep the excitement and interest levels up?


Do what feels good for you :)

Ng is a fine place to start, you get some pretty quick wins, doing MNIST from first principles within a month or two. You just need to know or get comfortable with matrix multiplication. It strikes a reasonable balance between being rigorous and approachable for a committed student at an undergrad level.

Principles of Statistical Learning is easier https://lagunita.stanford.edu/courses/HumanitiesandScience/S...

LAFF linear algebra is just starting http://www.ulaff.net/

Hinton's Neural Networks is offered in the fall https://www.coursera.org/learn/neural-networks

For my money, I wouldn't do something like Practical Machine Learning in R, because I think you'll learn more R than machine learning. I wouldn't do the Udacity TensorFlow course because I think it assumes a lot of stuff you would learn in Ng's class ... I think Ng is a fine place to start.


This feels like a pretty loaded question. It seems like you can have math with quick wins, keeping excitement and interest. When you say "boring" math, are you referring to the overall content or the way it's taught?

Most of my experiences with "boring" math was because it felt taught poorly or I wasn't ready for it.

ML is such a broad canopy that it probably includes many who aren't ready for the math, and will find it boring. It's the same with the distinction between appliers and "methodologists" in statistics.

Breaking down "people getting started with ML" into what they want to do with it feels more tractable. Maybe it's an issue of courses signaling who they are geared for.


I'm really glad it's out there though. I may be in the minority, but I would love to write better ML algorithms!


Can you recommend courses/books/etc that take that approach?


Agreed. Though this is pretty consistent with college CS/Math courses in general (at least in my experience). A lot of dense theoretical content covered in scribbles and slides. You don't really learn anything until you just do practice problems or research the same topics independently.


> You don't really learn anything until you just do practice problems or research the same topics independently.

This is to be expected. As my Linear Systems textbook says, "math is a contact sport."


You're right -- the class is intended to be a primer for your learning or, ideally, something you come into having already read about the material ready to gain insights.


The current Coursera course's videos are pretty unadorned, but he's not using a physical chalkboard any more. I also found that for most of them I can use the subtitles instead of the audio and play them back about about 2x speed.


That alone would be a big help. Reading his chalkboard work is hard.


I wrote up the course a few years ago for easy reference. I need to update these notes it as I have about a year's worth of (minor) typos that people have pointed out [which is hugely appreciated], but in general they seem well received.

http://holehouse.org/mlclass/


Heh, this is how my CS classes were.


During the time of the original class, I don't think scikit-learn and spark were quite as mature. But perhaps Octave still enjoys a certain prominance in academic machine learning research. Matlab was also used for the recent EdX SynthBio class. And it just feels a bit archaic now, doing science in a gui on the desktop, instead of on a cloud server via cli ;)


Related, the demos from Kevin P. Murphy's excellent ML book implemented in Octave [1] and (partially) in Python[2].

[1] https://github.com/probml/pmtk3/tree/master/demos [2] https://github.com/probml/pmtk3/tree/master/python/demos


Very nice. I took the class twice and think it is easiest to use Octave, but for after taking the class these Python examples might help some people.


Seems like to compensate for day to day weight/water fluctuations one would need to track the trailing activity and food data for a period of days prior to the data analyzed. I'm thinking 3-5.

.2 lbs/kilos lost is mostly a rounding error. Our weight could fluctuate that much on a daily basis from the amount of salt consumed.


I think you clicked on the wrong thread, you probably wanted to post here:

    > Machine Learning and Ketosis
https://news.ycombinator.com/item?id=12279415


Ng's machine learning class is excellent, but the main thing holding it back is its use of Matlab/Octave for the exercises. A Python version (with auto-grading of exercises) would be a huge improvement.


Can I find same in R?


Try

https://www.edx.org/course/applied-machine-learning-microsof...

https://lagunita.stanford.edu/courses/HumanitiesSciences/Sta...

> This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics. We focus on what we consider to be the important elements of modern data analysis. Computing is done in R. There are lectures devoted to R, giving tutorials from the ground up, and progressing with more detailed sessions that implement the techniques in each chapter.


What is the best learning resource for gaussian process (kriging) using Python?


Have you seen Gaussian Processes for Machine Learning [0]?

The entire text is freely available online at the mentioned URL.

[0] http://www.gaussianprocess.org/gpml/


That is classical resource for GP, but using Matlab. I'm looking for something practical using Python.


I haven't used it, but you may want to explore GPy [0] if you haven't already.

[0] https://github.com/SheffieldML/GPy


I never understood what is the difference between GPy and GP in sklearn. I'm using the latter, but still do not understand most of parameters that go into this model.


Lets talk about about how much Michael I. Jordon taught Andrew Ng what he knows about machine learning and AI.


He's the Michael Jordan of machine learning, after all.


Aye


https://www.coursera.org/about/terms/honorcode

I will not make solutions to homework, quizzes, exams, projects, and other assignments available to anyone else (except to the extent an assignment explicitly permits sharing solutions). This includes both solutions written by me, as well as any solutions provided by the course staff or others.


None of the material in these posts could be used directly to complete assignments for the class. I suppose someone could attempt to "back-port" some of the Python code to Octave, but if you're going to that much trouble it's probably easier to just solve it in Octave in the first place.


I took the course a while back and most of the assignments were just straight copying pasting from the pdf or translating some math formula into octave and never more than 10 lines of code. It's so spoon-fed already I don't see why anyone would want to cheat by porting it from python.


Having taken the myself, I couldn't agree more.

The class is meant to introduce one to machine learning. As such, the problems are usually fairly simple and one wouldn't need to cheat unless all one is attempting to do it to solve those without looking at either the leture videos or slides.

(Translating from Python to Octave might, on the other hand, require more effort in comparison to implementing the solutions in Octave.)


I know I'm just burning karma here, but this really triggers me.

Andrew Ng gives you a free introductory course in one of the hottest topics in computing, and in exchange asks you not to do one thing. And you do that one thing. I have my solutions in octave, and it would be really convenient for me to back them up on github, but I keep them on a usb stick for this very reason. I am respecting the man's wishes who was so kind as to teach me about machine learning.

You should take them down if you don't have the explicit permission to share them as the honor code states. You don't have explicit permission, do you?


Since you linked to the honor code, can you point to anything in these blog posts that qualifies - in the context of completing the class - as "solutions to homework, quizzes, exams, projects, and other assignments"? Could anything I provided be directly used to "cheat" and finish the class without doing the work?

And just as a practical matter, there are dozens of github repos with literal (as in copy, paste, submit, done) solutions to these problems already available.

You're certainly free to disagree, but I do not view this as violating either the spirit or the intent of the honor code. This content has been out for years and is not in any way novel or unique. It's simply another vector through which the material can be learned, possibly opening it up to an even wider audience. Which is, I believe, what Andrew's goal was all along.


[flagged]


I don't see much value in continuing to debate this. You're entitled to your opinion and I respect that, but I do not share your viewpoint.


> The way I see it, Andrew Ng is an entrepreneur co-founder with a startup called Coursera. He makes his courses free, because he's a good guy, and free attracts the audience that makes his platform worth something.

I liked this class and Andrew seems like a great guy but I'd like to point out that it's no longer free to take the evaluations for Coursera courses. Coursera is a startup and needs to find a way to monetize the courses to stay in business. I have no problem with that but I think its a little disingenuous to present Coursera as a free service when its clearly not.


I see this written on HN and reddit a lot lately, yet I've just signed up and begun two courses, included Andrews ML course and have yet to be asked for money or met restrictions beyond the once in place last year.


I'm doing the ML course at the moment and I'm asked to pay for a certificate after every screen.


Yes, you are right. I should have written "forced to" instead of asked to, but now it is too late to edit.

Anyway, my point stands. I can take these courses for free, despite people on HN and reddit claiming you can't.


These are direct solutions to the exercises. I took the class some time ago and also did it in python.


As far as I can remember you can't submit assignments in Python, can you? Or maybe you did it in Python first, and then ported in Octave before submitting? If so, how did it work out for you? At first I thought I wanted to go down the same path (because I'm comfortable with Python but not with Octave) but then concluded it was too much trouble backporting everything.


As far as I can remember you can't submit assignments in Python, can you?

When I took the class earlier this year, the answer was - effectively - "no". I mean, yeah, you could do some trickery with calling Python from Octave using whatever FFI Octave has, or you could possibly reverse engineer the protocol they use to talk from your code to the upstream server... but anybody doing all that would be doing more work that just completing the assignments in Octave to begin with.


There exist implementations of an interface between Python and the Coursera grading server.

https://github.com/mstampfer/Coursera-Stanford-ML-Python is an example.


Of course somebody would have reverse engineered the protocol already. Oh well. I still think most people would find it easier to just do the assignments themselves than deal with all this, but I'll grant that there's "always one in every crowd" as they say.


We have Pytave. You don't need FFI anymore. Or rather, we wrote the FFI for you. Write your code in Scipy and Octave sees Octave-shaped objects in return.


The course itself uses Octave; the OP just ported the code to Python.


From what I remember of the course, it was itself mostly "porting" code (well, formulas) from the textbook to Octave/Matlab.


Yes and Octave/Matlab is not that far from "standard" mathematical notation anyway... But the exercises were still useful as they helped me remembering the concepts.


They were supposed to be free so they blinked first




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: