I found that usually lots of work involved just transforming or examining data in relatively simple ways or using human expert decisions as to important threshholds for outliers. For example I could run an outlier algorithm on data and either the returned outliers were very obvious and could have been found using a manual query by knowing the business context, or it returned alot of false positive outliers that were useless for the business.
Other times, we'd have a predictive model that was good for 95% of cases but would make our company look ridiculous on predictions for the other 5%, so couldn't use it in production-- and the nature of the data was such that we couldn't use the model for only certain value ranges.
Perhaps it was just the nature of our realm of business (telecom), and these approaches are more useful for others (advertising, stock trading, etc). Any experience with business fields where this stuff made a sizable impact for something they productionized in business they can share?
I'm not a machine learning guy, but when I was at Kongregate, we had a problem with credit card fraud on our virtual goods platform. It wasn't serious fraudsters, just dipshit teens with their parent's credit card.
I had labeled data: historical transactions, with chargebacks, which I fed into Weka. I included all kinds of stuff we knew about the user. A simple rule-based classifier could pick out risky transactions, with a lot of false positives.
I made a simple tool for our customer service team to review these risky transactions. They would decide whether to warn the user, temporarily block them from buying or temp ban them, or permanently ban them.
This worked pretty well for us. The risk factors were new players, players spending quickly, and users who were dicks - as measured by how often others had muted them in chat, how often they swore in chat, etc.
As an aside, saying "fuck" or "shit" in chat wasn't very predictive of fraud - often those terms aren't signs of an abusive user, since they might just be saying "fuck, I suck at this game". What was predictive was users who said "Gay", "Penis", or "Rape". People who use those terms on a game platform are largely dickheads. So the score for abusiveness became known as the "Gay, Penis, Rape Score" or "GPR" for short.
For us, small increments does give us sizeable impact. And we don't aim for predicting 100% of the cases either. We take what we get and see how we can use it.
In business, we don't care about accuracy. We care about improvement.
I took a course called data mining at university and it largely consisted of munging data.
Biased by that one course, I would expect anything called "data mining" to contain a lot of practice and theory about cleaning data and a machine learning course to focus on what to do with the cleaned data.
Teaching best practices for applying these methods to particular fields is probably beyond the expertise of any one person. Perhaps there's an opportunity for professors or practitioners of each field here?
It was painful. Those videos are just Ng at a physical chalkboard, with marginally legible writing. All math, little motivation, and, in particular, few graphics, although most of the concepts have a graphical representation.
If you don't do a class where you build things from first principles, you'll never know how to tweak code you imported.
The linear regression algorithm he teaches is a stepping stone to neural networks, it's a neural network with no hidden layer and no nonlinearity. True, you would probably never use that in the field but you have to start with something simple.
After I took the Ng course and put a couple of algos built from examples in the course into production, I said, "oh, let me use R or scikit-learn instead of this hacky Octave." And off the shelf using default parameters, none of them performed nearly as well. You need to understand the algorithm pretty granularly to be able to then cross-validate and tune parameters.
The field is sufficiently new that for anything interesting, an off-the-shelf import from scikit-learn is not going to be anywhere near state of the art, you should have the ability to roll your own.
It would be interesting to re-implement Ng's examples and assignments in TensorFlow.
Ng is a fine place to start, you get some pretty quick wins, doing MNIST from first principles within a month or two. You just need to know or get comfortable with matrix multiplication. It strikes a reasonable balance between being rigorous and approachable for a committed student at an undergrad level.
Principles of Statistical Learning is easier https://lagunita.stanford.edu/courses/HumanitiesandScience/S...
LAFF linear algebra is just starting
Hinton's Neural Networks is offered in the fall
For my money, I wouldn't do something like Practical Machine Learning in R, because I think you'll learn more R than machine learning. I wouldn't do the Udacity TensorFlow course because I think it assumes a lot of stuff you would learn in Ng's class ... I think Ng is a fine place to start.
Most of my experiences with "boring" math was because it felt taught poorly or I wasn't ready for it.
ML is such a broad canopy that it probably includes many who aren't ready for the math, and will find it boring. It's the same with the distinction between appliers and "methodologists" in statistics.
Breaking down "people getting started with ML" into what they want to do with it feels more tractable. Maybe it's an issue of courses signaling who they are geared for.
This is to be expected. As my Linear Systems textbook says, "math is a contact sport."
.2 lbs/kilos lost is mostly a rounding error. Our weight could fluctuate that much on a daily basis from the amount of salt consumed.
> Machine Learning and Ketosis
> This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics. We focus on what we consider to be the important elements of modern data analysis. Computing is done in R. There are lectures devoted to R, giving tutorials from the ground up, and progressing with more detailed sessions that implement the techniques in each chapter.
The entire text is freely available online at the mentioned URL.
I will not make solutions to homework, quizzes, exams, projects, and other assignments available to anyone else (except to the extent an assignment explicitly permits sharing solutions). This includes both solutions written by me, as well as any solutions provided by the course staff or others.
The class is meant to introduce one to machine learning. As such, the problems are usually fairly simple and one wouldn't need to cheat unless all one is attempting to do it to solve those without looking at either the leture videos or slides.
(Translating from Python to Octave might, on the other hand, require more effort in comparison to implementing the solutions in Octave.)
Andrew Ng gives you a free introductory course in one of the hottest topics in computing, and in exchange asks you not to do one thing. And you do that one thing. I have my solutions in octave, and it would be really convenient for me to back them up on github, but I keep them on a usb stick for this very reason. I am respecting the man's wishes who was so kind as to teach me about machine learning.
You should take them down if you don't have the explicit permission to share them as the honor code states. You don't have explicit permission, do you?
And just as a practical matter, there are dozens of github repos with literal (as in copy, paste, submit, done) solutions to these problems already available.
You're certainly free to disagree, but I do not view this as violating either the spirit or the intent of the honor code. This content has been out for years and is not in any way novel or unique. It's simply another vector through which the material can be learned, possibly opening it up to an even wider audience. Which is, I believe, what Andrew's goal was all along.
I liked this class and Andrew seems like a great guy but I'd like to point out that it's no longer free to take the evaluations for Coursera courses. Coursera is a startup and needs to find a way to monetize the courses to stay in business. I have no problem with that but I think its a little disingenuous to present Coursera as a free service when its clearly not.
Anyway, my point stands. I can take these courses for free, despite people on HN and reddit claiming you can't.
When I took the class earlier this year, the answer was - effectively - "no". I mean, yeah, you could do some trickery with calling Python from Octave using whatever FFI Octave has, or you could possibly reverse engineer the protocol they use to talk from your code to the upstream server... but anybody doing all that would be doing more work that just completing the assignments in Octave to begin with.
https://github.com/mstampfer/Coursera-Stanford-ML-Python is an example.