
Machine learning is easier than it looks - jasonwatkinspdx
http://insideintercom.io/machine-learning-way-easier-than-it-looks/
======
eof
I feel I'm in a _somewhat_ unique position to talk about easy/hardness of
machine learning; I've been working for several months on a project with a
machine learning aspect with a well-cited, respected scientist in the field.
But I effectively "can't do" machine learning myself. I'm a primarily 'self-
trained' hacker; started programming by writing 'proggies' for AOL in middle
school in like 1996.

My math starts getting pretty shaky around Calculus; vector calculus is beyond
me.

I did about half the machine learning class from coursera, andrew ng's.
Machine learning is conceptually much simpler than one would guess; both
gradient descent and the shallow-neural network type; and in fact it is
actually pretty simple to get basic things to work.

I agree with the author that the notation, etc, can be quite intimidating vs
what is "really going on".

however, _applied_ machine learning is still friggin' hard; at least to me;
and I consider myself a pretty decent programmer. Naive solutions are just
unusable in almost any real application, and the author's use of loops and
maps are great for teaching machine learning; but everything needs to be
transformed to higher level vector/matrix problems in order to be genuinely
useful.

That isn't unattainable by any means; but the fact remains (imho) that without
the strong base in vector calculus and idiosyncratic techniques for
transforming these problems into more efficient means of computations; usable
machine learning is far from "easy".

~~~
csmatt
Does there exist a tool that will translate mathematical symbols or an entire
equation of them into plain English?

~~~
andreasvc
This sounds to me like a pointless exercise. There is a reason for using
mathematical notation for non trivial formulas, which is that is more compact
and succint, to allow it to convey information efficiently and unambiguously.
Think of a formula with a few levels of parentheses; you're not going te be
able to express that clearly in a paragraph of text. It's not so much the
symbols and notation itself which is hard to grasp, but the mental model of
the problem space; once you have that, the formula will usually make sense
because you can relate it to this mental model.

~~~
icelancer
>There is a reason for using mathematical notation for non trivial formulas,
which is that is more compact and succint, to allow it to convey information
efficiently and unambiguously.

Maybe, but not always. Remember that Richard Feynman took great issue with how
integration was taught in most math classes and devised his own method
(inspired from the Calculus for the Practical Man texts).

~~~
andreasvc
You can always try to find an even better notation, but the only point I was
making is that for certain cases anything is better than a wall of akward
text.

------
hooande
Most machine learning concepts are very simple. I agree with the author that
mathematical formulae can be unnecessarily confusing in many cases. A lot of
the concepts can be expressed very clearly in code or plain english.

For example, a matrix factorization could be explained with two arrays, a and
b, that represent objects in the prediction:

    
    
      for each example 
        for each weight w
          prediction += a[w] x b[w]
        err = (prediction - actual_value)
        for each weight w
          a[w] += err x small_nuumber
          b[w] += err x small_number
    

It's that simple. Multiply the weights of a by the weights of b, calculate
error and adjust weights, repeat.

K-Nearest Neighbor/KMeans are based on an even simpler operation:

    
    
      dist = 0
      for each weight w: dist += (a[w] - b[w])**2
    

Then make predictions/build clusters based on the smallest aggregate distance.

There are more advanced concepts. There are some serious mathematics involved
in some predictors. But the most basic elements of statistical prediction are
dead simple for a trained programmer to understand. Given enough data, 80%
solutions can easily be achieved with simple tools.

We should be spreading the word about the simplicity of fundamental prediction
algorithms, not telling people that it's hard and a lot of math background is
required. Machine learning is very powerful and can improve all of our lives,
but only if there is enough data available. Since information tends to be
unevenly distributed we need to get the tools into the hands of as many people
as possible. It would be much better to focus on the concepts that everyone
can understand instead of keeping statistics secrets behind the ivy clad walls
of academia.

~~~
ACow_Adonis
Spot on. I agree completely. I work at this stuff for a living, and it never
ceases to amaze how the academic literature and math papers are discussing
fundamentally simple ideas using styles and techniques programming evolved
past in the last few decades. Comments, structure, not-greek-characters (but
they make me look smart!), abstractions. When was the last time you saw a
stat/math paper in machine learning with comments/structure? And guess what,
its as undecipherable as code without comments/structure.

On the other hand, I'm also learning that what I/others find easy, a great
deal of people/programmers find hard. The number of programmers/hackers who
can actually implement such techniques on new real-world problems if they
don't have someone holding their hand I'm discovering is a very small
minority. So maybe its harder than it looks after all, and we just think its
easy because we've spent so much time with it? After all, programming isn't a
walk in the park for most people, and machine learning isn't a walk in the
park for most programmers.

~~~
bnegreve
> _When was the last time you saw a stat /math paper in machine learning with
> comments/structure? And guess what, its as undecipherable as code without
> comments/structure._

What do you mean? Academic papers are usually written in plain english with
mathematical formulas when it's necessary. What kind of comments would you
like to see?

~~~
dylandrop
Well, to quote the article, you can see things like "p is absolutely
continuous with respect to the Lebesque measure on En" and hundreds of
variable names and sub/superscripts, none of which are intuitively named. It's
really hard to argue that anyone who is not in academia would understand this
without passing through it multiple times.

That being said, I think mathematicians should be perfectly permitted to do
this, seeing as most people who read their papers are themselves,
mathematicians. Thus spelling out every dumb detail would probably just be a
waste of their time for the one case that the brave soul who is a programmer
tries to decipher it.

~~~
icelancer
Like another comment above alluded to, mathematicians tend to have parallel
gripes about code. While technically we're speaking the same language, the
dialect is often quite different.

~~~
dylandrop
Yes, which is why I mentioned that it makes sense for mathematicians to use
the language they do.

------
j2kun
The author clearly didn't read the page of the math paper he posted in trying
to argue his point. It says, and I quote:

Stated informally, the k-means procedure consists of simply starting with k
groups each of which consists of a single random point, and thereafter adding
each new point to the group whose mean the new point is nearest.

Admittedly, it's not the prettiest English sentence over written, but it's
just as plain and simply stated as the author of this article.

The article itself is interested in _proving_ asymptotic guarantees of the
algorithm (which the author of the article seems to completely ignore, as if
it were not part of machine learning at all). Of course you need mathematics
for that. If you go down further in the paper, the author reverts to a simple
English explanation of the various parameters of the algorithm and how they
affect the quality of the output.

So basically the author is cherry-picking his evidence and not even doing a
very good job of it.

------
munificent
This was a great post because I've heard of "k-means" but assumed it required
more math than my idle curiosity would be willing to handle. I love
algorithms, though, and now I feel like I have a handle on this. That's
awesome!

However, the higher level point of the post "ML is easy!" seems more than a
little disingenuous. Knowing next to nothing about machine learning, obvious
questions still come to mind:

Since you start with random points, are you guaranteed to reach a global
maximum? Can it get stuck?

How do you know how many clusters you want? How do I pick K?

This assumes that distance in the vector space strongly correlates to
"similarity" in the thing I'm trying to understand. How do I know my vector
model actually does that? (For example, how does the author know "has some
word" is a useful metric for measuring post similarity?)

I like what I got out of the post a lot, but the "this is easy" part only
seems easy because it swept the hard part under the rug.

~~~
andreasvc
You are asking exactly the right questions. As far as I know k-means will work
well when you can answer those questions for your problem, otherwise not so
much. In other words there's no silver bullet.

If you rephrase the "ML is easy" idea from the article to "it's easy to do
some simple but cool things with ML" then it's true, but in pushing the
envelope you can make it as complex as you like.

------
sieisteinmodel
Also: aerodynamics is not really hard, anyone can fold paper planes! Or:
programming 3D games is easy, just build new levels for an old game! Or: I
don't know what I am doing here, but look, this photoshop effect looks really
cool on my holiday photos!

etc.

Seriously: The writer would not be able to write anything about K-Means if not
for people looking at it from a mathematical view point. This angle is of
tremendous importance if you want to know how your algorithm behaves in corner
cases.

This does not suffice, if you have an actual application (e.g. a
recommendation or a hand tracking or an object recognition engine). These need
to work _as good as you can make it_ because every improvement of it will
result in $$$.

~~~
nikatwork
Author is not proposing "1 weird machine learning trick mathematicians HATE!"

They are encouraging non-maths people to give the basics a try even though it
seems intimidating. And it worked on me, like some others in this thread my
maths is shaky beyond diff calculus and my eyes glaze over when I see notation
- but now I'd like to give this a whirl. I have no illusions that I will
suddenly be an expert.

------
Daishiman
It's easy until you have to start adjusting parameters, understand the results
meaningfully, and tune the algorithms for actual "Bit Data". Try doing most
statistical analysis with dense matrices and watch your app go out of memory
in two seconds.

It's great that we can stand on the shoulders of giants, but having a certain
understanding of what these algorithms are doing is critical for choosing them
and the parameters in question.

Also, K-means is relatively easy to understand untuitively. Try doing that
with Latent Dirichlet Allocation, Pachinko Allocation, etc. Even Principal
Component Analysis and Linear Least Squares have some nontrivial properties
that need to be understood.

------
amit_m
tl;dr: (1) Author does not understand the role of research papers (2) Claims
mathematical notation is more complicated than code and (3) Thinks ML is easy
because you can code the wrong algorithm in 40 lines of code.

I will reply to each of these points:

1\. Research papers are meant to be read by researchers who are interested in
advancing the state of the art. They are usually pretty bad introductory
texts.

In particular, mathematical details regarding whether or not the space is
closed, complete, convex, etc. are usually both irrelevant and
incomprehensible to a practitioner but are essential to the inner workings of
the mathematical proofs.

Practitioners who want to apply the classic algorithms should seek a good
book, a wikipedia article, blog post or survey paper. Just about anything
OTHER than a research paper would be more helpful.

2\. Mathematical notation is difficult if you cannot read it, just like any
programming language. Try learning to parse it! It's not that hard, really.

In cases where there is an equivalent piece of code implementing some
computation, the mathematical notation is usually much shorter.

3\. k-means is very simple, but its the wrong approach to this type of
problem. There's an entire field called "recommender systems" with algorithms
that would do a much better job here. Some of them are pretty simple too!

~~~
Aardwolf
I'm pretty good at logic, problem solving, etc..., but do find parsing
mathematical notation quite hard. Is there actually a good way to learn it?

What I have most difficulty with is: it's not always clear which
symbols/letters are knowns, which are unknowns, and which are values you
choose yourself. Not all symbols/letters are always introduced, you sometimes
have to guess what they are. Sometimes axes of graphs are not labeled.
Sometimes explanation or examples for border cases are missing. And sometimes
when in slides or so, the parsing of the mathematical formulas takes too much
time compared to the speed, or, the memory of what was on previous slides
fades away so the formula on a later slide using something from a previous one
can no longer be parsed.

Also when you need to program the [whatever is explained mathematically in a
paper], then you have to tell the computer exactly how it works, for every
edge case, while in math notation people can and will be inexact.

Maybe there should be a compiler for math notation that gives an error if it's
incomplete. :)

~~~
amit_m
You probably want to look at a couple of good undergrad textbooks (calculus,
linear algebra, probability). The good textbooks explain the notation and have
an index for all the symbols.

Unfortunately, in most cases, you have to know a little bit about the field in
order to be able to parse the notation. The upside is that having some
background is pretty much a necessity to not screwing up when you try to
implement some algorithm.

------
myth_drannon
On Kaggle "The top 21 performers all have an M.S. or higher: 9 have Ph.D.s and
several have multiple degrees (including one member who has two Ph.D.s)."

[http://plotting-success.softwareadvice.com/who-are-the-
kaggl...](http://plotting-success.softwareadvice.com/who-are-the-kaggle-big-
data-wizards-1013/)

~~~
vikp
I've done pretty well at Kaggle with just a Bachelors in American History:
[http://www.kaggle.com/users/19518/vik-
paruchuri](http://www.kaggle.com/users/19518/vik-paruchuri), although I
haven't been using it as much lately. A lot of competition winners have lacked
advanced degrees. What I like most about Kaggle is that the only thing that
matters is what you can show. I learned a lot, and I highly recommend it to
anyone starting out in machine learning. I expanded on some of this stuff in a
Quora answer if you are interested: [http://www.quora.com/Kaggle/Does-
everyone-have-the-ability-t...](http://www.quora.com/Kaggle/Does-everyone-
have-the-ability-to-do-well-on-Kaggle-competitions-if-they-put-enough-time-
and-effort-into-them).

~~~
nl
Man..

To other readers, follow that link at your own risk. Vikp has a blog[1] which
has hours and hours of excellent writing about machine learning.

[1] [http://vikparuchuri.com/blog/](http://vikparuchuri.com/blog/)

~~~
vikp
Thanks, nl! Now I know that at least one person reads it.

~~~
sk2code
Count me as 2. Though I am sure there are lots of people who will be reading
the material on your blog. Pretty good stuff.

------
xyzzyz
I'd like to chime in here as a mathematician.

Many people here express their feelings that math or computer science papers
are very difficult to read. Some even suggest that they're deliberately
written this way. The truth is that yes, they in fact are deliberately written
this way, but the reason is actually opposite of many HNers impression:
authors want to make the papers _easier_ to understand, and not more
difficult.

Take for example a page from a paper that's linked in this article. Someone
here on HN complains that the paper talks about "p being absolutely continuous
with respect to the Lebesque measure on En", hundreds of subscripts and
superscripts, and unintuitively named variables, and that it makes paper very
difficult to understand, especially without doing multiple passes.

For non-mathematicians, it's very easy to identify with this sentiment. After
all, what does it even mean for a measure to be absolutely continuous with
respect to Lebesgue measure. Some of these words, like "measure" or
"continuous" make some intuitive sense, but how can "measure" be "continuous"
with respect to some other measure, and what the hell is Lebesgue measure
anyway?

Now, if you're a mathematician, you know that Lebesgue measure in simple cases
is just a natural notion of area or volume, but you also know that it's very
useful to be able to measure much more complicated sets than just rectangles,
polyhedrals, balls, and other similar regular shapes. You know Greeks
successfully approximated areas of curved shapes (like a disk) by polygons, so
you try to define such measure by inscribing or circumscribing a nice, regular
shapes for which the measure is easy to define, but you see it only works for
very simple and regular shapes, and is very hard to work with in practice. You
learned that Henri Lebesgue constructed a measure that assigns a volume to
most sensible sets you can think of (indeed, it's hard to even come up with an
example of a non-Lebesgue-measurable set), you've seen the construction of
that measure, and you know that it's indeed a cunning and nontrivial work. You
also know that any measure on Euclidean space satisfying some natural
conditions (like measure of rectangle with sides a, b is equal to product ab,
and if you move a set around without changing its shape, its measure shouldn't
change) must already be Lebesgue measure. You also worked a lot with Lebesgue
measure, it being an arguably most important measure of them all. You have an
intimate knowledge of Lebesgue measure. Thus, you see a reason to honor
Lebesgue by naming measure constructed by him with his name. Because of all of
this, whenever you read or hear about Lebesgue measure, you know precisely
what you're dealing with.

You know that a measure p is absolutely continuous with respect to q, if
whenever q(S) is zero for some set S, p(S) is also zero. You also know that if
you tried to express the concept defined in a previous sentence, but without
using names for measures involved, and a notation for a value a measure
assigns to some set, the sentence would come out awkward and complicated,
because you would have to say that a measure is absolutely continuous with
respect to some other measure, if whenever that other measure assigns a zero
value to some set, the value assigned to that set by the first measure must be
zero as well. You also know, that since you're not a native English speaker
(and I am not), your chance of making grammatical error in a sentence riddled
with prepositions and conjunctions are very high, and it would make this
sentence even more awkward. Your programmer friend suggested that you should
use more intuitive and expressive names for your objects, but p and q are just
any measures, and apart from the property you're just now trying to define,
they don't have any additional interesting properties that would help you find
names more sensible than SomeMeasure and SomeOtherMeasure.

But you not only know the definition of absolute continuity of measures: in
fact, if that was the only thing you knew about it was the definition, you'd
have forgotten it long ago. You know that absolute continuity is important
because of a Radon-Nikodym theorem, which states that if p is absolutely
continuous with respect to q, then p(A) is in fact integral over A of some
function g with respect to measure q (that is, p(A) = int_A g dq). You know
that it's important, because it can help you reduce many questions about
measure p to the questions about behaviour of function g with respect to
measure q (which in our machine learning case is a measure we know very, very
well, the Lebesgue measure).

You also know why the hell it's called absolutely continuous: if you think
about it for a while, the function g we just mentioned is kind of like a
derivative of a measure of measure p with respect to measure q, kind of like
dp/dq. Now, if you write p(A) = int_A (dp/dq) dq = int_A p'(q) dq, even though
none of the symbols dp/dq or p'(q) make sense, it seems to mean that p is an
"integral of its derivative", and you recall that there's a class of real
valued functions for which it is true as well, guess what, the class of
absolutely continuous functions. If you think about these concepts even
harder, you'll see that the latter concept is a special case of our absolutely
continuous measures, so all of this makes perfectly sense.

So anyway, you read that "p is absolutely continuous with respect to Lebesgue
measure", and instantly tons of associations light up in your memory, you know
what they are working with, you have some ideas why they might need it,
because you remember doing similar assumption in some similar context to
obtain some result (and as you're reading the paper further, you realize you
were right). All of what you're reading makes perfect sense, because you are
very familiar with the concepts author introduces, with methods of working
with them, and with known results about them. Every sentence you read is a
clear consequence of the previous one. You feel you're home.

 _..._

Now, in alternate reality, a nonmathematician-you also tries to read the same
paper. As the alternate-you haven't spent months and years internalizing these
concept to become vis second nature, ve has to look up every other word,
digress into Wikipedia to use DFS to find a connected component containing a
concept you just don't yet understand. You spend hours, and after them you
feel you learned nothing. You wonder if the mathematicians deliberately try to
make everything complicated.

Then you read a blog post which expresses the idea behind this paper very
clearly. Wow, you think, these assholes mathematicians are really trying to
keep their knowledge in an ivory tower of obscurity. But, since you only made
it through the few paragraphs of the paper, you missed an intuitive
explanation that's right there on that page from an paper reproduced by that
blog post:

 _Stated informally, the k-means procedure consists of simply starting with k
groups each of which consists of a single random point, and thereafter adding
each new point to the group whose mean the new point is nearest. After a point
is added to a group, the mean of that groups is adjusted in order to take
account of that new point_

Hey, so there was an intuitive explanation in that paper after all! So, what
was all that bullshit about measures and absolute continuity all about?

You try to implement an algorithm from the blog post, and, as you finish, one
sentence from blog post catches your attention:

 _Repeat steps 3-4. Until documents’ assignments stop changing._

You wonder, but when that actually happens? How can you be sure that they will
stop at all at some point? The blog post doesn't mention that. So you grab
that paper again...

~~~
Houshalter
I can get that academic papers do this, but even Wikipedia articles are
written in this cryptic language. For example, you read about some algorithm
and want to find out how it works and you get this:
[https://upload.wikimedia.org/math/f/1/c/f1c2177e965e20ab29c4...](https://upload.wikimedia.org/math/f/1/c/f1c2177e965e20ab29c4ba51f70bbdfc.png)

You try to highlight some of the symbols and find out that it's an image. You
can't even click on things and find out what they mean like is standard for
everything written in text. You try to google something like "P function" and
get completely unrelated stuff.

~~~
nilkn
Symbols are used because equations like that would be crazy long and hard to
keep track of if every symbol were a fully-formed word or phrase. It's an
unfortunate side effect that those unfamiliar with mathematical notation are
somewhat left in the dark.

The real problem here is the difficulty of looking up the meaning of a symbol
you don't know. The notation is great once you know it, but it's a pain to
learn if you're doing so completely out of context. Maybe you don't know that
the sigmas there represent summation. Your idea of just clicking the sigma and
seeing an explanation is a pretty great one that I'd like to see implemented.

~~~
eli_gottlieb
Crazy long and difficult-to-track equations are why _function abstraction_ was
invented. If this were code, we would say to cut it into understandable
chunks, label them as functions, and then compose them.

~~~
nilkn
Writing it all as code would exclude nonprogrammers just as much as this
excludes people not familiar with basic mathematical notation.

At least everybody should share some common mathematical instruction from
school. The same certainly cannot be said of programming.

As somebody who can read both code and math, though, I'd greatly, greatly
prefer to see math in math notation, not code. Code would be so much more
verbose that it would take easily twice as long to digest the information.
You'd also completely lose all the incredibly powerful visual reasoning that
comes along with modern mathematical notation.

------
kephra
The question "do I need hard math for ML" often comes up in #machinelearning
at irc.freenode.net

My point here is: You don't need hard math (most of the times) because most
machine learning methods are already coded in half a dozen different
languages. So its similar to fft. You do not need to understand why fft works,
just when and how to apply it.

The typical machine learning workflow is: Data mining -> feature extraction ->
applying a ML method.

I often joke that I'm using Weka as a hammer, to check, if I managed to shape
the problem into a nail. Now the critical part is feature extraction. Once
this is done right, most methods show more or less good results. Just pick the
one that fits best in results, time and memory constrains. You might need to
recode the method from Java to C to speedup, or to embed it. But this requires
nearly no math skills, just code reading, writing and testing skills.

------
tlarkworthy
I find newbs in ml don't appreciate cross validation. That's the one main
trick. Keep some data out of the learning process to test an approaches
ability on data it has not seen. With this one trick you can determine which
algorithm is best, and the parameters. Advanced stuff like Bayes means you
don't need it, but for your own sanity you should still always cross validate.
Machine learning is about generalisation to unseen examples, cross validation
is the metric to test this. Machine learning is cross validation.

~~~
syllogism
First, you're always going to need to evaluate supervised models on some data
you didn't train on. No method will ever relieve the need for that.

Second, I never use cross-fold if I can help it. It's a last resort if the
data's really small. Otherwise, use two held-out sets, one for development,
and another for testing. Cross-fold actually really sucks!

The small problem is that cross-validation has you repeating everything N
times, when your experiment iteration cycle is going to be a big bottleneck.
Second, it can introduce a lot of subtle bugs.

Let's say you need to read over your data to build some sort of dictionary of
possible labels for the instances, given one of the attributes. This can save
you a lot of time. But if you're cross-validating, you can forget to do these
pre-processes cross-validated too, and so perform a subtle type of cheating.
Coding that stuff correctly often has you writing code into the run-time about
the cross-fold validation. Nasty.

With held-out data, you can ensure your run-time never has access to the
labels. It writes out the predictions, and a separate evaluation script is
run. This way, you know your system is "clean".

Finally, when you're done developing, you can do a last run on a test set.

Bonus example of cross-fold problems: Sometimes your method assumes
independence between examples, even of this isn't strictly true. For instance,
if you're doing part-of-speech tagging, you might assume that sentences in
your documents are independent of each other (i.e. you don't have any cross-
sentence features). But they're not _actually_ independent! If you cross-fold,
and get to evaluate on sentences from the same documents you've trained on,
you might do much better than in a realistic evaluation, where you have to
evaluate on fresh documents.

~~~
tlarkworthy
Yes I realise the importance of a testing set too. I see that as just another
level validation. Hierarchies of left out data. If you understand in your
heart what cross validation _is_ , then the step to using a testing set is a
trivial extension.

I brought up cross validation in particular, because I see PhDs eating books
on advanced techniques like structure learning. But they have no practical
experience. And they ignore the basics for so long, as they don't realise
validation is the most important bit. If you have not drawn the curve of
training set error, validation error against training time, you have no done
machine learning yet...

This article doesn't mention it either :(

------
upquark
Math is essential for this field, anyone who tells you otherwise doesn't know
what they are talking about. You can hack together something quick and dirty
without understanding the underlying math, and you certainly can use existing
libraries and tools to do some basic stuff, but you won't get very far.

Machine learning is easy only if you know your linear algebra, calculus,
probability and stats, etc. I think this classic paper is a good way to test
if you have the right math background to dive deeper into the field:
[http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pd...](http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf)

------
Ihmahr
As a graduate in artificial intelligence and machine learning I can tell you
that machine learning IS hard.

Sure, the basic concepts are easy to understand. Sure, you can hack together a
program that performs quite well on some tasks. But there are so much
(interesting) problems that are not at all easy to solve or understand.

Like structural engineering it is easy to understand the concepts, and it is
even easy to build a pillow fort in the living room, but it is not easy to
build an actual bridge that is light, strong, etc.

------
pallandt
It's actually incredibly hard, especially if you want to achieve better
results than with a current 'gold standard' technique/algorithm, applied on
your particular problem.

While the article doesn't have this title (why would you even choose one with
such a high bias?), I presume the submitter decided upon this title after
being encouraged by this affirmation of the article's author: 'This data
indicates that the skills necessary to be a data “wizard” can be learned in
disciplines other than computer sciences and mathematics.'.

This is a half-baked conclusion. I'd reason most Kaggle participants are first
of all, machine learning fans, either professionals or 'amateurs' with no
formal qualifications, having studied it as a hobby. I doubt people with a
degree in cognitive sciences or otherwise in the 'other' categories as
mentioned in the article learned enough just through their university studies
to readily be able to jump into machine learning.

------
tptacek
Is k-means really what people are doing in serious production machine-learning
settings? In a previous job, we did k-means clustering to identify groups of
similar hosts on networks; we didn't call it "machine learning", but rather
just "statistical clustering". I had always assumed the anomaly models we
worked with were far simpler than what machine learning systems do; they
seemed unworthy even of the term "mathematical models".

~~~
patio11
_Is k-means really what people are doing in serious production machine-
learning settings?_

Yes, that is one option. Ask me over Christmas and I'll tell you a story about
a production system using k-means. (Sorry guys, not a story for public
consumption.)

More broadly, a _lot_ of machine learning / AI / etc is simpler than people
expect under the hood. It's almost a joke in the AI field: as soon as you
produce a system that actually works, people say "Oh that's not AI, that's
just math."

~~~
glimcat
Most of the classifiers I've seen used in industry aren't even k-means.
They're fixed heuristics (often with plausibly effective decision boundaries
found via k-means).

Consider the Netflix Prize, where the most effective algorithm couldn't
actually be used in production. Commercial applications tend to favor "good
enough" classification with low computational complexity.

> It's almost a joke in the AI field: as soon as you produce a system that
> actually works, people say "Oh that's not AI, that's just math."

I think that's the standard "insert joke here" any time someone has a "What is
AI?" slide.

------
misiti3780
I disagree with this article, although I did find it interesting. Replace
k-means with a supervised learning algorithm like an SVM, and use some more
complicated features other than binary and this article would be a lot
different.

Also - maybe "article recommendation" is "easy" in this context, but other
areas such as computer vision, sentiment analysis are not. Some other
questions I might ask

How do you know how well this algorithm is performing?

How are you going to compare this model to other models? Which metrics will
you use? What statistical tests would you use and why?

What assumptions are you making here ? How do you know you can make them and
why?

There are a lot of things that this article fails to address.

Disclaimer: I realize more complex models + features don't always lead to
better performance, but you need to know how to verify that to be sure.

~~~
iskander
SVM prediction is pretty straight-forward, even if you're using a kernel/dual
formulation:

"Here are some training points from classes A and B, each with an importance
weight. Compute the weighted average of your new point's similarity to all the
training exemplars. If the total comes out positive then it's from class A,
otherwise it's from class B."

If you tried to explain training from a quadratic programming perspective it
would get messy, but if you use stochastic gradient descent then even the
training algorithm (for primal linear SVM) is pretty light on math.

Of course, this is only possible after a decade or two of researchers have
played with an digested the ideas for you. I think until Léon Bottou's SGD SVM
paper, even cursory explanations of SVMs seemed to get derailed into convex
duality, quadratic programming solvers, and reproducing kernel hilbert spaces.

~~~
misiti3780
I'm not sure I agree:i I'll say it differently:

As far as I know, k-means (as the author described it) has two parameters:
initial number of clusters K and max iterations (although I could be wrong),
SVMS have:

UPDATE: here is a gist:

[https://gist.github.com/josephmisiti/7572696](https://gist.github.com/josephmisiti/7572696)

taken from here :
[http://www.csie.ntu.edu.tw/~cjlin/libsvm/](http://www.csie.ntu.edu.tw/~cjlin/libsvm/)

I do not think that picking up LibSVM, compiling it, and running it is
straight forward for everyone ...

~~~
iskander
I think there's a big difference between explaining the essence of an
algorithm and understanding all the details and decisions that go into a
particular implementation.

If explaining SVM, I'd stick with the linear primal formulation, which only
really requires choosing the slack parameter C. If I needed to perform or
explain non-linear prediction, I'd switch to the RBF kernel, which gives you
one additional parameter (the RBF variance).

Implementations of K-means, by the way, can also potentially have lots of
parameters. Check out: [http://scikit-
learn.org/stable/modules/generated/sklearn.clu...](http://scikit-
learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html#sklearn.cluster.MiniBatchKMeans)

~~~
misiti3780
fair enough - btw i just realized i bookmarked your blog
([http://blog.explainmydata.com](http://blog.explainmydata.com)) a few weeks
back - i was really enjoying those posts. nice work

~~~
iskander
Thanks :-)

I'm going to finish grad school soon and the job I'm going to will be pretty
rich with data analysis, hopefully will lead to more blog posts.

------
apu
For those wanting to get started (or further) in machine learning, I highly
recommend the article, "A Few Useful Things to Know About Machine Learning,"
by Pedro Domingos (a well respected ML researcher):
[http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf).
It's written in a very accessible style (almost no math); contains a wealth of
practical information that everyone in the field "knows", but no one ever
bothered to write down in one place, until now; and suggests the best
approaches to use for a variety of common problems.

As someone who uses machine learning heavily in my own research, a lot of this
seemed like "common sense" to me when I read it, but on reflection I realized
that this is _precisely_ the stuff that is most valuable _and_ hardest to find
in existing papers and blog posts.

------
mrcactu5
The equations look fine to me - I was a math major in college. Honestly, I get
so tired of humanities people -- or programmers, bragging about how much they
hate math.

Except:

[https://gist.github.com/benmcredmond/0dec520b6ab2ce7c59d5#fi...](https://gist.github.com/benmcredmond/0dec520b6ab2ce7c59d5#file-
kmeans-rb)

I didn't know k-means clustering was that simple. I am taking notes...

    
    
      * pick two centers at random
    
      run 15 times: 
    
      * for each post, find the closest center
      * take the average point of your two clusters 
      as your new center
    

This is cool. It is 2-means clustering and we can extend it to 5 or 13...

We don't need any more math, as long as we don't ask whether this algorithm
__converges __or __how quickly __

------
syllogism
I write academic papers, and I've started writing blog posts about them, and I
think this post doesn't cover one of the main reasons that academic papers are
less accessible to non-specialists.

When you write an academic paper, it's basically a diff on previous work. It's
one of the most important considerations when the paper first comes out. The
reviewers and the people up-to-the-minute with the literature need to see
which bit is specifically new.

But to understand your algorithm from scratch, someone needs to go back and
read the previous four or five papers --- and probably follow false leads,
along the way!

It's another reason why academic code is often pretty bad. You really really
should write your system to first replicate the previous result, and then
write your changes in on top of it, with the a _bare minimum_ branching logic,
controlled by a flag, so that the same runtime can provide both results. And
you should be able to look at each point where you branch on that flag, and
check that your improvements are only exactly what you say they are.

When you start from scratch and implement a good bang-for-buck idea, yes, you
can get a very simple implementation with very good results. I wrote a blog
post explaining a 200-line POS tagger that's about as good as any around.[1]
Non-experts would usually not predict that the code could be so simple, from
the original paper, Collins (2002).[2]

I've got a follow-up blog post coming that describes a pretty good parser that
comes in at under 500 lines, and performs about as accurately as the Stanford
parser. The paper I wrote this year, which adds 0.2% to its accuracy, barely
covers the main algorithm --- that's all background. Neither does the paper
before me, released late last year, which adds about 2%. Nor the paper before
that, which describes the features...etc.

When you put it together and chop out the false-starts, okay, it's simple. But
it took a lot of people a lot of years to come up with those 500 lines of
Python...And they're almost certainly on the way towards a local maximum! The
way forward will probably involve one of the many other methods discussed
along the way, which don't help this particular system.

[1] [http://honnibal.wordpress.com/2013/09/11/a-good-part-of-
spee...](http://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-
tagger-in-about-200-lines-of-python/)

[2]
[http://acl.ldc.upenn.edu/W/W02/W02-1001.pdf‎](http://acl.ldc.upenn.edu/W/W02/W02-1001.pdf‎)

------
pyduan
As someone who works in machine learning, I have mixed feelings about this
article. While encouraging people to start learning about ML by demystifying
it is a great thing, this article comes off as slightly cocky and dangerous.
Programmers who believe they understand ML while only having a simplistic view
of it risk not only to create less-than-optimal algorithms, and might instead
create downright dangerous models:
[http://static.squarespace.com/static/5150aec6e4b0e340ec52710...](http://static.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png)

In the context of fraud detection (one of the main areas I work in these
days), a model that is right for the wrong reasons might lead to catastrophic
losses when the underlying assumption that made the results valid suddenly
ceases to be true.

Aside from the fact the techniques he mentioned are some of the simplest in
machine learning (and are hardly those that would immediately come to mind
when I think "machine learning"), the top comment on the article is spot on:

> "The academic papers are introducing new algorithms and proving properties
> about them, you’re applying the result. You’re standing on giants’ shoulders
> and thinking it’s easy to see as far as they do."

While understanding _how_ the algorithm works is of course important (and I do
agree that they are often more readable when translated to code),
understanding _why_ (and _when_ ) they work is equally important. Does each
K-Means iteration always reach a stable configuration? When can you expect it
to converge fast? How do you choose the number of clusters, and how does this
affect convergence speed? Does the way you initialize your centroids have a
significant effect on the outcome? If yes, which initializations tend to work
better in which situations?

These are all questions I might ask in an interview, but more importantly,
being able to answer these is often the difference between blindly applying a
technique and applying it intelligently. Even for "simple" algorithms such as
K-Means, implementing them is often only the tip of the iceberg.

~~~
benofsky
Hey, author here.

> Aside from the fact the techniques he mentioned are some of the simplest in
> machine learning (and are hardly those that would immediately come to mind
> when I think "machine learning")

The primary point of this article was the contrast between something as simple
as K-Means, and the literature that describes it. It wasn't meant as a full
intro to ML, but rather something along the lines of "give it a try, you might
be surprised by what you can achieve".

> Even for "simple" algorithms such as K-Means, implementing them is often
> only the tip of the iceberg.

Yup. But getting more people to explore the tip of the iceberg is, in my
opinion, a good thing. We don't discourage people from programming because
they don't instantly understand the runtime complexity of hash tables and
binary trees. We encourage them to use what's already built knowing that smart
people will eventually explore the rest of the iceberg.

~~~
pyduan
Thanks for responding. I fully agree with your comment -- as I said, I too
think many people are sometimes put off by the apparent complexity of machine
learning, and demystifying how it works is a great thing. Unfortunately
there's always a risk that a "hey, maybe this isn't so hard after all" might
turns into a "wow, that was easy". While I think the former is great, the
latter is dangerous because machine learning is often used to _make decisions_
(sometimes crucial, for example when dealing with financial transactions), so
I would argue more care should be taken than if we were talking about general
purpose programming: if you trust an algorithm with making important business
decisions, then you better have an intimate knowledge of how it works.

While I again agree with the underlying sentiment, I was just a bit
disappointed that it seems to invite the reader to be satisfied of himself
rather than motivate him to dig deeper. Nothing a future blog post can't solve
though!

------
rdtsc
A lot of concepts are easier when you know how they work.

CPUs were magical for me before I took a computer architecture course. So was
AI and machine learning. Once you see the "trick" so to speak you lose some of
the initial awe.

------
aidos
Most of the comments on here are from people in the field of ML saying "this
is a toy example, ML is hard."

Maybe that's the case. And maybe the title of the submission ruffled some
feathers but the thrust of it is that ML is approachable. I'm sure there's
devil in the detail, but it's nice for people who are unfamiliar in a subject
to see it presented in a way that's more familiar to them with their current
background.

I have a university background in Maths and Comp Sci so I'm not scared of code
or mathematical notation. Maybe if I'd read the comments on here I'd get the
sense that ML is too vast and difficult to pick up. I'm doing Andrew Ng's
coursera course at the moment and so far it's all been very easy to
understand. I'm sure it gets harder (I even hope so) and maybe I'll never get
to the point where I'm expert at it, but it would be nicer to see more of a
nurturing environment on here instead of the knee jerk reactions this seems to
have inspired.

~~~
m_ke
Ng's coursera class is really dumbed down and although he gives a nice
intuitive explanation of the basics, it's nowhere near as hard as his actual
Stanford class (or any other ml class from a good engineering school).

~~~
mrcactu5
many people still can't pass it or understand it...

------
ronaldx
I'm cynical about how machine learning of this type might be used in practice
and this is an illustration of why: the stated goal is a "you might also like"
section.

There is no reason to believe the results are any better than a random method
in respect of the goal (and it's reasonable to believe they may be worse) - we
would have to measure this separately by clickthrough rate or user
satisfaction survey, perhaps.

I believe you would get far better results by always posting the three most
popular articles. If you want to personalise, post personally-unread articles.
A lot less technical work, a lot less on-the-fly calculation, a lot more
effective. The machine learning tools do not fit the goal.

The most effective real example of a "you might also like" section is the Mail
Online's Sidebar of Shame. As best as I can tell, they display their popular
articles in a fixed order.

Machine Learning seems to make it easy to answer the wrong question.

~~~
andreasvc
This is really only an argument that the example in the article is not
realistic (which it doesn't have to be, it might be expository). There are in
fact countless applications of machine learning in actual daily use, such as
detecting credit card fraud, where simpler manual methods would perform
measurably worse in terms of money lost.

~~~
ronaldx
Sure, there are realistic applications of Machine Learning with great results.

But the article has failed in its headline goal ("ML is easier than it looks")
if it chooses an example that is mathematically more complicated and still
less effective than a naive alternative.

The first difficult task is to identify an effective method.

------
agibsonccc
Wait till you have to hand craft your algorithms because the off the shelf
ones are too slow ;). In the end you can stand on the shoulders of giants all
day, but until you actually sit down and write an SVM or even something more
cutting edge like stacked deep autoencoders yourself, machine learning isn't
"easy".

In the end, libs are there for simpler use cases or educational purposes.
Realistically, that's more than good enough for 90% of people.

That being said, it's not impossible to learn. Oversimplifying the statistics,
tuning, and work that goes in to these algorithms you're using though? Not a
good idea.

------
danialtz
I recently read a book called "Data Smart" [1], where the author does k-means
and prediction algorithms literally in Excel. This was quite eye opening as
the view to ML is not so enigmatic to enter. However, the translation of your
data into a format/model to run ML is another challenge.

[1] [http://www.amazon.com/Data-Smart-Science-Transform-
Informati...](http://www.amazon.com/Data-Smart-Science-Transform-
Information/dp/111866146X)

------
panarky
Sure, some ML concepts are intuitive and accessible without advanced math.

But it would help to highlight some of the fundamental challenges of a
simplistic approach.

For example, how is the author computing the distance between points in
n-dimensional space?

And does this mean that a one-paragraph post and a ten-paragraph post on the
same topic probably wouldn't be clustered together?

~~~
dylandrop
Well I don't think any of the problems you mentioned are difficult to solve,
even by a non-seasoned programmer.

For your first question, just use Euclidean distance. (In other words,
sqrt((x1 - y1)^2 + (x2 - y2)^2 + ... + (xn - yn)^2). IMO, anyone who can
program can do that.

Also for your second question - probably - but he's using a very simplistic
model. To fix that, you simply fix the heuristic. For example, you could
instead score groups by how frequently certain important words occur, like
"team" and "support".

~~~
m_ke
You have binary features, why would you use euclidean distance?

~~~
dylandrop
Well in this case, you can also do the normal, in other words adding the
differences. The poster asked about finding the differences between points in
N-dimensional space in general (he didn't specify for the author's example).
But the idea is that your means wouldn't be bucketed into just a couple of
values, but have a larger distribution. In any case, it doesn't matter much
for this problem, but if you scored it by word count rather than a simple 1 or
0, it would matter.

------
sytelus
Building a model on some training data is easy. Debugging it is _hard_. Th
machine learned model is essentially equivalent of code that gives
probabilistic answers. But there are no breakpoints to put or watch to add.
When model doesn't give the answer you expect, you need to do _statistical
debugging_. Is the problem in training data, do you have correct sampling, are
you missing a feature, is your feature selection optimal, does you featurizer
have a bug somewhere? Possibilities are endless. Debugging ML model is black
art. Most ML "users" would simply give up if model doesn't work on some cases
or if they added a cool feature but results are not that good. In my
experience less than 10% of people who claims ML expertise are actually good
at doing statistical debugging and able to identify issue when model doesn't
work as expected.

------
mau
tldr: the ML algorithms look hard reading the papers, while the code looks
simpler and shorter, also you can get pretty decent results in a few lines of
R/Python/Ruby so ML is not that complex.

I disagree in so many ways: 1\. complex algorithms are usually very short in
practice (e.g. dijkstra's shortest path or edit distance are the firsts that
come to mind) 2\. ML is not just applying ML algorithms: you have to evaluate
your results, experiment with features, visualize data, think about what you
can exploit and discover patterns that can improve your models. 3\. If you
know the properties of the algorithms you are using then you can have some
insights that might help you on improving your results drastically. It's very
easy to apply the right algorithms with the wrong normalizations and still get
decent results in some tests.

------
hokkos
Matrix multiplication, orthonormal basis, triangular matrix, gradient descent,
integrals, Lebesgue mesure, convex, and the mathematical notation in the paper
are not harder than the code shown here. It is better to have solid prof of
what you are doing is sound and will converge before jumping into the code.

------
outworlder
I don't get all negative comments.

From my limited text comprehension abilities, the author did not say that the
whole field is trivial and that we should sack all academics.

Instead, the argument is that basic Machine Learning techniques are easy and
one shouldn't be afraid of applying them.

~~~
freshhawk
Likely this part at the beginning that exposed the author's anti-
intellectualism/ignorance:

"The majority of literature on machine learning, however, is riddled with
complex notation, formulae and superfluous language. It puts walls up around
fundamentally simple ideas."

That's what got me, it's such a flat out stupid thing to say. Especially the
"puts walls up" phrasing to imply that math beyond what the author understands
is a conspiracy. _Especially_ from someone who writes code ... which is full
of "complex notation", symbols and "superfluous classes/libraries/error
checking/abstractions" by the same absurd reasoning.

~~~
benofsky
Hi, author here. I love reading academic papers and I love Math. But it's an
unfortunate fact that academic papers make up the majority of the literature
on ML — and that their notation, and writing style exclude a large population
who would otherwise find them useful. That's all I'm trying to get at here.

~~~
freshhawk
They don't exclude people by using math any more than programmers exclude
people by using subclasses or anonymous functions.

I too would love to see more entry level ML resources, but we are talking
about academic papers here.

Imagine how stupid it would seem if some outsider came in to HN and started
telling programmers to dumb down their code when it's open source because
their advanced techniques are too hard for a novice to understand off the bat.
Instead of optimizing for efficiency, elegance and readability for their peers
- the people they are collaborating with to solve actual problems - they're
told to cater to novices always.

The language in your post is textbook anti intellectualism isn't it? And
strangely entitled. You would certainly not apply these criteria to code but
since it's not your field they must cater to your lack of experience? You know
better than they how to communicate ML research to other experts?

------
samspenc
Upvoted this for an interesting read, but I agree with the sentiments in the
comments that (1) ML is in general hard (2) some parts of ML are not that
hard, but are likely the minority (3) we are standing on the shoulders of
giants, who did the hard work.

------
BjoernKW
The fundamentals of linear algebra and statistics are indeed quite easy to
understand. Common concepts and algorithms such as cosine similarity and
k-means are very straightforward.

Seemingly arcane mathematical notation is what frightens off beginners in many
cases, though. Once you've understood that - for instance - a sum symbol
actually is nothing else but a loop many things become a lot easier.

However, the devil's in the details. Many edge cases and advanced methods of
machine learning are really hard to understand. Moreover, when 'good enough'
isn't just good enough any more things tend to become very complex very
quickly.

------
Irishsteve
The post does do a good job of showing how easy it is to implement knn.

The post doesn't really go into centroid selection or evaluation, or the fact
that clustering on text is going to be painful once you move to a larger
dataset.

------
gms
The difficult aspects take centre stage when things go wrong.

------
kamilafsar
Some while back I implemented k-means in JavaScript. It's a really simple,
straight forward algorithm which makes sense to me, as a visual thinker and a
non-mathematician. Check out the implementation here:

[https://github.com/kamilafsar/k-means-
visualizer/blob/master...](https://github.com/kamilafsar/k-means-
visualizer/blob/master/k-means.js)

------
Toenex
I think this is one of the reasons why it should become standard practise to
provide code implementations of described algorithms. It not only provides an
executable demonstration of the algorithm but as importantly an alternative
description that may be more accessible to other audiences. It can also be
used as conformation that was is intended is indeed what is being done.

------
adammil
It is nice to read about this in plain language. But, can someone explain what
the X and Y axis are meant to represent in the graph?

~~~
davmre
They don't mean anything in particular. The _actual_ analysis is being done in
a high-dimensional space, in which each post is represented by a high-
dimensional vector of the form [0,0,1,0,...., 0,1,0]. The length of the vector
is the total number of distinct words used across all blog posts (maybe
something like 30,000), and each entry is either 0 or 1 depending on whether
the corresponding word occurs in this post. All the distances and cluster
centers are actually being computed in this 30000-dimensional space; the two-
dimensional visualization is just for intuition.

If you're wondering how the author came up with the two-dimensional
representation, the article doesn't say, but it's likely he used something
like Principal Component Analysis
([http://en.wikipedia.org/wiki/Principal_component_analysis](http://en.wikipedia.org/wiki/Principal_component_analysis)).
This is a standard technique for dimensionality reduction, meaning that it
finds the "best" two-dimensional representation of the original
30,000-dimensional points, where "best" in this case means something like
"preserves distances", so that points that were nearby in the original space
are still relatively close in the low-dimensional representation.

------
pesenti
Of the two methods described - search vs. clustering - the first one - simpler
and not involving ML - is better for this use case. The only reason it seems
to give worst results is because it's only used with the titles and not the
full body (unlike the clustering approach). So I guess machine learning is
easier to mis-use than it looks...

------
dweinus
They should try using tf-idf to create the initial representation of the
keywords per post...also, I find there are many cases where applying machine
learning/statistics correctly is harder than it looks, this single case not
withstanding.

------
Rickasaurus
It may be easier to do than it looks, but it's also harder to do well.

------
m_ke
This is as valid as someone stating that computer science is easy because they
know HTML.

------
fexl
I like the simple explanation of K-Means, and I like the contrast with the
dense set-theoretic language -- a prime example of "mathematosis" as W.V.O.
Quine put it.

~~~
fexl
I admit that was an unfair swipe at formalism, but I do think that formalism
is far more effective in the light of a clear, intuitive exposition that
conveys the concept straight into the mind. Formalism is good for battening
things down tightly, after the fact.

