Hacker News new | past | comments | ask | show | jobs | submit login
Machine learning is easier than it looks (insideintercom.io)
459 points by jasonwatkinspdx on Nov 20, 2013 | hide | past | favorite | 167 comments



I feel I'm in a somewhat unique position to talk about easy/hardness of machine learning; I've been working for several months on a project with a machine learning aspect with a well-cited, respected scientist in the field. But I effectively "can't do" machine learning myself. I'm a primarily 'self-trained' hacker; started programming by writing 'proggies' for AOL in middle school in like 1996.

My math starts getting pretty shaky around Calculus; vector calculus is beyond me.

I did about half the machine learning class from coursera, andrew ng's. Machine learning is conceptually much simpler than one would guess; both gradient descent and the shallow-neural network type; and in fact it is actually pretty simple to get basic things to work.

I agree with the author that the notation, etc, can be quite intimidating vs what is "really going on".

however, applied machine learning is still friggin' hard; at least to me; and I consider myself a pretty decent programmer. Naive solutions are just unusable in almost any real application, and the author's use of loops and maps are great for teaching machine learning; but everything needs to be transformed to higher level vector/matrix problems in order to be genuinely useful.

That isn't unattainable by any means; but the fact remains (imho) that without the strong base in vector calculus and idiosyncratic techniques for transforming these problems into more efficient means of computations; usable machine learning is far from "easy".


This response, and the replies to it actually come as a bit of a surprise to me. I am a graduate student in artificial intelligence who has a comparatively weak background in mathematics who is still very capable of applying, implementing, and improving upon machine learning techniques. I have never taken a calculus or linear algebra class. (Don't ask how I managed that, an effect of the odd route I took to get where I am). The point is you shouldn't be telling yourself you can't do it because you don't know the math, you may surprise yourself.

That being said there are plenty of barriers to a software engineer picking up some papers on machine learning and getting started. These are the result of some serious issues that I believe exist with academia at present and specifically the "publication game". I will not however get into this as this comment will quickly turn into the longest paper I've ever written. It is a problem with the system, and as the reader it is not your fault.

Your issue with applied machine learning may be largely imagined. Take for example the k-means clustering algorithm in the article. This is quite useful in a number of places be it data analysis or intelligent user interfaces, its straightforward to implement, and many usable implementations exist online. The same case is true for other models such as linear regression, naive bayes, even things like support vector machines are pretty easy to use these days.

In any case, as selfish as this is, this sentiment you're expressing provides some comfort that I'll be in demand when I enter the job market. :P Though if you really want to implement and apply some simple yet useful algorithms he recommends Programming Collective Intelligence at the end of the article and I can definitely second that recommendation.


Just wanted to say thanks for the recommendation! I actually started writing the book because I felt that a lot of the existing textbooks were unnecessarily obtuse. The details are complicated but you can still get some interesting results and have some fun with a high-level understanding.

(I'll warn you that a lot about the book is very dated now)


Hey Toby,

First off thanks for writing the book. It was the first book recommended to me for ML while I was doing my undergrad and it got me using ML as opposed to studying ML. The book opened my eyes to the fact that I really could do these things, and it definitely played a role in influencing me to pursue my MSc in AI. So again, thank you. :)

Secondly I'll agree that the book is most definitely dated at this point though it provides a survey of quite a broad selection of algorithms. It certainly isn't going to put the reader on the cutting edge or anywhere near it but it will get them started which appears to be the hardest part for some of those participating in this comment thread.

Have you considered a second edition?


I think you're glossing over the fact that for a non-trivial problem, you have the pick the right machine learning algorithm from a large set of possibilies, and typically you have to set a number of parameters to make it work. On top of that you have to follow proper methodology to get meaningful results; e.g., to determine whether the accuracy you're getting generalizes to other data as well. I think all of this contributes to making applied machine learning a non-trivial skill. And indeed, AI is just the right background for that, but the point is that you need such a background.


Thats true, though as my optimization friends put it, sometimes a really close answer that is easy to come to is often better than a precise answer which is difficult to come to. Selecting the right algorithm is a bit of an art though some simple rules can make it pretty easy. Say clustering for groups, linear regression for estimating continuous values, naive Bayes / logistic regression for classification. Without parameter tuning you can go from having a bunch of data to having a model that does something quite easily.

When were talking about applied machine learning it is generally in the context of developing a product, a feature. The most important part is going from data to feature, not necessarily squeezing the last few percentage points of performance. In this context getting a good answer easily is more important than getting the best answer.


[deleted]


The combinatorial number of possibilities quickly grows beyond what can be tested automatically, so you have to start from good intuitions. That's actually what machine learning is _not_ great at; it finds patterns in data (maybe even previously unknown ones), but arguably doesn't come up with creative insights.


Another recommendation for programming collective intelligence here. It really demystified the algorithms behind sites like amazon for me.


I feel the exact opposite. ML is a huge part of my job (or was), and I topped out at Discrete Math 2 or so, finishing up a few courses of the usual Calculus but never anything as esoteric as Real Analysis or the like. (I think I finished up some Linear Algebra.)

Anyway, my math is not that great and I'm way out of practice when it comes to theoretical work. Not only that, I suck at algorithm design and I'm at best a moderately decent developer from a theory standpoint.

But applied machine learning came natural, because I have a strong grasp of frequentist and Bayesian statistics as well as solving the problems backwards. If you try to apply ML or any other "exotic" solution by starting with "oh man here is this awesome technique," you will definitely be frustrated; I know that feeling all too well.

ML is just another tool in the shed. With practice, you start to understand where simple linear regression techniques are useful and when more heavy-duty methods are applicable. Keep solving real world problems rather than trying to look for stuff to hit with your new hammer (ML in this case) and I bet you'll come around a lot quicker.


I was going to say something around those lines but you save me the effort.

Although in my opinion using ML and have some decent results is not that hard. Use it and being close to the state in the art (i.e improving your recognition rate from 65% to 80%) is a completely different matter and it is hard and it requires a fully understanding of the details of the algorithm in use.

Still it doesn't meant that you can't do interesting things with an overall understanding of machine learning, but making a good cat detector is hard [0] [1]

[0] http://research.google.com/archive/unsupervised_icml2012.htm...

[1] http://techtalks.tv/talks/machine-learning-and-ai-via-brain-...


For a 65% to 80% jump, I'd say it depends on the subfield. With NLP (which is basically applied machine learning), it's not uncommon to see algorithms with >80% accuracy.


My take on this is somewhat different - my background is pure mathematics in my undergraduate degree, then applied machine learning at Facebook (ads optimization) and GS (algo trading), and currently at graduate school in ML.

My take is that 'applying' machine learning is much easier than one would think, but 'understanding' machine learning is _much_ harder than one would think.

I imagine most people start with an application - that's why they're interested in the first place. Applying an ML algorithm to a given dataset is trivial - only a few lines in R/Python with scikits-learn, etc. There's even a handy cheat-sheet [1] to tell you which algorithm to use. Your regression model spits out a few parameters, some estimates of quality of fit, maybe some pretty graphs, and you can call it a day.

The next stage is implementation - up until now, the black box has been spitting out parameters, and you want to understand what these numbers actually are and how they are determined. This means getting your hands dirty with the nitty-gritty of ML algorithms - understanding the algorithms statistical basis, understanding the computational performance of the algorithms (as you alluded to), and reading textbooks/journal articles. Many of these algorithms have a simple 'kernel' of an idea that can be obscured by a lot of mathematical sophistication - For SVM, the idea of finding a plane that separates your data as much as possible, logistic regression is just finding the line of best fit in logit space, k-means is just of repeatedly assigning points to clusters then updating clusters, etc.

It's not _that_ difficult to get working implementations of these algorithms (although debugging an incorrect implementation can be frustrating). You can implement simple versions of gradient boosting machines - a theoretically sophisticated technique that often works very well in practice - in around 100 lines of Python [2] or Go [3]. You can get implement Hopfield networks (a basic neural network for unsupervised learning) in Julia [4] and Haskell [5] without much trouble either.

I think where a lot of the value from spending 3-5+ years in graduate school comes in when you an off-the-shelf technique isn't cutting it - maybe you're overfitting, you want to play with hyperparameters, your datasets are scaling faster than the off-the-shelf implementations can deal with them, etc. Solving these kind of problems is where a lot of the mathematical/statistical sophistication can come into play - Bayesian hyperparameter optimization of existing ML techniques improved on state-of-the-art in image processing [6], understanding _why_ the LASSO works (and understanding when it doesn't) can mean having to understand the distribution of eigenvalues of random Gaussian matrices [7]. That's saying nothing of figuring out how to take an existing algorithm and scaling it out in parallel across 1000+ machines [8], which can mean using some deep convex optimization theory [9] to justify your approximations and prove that the approximated model you learn is well-behaved, etc.

I suspect that some of the gulf between ML in practice and theory arises from the differing goals in the two parties. If you just want an parameter estimation, then a lot of the rigour will seem useless, and if you want to understand a procedure deeply, you can't do away with rigour. Take the k-means paper used in the article. In the screenshot, the paper's author is presenting results of the flavour

> Under weak conditions on the data, as the number of iterations increases, regardless of the random choices made, the k-means clustering tends towards a good clustering of the data.

Depending on your viewpoint, this result is either obvious, non-obvious, or irrelevant. To prove this needs (at a glance) theorems from topology, measure theory, martingale convergence, etc.

I suppose some people find that fascinating, and want to understand _why_ this is true, and what 'good', 'tends', and 'weak' _mean_ in the above statement - and some people just find this irrelevant to their goal of finding similar blog articles. Both viewpoints are reasonable, but I suspect this divergence is the cause of the articles like these.

Applying machine learning isn't that hard, but _understanding_ it is.

[1]: https://lh3.ggpht.com/-ME24ePzpzIM/UQLWTwurfXI/AAAAAAAAANw/W...

[2]: http://www.trungh.com/2013/04/a-short-introduction-to-gradie...

[3]: https://github.com/ajtulloch/decisiontrees

[4]: https://github.com/johnmyleswhite/HopfieldNets.jl

[5]: https://github.com/ajtulloch/hopfield-networks

[6]: http://arxiv.org/pdf/1206.2944.pdf

[7]: http://arxiv.org/pdf/math/0410542v3.pdf

[8]: http://books.nips.cc/papers/files/nips25/NIPS2012_0598.pdf

[9]: http://www.cs.berkeley.edu/~jduchi/projects/DuchiHaSi10.pdf


totally agree with you. i'm usually on the applications side of things, and you benefit enormously if you can understand the math behind an algorithm (it's also pretty much required for implementing most papers). that being said, actually writing a new paper is a completely different beast. in fact, many people who can write machine learning papers aren't going to be that good at actually implementing a useful solution without lots of engineering resources backing them up. there are hax, then there are math skills. overlap is rare.


Does there exist a tool that will translate mathematical symbols or an entire equation of them into plain English?


That sounds good but what I think you're really asking for is an explanation of each of the symbols in their given context and an introduction to the mathematical "jargon" and idioms of the equation's branch of mathematics.

Such things do exist. Unfortunately, they are called "textbooks".


This sounds to me like a pointless exercise. There is a reason for using mathematical notation for non trivial formulas, which is that is more compact and succint, to allow it to convey information efficiently and unambiguously. Think of a formula with a few levels of parentheses; you're not going te be able to express that clearly in a paragraph of text. It's not so much the symbols and notation itself which is hard to grasp, but the mental model of the problem space; once you have that, the formula will usually make sense because you can relate it to this mental model.


I agree to some extent. However, I do think that there is a level of "ramping up" that requires a good bit of teaching and understanding.

It sort of is the members only cigar club of academics. While I can read, understand, and subsequently implement most things in computer science land, when it comes to mathematical notation, there is a lot left to understand.

I think a LOT of people would benefit from a five or six video course simply showing a translation of complex notations to a working algorithm in a popular language, so the commonly used symbols start having meaning.


>There is a reason for using mathematical notation for non trivial formulas, which is that is more compact and succint, to allow it to convey information efficiently and unambiguously.

Maybe, but not always. Remember that Richard Feynman took great issue with how integration was taught in most math classes and devised his own method (inspired from the Calculus for the Practical Man texts).


You can always try to find an even better notation, but the only point I was making is that for certain cases anything is better than a wall of akward text.


No. Feynman never took issue with integration notation or how integration is defined or taught. The story you're referring to is how he learned of a technique for computing integrals that was not covered in schools. The technique is called "differentiation under the integral", and is arguably even more involved.


Translating in a nice programming language might work out more nicely, for coders at least.


Not in the context of a length limited conference paper. An implementation adds details for memory allocation and use of data structures &c. This is a distraction in the exposition of an algorithm. It's useful if authors make a well commented implementation available, but the actual paper should contain an abstract definition of the algorithm.


No I was thinking of csmatt's suggestion, to automatically translate the math in the paper to something more accessible to programmers.


Many mathematical concepts cannot be translated into programming, since computation is by definition discrete. For example, even a simple concept like irrational numbers cannot be completely captured by code.


A better approach might be translating them into more uniform s-expressions. Mathematica does this. You can also check out a similar approach in Structure and Interpretation of Classical Mechanics where tensor expressions are reduced to scheme expressions.


Mathematical symbols are overloaded by context, this could be done but you'd still need to tell it whether it was a formula from "machine learning" or "topology".

Then you'd still fail to account for symbols invented/repurposed in a single paper or by a single author.

Basically, the tool itself is a machine learning problem.


Even the simple case of taking a PDF, finding the first use of any given symbol in that paper, and turning the PDF into an interactive document where hovering on any use of the symbol shows a tooltip showing the first use - now that would be both useful and feasible.

But... the Venn diagram of people able to do such a thing, people motivated to do such a thing (i.e. people reading academic papers who aren't math whizzes), and people with the time to do such a task... seems not to have a sufficient intersection. Most of us in this thread fulfill categories 1 and 2, but not 3... come to think of it, I have work to do... ducks.


Some friends and I are trying to do something similar actually.


Imo this would be hard if not impossible.

Mathematical equations don't have any meaning taken by themselves; they just express a certain relationship between quantities without stating anything specific about what those quantities are.

Any translation of an equation to English would be a certain (usually lossy) interpretation of the equation for some particular quantities. But the same equation can have multiple interpretations in wildly different contexts.

Striking example is quantum mechanics. Physicists in the first half of the 20th century figured out a mathematical model (e.g. the Shroedinger equation) that worked perfectly well for predicting the behaviour of fundamental particles. Interpreting what the equation actually means (let's say in English) proved to be hard though and spawned multiple, sometimes conflicting or paradoxical interpretations. In fact, to this very day physicists are split on their favorite interpretation of QM. On the positive side, the mathematics of QM just works no matter how the result or the calculation are being interpreted.


Yes. If the symbols are in MathML there are accessibility tools which will read them out to you - effectively translating them into plain spoken english.

eg http://www.dessci.com/en/products/mathplayer/tech/accessibil...

(yes, I know that's for IE. There are other systems though).


What you're suggesting is converting a universal, concise, formal language into the potentially ambiguous and/or redundant "plain English". What's the point of that? One has to learn some math before being able to read math.


Collection of equations to C would be a good start too.


Sounds like a problem for machine learning :)


I agree that the notation is tough. Especially for folks who need to learn to do this stuff as a one-off and not as a career. That's why I wrote a book that teaches the algorithms with nearly no math notation while trying to be very rigorous: http://www.amazon.com/Data-Smart-Science-Transform-Informati...


Good point. However, the author does recommend using open source ML libraries, so applying the techniques may not require some of the implementation knowledge you suggest. It's likely more relevant to understand the semantic constraints and pitfalls (eg, am I over-fitting right now?) since the work of a small number of talented implementers can be easily reused.


Most machine learning concepts are very simple. I agree with the author that mathematical formulae can be unnecessarily confusing in many cases. A lot of the concepts can be expressed very clearly in code or plain english.

For example, a matrix factorization could be explained with two arrays, a and b, that represent objects in the prediction:

  for each example 
    for each weight w
      prediction += a[w] x b[w]
    err = (prediction - actual_value)
    for each weight w
      a[w] += err x small_nuumber
      b[w] += err x small_number
It's that simple. Multiply the weights of a by the weights of b, calculate error and adjust weights, repeat.

K-Nearest Neighbor/KMeans are based on an even simpler operation:

  dist = 0
  for each weight w: dist += (a[w] - b[w])**2
Then make predictions/build clusters based on the smallest aggregate distance.

There are more advanced concepts. There are some serious mathematics involved in some predictors. But the most basic elements of statistical prediction are dead simple for a trained programmer to understand. Given enough data, 80% solutions can easily be achieved with simple tools.

We should be spreading the word about the simplicity of fundamental prediction algorithms, not telling people that it's hard and a lot of math background is required. Machine learning is very powerful and can improve all of our lives, but only if there is enough data available. Since information tends to be unevenly distributed we need to get the tools into the hands of as many people as possible. It would be much better to focus on the concepts that everyone can understand instead of keeping statistics secrets behind the ivy clad walls of academia.


Spot on. I agree completely. I work at this stuff for a living, and it never ceases to amaze how the academic literature and math papers are discussing fundamentally simple ideas using styles and techniques programming evolved past in the last few decades. Comments, structure, not-greek-characters (but they make me look smart!), abstractions. When was the last time you saw a stat/math paper in machine learning with comments/structure? And guess what, its as undecipherable as code without comments/structure.

On the other hand, I'm also learning that what I/others find easy, a great deal of people/programmers find hard. The number of programmers/hackers who can actually implement such techniques on new real-world problems if they don't have someone holding their hand I'm discovering is a very small minority. So maybe its harder than it looks after all, and we just think its easy because we've spent so much time with it? After all, programming isn't a walk in the park for most people, and machine learning isn't a walk in the park for most programmers.


> Comments

Papers are not just blocks of equations. There is english giving context. There is also the issue of length limits for a paper. It's assumed that the reader has a background in the area and so the authors can be a bit more concise.

> structure

Most papers in math follow a common structure of definition-theorem-proof. You can think of lemmas as subroutines that are 'called' in a proof.

>not-greek-characters (but they make me look smart!)

Unless you're advocating spelling out things (i.e. calling something "decay constant" instead of using an alpha), I don't think there is a fundamental difference between using a Greek or English alphabet. Greek is nice because it gives you more characters to use beyond the English alphabet (which is already used extensively in math). If you _are_ advocating replacing single symbols with words, I think that would be pretty hard to parse, and rather cumbersome in longer formulas.

>abstractions

Math is _all about_ abstraction. I don't know what you mean by this.


A lot of the difficulty in mathematics is trying to come up with the correct, precise definitions for "fundamentally simple" ideas (the French are historically well known for this). There is literally no way to get the precision afforded by mathematics without requiring substantial jargon, it would be like trying to write an entire operating system without functions. Mere intuition doesn't cut it here, we want logically airtight proofs, and for this you need exact definitions and terminology.

For example, everyone intuitively knows what holes are. Well, how do you rigorously define a hole? It took centuries to come up with the correct definition, which is given in terms of homotopy groups, and the definition of homotopy groups will look incomprehensible to the non-mathematician. Seemingly "easy" statements like "bounded three dimensional objects without holes are spheres" (i.e. the Poincare conjecture) turn out to be very, very hard (worth a million dollars and a Fields medal!)


> When was the last time you saw a stat/math paper in machine learning with comments/structure? And guess what, its as undecipherable as code without comments/structure.

What do you mean? Academic papers are usually written in plain english with mathematical formulas when it's necessary. What kind of comments would you like to see?


Well, to quote the article, you can see things like "p is absolutely continuous with respect to the Lebesque measure on En" and hundreds of variable names and sub/superscripts, none of which are intuitively named. It's really hard to argue that anyone who is not in academia would understand this without passing through it multiple times.

That being said, I think mathematicians should be perfectly permitted to do this, seeing as most people who read their papers are themselves, mathematicians. Thus spelling out every dumb detail would probably just be a waste of their time for the one case that the brave soul who is a programmer tries to decipher it.


Like another comment above alluded to, mathematicians tend to have parallel gripes about code. While technically we're speaking the same language, the dialect is often quite different.


Yes, which is why I mentioned that it makes sense for mathematicians to use the language they do.


Keep in mind that to a mathematician, the code is "unnecessarily confusing" compared to the mathematical notation. They speak math; we speak code.


> Most machine learning concepts are very simple.

Adding to that, a lot of methods that fall under the "machine learning" label are concepts that are pretty well understood in statistical point estimation, testing of hypotheses, and numerical analysis. I'm not 100% that it really ought to be treated as a new discipline.


The author clearly didn't read the page of the math paper he posted in trying to argue his point. It says, and I quote:

Stated informally, the k-means procedure consists of simply starting with k groups each of which consists of a single random point, and thereafter adding each new point to the group whose mean the new point is nearest.

Admittedly, it's not the prettiest English sentence over written, but it's just as plain and simply stated as the author of this article.

The article itself is interested in proving asymptotic guarantees of the algorithm (which the author of the article seems to completely ignore, as if it were not part of machine learning at all). Of course you need mathematics for that. If you go down further in the paper, the author reverts to a simple English explanation of the various parameters of the algorithm and how they affect the quality of the output.

So basically the author is cherry-picking his evidence and not even doing a very good job of it.


This was a great post because I've heard of "k-means" but assumed it required more math than my idle curiosity would be willing to handle. I love algorithms, though, and now I feel like I have a handle on this. That's awesome!

However, the higher level point of the post "ML is easy!" seems more than a little disingenuous. Knowing next to nothing about machine learning, obvious questions still come to mind:

Since you start with random points, are you guaranteed to reach a global maximum? Can it get stuck?

How do you know how many clusters you want? How do I pick K?

This assumes that distance in the vector space strongly correlates to "similarity" in the thing I'm trying to understand. How do I know my vector model actually does that? (For example, how does the author know "has some word" is a useful metric for measuring post similarity?)

I like what I got out of the post a lot, but the "this is easy" part only seems easy because it swept the hard part under the rug.


You are asking exactly the right questions. As far as I know k-means will work well when you can answer those questions for your problem, otherwise not so much. In other words there's no silver bullet.

If you rephrase the "ML is easy" idea from the article to "it's easy to do some simple but cool things with ML" then it's true, but in pushing the envelope you can make it as complex as you like.


Great questions! I can answer a few of them.

- Yes, you can get stick in a local maximum. However, see the k-means++ algorithm for a clever initialization scheme that gets you within a good constant factor of the global maximum.

- The best way to pick K is to fit the algorithm with a range of K and see which K seems to give the best application performance.

- It is "your job" in creating the feature space to ensure that feature-space nearness corresponds to application-space nearness. This is not easy at all, and essentially the only reliable test if you've done a good job is if the final algorithm performs as you'd like it to.

I think that the post is really observing that, once you understand it, ML algorithms aren't really that complex. However, once you understand it, most things aren't that complex, and that doesn't make fully understanding them easy in the first place.


> This assumes that distance in the vector space strongly correlates to "similarity" in the thing I'm trying to understand. How do I know my vector model actually does that? (For example, how does the author know "has some word" is a useful metric for measuring post similarity?)

There isn't a great answer to this I'm afraid. It's a combination of "Well it seems to make sense" and "we tried it and it worked". There can be a lot of trial and error.

> (For example, how does the author know "has some word" is a useful metric for measuring post similarity?)

Specifically about this one, that's mostly because it makes some sense and is used a lot elsewhere. It's simple and tends to work reasonably well, doesn't require trying to parse and understand sentences, etc. The idea is that the words you say are likely to be closely linked to the thing you're talking about.

The next step up from this (usually called BoW, or Bag of Words) is fairly straightforward, it's how often each word appears as a fraction of the total document. Up from there is one of the most commonly used approaches called TF-IDF (term frequency inverse document frequency) which is almost the same but it modifies the counts based on how rare the words are. I go straight for TF-IDF as it's fairly sensible, simple and tends to work well in practice.

> I like what I got out of the post a lot, but the "this is easy" part only seems easy because it swept the hard part under the rug.

It's the difference between making something that works and making something that works well. A lot of domain knowledge and technical knowledge might be required for the second, but the first might be a lot more attainable than people think.


This is why the author of this post is wrong. The answer is that there is no known efficient algorithm to ensure you get to a maximum of the k-Means objective (and there is unlikely to be one), even if you fix the dimension of the vectors to two. Also, nobody knows how to pick K.


Also: aerodynamics is not really hard, anyone can fold paper planes! Or: programming 3D games is easy, just build new levels for an old game! Or: I don't know what I am doing here, but look, this photoshop effect looks really cool on my holiday photos!

etc.

Seriously: The writer would not be able to write anything about K-Means if not for people looking at it from a mathematical view point. This angle is of tremendous importance if you want to know how your algorithm behaves in corner cases.

This does not suffice, if you have an actual application (e.g. a recommendation or a hand tracking or an object recognition engine). These need to work as good as you can make it because every improvement of it will result in $$$.


Author is not proposing "1 weird machine learning trick mathematicians HATE!"

They are encouraging non-maths people to give the basics a try even though it seems intimidating. And it worked on me, like some others in this thread my maths is shaky beyond diff calculus and my eyes glaze over when I see notation - but now I'd like to give this a whirl. I have no illusions that I will suddenly be an expert.


It's easy until you have to start adjusting parameters, understand the results meaningfully, and tune the algorithms for actual "Bit Data". Try doing most statistical analysis with dense matrices and watch your app go out of memory in two seconds.

It's great that we can stand on the shoulders of giants, but having a certain understanding of what these algorithms are doing is critical for choosing them and the parameters in question.

Also, K-means is relatively easy to understand untuitively. Try doing that with Latent Dirichlet Allocation, Pachinko Allocation, etc. Even Principal Component Analysis and Linear Least Squares have some nontrivial properties that need to be understood.


tl;dr: (1) Author does not understand the role of research papers (2) Claims mathematical notation is more complicated than code and (3) Thinks ML is easy because you can code the wrong algorithm in 40 lines of code.

I will reply to each of these points:

1. Research papers are meant to be read by researchers who are interested in advancing the state of the art. They are usually pretty bad introductory texts.

In particular, mathematical details regarding whether or not the space is closed, complete, convex, etc. are usually both irrelevant and incomprehensible to a practitioner but are essential to the inner workings of the mathematical proofs.

Practitioners who want to apply the classic algorithms should seek a good book, a wikipedia article, blog post or survey paper. Just about anything OTHER than a research paper would be more helpful.

2. Mathematical notation is difficult if you cannot read it, just like any programming language. Try learning to parse it! It's not that hard, really.

In cases where there is an equivalent piece of code implementing some computation, the mathematical notation is usually much shorter.

3. k-means is very simple, but its the wrong approach to this type of problem. There's an entire field called "recommender systems" with algorithms that would do a much better job here. Some of them are pretty simple too!


I'm pretty good at logic, problem solving, etc..., but do find parsing mathematical notation quite hard. Is there actually a good way to learn it?

What I have most difficulty with is: it's not always clear which symbols/letters are knowns, which are unknowns, and which are values you choose yourself. Not all symbols/letters are always introduced, you sometimes have to guess what they are. Sometimes axes of graphs are not labeled. Sometimes explanation or examples for border cases are missing. And sometimes when in slides or so, the parsing of the mathematical formulas takes too much time compared to the speed, or, the memory of what was on previous slides fades away so the formula on a later slide using something from a previous one can no longer be parsed.

Also when you need to program the [whatever is explained mathematically in a paper], then you have to tell the computer exactly how it works, for every edge case, while in math notation people can and will be inexact.

Maybe there should be a compiler for math notation that gives an error if it's incomplete. :)


You probably want to look at a couple of good undergrad textbooks (calculus, linear algebra, probability). The good textbooks explain the notation and have an index for all the symbols.

Unfortunately, in most cases, you have to know a little bit about the field in order to be able to parse the notation. The upside is that having some background is pretty much a necessity to not screwing up when you try to implement some algorithm.


On Kaggle "The top 21 performers all have an M.S. or higher: 9 have Ph.D.s and several have multiple degrees (including one member who has two Ph.D.s)."

http://plotting-success.softwareadvice.com/who-are-the-kaggl...


I've done pretty well at Kaggle with just a Bachelors in American History: http://www.kaggle.com/users/19518/vik-paruchuri, although I haven't been using it as much lately. A lot of competition winners have lacked advanced degrees. What I like most about Kaggle is that the only thing that matters is what you can show. I learned a lot, and I highly recommend it to anyone starting out in machine learning. I expanded on some of this stuff in a Quora answer if you are interested: http://www.quora.com/Kaggle/Does-everyone-have-the-ability-t....


Man..

To other readers, follow that link at your own risk. Vikp has a blog[1] which has hours and hours of excellent writing about machine learning.

[1] http://vikparuchuri.com/blog/


Thanks, nl! Now I know that at least one person reads it.


Count me as 2. Though I am sure there are lots of people who will be reading the material on your blog. Pretty good stuff.


While this may be true that doesn't by any means imply that someone with "only" a B.Sc. in Computer Science can't understand and use machine learning. In fact the only thing this implies is that those who have focused the most in the field are performing the best in competitions. Fairly standard wouldn't you say? The result shouldn't discourage anyone.


I'd like to chime in here as a mathematician.

Many people here express their feelings that math or computer science papers are very difficult to read. Some even suggest that they're deliberately written this way. The truth is that yes, they in fact are deliberately written this way, but the reason is actually opposite of many HNers impression: authors want to make the papers easier to understand, and not more difficult.

Take for example a page from a paper that's linked in this article. Someone here on HN complains that the paper talks about "p being absolutely continuous with respect to the Lebesque measure on En", hundreds of subscripts and superscripts, and unintuitively named variables, and that it makes paper very difficult to understand, especially without doing multiple passes.

For non-mathematicians, it's very easy to identify with this sentiment. After all, what does it even mean for a measure to be absolutely continuous with respect to Lebesgue measure. Some of these words, like "measure" or "continuous" make some intuitive sense, but how can "measure" be "continuous" with respect to some other measure, and what the hell is Lebesgue measure anyway?

Now, if you're a mathematician, you know that Lebesgue measure in simple cases is just a natural notion of area or volume, but you also know that it's very useful to be able to measure much more complicated sets than just rectangles, polyhedrals, balls, and other similar regular shapes. You know Greeks successfully approximated areas of curved shapes (like a disk) by polygons, so you try to define such measure by inscribing or circumscribing a nice, regular shapes for which the measure is easy to define, but you see it only works for very simple and regular shapes, and is very hard to work with in practice. You learned that Henri Lebesgue constructed a measure that assigns a volume to most sensible sets you can think of (indeed, it's hard to even come up with an example of a non-Lebesgue-measurable set), you've seen the construction of that measure, and you know that it's indeed a cunning and nontrivial work. You also know that any measure on Euclidean space satisfying some natural conditions (like measure of rectangle with sides a, b is equal to product ab, and if you move a set around without changing its shape, its measure shouldn't change) must already be Lebesgue measure. You also worked a lot with Lebesgue measure, it being an arguably most important measure of them all. You have an intimate knowledge of Lebesgue measure. Thus, you see a reason to honor Lebesgue by naming measure constructed by him with his name. Because of all of this, whenever you read or hear about Lebesgue measure, you know precisely what you're dealing with.

You know that a measure p is absolutely continuous with respect to q, if whenever q(S) is zero for some set S, p(S) is also zero. You also know that if you tried to express the concept defined in a previous sentence, but without using names for measures involved, and a notation for a value a measure assigns to some set, the sentence would come out awkward and complicated, because you would have to say that a measure is absolutely continuous with respect to some other measure, if whenever that other measure assigns a zero value to some set, the value assigned to that set by the first measure must be zero as well. You also know, that since you're not a native English speaker (and I am not), your chance of making grammatical error in a sentence riddled with prepositions and conjunctions are very high, and it would make this sentence even more awkward. Your programmer friend suggested that you should use more intuitive and expressive names for your objects, but p and q are just any measures, and apart from the property you're just now trying to define, they don't have any additional interesting properties that would help you find names more sensible than SomeMeasure and SomeOtherMeasure.

But you not only know the definition of absolute continuity of measures: in fact, if that was the only thing you knew about it was the definition, you'd have forgotten it long ago. You know that absolute continuity is important because of a Radon-Nikodym theorem, which states that if p is absolutely continuous with respect to q, then p(A) is in fact integral over A of some function g with respect to measure q (that is, p(A) = int_A g dq). You know that it's important, because it can help you reduce many questions about measure p to the questions about behaviour of function g with respect to measure q (which in our machine learning case is a measure we know very, very well, the Lebesgue measure).

You also know why the hell it's called absolutely continuous: if you think about it for a while, the function g we just mentioned is kind of like a derivative of a measure of measure p with respect to measure q, kind of like dp/dq. Now, if you write p(A) = int_A (dp/dq) dq = int_A p'(q) dq, even though none of the symbols dp/dq or p'(q) make sense, it seems to mean that p is an "integral of its derivative", and you recall that there's a class of real valued functions for which it is true as well, guess what, the class of absolutely continuous functions. If you think about these concepts even harder, you'll see that the latter concept is a special case of our absolutely continuous measures, so all of this makes perfectly sense.

So anyway, you read that "p is absolutely continuous with respect to Lebesgue measure", and instantly tons of associations light up in your memory, you know what they are working with, you have some ideas why they might need it, because you remember doing similar assumption in some similar context to obtain some result (and as you're reading the paper further, you realize you were right). All of what you're reading makes perfect sense, because you are very familiar with the concepts author introduces, with methods of working with them, and with known results about them. Every sentence you read is a clear consequence of the previous one. You feel you're home.

...

Now, in alternate reality, a nonmathematician-you also tries to read the same paper. As the alternate-you haven't spent months and years internalizing these concept to become vis second nature, ve has to look up every other word, digress into Wikipedia to use DFS to find a connected component containing a concept you just don't yet understand. You spend hours, and after them you feel you learned nothing. You wonder if the mathematicians deliberately try to make everything complicated.

Then you read a blog post which expresses the idea behind this paper very clearly. Wow, you think, these assholes mathematicians are really trying to keep their knowledge in an ivory tower of obscurity. But, since you only made it through the few paragraphs of the paper, you missed an intuitive explanation that's right there on that page from an paper reproduced by that blog post:

Stated informally, the k-means procedure consists of simply starting with k groups each of which consists of a single random point, and thereafter adding each new point to the group whose mean the new point is nearest. After a point is added to a group, the mean of that groups is adjusted in order to take account of that new point

Hey, so there was an intuitive explanation in that paper after all! So, what was all that bullshit about measures and absolute continuity all about?

You try to implement an algorithm from the blog post, and, as you finish, one sentence from blog post catches your attention:

Repeat steps 3-4. Until documents’ assignments stop changing.

You wonder, but when that actually happens? How can you be sure that they will stop at all at some point? The blog post doesn't mention that. So you grab that paper again...


A bit of a counter-point...

The problem with viewing these things as mathematical problems is that you buy into the simplifications you've made to fit the task to the model, and stop thinking about how the thing is actually working in your specific case. In short, it's a leaky abstraction.

Convergence is a good example. Lots of machine learning courses will go through the convergence proofs of say, the perceptron algorithm --- even though in all the hard problems, it's probably a really bad idea to let your algorithm converge! Your features are high dimensional and noisy, and so are your labels. Any linear separation you find is going to be terribly over-fit.

Taking it back to the blog post, when thinking about this k-Means clustering, it's actually much more important to think of it in terms of, "okay, all the algorithm gets to see is a term-document matrix". You should then expect that most other operations over term-document matrices will probably give you similarish results.

The difference between, say, k-Means clustering, and a LDA model, are still important for someone working in the area to understand. They should also understand how a hierarchical dirichilet process solves problems you might have for an LDA model.

Ultimately, you need to be able to develop a good tool box, and that means reasoning about the performance of various methods on your problem domain. (It doesn't mean "blindly twiddle nobs on a collection of crappy implementations, ala weka...")


Regarding the perceptron, most modern texts also have a regret analysis for the perceptron, which coupled with an online-to-batch conversion tells you how well do you expect a perceptron to perform on unseen data after a single pass, and it's usually a very good estimate (the answer is on average about as well as it did on the examples in the training data).


This isn't often discussed in my field (NLP) -- thanks!

We usually use averaged perceptron though, which seems like it would make this hard to analyse.


The averaged perceptron is the one to which the proof applies :-)


I don't know if that's a counterpoint per se. That's the heart of why ML is actually challenging and knowing how fragile a notion like convergence is is important and requires at least part of that mathematical jumbo jumbo.


"I have have a Builder class, and each instance can be populated with an self-ordering collection of strategies. Invoking the build method will produce an anonymous closure which the caller can choose to execute as soon as it has the global mutex. Doing it at the wrong time could also ruin the temporal locality benefits, changing big-O drastically."

"Darn you programmers, use plain English!" :p


This reminded me of Bret Victor's talk http://worrydream.com/MediaForThinkingTheUnthinkable/

Basically a lot of difficulty in understanding concepts in mathematical/engineering disciplines is in the default language or representation used to communicate them, but not on the concepts themselves.

I wish there was a bigger focus in academia en general to communicate the concepts first and use the formal notation/language second. I think the best professors (in terms of making the students learn) are the ones that do this.


> You wonder, but when that actually happens? How can you be sure that they will stop at all at some point? The blog post doesn't mention that. So you grab that paper again...

This is exactly what I experience whenever I read academic papers (math and statistics, not computer science though).

Great explanation, xyzzyz.

Another key part of the writing is commonality which (in my history of courses) is sometimes underrated.

If pi means profit, let it always stand for profit in that context. That may/will be confusing when you don't know the shorthand, but when you do then that's what makes the related litterature easy to read.

But it all boils down to: know the basics. Rarely is an academic paper written for the purpose of (re)introducing all concepts (that's what textbooks are for), so if you expect to pick up a paper and understand it on its own, then you are likely to be disappointed.

The alternative would be even more difficult to read - "okay, we get it! f is a function"


I can totally appreciate the value of having well defined and well understood definitions and notations that might require lots of background to understand. This allows for precise and concise discourse.

But there have been many published bounds in Machine Learning Theory that turned out to be completely vacuous. That is, they prove a bound on generalization error that is always greater than 1.

http://hunch.net/?p=181

To me, that definitely seems like complicated math for the sake of showing off and writing papers rather than writing for clarity and demonstrating something useful.


I think some people just have a better grasp of the technical language that is commonly used for describing most of this stuff. The language definitely helps the people most closely involved, but it also leaves a lot of people out.

Being an abstraction, the language helps in developing the field further just by allowing to find new abstractions on top of the already existing ones. Very similar to what happens with programming languages in general.


It would be math for the sake of showing off only if the authors or referees knew before publishing that the bounds were vacuous. In fact, most mathematicians would probably be embarrassed if that happened to one of their results, and it's nothing to show off about.


Not really. This is more of an example of a "failed experiment" than showing off. Research is hard. You never know if what you're doing will be useful, or just plain wrong.


> Your programmer friend suggested that you should use more intuitive and expressive names for your objects, but p and q are just any measures, and apart from the property you're just now trying to define, they don't have any additional interesting properties that would help you find names more sensible than SomeMeasure and SomeOtherMeasure.

That's common with Haskell variables, too. So longer names would not make the following easier to understand, either:

    map _ [] = []
    map f (x:xs) = f x : map f xs


I disagree. Haskell is notoriously hard to read if you're not experienced with the language. However, even ignoring the syntactic obscurity this is easier to understand if you're not too familiar with the langauge.

    map _ [] = []
    map myFunc (first:rest) = myFunc first : map myFunc rest


Eh. Your construction is perhaps easier to understand if you've never used the language, and are trying to learn it just by reading arbitrary source code. But who writes code for such a person?

You might as well spell out:

int iterator; for (iterator = ...

But everybody uses 'i', because it's faster and simpler, takes less time to read, and doesn't fill up your screen so that surrounding elements can be easily associated (e.g. a[i].do(x) vs myArray[iterator].doCoreFunction(element_x) <- yuck)


I write all the time:

    for(int index = 0; index < collection.size(); ++index)
I don't believe in the philosophy of saving seconds of programmer time at the cost of understandability, especially when the latter is increasingly more important. Also, calling it "index" makes it really obvious when you're possibly using it wrong (ie, as anything other than an index). My version of your latter example would never look so ugly, because I would never ended up with such badly worded things in the first place. (Who puts domain functions on their model objects?)

    for(int index = 0; index < things.size(); ++index) {
        final Thing thing = things.get(index);
        businessUnit.doSomething(thing);
    }


At some point this construction has the opposite effect of your stated goal. Programming languages have their own syntax, grammer, and idioms. Working with those isn't so much about saving time, but about making clear statements. I'm not certain how calling your variable index is better than calling it i - the entire fact that it implied in the construct.

I'm not disagreeing that having decent names is a good idea, just that idioms and convention convey a ton of meaning. For instance if any of my team ever did:

   def func(self, a, b, c): ...
There would be a problem. (most of the time anyway...)

By analogy: do you always refer to people strictly by full name (that is never calling Robert Smith by Bob, rob, Robert, etc)? Do you always refer to your vehicle as "my $Year $Make $Model" rather than "my car"? Why not? Because it is strictly clearer and easier to do so, with less ambiguity and chance for confusion. I'm guessing not, because the language you speak in has idioms and constructs that allow shortening this, and understanding is not lost.

Basically my point is that assuming the reader of your code is a complete neophyte, with no understanding of common idioms is largely ridiculous, and potentially harmful to the actual neophyte+(small amount of experince) folks because they wonder why the common pattern is not being followed... asking "why is this different? it must be special, not just a for loop..."


But people also write:

    for(int i = 1; i <= n; ++i)
        sum += i;
This would look immediately wrong if i was actually index.

Also, your analogy to linguistic concepts is bad. Sure I call people Bob, but I'm also not expecting to have to reread my conversation with them half a year later and modify it to say something else instead. Also, those linguistic concepts are not idioms. They are just alternate names and indexical constructions, respectively. An idiom is something like "What's up?", where the literal meaning of "what is up" is not intended, but rather another meaning that is completely lost unless you are in the know.


Using the variable names i/j/k for loops is such a widely used convention that IMHO it doesn't really mater for a lot of people. Virtually everyone I know will read "array[i]" and know exactly what's going on.

And believe it or not, those variable names aren't arbitrary. They hearken from the ancient times: http://programmers.stackexchange.com/a/86911


I tend to use

  for (auto iter = something.begin(); iter != something.end(); ++iter)
in C++. It's an abbreviation, but a clearer one (both to me and to the reader of my code). An integer is (to me) an index, not an iterator, so I'd kind of call that misleading anyway.


Thanks for the explication but...aren't all aspects of computational machine-learning mathematics available without Lebesgue measure? That is, with the simpler Borel measure? Or even without introducing the notion of measure?

Other than pathological cases which would never be programmable anyway, couldn't we ignore any requirement to speak of Lebesgue measure? This is computation we're talking about, not theoretical mathematics, after all.


I haven't read the rest of the paper linked by blog post (and I know next to nothing about machine learning itself anyway), but the page linked clearly suggests the motivation behind Lebesgue measure: the theorem that author wants to prove says that some sequence of random random variables ve defines is almost surely convergent, and the limit is almost surely equal to some other random variable ("almost surely" is another mathematical term with a well-defined meaning that can be very confusing, but this one is actually easy enough to make Wikipedia explanation understandable). There are some strong and useful theorem for the convergence of the limits of Lebesgue integrals, and there are also theorems that sometimes guarantee that whenever intervals converge, functions will converge as well. The author probably wants to use these theorems.

Now, could it all be done without mentioning Lebesgue integral and Lebesgue measure? In most cases, yes, in this case almost surely. The question is, would it actually be easier and simpler? The author would probably need to reproduce the ideas needed for the proofs of convergence theorems for Lebesgue integrals, which wouldn't be really important and relevant. It's much easier to put your problem in setting in which it has already been solved.

Anyway, as I mentioned above, every mathematician is intimately familiar with Lebesgue measure and integral, so putting the problem in some different, "simpler" setting than this would probably make it more difficult to understand.


Lebesgue measure is actually simpler than Borel measure, since it includes all null sets for convenience.

Yes, you could not talk about Lebesgue measure at all. However, the importance of theory is that you gain a deeper understanding of the abstract objects that the code is supposed to represent in the first place. This in turn, can have practical applications.

For example, you could implement a Fibonacci function using recursion as usual. But if you study the math a bit more deeply, then you can discover Binet's formula and get it in constant time.


I can get that academic papers do this, but even Wikipedia articles are written in this cryptic language. For example, you read about some algorithm and want to find out how it works and you get this: https://upload.wikimedia.org/math/f/1/c/f1c2177e965e20ab29c4...

You try to highlight some of the symbols and find out that it's an image. You can't even click on things and find out what they mean like is standard for everything written in text. You try to google something like "P function" and get completely unrelated stuff.


Symbols are used because equations like that would be crazy long and hard to keep track of if every symbol were a fully-formed word or phrase. It's an unfortunate side effect that those unfamiliar with mathematical notation are somewhat left in the dark.

The real problem here is the difficulty of looking up the meaning of a symbol you don't know. The notation is great once you know it, but it's a pain to learn if you're doing so completely out of context. Maybe you don't know that the sigmas there represent summation. Your idea of just clicking the sigma and seeing an explanation is a pretty great one that I'd like to see implemented.


If there was an explanation of what it does in natural language (which is often done on many wikpedia articles) then it would be fine. Even just the ability to copy and paste the symbols into google would be great, but it's just an image, and a non-math person won't know what they are called.

I still think that raw math is unnecessarily cryptic and unhelpful, even when I understand the symbols fine. It's like reading source code without comments or named variables or indentation of any kind. Actually that's pretty much exactly what it is.


> Even just the ability to copy and paste the symbols into google would be great, but it's just an image, and a non-math person won't know what they are called.

Okay, so in the case of a wikipedia article, if you're really ignorant of the symbology used and don't know where to start, just click 'edit' and look at the latex code. If that's not helpful, like the case where the latex for the P(.) function is simply 'P', read the contextual information.

"The posterior distribution can be defined by marginalizing over joint distribution, factorized into its independent prior distributions and then dividing by the partition function."... or whatever.

If the article doesn't have any contextual info, then you probably shouldn't rely on it if you're learning. Look somewhere else.

> It's like reading source code without comments or named variables or indentation of any kind

Math is an abstraction. It is more abstract that most code, that is why implementing math into code is something people make a living doing. How would you prefer that math equation you gave be presented? As the (currently) top comment states, those symbols are FULL of meaning. Understanding what that equation means is more than just knowing that "P stand for probability, and the funny angled thing is a sigma and that means sum." In fact, quite often an equation represents an idea that is simple enough to conceptualize, but extremely non-trivial to bring into more concrete terms. For example, that equation you posted looks like the kind of thing you'd see in a paper using Bayesian methods. In many cases to implement that equation would require Markov chain monte carlo sampling, maybe a gradient descent algorithm, a quadratic program solver, who knows. It's an infeasible way to present the ideas, at best.


Crazy long and difficult-to-track equations are why function abstraction was invented. If this were code, we would say to cut it into understandable chunks, label them as functions, and then compose them.


Writing it all as code would exclude nonprogrammers just as much as this excludes people not familiar with basic mathematical notation.

At least everybody should share some common mathematical instruction from school. The same certainly cannot be said of programming.

As somebody who can read both code and math, though, I'd greatly, greatly prefer to see math in math notation, not code. Code would be so much more verbose that it would take easily twice as long to digest the information. You'd also completely lose all the incredibly powerful visual reasoning that comes along with modern mathematical notation.


This is abstracted and very understandable. The picture is the definition of a function, which in turn invokes the definition of other functions. The P's are abstractions for conditional probability, the sums are loops etc. If this were code, it would be no more than a few lines and very understandable. Unlike code however, for all this there is very thorough documentation for mathematics in millions of pages by thousands of people in many different languages at all levels. The documentation is textbooks, and they're pretty good.


But those functions might still be cryptic. And the rabbit hole may go deeper than the amount of text one can be reasonably allocated to a discussion.

Consider how much work Bourbaki had to write before he/they got to real analysis.


I thought most wikipedia articles use a "where..." beneath every equation explaining the symbols. I could be wrong though


This is akin to stating that "public static void main(String[] args)" is complete nonsense when one didn't take the time to learn a little Java first.

The linked formula has been completely removed from its context

http://en.wikipedia.org/wiki/Averaged_one-dependence_estimat...

Below the formula in the page, you see everything defined and logically derived, and a brief overview of what it's supposed to do. You are correct in that Wikipedia may not be the best place to learn from scratch if you are a novice, but its purpose is more to be a concise reference than a thorough introduction.

If you need more guidance, there are thousands of textbooks at all levels of understanding, and if you opened a single one of them, you would immediately see that the "P" is standard notation for (conditional) probability.


Wikipedia needs this; http://www.mathjax.org/


The question "do I need hard math for ML" often comes up in #machinelearning at irc.freenode.net

My point here is: You don't need hard math (most of the times) because most machine learning methods are already coded in half a dozen different languages. So its similar to fft. You do not need to understand why fft works, just when and how to apply it.

The typical machine learning workflow is: Data mining -> feature extraction -> applying a ML method.

I often joke that I'm using Weka as a hammer, to check, if I managed to shape the problem into a nail. Now the critical part is feature extraction. Once this is done right, most methods show more or less good results. Just pick the one that fits best in results, time and memory constrains. You might need to recode the method from Java to C to speedup, or to embed it. But this requires nearly no math skills, just code reading, writing and testing skills.


I find newbs in ml don't appreciate cross validation. That's the one main trick. Keep some data out of the learning process to test an approaches ability on data it has not seen. With this one trick you can determine which algorithm is best, and the parameters. Advanced stuff like Bayes means you don't need it, but for your own sanity you should still always cross validate. Machine learning is about generalisation to unseen examples, cross validation is the metric to test this. Machine learning is cross validation.


This was of utmost importance in my ERP class at GaTech - implementing a data segmenting algorithmic solution always requires that you hold out a sample set. This really is across all statistics, if I'm not mistaken - the validation set and the holdout.


First, you're always going to need to evaluate supervised models on some data you didn't train on. No method will ever relieve the need for that.

Second, I never use cross-fold if I can help it. It's a last resort if the data's really small. Otherwise, use two held-out sets, one for development, and another for testing. Cross-fold actually really sucks!

The small problem is that cross-validation has you repeating everything N times, when your experiment iteration cycle is going to be a big bottleneck. Second, it can introduce a lot of subtle bugs.

Let's say you need to read over your data to build some sort of dictionary of possible labels for the instances, given one of the attributes. This can save you a lot of time. But if you're cross-validating, you can forget to do these pre-processes cross-validated too, and so perform a subtle type of cheating. Coding that stuff correctly often has you writing code into the run-time about the cross-fold validation. Nasty.

With held-out data, you can ensure your run-time never has access to the labels. It writes out the predictions, and a separate evaluation script is run. This way, you know your system is "clean".

Finally, when you're done developing, you can do a last run on a test set.

Bonus example of cross-fold problems: Sometimes your method assumes independence between examples, even of this isn't strictly true. For instance, if you're doing part-of-speech tagging, you might assume that sentences in your documents are independent of each other (i.e. you don't have any cross-sentence features). But they're not _actually_ independent! If you cross-fold, and get to evaluate on sentences from the same documents you've trained on, you might do much better than in a realistic evaluation, where you have to evaluate on fresh documents.


Yes I realise the importance of a testing set too. I see that as just another level validation. Hierarchies of left out data. If you understand in your heart what cross validation is, then the step to using a testing set is a trivial extension.

I brought up cross validation in particular, because I see PhDs eating books on advanced techniques like structure learning. But they have no practical experience. And they ignore the basics for so long, as they don't realise validation is the most important bit. If you have not drawn the curve of training set error, validation error against training time, you have no done machine learning yet...

This article doesn't mention it either :(


Math is essential for this field, anyone who tells you otherwise doesn't know what they are talking about. You can hack together something quick and dirty without understanding the underlying math, and you certainly can use existing libraries and tools to do some basic stuff, but you won't get very far.

Machine learning is easy only if you know your linear algebra, calculus, probability and stats, etc. I think this classic paper is a good way to test if you have the right math background to dive deeper into the field: http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pd...


As a graduate in artificial intelligence and machine learning I can tell you that machine learning IS hard.

Sure, the basic concepts are easy to understand. Sure, you can hack together a program that performs quite well on some tasks. But there are so much (interesting) problems that are not at all easy to solve or understand.

Like structural engineering it is easy to understand the concepts, and it is even easy to build a pillow fort in the living room, but it is not easy to build an actual bridge that is light, strong, etc.


It's actually incredibly hard, especially if you want to achieve better results than with a current 'gold standard' technique/algorithm, applied on your particular problem.

While the article doesn't have this title (why would you even choose one with such a high bias?), I presume the submitter decided upon this title after being encouraged by this affirmation of the article's author: 'This data indicates that the skills necessary to be a data “wizard” can be learned in disciplines other than computer sciences and mathematics.'.

This is a half-baked conclusion. I'd reason most Kaggle participants are first of all, machine learning fans, either professionals or 'amateurs' with no formal qualifications, having studied it as a hobby. I doubt people with a degree in cognitive sciences or otherwise in the 'other' categories as mentioned in the article learned enough just through their university studies to readily be able to jump into machine learning.


Is k-means really what people are doing in serious production machine-learning settings? In a previous job, we did k-means clustering to identify groups of similar hosts on networks; we didn't call it "machine learning", but rather just "statistical clustering". I had always assumed the anomaly models we worked with were far simpler than what machine learning systems do; they seemed unworthy even of the term "mathematical models".


Is k-means really what people are doing in serious production machine-learning settings?

Yes, that is one option. Ask me over Christmas and I'll tell you a story about a production system using k-means. (Sorry guys, not a story for public consumption.)

More broadly, a lot of machine learning / AI / etc is simpler than people expect under the hood. It's almost a joke in the AI field: as soon as you produce a system that actually works, people say "Oh that's not AI, that's just math."


Most of the classifiers I've seen used in industry aren't even k-means. They're fixed heuristics (often with plausibly effective decision boundaries found via k-means).

Consider the Netflix Prize, where the most effective algorithm couldn't actually be used in production. Commercial applications tend to favor "good enough" classification with low computational complexity.

> It's almost a joke in the AI field: as soon as you produce a system that actually works, people say "Oh that's not AI, that's just math."

I think that's the standard "insert joke here" any time someone has a "What is AI?" slide.


We often sell our machine learning as mathematical optimization, because ML & AI have bad connotations to industrial customers.

See: http://en.wikipedia.org/wiki/AI_winter


You might want to retest that assumption. Call it a "Big Data" tool and see how you go.

The AI Winter was a long time ago now.


Ha yeah welcome to the field, linear regression is now machine learning. Formerly known as pattern recognition > artificial intelligence > statistics.


K-means is simple, easy to implement and it often works quite well. It's unlikely to be state-of-the-art for any particular problem, but if you just need to do some straightforward clustering as part of a larger data-processing pipeline, K-means is a reasonable default choice, at least until you figure out that the clustering step is your bottleneck in terms of overall performance.

For example, this paper http://www.stanford.edu/~acoates/papers/coatesng_nntot2012.p... demonstrates that using K-means to learn feature representations for images (i.e., use k-means to cluster small patches of pixels from natural images, then represent images in the basis of cluster centers; repeat this several times to form a "deep" network of feature representations) gives image classification results that are competitive with much more complex / intimidating deep neural network models.


Whatever problem you are solving, if you don't try k-means before spending a huge amount of programmer and computer time on more complex algorithms, you are doing it wrong.

Just as most programming is CRUD, most machine learning in practice is k-means or linear models.


Many previous k-means applications are now done using support vector machines. They are a bit more advances, but not a lot.


In my opinion, K-means wouldn't really count as machine learning, learning vector quantization is close counterpart.


I disagree with this article, although I did find it interesting. Replace k-means with a supervised learning algorithm like an SVM, and use some more complicated features other than binary and this article would be a lot different.

Also - maybe "article recommendation" is "easy" in this context, but other areas such as computer vision, sentiment analysis are not. Some other questions I might ask

How do you know how well this algorithm is performing?

How are you going to compare this model to other models? Which metrics will you use? What statistical tests would you use and why?

What assumptions are you making here ? How do you know you can make them and why?

There are a lot of things that this article fails to address.

Disclaimer: I realize more complex models + features don't always lead to better performance, but you need to know how to verify that to be sure.


SVM prediction is pretty straight-forward, even if you're using a kernel/dual formulation:

"Here are some training points from classes A and B, each with an importance weight. Compute the weighted average of your new point's similarity to all the training exemplars. If the total comes out positive then it's from class A, otherwise it's from class B."

If you tried to explain training from a quadratic programming perspective it would get messy, but if you use stochastic gradient descent then even the training algorithm (for primal linear SVM) is pretty light on math.

Of course, this is only possible after a decade or two of researchers have played with an digested the ideas for you. I think until Léon Bottou's SGD SVM paper, even cursory explanations of SVMs seemed to get derailed into convex duality, quadratic programming solvers, and reproducing kernel hilbert spaces.


I'm not sure I agree:i I'll say it differently:

As far as I know, k-means (as the author described it) has two parameters: initial number of clusters K and max iterations (although I could be wrong), SVMS have:

UPDATE: here is a gist:

https://gist.github.com/josephmisiti/7572696

taken from here : http://www.csie.ntu.edu.tw/~cjlin/libsvm/

I do not think that picking up LibSVM, compiling it, and running it is straight forward for everyone ...


I think there's a big difference between explaining the essence of an algorithm and understanding all the details and decisions that go into a particular implementation.

If explaining SVM, I'd stick with the linear primal formulation, which only really requires choosing the slack parameter C. If I needed to perform or explain non-linear prediction, I'd switch to the RBF kernel, which gives you one additional parameter (the RBF variance).

Implementations of K-means, by the way, can also potentially have lots of parameters. Check out: http://scikit-learn.org/stable/modules/generated/sklearn.clu...


fair enough - btw i just realized i bookmarked your blog (http://blog.explainmydata.com) a few weeks back - i was really enjoying those posts. nice work


Thanks :-)

I'm going to finish grad school soon and the job I'm going to will be pretty rich with data analysis, hopefully will lead to more blog posts.


For those wanting to get started (or further) in machine learning, I highly recommend the article, "A Few Useful Things to Know About Machine Learning," by Pedro Domingos (a well respected ML researcher): http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf. It's written in a very accessible style (almost no math); contains a wealth of practical information that everyone in the field "knows", but no one ever bothered to write down in one place, until now; and suggests the best approaches to use for a variety of common problems.

As someone who uses machine learning heavily in my own research, a lot of this seemed like "common sense" to me when I read it, but on reflection I realized that this is precisely the stuff that is most valuable and hardest to find in existing papers and blog posts.


The equations look fine to me - I was a math major in college. Honestly, I get so tired of humanities people -- or programmers, bragging about how much they hate math.

Except:

https://gist.github.com/benmcredmond/0dec520b6ab2ce7c59d5#fi...

I didn't know k-means clustering was that simple. I am taking notes...

  * pick two centers at random

  run 15 times: 

  * for each post, find the closest center
  * take the average point of your two clusters 
  as your new center
This is cool. It is 2-means clustering and we can extend it to 5 or 13...

We don't need any more math, as long as we don't ask whether this algorithm converges or how quickly


I write academic papers, and I've started writing blog posts about them, and I think this post doesn't cover one of the main reasons that academic papers are less accessible to non-specialists.

When you write an academic paper, it's basically a diff on previous work. It's one of the most important considerations when the paper first comes out. The reviewers and the people up-to-the-minute with the literature need to see which bit is specifically new.

But to understand your algorithm from scratch, someone needs to go back and read the previous four or five papers --- and probably follow false leads, along the way!

It's another reason why academic code is often pretty bad. You really really should write your system to first replicate the previous result, and then write your changes in on top of it, with the a _bare minimum_ branching logic, controlled by a flag, so that the same runtime can provide both results. And you should be able to look at each point where you branch on that flag, and check that your improvements are only exactly what you say they are.

When you start from scratch and implement a good bang-for-buck idea, yes, you can get a very simple implementation with very good results. I wrote a blog post explaining a 200-line POS tagger that's about as good as any around.[1] Non-experts would usually not predict that the code could be so simple, from the original paper, Collins (2002).[2]

I've got a follow-up blog post coming that describes a pretty good parser that comes in at under 500 lines, and performs about as accurately as the Stanford parser. The paper I wrote this year, which adds 0.2% to its accuracy, barely covers the main algorithm --- that's all background. Neither does the paper before me, released late last year, which adds about 2%. Nor the paper before that, which describes the features...etc.

When you put it together and chop out the false-starts, okay, it's simple. But it took a lot of people a lot of years to come up with those 500 lines of Python...And they're almost certainly on the way towards a local maximum! The way forward will probably involve one of the many other methods discussed along the way, which don't help this particular system.

[1] http://honnibal.wordpress.com/2013/09/11/a-good-part-of-spee...

[2] http://acl.ldc.upenn.edu/W/W02/W02-1001.pdf‎


As someone who works in machine learning, I have mixed feelings about this article. While encouraging people to start learning about ML by demystifying it is a great thing, this article comes off as slightly cocky and dangerous. Programmers who believe they understand ML while only having a simplistic view of it risk not only to create less-than-optimal algorithms, and might instead create downright dangerous models: http://static.squarespace.com/static/5150aec6e4b0e340ec52710...

In the context of fraud detection (one of the main areas I work in these days), a model that is right for the wrong reasons might lead to catastrophic losses when the underlying assumption that made the results valid suddenly ceases to be true.

Aside from the fact the techniques he mentioned are some of the simplest in machine learning (and are hardly those that would immediately come to mind when I think "machine learning"), the top comment on the article is spot on:

> "The academic papers are introducing new algorithms and proving properties about them, you’re applying the result. You’re standing on giants’ shoulders and thinking it’s easy to see as far as they do."

While understanding how the algorithm works is of course important (and I do agree that they are often more readable when translated to code), understanding why (and when) they work is equally important. Does each K-Means iteration always reach a stable configuration? When can you expect it to converge fast? How do you choose the number of clusters, and how does this affect convergence speed? Does the way you initialize your centroids have a significant effect on the outcome? If yes, which initializations tend to work better in which situations?

These are all questions I might ask in an interview, but more importantly, being able to answer these is often the difference between blindly applying a technique and applying it intelligently. Even for "simple" algorithms such as K-Means, implementing them is often only the tip of the iceberg.


Hey, author here.

> Aside from the fact the techniques he mentioned are some of the simplest in machine learning (and are hardly those that would immediately come to mind when I think "machine learning")

The primary point of this article was the contrast between something as simple as K-Means, and the literature that describes it. It wasn't meant as a full intro to ML, but rather something along the lines of "give it a try, you might be surprised by what you can achieve".

> Even for "simple" algorithms such as K-Means, implementing them is often only the tip of the iceberg.

Yup. But getting more people to explore the tip of the iceberg is, in my opinion, a good thing. We don't discourage people from programming because they don't instantly understand the runtime complexity of hash tables and binary trees. We encourage them to use what's already built knowing that smart people will eventually explore the rest of the iceberg.


Thanks for responding. I fully agree with your comment -- as I said, I too think many people are sometimes put off by the apparent complexity of machine learning, and demystifying how it works is a great thing. Unfortunately there's always a risk that a "hey, maybe this isn't so hard after all" might turns into a "wow, that was easy". While I think the former is great, the latter is dangerous because machine learning is often used to make decisions (sometimes crucial, for example when dealing with financial transactions), so I would argue more care should be taken than if we were talking about general purpose programming: if you trust an algorithm with making important business decisions, then you better have an intimate knowledge of how it works.

While I again agree with the underlying sentiment, I was just a bit disappointed that it seems to invite the reader to be satisfied of himself rather than motivate him to dig deeper. Nothing a future blog post can't solve though!


A lot of concepts are easier when you know how they work.

CPUs were magical for me before I took a computer architecture course. So was AI and machine learning. Once you see the "trick" so to speak you lose some of the initial awe.


Most of the comments on here are from people in the field of ML saying "this is a toy example, ML is hard."

Maybe that's the case. And maybe the title of the submission ruffled some feathers but the thrust of it is that ML is approachable. I'm sure there's devil in the detail, but it's nice for people who are unfamiliar in a subject to see it presented in a way that's more familiar to them with their current background.

I have a university background in Maths and Comp Sci so I'm not scared of code or mathematical notation. Maybe if I'd read the comments on here I'd get the sense that ML is too vast and difficult to pick up. I'm doing Andrew Ng's coursera course at the moment and so far it's all been very easy to understand. I'm sure it gets harder (I even hope so) and maybe I'll never get to the point where I'm expert at it, but it would be nicer to see more of a nurturing environment on here instead of the knee jerk reactions this seems to have inspired.


Ng's coursera class is really dumbed down and although he gives a nice intuitive explanation of the basics, it's nowhere near as hard as his actual Stanford class (or any other ml class from a good engineering school).


many people still can't pass it or understand it...


I'm cynical about how machine learning of this type might be used in practice and this is an illustration of why: the stated goal is a "you might also like" section.

There is no reason to believe the results are any better than a random method in respect of the goal (and it's reasonable to believe they may be worse) - we would have to measure this separately by clickthrough rate or user satisfaction survey, perhaps.

I believe you would get far better results by always posting the three most popular articles. If you want to personalise, post personally-unread articles. A lot less technical work, a lot less on-the-fly calculation, a lot more effective. The machine learning tools do not fit the goal.

The most effective real example of a "you might also like" section is the Mail Online's Sidebar of Shame. As best as I can tell, they display their popular articles in a fixed order.

Machine Learning seems to make it easy to answer the wrong question.


This is really only an argument that the example in the article is not realistic (which it doesn't have to be, it might be expository). There are in fact countless applications of machine learning in actual daily use, such as detecting credit card fraud, where simpler manual methods would perform measurably worse in terms of money lost.


Sure, there are realistic applications of Machine Learning with great results.

But the article has failed in its headline goal ("ML is easier than it looks") if it chooses an example that is mathematically more complicated and still less effective than a naive alternative.

The first difficult task is to identify an effective method.


Wait till you have to hand craft your algorithms because the off the shelf ones are too slow ;). In the end you can stand on the shoulders of giants all day, but until you actually sit down and write an SVM or even something more cutting edge like stacked deep autoencoders yourself, machine learning isn't "easy".

In the end, libs are there for simpler use cases or educational purposes. Realistically, that's more than good enough for 90% of people.

That being said, it's not impossible to learn. Oversimplifying the statistics, tuning, and work that goes in to these algorithms you're using though? Not a good idea.


I recently read a book called "Data Smart" [1], where the author does k-means and prediction algorithms literally in Excel. This was quite eye opening as the view to ML is not so enigmatic to enter. However, the translation of your data into a format/model to run ML is another challenge.

[1] http://www.amazon.com/Data-Smart-Science-Transform-Informati...


Sure, some ML concepts are intuitive and accessible without advanced math.

But it would help to highlight some of the fundamental challenges of a simplistic approach.

For example, how is the author computing the distance between points in n-dimensional space?

And does this mean that a one-paragraph post and a ten-paragraph post on the same topic probably wouldn't be clustered together?


Well I don't think any of the problems you mentioned are difficult to solve, even by a non-seasoned programmer.

For your first question, just use Euclidean distance. (In other words, sqrt((x1 - y1)^2 + (x2 - y2)^2 + ... + (xn - yn)^2). IMO, anyone who can program can do that.

Also for your second question - probably - but he's using a very simplistic model. To fix that, you simply fix the heuristic. For example, you could instead score groups by how frequently certain important words occur, like "team" and "support".


It's not that it's hard to code, it's that a aimplistic approach will lead to questions that might not be intuitive.

If you use Euclidean distance, then you essentially count 1 for each word difference. That means long posts will have longer distances from shorter posts with the same topic.

If you fix that by treating "important" words differently, then you have to figure out importance, and then normalize by number of terms, etc.

Before you know it you're building a naive Bayes classifier. Again, the challenge isn't in the coding, it's in understanding the limitations of the more simplistic design.


You have binary features, why would you use euclidean distance?


Well in this case, you can also do the normal, in other words adding the differences. The poster asked about finding the differences between points in N-dimensional space in general (he didn't specify for the author's example). But the idea is that your means wouldn't be bucketed into just a couple of values, but have a larger distribution. In any case, it doesn't matter much for this problem, but if you scored it by word count rather than a simple 1 or 0, it would matter.


Building a model on some training data is easy. Debugging it is hard. Th machine learned model is essentially equivalent of code that gives probabilistic answers. But there are no breakpoints to put or watch to add. When model doesn't give the answer you expect, you need to do statistical debugging. Is the problem in training data, do you have correct sampling, are you missing a feature, is your feature selection optimal, does you featurizer have a bug somewhere? Possibilities are endless. Debugging ML model is black art. Most ML "users" would simply give up if model doesn't work on some cases or if they added a cool feature but results are not that good. In my experience less than 10% of people who claims ML expertise are actually good at doing statistical debugging and able to identify issue when model doesn't work as expected.


tldr: the ML algorithms look hard reading the papers, while the code looks simpler and shorter, also you can get pretty decent results in a few lines of R/Python/Ruby so ML is not that complex.

I disagree in so many ways: 1. complex algorithms are usually very short in practice (e.g. dijkstra's shortest path or edit distance are the firsts that come to mind) 2. ML is not just applying ML algorithms: you have to evaluate your results, experiment with features, visualize data, think about what you can exploit and discover patterns that can improve your models. 3. If you know the properties of the algorithms you are using then you can have some insights that might help you on improving your results drastically. It's very easy to apply the right algorithms with the wrong normalizations and still get decent results in some tests.


Matrix multiplication, orthonormal basis, triangular matrix, gradient descent, integrals, Lebesgue mesure, convex, and the mathematical notation in the paper are not harder than the code shown here. It is better to have solid prof of what you are doing is sound and will converge before jumping into the code.


I don't get all negative comments.

From my limited text comprehension abilities, the author did not say that the whole field is trivial and that we should sack all academics.

Instead, the argument is that basic Machine Learning techniques are easy and one shouldn't be afraid of applying them.


Likely this part at the beginning that exposed the author's anti-intellectualism/ignorance:

"The majority of literature on machine learning, however, is riddled with complex notation, formulae and superfluous language. It puts walls up around fundamentally simple ideas."

That's what got me, it's such a flat out stupid thing to say. Especially the "puts walls up" phrasing to imply that math beyond what the author understands is a conspiracy. Especially from someone who writes code ... which is full of "complex notation", symbols and "superfluous classes/libraries/error checking/abstractions" by the same absurd reasoning.


Hi, author here. I love reading academic papers and I love Math. But it's an unfortunate fact that academic papers make up the majority of the literature on ML — and that their notation, and writing style exclude a large population who would otherwise find them useful. That's all I'm trying to get at here.


They don't exclude people by using math any more than programmers exclude people by using subclasses or anonymous functions.

I too would love to see more entry level ML resources, but we are talking about academic papers here.

Imagine how stupid it would seem if some outsider came in to HN and started telling programmers to dumb down their code when it's open source because their advanced techniques are too hard for a novice to understand off the bat. Instead of optimizing for efficiency, elegance and readability for their peers - the people they are collaborating with to solve actual problems - they're told to cater to novices always.

The language in your post is textbook anti intellectualism isn't it? And strangely entitled. You would certainly not apply these criteria to code but since it's not your field they must cater to your lack of experience? You know better than they how to communicate ML research to other experts?


Upvoted this for an interesting read, but I agree with the sentiments in the comments that (1) ML is in general hard (2) some parts of ML are not that hard, but are likely the minority (3) we are standing on the shoulders of giants, who did the hard work.


The fundamentals of linear algebra and statistics are indeed quite easy to understand. Common concepts and algorithms such as cosine similarity and k-means are very straightforward.

Seemingly arcane mathematical notation is what frightens off beginners in many cases, though. Once you've understood that - for instance - a sum symbol actually is nothing else but a loop many things become a lot easier.

However, the devil's in the details. Many edge cases and advanced methods of machine learning are really hard to understand. Moreover, when 'good enough' isn't just good enough any more things tend to become very complex very quickly.


The post does do a good job of showing how easy it is to implement knn.

The post doesn't really go into centroid selection or evaluation, or the fact that clustering on text is going to be painful once you move to a larger dataset.


The difficult aspects take centre stage when things go wrong.


Some while back I implemented k-means in JavaScript. It's a really simple, straight forward algorithm which makes sense to me, as a visual thinker and a non-mathematician. Check out the implementation here:

https://github.com/kamilafsar/k-means-visualizer/blob/master...


I think this is one of the reasons why it should become standard practise to provide code implementations of described algorithms. It not only provides an executable demonstration of the algorithm but as importantly an alternative description that may be more accessible to other audiences. It can also be used as conformation that was is intended is indeed what is being done.


It is nice to read about this in plain language. But, can someone explain what the X and Y axis are meant to represent in the graph?


They don't mean anything in particular. The actual analysis is being done in a high-dimensional space, in which each post is represented by a high-dimensional vector of the form [0,0,1,0,...., 0,1,0]. The length of the vector is the total number of distinct words used across all blog posts (maybe something like 30,000), and each entry is either 0 or 1 depending on whether the corresponding word occurs in this post. All the distances and cluster centers are actually being computed in this 30000-dimensional space; the two-dimensional visualization is just for intuition.

If you're wondering how the author came up with the two-dimensional representation, the article doesn't say, but it's likely he used something like Principal Component Analysis (http://en.wikipedia.org/wiki/Principal_component_analysis). This is a standard technique for dimensionality reduction, meaning that it finds the "best" two-dimensional representation of the original 30,000-dimensional points, where "best" in this case means something like "preserves distances", so that points that were nearby in the original space are still relatively close in the low-dimensional representation.


Sure, author here.

The point of the graphs are as an example more than anything — in any real world solution it's not trivial to plot your documents (as representing documents with thousands of words in 1d, 2d or 3d is not trivial).

But if _you could_ represent your documents in 2d, you could plot them like I do here. And, the X-axis would represent 1 feature of your documents, and the Y-axis another.

Tl;dr don't focus on the axes — focus on the idea of being able to compare the relative distances between documents.


The illustration has 2 dimensions, x and y.

If we have x, y, z, i, j, and n then we'd have six dimensions.

If each word represents a dimension, any blog title with that word gets a 1 on that word dimension, or a 0 if does not.

You can then calculate the distance between these points considering all of the dimensions, although clearly it's going to be a little limited on binary inputs, since the square of the differences is always going to be 1 or 0. Like for 4 dimensions, sqrt( ( 0 - 1 )^2 + ( 0 - 0 )^2 + ( 0 - 0 )^2 + ( 1 - 0 )^2 ), which just always gets simplified to something like sqrt( 1 + 0 + 0 + 1 ), which is a little boring.

The binary values cause your data points to all be stacked directly on top of each other, which leads me to believe that using binary inputs is a less than ideal application for k-means. Just look at it in the 2d case, where you have either [0,1], [0,0], [1,0], or [1,1] for each data point. Not very hard to determine the clustering there... basically just doing an overly complex boolean expression.


They don't really represent anything.

Think of those graphs as being something like the rubber sheet analogy that's often used to explain general relativity - it's not a great analogy in all respects, but humans can't really visualize 4 dimensions and it captures the gist of the idea well enough. Machine learning models have a similar problem: In reality the data points in k-means clustering exist in some very high-dimensinal space that's impossible to draw. So for the sake of visual aids it's common to draw 2D plots where the only thing they're meant to suggest is that points that appear close together in the visual aid would also be fairly close to each other in the actual problem space.


The graphs are just illustrations. The actual graphs would be high-dimensional (one dimension or axis for each word in the total word set), and each axis would have only two ticks on it: 0 and 1. A post is mapped to the binary vector in this space whose coordinates are determined by the presence or absence of words in the post.


Of the two methods described - search vs. clustering - the first one - simpler and not involving ML - is better for this use case. The only reason it seems to give worst results is because it's only used with the titles and not the full body (unlike the clustering approach). So I guess machine learning is easier to mis-use than it looks...


They should try using tf-idf to create the initial representation of the keywords per post...also, I find there are many cases where applying machine learning/statistics correctly is harder than it looks, this single case not withstanding.


It may be easier to do than it looks, but it's also harder to do well.


This is as valid as someone stating that computer science is easy because they know HTML.


I like the simple explanation of K-Means, and I like the contrast with the dense set-theoretic language -- a prime example of "mathematosis" as W.V.O. Quine put it.


I admit that was an unfair swipe at formalism, but I do think that formalism is far more effective in the light of a clear, intuitive exposition that conveys the concept straight into the mind. Formalism is good for battening things down tightly, after the fact.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: