No doubt this is a difficult position to master and those who perform well are able to tackle lots of mathematical and computational challenges. They also are model builders who (have tendency to) relentlessly seek complex models in order to solve complex problems.
The other, rarer side is the learning theorist who may or may not understand the model building, algorithmic, and computational tools but understands well the theories which allow us to have reasonable expectations that the tools of the first group will work at all. These guys have a funny story in that they were the old statisticians who got a major egg-on-the-face after proclaiming that essentially all of ML was impossible. Turns out the first group managed to redefine the problem slightly and make major headway (and money).
The thing I want to bring to light however is that the second group knows the math that bounds the capacities of ML algorithms. This isn't easy. It's one thing to say you recognize that the curse of dimensionality exists, but it's another to have felt it's mathematical curves and to build an intuition for what forces are sufficient to cause disruption.
The more experience you have with the learning maths, the more likely you are, I feel, to apply very simple algorithms, to be scared of "little x's" (real data) enough to treat it with great care, and to attempt to explore the problem space with confidence for what steps will lead you to folly.
It's a fine line between the two, though. Stray too far to the first group and you'll spend a month building an algorithm that does a millionth of a percentage point better than Fisher's LDA. Spend too much time in the second camp and you'll confidently state that no algorithm exists that does better than a millionth of a percentage point over Fisher LDA... and then lose purely by never trying.
You can build an extremely complicated model that is not useful, where a simpler one might suffice.
I find data cleansing(if you are including feature selection) hard, and I consider it a refinement. If I am working on a classification problem, I start with naive bayes with a trivial feature generator(if words are feature, split on whitespace and discard some symbols), train it, and cross validate. Depending on the results of the cross validation on differently sized data-sets(say 100 tweets, 200, 500, 1000, 2000, 5000) I decide if I refine bayes further or I need to pick another algorithm.
I avoid SVM because I have a hard time figuring out the kernel and relation between data. I mostly don't use linear classifiers because the relation is very rarely linear.
Generally if the features are pseudo-independent(naive bayes assumes independent events but it might work fine even if the events aren't independent), naive bayes does the job. If not, it's time to refine the feature generator and selector.
Regarding stronger assumptions, is there anything other than independence(thus the name naive) that it assumes?
But I would take a more pessimistic interpretation of this.
That is: all our "learning algorithms" has failed to learn and those with some clever heuristics succeed versus the broken methods we so-far have.
Maybe its the pain of my previous AI job talking but when the choice is just an opaque hunch of an expert, it doesn't feel like a victory for human intelligence. A victory for human intelligence looks much like the discovery of a physical law where you both deal with a phenomenon and communicate how someone else can also deal with it.
What is quintessentially human in the modern sense is human beings understanding ourselves rather than
That said one shouldn't underestimate the amount of commonality between problems that to some people may appear unrelated. For example this post talks about the gains in machine translation performance from including larger contexts. The same principle applies to many other sequence learning problems. For example you have a very similar issue with handwriting recognition where it is often not possible (even for a human) to determine the correct letter classification for a given handwritten character without seeing it within the context of the word.
1) Programmers that have the needed math skills, or mathematicians with the needed coding skills
2) A distributed ML framework
Solving problem one is not easy but it's straightforward.
Solving problem two is harder. While there are a lot of open source machine learning projects, almost all of them seem to have a focus of being used by a person and not a program. Moreover very few do distributed processing except for mahout (http://mahout.apache.org/). Mahout is promising but the documentation is still thin, and I'm not sure if it's getting momentum in terms of mind share yet.
You need a decent understanding of calculus (mid 1800s level, mulitvariate calculus), a more decent understanding of Linear Algebra (1950s ), information theory (1960s), and probability and statistics. With the last having shifted the most from the past due to more recent respect for bayesian methods. Note the years in parenthesis is not to say that nothing new has been used from those areas, more like if you pick up a book on that topic from that year you would be pretty well covered for the purposes of ML.
Worth having a vague idea of are stuff like PAC learning, topology and computational complexity stuff like Valiant's work on evolvability. If you are doing stuff related to genetic programming then category and type theory have riches to be plundered.
Or if you want to be more hardcore and are looking at very higher dimensional data and reductions on them you might look at algebraic geometry (in particular algebraic varieties) and group theory. So basically the answer to your question is as little or as much math as you want and or depending on the problem and your interests in trying different approaches than the typical toolkits of linear algebra and statistics.
Could you expand this a bit as I don't understand the meaning. Are you saying that if the problem you are working on can be solved with genetic algorithms, then you could blow it away with category and type theory?
I don't have a vested interest in either, I am just curious. Thanks.
Imagine all the applications for consumer products if algorithms would be really able to actually understand language (as far as you can understand something if you are a computer program and not a sentient human being), e.g. for example if we were able to do real text summarization.
I believe this is not only possible, but not as far away as people think. However, to reach that goal we need to let go of that idea that NLP is mostly about clever feature engineering, but instead start building algorithm that derive those features themselves. Part of the problem is how evaluation is setup in NLP. What the best algorithm is, is decided based on who gets the best performance on some dataset. This sounds all nice and objective, but you will always be able to get the best performance if you try enough combinations of features (overfitting the testset) . These small improvements say little about real world performance.
For the NLP people among you this is an interesting paper that tries to do a lot of things different: http://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf
This is the corresponding tutorial, which is quite entertaining as well: http://videolectures.net/nips09_collobert_weston_dlnl/
 I think, this is less true for machine translation, where there are more and bigger testsets and less feature engineering going on.
On the other hand there exist alternatives such as LaSVM that can effectively scale linearly to large datasets (but the optimizer works in dual representation as with SMO and not like Pegasos).
The key takeaway from the paper (for me) was that the computation time on a single processor was not significantly better than that of the standard implementation provided by SVM-Light. However, with a variety of tricks permitted by the use of an SGD/Pegasos-like method, the authors were able to get significant speedup when using a compute cluster, allowing a good reduction in computation times (e.g. ~200x reduction on 512 processors).
[deleted part about kernelizing pegasos, realized i dont know that area]
The difficulty with Pegasos for non linear kernels is the support set quickly becomes very large and so evaluating the model becomes very slow. Note that since the alpha values are not constrained to be non-negative (unlike the standard dual algorithms) the alpha values don't ever get clipped to zero--instead they just slowly converge to zero. It's still (I think) one of the fastest methods in terms of theoretical convergence guarantees but perhaps not as fast as LaSVM or something similar in practice.
However, there's been a more general trend in machine learning to use linear models with lots of features instead of kernel models, partially because of these sort scalability issues.
However it seems that you need to compute the kernel expansion on the full set of samples (or maybe just the accumulated past samples?): this does not sound very online to me...
At first i tried using an off-the-self classifier to figure out which parameters will work well. That failed because by the time i had sampled a decent proportion of the possible parameter values, the channel would change (the number of possible combinations is of the order of a few millions).
Turned out that the real problem is not learning the performance of the available parameters, rather it lies in "learning how to learn": i.e. my ML system needs to adaptively search the space, by responding to the history of previous explorations and their outcomes. This kind of exploration would be effective only with an understanding of how the underlying modulation/coding algorithms work and interact with each other.
What did help in understanding models is the application of newer feature selection techniques that give a ranked list of features, such as grafting.
For example: trying to model environmental impact of Bill Gates's 66,000 sq ft house during a hackathon -> discovery that we need fuzzy set analysis (https://github.com/seamusabshere/fuzzy_infer) -> new, marketable capabilities in our hotel modelling product (https://github.com/brighterplanet/lodging/blob/master/lib/lo...).
I think it would have been better if this was just the first part of a multi-article write up on ML. With this one being an intro and follow-ups on specific approaches.
If you have to be really clever with feature engineering, then what's the point of even calling yourself a machine learning person.