
Machine Learning for Everyday Tasks - linkregister
http://blog.mailgun.com/machine-learning-for-everyday-tasks/
======
minimaxir
This is a case where knowing the data and it's _inherent structure_ is just as
important, if not more important, then knowing how to apply machine learning.

In this case, the formula was (time ~ content length + number of tags).
Except, by construction, the content length is _positively correlated_ with
the number of tags! And since the parsing tags is the primary purpose of HTML
scrapers, it's likely that content length is completely redundant or a poor
predictor of time. (which was the conclusion). If you wanted to calculate the
maginitude/significance of an explanatory variable, a linear regression is
more than sufficient, and R has great tools for that.

The quantile plot revealed there is a skew in time processed, presumably the
number of tags has a similar skew.

~~~
thanatropism
I keep telling people in my profession who didn't really learn important
things in school: "regression" should really be called "iid regression".

For example: time-series econometrics is a whole field dedicated to try to
squeeze iid information from stuff that's not iid. So you can use regression.

But they're receptive to my message because they were repetitively warned that
curve fitting was not regression analysis. Somehow this is lost in our "data"
culture.

~~~
ekianjo
> to try to squeeze iid information from stuff that's not iid. So you can use
> regression.

I'm not sure I understand your last sentence here - so you can, or you can't
use regression for non iid stuff? It would seem that you cannot.

~~~
pyromine
There is more than certainly forms of regressions that can work on
heteroskedastic errors, i.e. non-iid (can be clustering, autoregressive, or
more), but part of the general ordinary least squares assumptions and
interpretations is iid. errors.

If you do not have iid. errors, any interpretations of the model will be
skewed.

There are however other heteroskedasticaly robust interpretations of
regressions.

~~~
felixpl
Weighted least squares should do the trick, i.e. scale your data with the
inverse covariance matrix

------
tomelders
Im currently building a UI for an app that asks the user to input a ridiculous
amount of info in exchange for an accurate trade in price on a vehicle. I
mentioned to the client that this is a classic linear regression problem and
that they could pribably get more accurate estimates with far fewer input
variables. Also, their backend datastore is in a real mess and is going to be
a constant source of problems going forward.

But they simply weren't interested in even considering it. They just do not
believe that machine learning can out perform their current process.

~~~
edpichler
> Ridiculous amount of info

Just to understand, when you said "ridiculous" you mean "a few" or "a lot" of?

~~~
tomelders
About 50 fields. And a lot of those fields are really difficult for your
average person to know how to answer.

------
nacc
I might have missed something, but if they only have 2 features, simply
plotting the data out will make any trend very clear.

~~~
sergey-obukhov
Re-pasting my comment here where it belongs as a reply:

It could be a great idea - to plot the data on 2 axes (x - html length or html
size, y - processing time, if I understand this correctly). It's simple and
elegant. I'll try that. It could be though that the chart will get messy with
all this data points. One of the reasons I like the percentiles approach is
that it makes it clear that there is a trade-off between message processing
time and the number / percentage of messages we can process.

~~~
minimaxir
> It's simple and elegant.

The OP likely made that comment because plotting the data is often done
_before_ the fancy machine learning as a part of the exploratory data
analysis. (especially with a low number of variables!)

------
Animats
Now this has potential. Put a tool in the mail reader which reads
"transactional" emails and matches them up with actual transactions. Discard
entire emails which have no match to a transaction, and discard advertising in
transactional emails. This would improve spam filtering.

It would also be useful for mail readers which understand conversations,
converting them to message-like bubble format.

~~~
jasikpark
Well Hop is kind of a thing.

~~~
Animats
That's nice, but it uses an unnecessary hosted service, which is bad for
security and reliability. This should all be done in an IMAP client.

------
nerfhammer
I once wrote a classifier to tell whether the metadata on a web page was any
good.

The best features? Look for classic html mistakes on the page. Still using
font tags, do they use gifs instead of png, did they include a keywords meta
tag, did they specify an encoding or are there windows-1252 characters present
anywhere, etc. I came up with 20 or so signs of bad html elsewhere on the
page, and collectively those features were much more predictive than any of
the content itself.

------
bsbechtel
I'm waiting for a "Machine Learning Tools for the Non-technical" post. When we
can make machine learning accessible to business owners and non-technical
people, this is when machine learning will really have a big impact on
society. Until then, it's accessibility is just too limited.

~~~
tomelders
Im currently trying to crack it. What I've realised so far is that once you
get past the scary mathematic notation, its really quite simple under the
hood. Maybe that wont be the case as I learn more, but linear regression and
gradient descent is really very simple and easy to grok.

So far I've been cobblibg together a learning regimen by combing Coursera,
Khan Academy, and a book called "Machine Learning for Programmers".
Individually they all eventually brush over some key concepts that make it
dofficult to get an intuition for what's happening. But when combined they
fill in the gaps. Also, this repo has been like the rosetta stone for someone
who doesn't have a single maths qualification, not even a GCSE.

[https://github.com/Jam3/math-as-
code/blob/master/README.md](https://github.com/Jam3/math-as-
code/blob/master/README.md)

~~~
stewbrew
I hope you are aware that those introductory courses give you a rather high
level view of things and often hide any problematic points from you. It's like
driving and only learning to know about the accelerator.

------
YeGoblynQueenne
>> CART showed slightly better results for this task but its main advantage is
that it gives you a decision tree that is easy to understand, explain and
implement vs SVM or Random Forests that work like a black box and require
using heavy ML libraries in production.

I don't get that. Random Forests are ensembles of decision trees and can be
used for regression. I've not used the R CART library, so I might be missing
something, but the interpretability of Random Forests should be the same as
that of, well, any other model built by a decision-tree learner (i.e. just as
unreadable as any other classifier's model, once their parameters become
sufficiently many).

------
linkregister
I got lost at the cluster analysis. Can someone explain that in more detail?
(Maybe Sergey?)

~~~
sergey-obukhov
There are many approaches
[https://en.wikipedia.org/wiki/Cluster_analysis](https://en.wikipedia.org/wiki/Cluster_analysis).
In the research k-means clustering was used. It's probably not the best
clustering algo for this task and a much better results would be achieved if a
different one was used or if some data manipulation was applied prior to the
clustering. Here's k-means clustering explanation if you're interested
[https://youtu.be/_aWzGGNrcic](https://youtu.be/_aWzGGNrcic)

------
c06n
Are we now going to call every statistical technique "Machine Learning"?

~~~
thanatropism
No, nene no. Statistical techniques start from hypotheses and are validated by
tests and confidence intervals.

Here we basically don't care what's our chances of getting it wrong. It's a
marginal feature in some email client, not quality control in a brewery.

~~~
jfoutz
I kinda think you're being down voted because people don't know the origin of
the t test. I thought it was funny.

------
nether
"Everyday tasks," I was picturing making coffee, driving my car...

------
apathy
this is strange -- why is there a little bit of Python in the middle of an R
implementation/prototyping?

~~~
sergey-obukhov
While the research and prototyping were done in R the service itself is in
python. The implementation is so simple that it doesn't require any specific
ML libs. More complex tasks will probably require using scikit (if desired to
stay on python ground) or plugging in R in one way or another, etc.

~~~
apathy
so the whole thing was just an exercise in choosing a cut point? It would seem
like you might want to revisit this sort of thing every now and then, what fer
since both web "frameworks", your parsing code, and CPU hardware change over
time.

~~~
sergey-obukhov
you're absolutely right!

------
tlarkworthy
And that's why you PCA before clustering (for normalization).

------
sergey-obukhov
It could be a great idea - to plot the data on 2 axes (x - html length or html
size, y - processing time, if I understand this correctly). It's simple and
elegant. I'll try that. It could be though that the chart will get messy with
all this data points. One of the reasons I like the percentiles approach is
that it makes it clear that there is a trade-off between message processing
time and the number / percentage of messages we can process.

