Hacker News new | comments | show | ask | jobs | submit login
Machine Learning for Everyday Tasks (mailgun.com)
342 points by linkregister on Nov 22, 2016 | hide | past | web | favorite | 75 comments

This is a case where knowing the data and it's inherent structure is just as important, if not more important, then knowing how to apply machine learning.

In this case, the formula was (time ~ content length + number of tags). Except, by construction, the content length is positively correlated with the number of tags! And since the parsing tags is the primary purpose of HTML scrapers, it's likely that content length is completely redundant or a poor predictor of time. (which was the conclusion). If you wanted to calculate the maginitude/significance of an explanatory variable, a linear regression is more than sufficient, and R has great tools for that.

The quantile plot revealed there is a skew in time processed, presumably the number of tags has a similar skew.

I keep telling people in my profession who didn't really learn important things in school: "regression" should really be called "iid regression".

For example: time-series econometrics is a whole field dedicated to try to squeeze iid information from stuff that's not iid. So you can use regression.

But they're receptive to my message because they were repetitively warned that curve fitting was not regression analysis. Somehow this is lost in our "data" culture.

i.i.d.: independent and identically distributed

> to try to squeeze iid information from stuff that's not iid. So you can use regression.

I'm not sure I understand your last sentence here - so you can, or you can't use regression for non iid stuff? It would seem that you cannot.

There is more than certainly forms of regressions that can work on heteroskedastic errors, i.e. non-iid (can be clustering, autoregressive, or more), but part of the general ordinary least squares assumptions and interpretations is iid. errors.

If you do not have iid. errors, any interpretations of the model will be skewed.

There are however other heteroskedasticaly robust interpretations of regressions.

Weighted least squares should do the trick, i.e. scale your data with the inverse covariance matrix

Im currently building a UI for an app that asks the user to input a ridiculous amount of info in exchange for an accurate trade in price on a vehicle. I mentioned to the client that this is a classic linear regression problem and that they could pribably get more accurate estimates with far fewer input variables. Also, their backend datastore is in a real mess and is going to be a constant source of problems going forward.

But they simply weren't interested in even considering it. They just do not believe that machine learning can out perform their current process.

On a tangential note, "The robust beauty of improper linear models in decision making"[1] is probably one of my favorite papers on modeling and decision research.

> Improper linear models are those in which the weights of the predictor variables are obtained by some nonoptimal method; for example, they may be obtained on the basis of intuition, derived from simulating a clinical judge's predictions, or set to be equal. This article presents evidence that even such improper linear models are superior to clinical intuition when predicting a numerical criterion from numerical predictors.

[1] http://www.niaoren.info/pdf/Beauty/9.pdf

I studied this a couple of years back. You can even prove error bounds.

For example, using Jensen's inequality, I was able to prove that a unit weighted linear model will get the same result as the optimal linear model 75% of the time on average, provided the features are independent normally distributed gaussians.


I haven't figured out how to prove the same result for binary features, but simulations suggest that will work also.

> prove that a unit weighted linear model will get the same result as the optimal linear model 75% of the time on average

This is such an important result & yet doesn't seem to be a widely known result. Your proof is very well done. You should consider publishing it in CMJ. Or if you don't mind, I'd be happy to rework it & send it in :) Thanks for writing it up.

I'm not familiar with CMJ but happy to discuss the idea. Write to me, email is in my profile.

As a paper it would be significantly stronger if it did the binary case, though.

Hey thanks! I prefer the CMJ (http://www.maa.org/press/periodicals/college-mathematics-jou...) to AMM etc. because your proof is ultimately undergrad math - some calculus, complex analysis, change of coords and inequalities. It's a pity the CMJ is not as well known. My professor used to circulate copies in class and we learnt more from it than from the grad journals which went out of the way to make things abstract. You also have a fantastic blog on Harris Benedict equation, which I have combined with an ML approach on WeightGreat - truth be told, this was one of those miracles that happen when you read Hacker News regularly. You find someone has worked on the exact same problem you are interested in, using a different approach, and you can combine that with your own insight to come up with a very promising idea.

I'll take a look. And send me an email, what you are doing sounds interesting, would love to learn more.

That is surprising. I wouldn't've guessed 75%. I would've guessed it could be arbitrarily close to 50%.

I haven't read the paper but based on your extract are such improper models more susceptible to black swan events? I'd suspect a more clinical model might actually be even more susceptible (more variables means a greater chance of some variables being wrong and/or missing correlations between variables which would amplify a wing variable to a greater degree) but I'm curious if this aspect had been explored.

Most people get scared by things they cant visualize. I have a friend at a large, popular tech company. He was tasked with solving a problem that a linear regression was exactly perfect for. The thing they were trying to model was even known to be linear! His boss wouldn't let him do it because he couldn't visualize a best plane.

Instead, he had to do a series of best fit lines in the various possible 2D projections. At the end, he showed the boss that a linear regression provided the same result and the boss was still untrusting.

Your friend's boss doesn't sound like he should be managing data scientists or ML engineers

"Nobody ever gets fired for contracting IBM."

Heh. I have exactly the opposite problem. One of my senior product managers is crazy in love with the concept of ML, despite knowing absolutely nothing about it beyond what he's read in some puff pieces in business magazines. Any time I run into any non-trivial problem his first suggestion is often "What's so hard about that? Just collect a bunch historical data and feed it into a machine learning algorithm (in the cloud!)".

edit: Also his definition of "Big Data" seems to be "more data than can be printed on an A4 piece of paper"

Where do you find these people? I have plenty of ML to sell them.

I'm joking of course, but I'm also quite curious: in which sector(s) does this seem to happen the most? My clients have been weary of ML hype, actually.

> But they simply weren't interested in even considering it. They just do not believe that machine learning can out perform their current process.

Right here is where a startup could be born...

I considered it, but the used car business is something I have absolutely no desire to get into. Fill your boots.

Are you sure they gave you the right reason? Or do they want the user to input the data, so they can use it for some other analysis?

> Ridiculous amount of info

Just to understand, when you said "ridiculous" you mean "a few" or "a lot" of?

About 50 fields. And a lot of those fields are really difficult for your average person to know how to answer.

> that they could pribably get more accurate estimates with far fewer input variables

suggests it meant "a lot"

He definitely means a lot.

I might have missed something, but if they only have 2 features, simply plotting the data out will make any trend very clear.

Re-pasting my comment here where it belongs as a reply:

It could be a great idea - to plot the data on 2 axes (x - html length or html size, y - processing time, if I understand this correctly). It's simple and elegant. I'll try that. It could be though that the chart will get messy with all this data points. One of the reasons I like the percentiles approach is that it makes it clear that there is a trade-off between message processing time and the number / percentage of messages we can process.

> It's simple and elegant.

The OP likely made that comment because plotting the data is often done before the fancy machine learning as a part of the exploratory data analysis. (especially with a low number of variables!)

IanCal's comment about setting the alpha is a good one. Another easy option is to make a heatmap of 2D bin counts. Since you're using R already, you can use ggplot for this:


Set a low alpha to the points on the chart of there are loads.

I'm reminded of Anscombe's Quartet as an example of how a little visualization can reveal things that might be hidden behind summary statistics.


For higher dimensions, a RadViz plot with lower alpha is really useful too.

[0] http://docs.orange.biolab.si/2/widgets/rst/visualize/radviz....

Agreed. Also, they don't discuss how they chose their features... It seems the problem was already solved before even applying ML to it. Maybe the example is too naive ?

Html size and html tags count was a natural choice. If it didn't work out the next step would be to try something else. You're right that it's a very naive example and that in a way it was solved before any ml was applied. The surprising part for me was that both features turned out to be interchangeable i.e. any of them could be used. I would expect html tags count to be much more accurate / reliable, etc. Another interesting part for me was the threshold. It's somewhat clear that it should be somewhere between 20 and 1 sec probably but where exactly?

The slowest part of any parser is the lexer - guessing whatever processing they do on the parse tree structure is insignificant by comparison.

Would the person or persons down voting all the author of TFA's comments mind explaining why? Seems a bit off.

Now this has potential. Put a tool in the mail reader which reads "transactional" emails and matches them up with actual transactions. Discard entire emails which have no match to a transaction, and discard advertising in transactional emails. This would improve spam filtering.

It would also be useful for mail readers which understand conversations, converting them to message-like bubble format.

Well Hop is kind of a thing.

That's nice, but it uses an unnecessary hosted service, which is bad for security and reliability. This should all be done in an IMAP client.

I once wrote a classifier to tell whether the metadata on a web page was any good.

The best features? Look for classic html mistakes on the page. Still using font tags, do they use gifs instead of png, did they include a keywords meta tag, did they specify an encoding or are there windows-1252 characters present anywhere, etc. I came up with 20 or so signs of bad html elsewhere on the page, and collectively those features were much more predictive than any of the content itself.

I'm waiting for a "Machine Learning Tools for the Non-technical" post. When we can make machine learning accessible to business owners and non-technical people, this is when machine learning will really have a big impact on society. Until then, it's accessibility is just too limited.

Im currently trying to crack it. What I've realised so far is that once you get past the scary mathematic notation, its really quite simple under the hood. Maybe that wont be the case as I learn more, but linear regression and gradient descent is really very simple and easy to grok.

So far I've been cobblibg together a learning regimen by combing Coursera, Khan Academy, and a book called "Machine Learning for Programmers". Individually they all eventually brush over some key concepts that make it dofficult to get an intuition for what's happening. But when combined they fill in the gaps. Also, this repo has been like the rosetta stone for someone who doesn't have a single maths qualification, not even a GCSE.


I hope you are aware that those introductory courses give you a rather high level view of things and often hide any problematic points from you. It's like driving and only learning to know about the accelerator.

This has generally been my experience with mathematic notation applied to CS topics; the notation itself is basically impenetrable, but once you can get the underlying ideas translated into readable form they turn out to be straightforward and easy to apply.

I'm in a similar boat, and that repo looks very useful! Thanks.

If you're at all into Python or R I'd suggest Jose Portilla's courses on Udemy as well, they're very good at covering the applied aspects.

For practical use cases, machine learning is actually easier than what these blog posts imply. One of my pet peeves with some marketing thought pieces on Machine Learning/AI is that they deliberately abstract so much of the process in order to sound smarter. (the original submission did a good job in explaining the terminology/code, however)

ML workflows can nowadays be simplified to a few lines of code, and is one of the reasons I now open-source my code in Jupyter/R Notebooks to help facilitate transparency.

The accessibility of machine learning does not affect its impact on society. It already has had a big impact on society whether people realize it or not. It is used for countless tasks that people take for granted (spam filter, facial recognition, translation, etc.).

We live in a society where people are completely dependent on technologies that they know nothing about. The average person who uses a computer or a smartphone wouldn't even be able to tell you how to represent the number 2 in binary yet these devices have a huge impact on their lives. Whether this is a good thing or a bad thing I'm not sure since it is no longer possible to understand every aspect of technology at our current (and accelerating) level of progress.

> The accessibility of [cars] does not affect its impact on society.

I would think the accessibility of something is directly related to its impact on society most of the time. There is no doubt ml has already had an impact, but putting it in the hands of non-technical users opens up possibilities that the technical crowd is incapable of imagining.

I'm not sure I understand your 'cars' comment. The vast majority of people who drive cars have no idea how an engine works yet they are completely dependent on them.

'How' something works doesn't affect it's impact on society, the more important consideration is 'what' it does.

My original point was that machine learning is not something that the average person can apply in his or her own business yet. Yes, big tech companies use it to improve their operations, but a small business owner can't define what would even be an applicable ML problem for their business yet. Millions of small businesses have websites, and even their own custom software, yet ML is still out of reach for many of these business. This is where I think a huge impact can, but has yet, to be made.

I completely agree with you. It just seemed like your initial comment was dismissing the effect/impact of ML because not everyone has the time to learn how to implement it.

Machine Learning only works if you have data that it can use. Non-technical people are unlikely to be in possession of this data. But I still agree with you.

This is starting to get what I'm trying to express....non-technical people don't even know what ML is capable of, how to define what could be a ML problem, or how to go about acquiring the tools to solve said problem. Once that information is easily made available to at the very least the average programmer, then I think ML will have a much larger impact than we can imagine.

I think nutonian.com had cracked that(machine learning for non-programmers) quite well.

>> CART showed slightly better results for this task but its main advantage is that it gives you a decision tree that is easy to understand, explain and implement vs SVM or Random Forests that work like a black box and require using heavy ML libraries in production.

I don't get that. Random Forests are ensembles of decision trees and can be used for regression. I've not used the R CART library, so I might be missing something, but the interpretability of Random Forests should be the same as that of, well, any other model built by a decision-tree learner (i.e. just as unreadable as any other classifier's model, once their parameters become sufficiently many).

I got lost at the cluster analysis. Can someone explain that in more detail? (Maybe Sergey?)

There are many approaches https://en.wikipedia.org/wiki/Cluster_analysis. In the research k-means clustering was used. It's probably not the best clustering algo for this task and a much better results would be achieved if a different one was used or if some data manipulation was applied prior to the clustering. Here's k-means clustering explanation if you're interested https://youtu.be/_aWzGGNrcic

They are trying to predict what types of html pages cause parsing problems. They have a set of variables that describe an html page. They are trying to figure out which variable or combination of variables predict a bad parsing outcome.

Think of cluster analysis as simply plotting those variables as points on a graph. Then drawing a circle around points that are close to each other.

What part? :)

Are we now going to call every statistical technique "Machine Learning"?

Nope, we're switching to "AI" these days.

Only if done using a machine.

No, nene no. Statistical techniques start from hypotheses and are validated by tests and confidence intervals.

Here we basically don't care what's our chances of getting it wrong. It's a marginal feature in some email client, not quality control in a brewery.

I kinda think you're being down voted because people don't know the origin of the t test. I thought it was funny.

No, that's just Machine Drunk Brawling, aka. Sophisticated Stupidity.

Certainly Data Mining and applied statistics

"Everyday tasks," I was picturing making coffee, driving my car...

this is strange -- why is there a little bit of Python in the middle of an R implementation/prototyping?

While the research and prototyping were done in R the service itself is in python. The implementation is so simple that it doesn't require any specific ML libs. More complex tasks will probably require using scikit (if desired to stay on python ground) or plugging in R in one way or another, etc.

so the whole thing was just an exercise in choosing a cut point? It would seem like you might want to revisit this sort of thing every now and then, what fer since both web "frameworks", your parsing code, and CPU hardware change over time.

you're absolutely right!

And that's why you PCA before clustering (for normalization).

It could be a great idea - to plot the data on 2 axes (x - html length or html size, y - processing time, if I understand this correctly). It's simple and elegant. I'll try that. It could be though that the chart will get messy with all this data points. One of the reasons I like the percentiles approach is that it makes it clear that there is a trade-off between message processing time and the number / percentage of messages we can process.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact