In this case, the formula was (time ~ content length + number of tags). Except, by construction, the content length is positively correlated with the number of tags! And since the parsing tags is the primary purpose of HTML scrapers, it's likely that content length is completely redundant or a poor predictor of time. (which was the conclusion). If you wanted to calculate the maginitude/significance of an explanatory variable, a linear regression is more than sufficient, and R has great tools for that.
The quantile plot revealed there is a skew in time processed, presumably the number of tags has a similar skew.
For example: time-series econometrics is a whole field dedicated to try to squeeze iid information from stuff that's not iid. So you can use regression.
But they're receptive to my message because they were repetitively warned that curve fitting was not regression analysis. Somehow this is lost in our "data" culture.
I'm not sure I understand your last sentence here - so you can, or you can't use regression for non iid stuff? It would seem that you cannot.
If you do not have iid. errors, any interpretations of the model will be skewed.
There are however other heteroskedasticaly robust interpretations of regressions.
But they simply weren't interested in even considering it. They just do not believe that machine learning can out perform their current process.
> Improper linear models are those in which the weights of the predictor variables are obtained by some nonoptimal method; for example, they may be obtained on the basis of intuition, derived from simulating a clinical judge's predictions, or set to be equal. This article presents evidence that even such improper linear models are superior to clinical intuition when predicting a numerical criterion from numerical predictors.
For example, using Jensen's inequality, I was able to prove that a unit weighted linear model will get the same result as the optimal linear model 75% of the time on average, provided the features are independent normally distributed gaussians.
I haven't figured out how to prove the same result for binary features, but simulations suggest that will work also.
This is such an important result & yet doesn't seem to be a widely known result. Your proof is very well done. You should consider publishing it in CMJ. Or if you don't mind, I'd be happy to rework it & send it in :) Thanks for writing it up.
As a paper it would be significantly stronger if it did the binary case, though.
Instead, he had to do a series of best fit lines in the various possible 2D projections. At the end, he showed the boss that a linear regression provided the same result and the boss was still untrusting.
edit: Also his definition of "Big Data" seems to be "more data than can be printed on an A4 piece of paper"
I'm joking of course, but I'm also quite curious: in which sector(s) does this seem to happen the most? My clients have been weary of ML hype, actually.
Right here is where a startup could be born...
Just to understand, when you said "ridiculous" you mean "a few" or "a lot" of?
suggests it meant "a lot"
It could be a great idea - to plot the data on 2 axes (x - html length or html size, y - processing time, if I understand this correctly). It's simple and elegant. I'll try that. It could be though that the chart will get messy with all this data points. One of the reasons I like the percentiles approach is that it makes it clear that there is a trade-off between message processing time and the number / percentage of messages we can process.
The OP likely made that comment because plotting the data is often done before the fancy machine learning as a part of the exploratory data analysis. (especially with a low number of variables!)
It would also be useful for mail readers which understand conversations, converting them to message-like bubble format.
The best features? Look for classic html mistakes on the page. Still using font tags, do they use gifs instead of png, did they include a keywords meta tag, did they specify an encoding or are there windows-1252 characters present anywhere, etc. I came up with 20 or so signs of bad html elsewhere on the page, and collectively those features were much more predictive than any of the content itself.
So far I've been cobblibg together a learning regimen by combing Coursera, Khan Academy, and a book called "Machine Learning for Programmers". Individually they all eventually brush over some key concepts that make it dofficult to get an intuition for what's happening. But when combined they fill in the gaps. Also, this repo has been like the rosetta stone for someone who doesn't have a single maths qualification, not even a GCSE.
If you're at all into Python or R I'd suggest Jose Portilla's courses on Udemy as well, they're very good at covering the applied aspects.
ML workflows can nowadays be simplified to a few lines of code, and is one of the reasons I now open-source my code in Jupyter/R Notebooks to help facilitate transparency.
We live in a society where people are completely dependent on technologies that they know nothing about. The average person who uses a computer or a smartphone wouldn't even be able to tell you how to represent the number 2 in binary yet these devices have a huge impact on their lives. Whether this is a good thing or a bad thing I'm not sure since it is no longer possible to understand every aspect of technology at our current (and accelerating) level of progress.
I would think the accessibility of something is directly related to its impact on society most of the time. There is no doubt ml has already had an impact, but putting it in the hands of non-technical users opens up possibilities that the technical crowd is incapable of imagining.
'How' something works doesn't affect it's impact on society, the more important consideration is 'what' it does.
I don't get that. Random Forests are ensembles of decision trees and can be used for regression. I've not used the R CART library, so I might be missing something, but the interpretability of Random Forests should be the same as that of, well, any other model built by a decision-tree learner (i.e. just as unreadable as any other classifier's model, once their parameters become sufficiently many).
Think of cluster analysis as simply plotting those variables as points on a graph. Then drawing a circle around points that are close to each other.
Here we basically don't care what's our chances of getting it wrong. It's a marginal feature in some email client, not quality control in a brewery.