And this is not unique to machine learning, per se. https://fivethirtyeight.com/features/trump-noncitizen-voters... has a great widget that shows that as you get more data, you do not necessarily decrease inherent noise. In fact, it stays very constant. (Granted, this is in large because machine learning has most of its roots in statistics.)
More explicitly, with ML, you are building probabilistic models. This is contrasted to most models folks are used to which are analytic models. That is, you run the calculations for an object moving across the field, and you get something within the measurement bounds that you expected. With a probabilistic model, you get something that is within the bounds of being in line with previous data you have collected.
(None of this is to say this is a bad article. Just a bias to keep in mind as you are reading it. Hopefully, it helps you challenge it.)
It's a very good article, though in the context of deciding how many variables should be in a model of some complex phenomenon, this example is a little tougher to wrap your head around. It's not quite a predictive model, but there were some variables left out. A naive model I suppose is "this data is generated by infallible respondents", whereas a better model would incorporate that error rate. There isn't as much of a question about which pieces of information are relevant, though, like you might encounter when trying to predict future drug use from household income, race, age, number of books read as a child, number of pets, and so on.
Here's Andrew, the author of that blog post:
> Yes, I think it makes a lot of sense to criticize particular frequentist or Bayesian methods rather than to criticize freq or Bayes statisticians.
Which is exactly what I did. There are times when frequentist methods are effective. I just wouldn't use them to tell me that I have a disease.
illegals labeled citizens: 0.01
For sufficiently well-behaved signals, the estimator of the strength of various frequency components (i.e. the Fourier transform) is pretty stable as one enlarges the window of measurement, which is post-hoc validation that the signal is well-behaved.
This might not be true for very weird signals, and enlarging the window of measurement might significantly change model parameters--meaning that one might need a non-parametric model to get by (enlarging number of parameters with number of measurements) rather than restricting to a finite number of frequency models. Eg: Suppose the first 100 measurements show a nice sinusoid of some frequency, but the next 100 measurements show a pretty much flat signal. Then, you are forced to revise downward the parameter corresponding to the importance of the sinusoid component, and increase the importance of the zero frequency component. The thing is, one never truly knows how the signal is going to behave in the future. Any parametric model is a bias that things won't get too complicated.
Another closely related thing is the No Free Lunch Theorem.
These are concepts I believe are very important to internalize if you work with machine learning. Fundamentally we are making predictions (ie. guessing) on the nature of entirely unknown information. So there is a certain inherent impossibility to the task in the general sense. It shouldn't always work.
So, if anything, I only meant they were related in that they are both good targets to internalize when working with ML. If there are better targets, I'm definitely interested in learning more.
Introduction, Regression/Classification, Cost Functions, and Gradient Descent:
Perceptrons, Logistic Regression, and SVMs:
Neural networks & Backpropagation:
If there existed an underlying model with a finite number of parameters, then you could potentially find it will a finite number of measurements. But if there is no underlying model, the only viable approach is using non-parametric estimation. If you try to truncate a non-parametric model to a finite/fixed dimensional model, you are introducing a strong source of bias. (the "bias" of assuming that there actually exists a not-crazily-complicated underlying model)
A simple model can be "robust" to errors because the errors tend to cancel out.
More complex models are typically more sensitive to slight changes in the input (and therefore noise), so even if your model is exactly correct, small errors in the input can be amplified yielding large errors in the predictions.
Or it can be preferable because it tends to fail in certain characteristic ways.
For example, Newtonian mechanics breaks down at both the high and low mass/energy scales, but it's usually possible to determine whether you'll have to employ GR or QM (and perhaps despair if both are required simultaneously).
Also, it's much easier to test simple models (and typically also to train them if we're talking ML).
Furthermore, if your first approximation is well-understood, then you can complicate it to address new data.
Simplicity has many virtues; just because we can employ more complicated methods now doesn't mean that we should, or that the sole reason previous generations of scholars opted for simple models is because they couldn't perform the calculations needed by more complex ones.
In general we can't expect to know the underlying truth with certainty, but maybe a question is so important and so well investigated that it merits a seriously convoluted answer.
For less well-studied or relatively unimportant problems, a model that is obviously wrong but easy to use is usually preferable.
It's called the bias-variance trade-off for a reason (although you should still try to get a good deal, which in this case means minimizing error or perhaps risk).
0. This point can be particularly subtle-- we generally have more computing power available than (quality) data.
I can try millions of variations on a given hypothesis for a small data set relatively quickly and find a handful of results with publishable p-values.
This is how you get replication crises.
Even if one of my hypotheses is invalidated, I can generate many more that are still consistent with the data.
Take string theory for example-- a naive approach to testing the various parameterizations/permutations of the basic idea would run out of time before the heat death of the universe, so absent theoretical work constraining the possibility space, it would be effectively unfalsifiable.
I guess as long as the users' expectations are correct it can be useful in some very specific areas. Referencing the AlphaGo game last year, I was a Go player for more than a decade. But yet AlphaGo's weird move inspires new insights that break the conventional structure / thinking-framework of a Go player. From that angle, I do think that even though DL is somewhat a blackbox, humans can pick up new insights because it explores areas which are normally ridiculous to a human with 'common sense' to explore.
With AlphaGo there are several things you have to consider. It is not only using deep learning it also uses monte carlo tree search, basically this algorithms is good at exploring search spaces like games. But the key factor in AlphaGo was deep learning evaluating the states.
What I want to say is that neural networks are not meant to explore but to discover patterns. It is something very different. They are very interesting because they work as our brains (discovering and fulfilling patterns), but they as bad in searching as us.
I see.. Thanks for sharing. I assume you mean evaluating game board states, finding valuations objectively and figuring out the pattern of where to move that can lead to higher probability of winning? 20 years ago I made an Othello game that use a search tree (I think configurable from 3 to 9 levels deep), assign valuations to different board positions and let the computer assume that the player would make the best move to his advantage. It turns out can beat human players easily, but that is such a small search space and valuation weights are very clear. Thus I was amazed at the computer winning with such a large game board where even for professional players, it is arguable which state is a better state. EDIT - ok I just googled monte carlo tree search, I must have accidentally implemented that search at that time, base on common sense and what I can do with the programming langauge I had at that time (Visual Basic).
I find very interesting how some trade algorithms are using deep learning to extract features form news apart from the variables you mentioned.
About the search, in case you are interested. You have the basic tree search, then you have A* (a-start) which uses an heuristic to decide which is the next node to expand (very used in path finding in games, where the heuristic is the euclidean distance to the target point). For games you use a search where in one state you maximize the heuristic (your move) and in the next one you minimize it (your opponent move) (sorry I don't remember the name). And monte carlo tree search what it does is to no evaluate some states, it just do some random moves and evaluates the final state, this way it tries to improve the exploration/exploitation of the search (and works pretty well).
I'd call that quantitative analysis, not technical analysis. The difference between them being the difference between astronomy and astrology. Technical analysis refers to classic trading strategies of visual patterns in the price charts like double top or bottom, head and shoulders, etc; these patterns and their traders suffer from massive hindsight bias. Feeding prices into computers looking for patterns is not technical analysis unless you've programmed the computer to look for said human found patterns. If the computer is actually searching for real patterns and you're testing them properly with forward tests on fresh data, you're doing quantitative analysis not technical analysis.
TA is practiced by manual traders, Quants are generally automated traders or traders doing proper statistics rather than relying on the visual patterns manual traders think they see.
Not to say that people don't, but as mentioned, they don't last. How long they do last is an even more difficult prediction problem.
Look at the ImageNet ICVLR competition. Hand-built models can't approach SoTA results built with CNNs.
ML is mainly interested in prediction (correlation instead of causation), typically over some data that just fell in your lap.
The first case is where data science has got a bad name; people swing into domains and companies full of cocksure ideas, produce insights that are risible or obvious and get ejected. Sometimes it takes years for sufficient knowledge to be acquired by analysts to deal with difficult domains.
Lots of people use Bayesian inference to do the second. Tools like Stan and PyMC3 are really popular and effective.
I invite you to read Chris Bishop's or Stephen Muggleton's books.
Anyone who works with data will find it hard to imagine data that "fell into your lap", all the data I've ever used successfully required slogging and grinding.
"fell into your lap" has nothing to do with how hard the work is, it's about the difference between a controlled experiment and an observational study. the bulk of ML is observational in nature (focusing on prediction) and therefore has nothing to say about causation or understanding the causal variables of the underlying reality.
One big deal is applications to dynamic domains.
Regardless, that post was a great read.
You should never, ever extrapolate. It doesn't matter what your model is, it won't work.
On a side note, it could be that there is a breakpoint at Magnitude 7.25, where the slope of the line really changes, and a segmented linear regression is appropriate (https://en.wikipedia.org/wiki/Segmented_regression). But we would need more data to be sure, anyway.
But a sensible thing to do would be to draw many samples from the posterior distribution, instead of just using the maximum likelyhood estimate. That way the prediction accurately represents the uncertainty resulting from not having any data above magnitude 8 as well as, perhaps, your background knowledge that earthquakes of magnitude 15 never happen.
I was also curious about how the data in the past few years did not follow the same trend as before. Does anyone know if that is what geologists call to be 'overdue' to an earthquake? Like California is supposed to be for a while?
Here's another plot, this time from UK seismic frequency, where again the frequency for high magnitude earthquakes seem 'under' the expected curve. Yet, again, these are 2 plots...
most of the time there is no a priori way of determining this
you come to the problem with your own assumptions (or you inherit them) and that guides you (or misguides you)
Even better, you can put priors on the parameters of your model and give it the full Bayesian treatment via MCMC. This avoids overfitting, and gives you information about how strongly your data specifies the model.
It is far too frequently misunderstood as the science of making certainty from uncertainty.