What maths must be understood to enable pursuit of either of the above fields? are there any seminal texts/courses/content which should be consumed before starting?
You absolutely need a solid grounding in multi-variable calculus, linear algebra, probability theory and information theory. It will also be helpful to be well versed in graph theory.
In my opinion one of the best starting points is "Information Theory, Inference and Learning Algorithms" by David MacKaye. It's a bit long in the tooth now, but it is still one of the most approachable and well written books in the field.
Another old book that stands up very well is "Probability Theory: the Logic of Science" by E. T. Jaynes.
"Elements of Statistical Learning" by Tibshirani is also good.
"Bayesian Data Analysis" by Andrew Gelman is another great read.
"Deep Learning" by Ian Goodfellow and Yoshua Bengio is useful for getting caught up with recent advances in that field.
I'm not super interested in ML but I am very interested in applied mathematics in computer science. I've got a fair bit of linear algebra due to cryptography, but have had virtually no need of any form of calculus (unless I'm relying on it without knowing it) in my career.
So beyond just saying that you'd need grounding in multivariable calculus to do serious ML work, I would be super interested in hearing more about why that is and what kinds of problems crop up in ML that demand it.
Calculus essentially discusses how things change smoothly and it has a very nice mechanism for talking about smooth changes algebraically.
A system which is at an optimum will, at that exact point, be no longer increasing or decreasing: a metal sheet balanced at the peak of a hill rests flat.
Many problems in ML are optimization problems: given some set of constraints, what choices of unknown parameters minimizes error? This can be very hard (NP-hard) in general, but if you design your situation to be "smooth" then you can use calculus and its very nice set of algebraic solutions.
You also need multivariate calculus because typically while you're only trying to minimize "error", you do so by changing many, many parameters at once. This means that you've got to talk about smooth changes in a high-dimensional space.
--
The other side of calculus is integration which talks about "measuring" how big things are. Most of probability is discussing very generalized ratios: of the total, "how big is this piece" is analogous to "what are the odds this will happen".
The general discussion of measure is complex and essentially the only tool to tackle it involves gigantic (infinite, really) sums of small, well-behaved pieces to form a complex whole.
It just happens to turn out (and this is the big secret of calculus) that this machinery (integration) is dual to the study of smooth changes and you can knock them both out together.
--
So ultimately, ML hinges upon being able to measure things (integration) and talk about how they change (derivation). Those two happen to be the same concept in a way and they are essentially what you study in calculus.
A lot of probability theory requires it. For instance, ML is largely framed mathematically as a series of optimisation problem, which are then solved by finding the gradient and performing gradient descent; this requires elementary calculus to calculate the gradient.
Additionally, if you want to calculate a probability given a density function, or evaluate an expectation, you need to calculate several integrals. This arises quite often in the theoretical sections of ML papers/textbooks.
The use of calculus in ML is probably similar to the use of number theory in crypto- you can do applied work fine without it, but you understand the work a lot better by knowing the math, and are less likely to make dumb mistakes.
Most of ML is fitting models to data. To fit a model you minimise some error measure as a function of its real valued parameters, e.g. the weights of the connections in a neural network. The algorithms to do the minimisation are based on gradient descent, which depends on derivatives, i.e. differential calculus.
If you're doing Bayesian inference you're going to need integral calculus because Bayes' law gives the posterior distribution as an integral.
For ML you just need Calculus 1 and 2. The curl/div and Stokes is Calculus 3 which a physics thing. You don't need that for ML.
You may need the basics of functional analysis in certain areas of ML, which is arguably Calculus 4.
> Most of ML is fitting models to data. To fit a model you minimise some error measure as a function of its real valued parameters, e.g. the weights of the connections in a neural network. The algorithms to do the minimisation are based on gradient descent, which depends on derivatives, i.e. differential calculus.
> If you're doing Bayesian inference you're going to need integral calculus because Bayes' law gives the posterior distribution as an integral.
The most obvious thing is understanding back-propagation. Backprop is pretty much all partial derivatives / chain rule manipulations. Also a lot of machine learning involves convex optimization which entails some calculus.
Much of ML is optimization. This is linked to calculus by derivatives. There is the simple part that at a minimum or maximum the derivative is 0. However, more relevance comes from gradient descent. This depends very heavily on calculating derivatives, and its one of the most universal fast optimization methods.
Beyond that, for iterative methods, convergence is a matter of limits. This again is calculus. Formulating iteration as repeatedly applying a function, we converge to a fixed point of that function if and only if the derivative at that fixed point lies between -1 and 1. Again derivatives come in.
Finally, for error estimation, taylor-expansions are often useful. Again, the topic here is calculus. Notably, all I can think of regards limits and derivatives, not integrals. That might just be due to my hatred of integrals though.
I have a pretty good math background, but understanding K-L divergence ([0], a measure of the difference between two probability distributions) required revisiting some calculus. It's needed for understanding models with probabilistic output, used in both generative models and reinforcement learning.
Almost every corner of an ML problem has an optimization problem that needs to be solved: There is a function that you want to minimize subject to constraints. Typically these are everywhere smooth, or sometimes almost everywhere smooth. So calculus shows up in (i) algorithms to find the bottom of these functions (if they exist) or (ii) deriving the location of the minima in closed form. These functions would be "how close am I to the correct parameter", "What losses would these settings rake up on average" etc etc.
The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture (another option is to treat it as an adverserial game with nature). In the probabilistic approach, we make the assumption that functions being revealed to us is in some probabilistic proximity of the true function and the sample is closing onto it slowly. We have to be careful to be not too eager to model the revealed function, our goal is to optimize the function where these revealed functions are ultimately headed.
Those things aside, if you have to choose just one prereq, I think it has to be linear algebra and you already have that in your bag. Without it, a lot of multivariate calculus will not make much sense anyway. Then one can push things a little bit and go for the linear algebra where your vectors have infinte dimension. This becomes important because often your data would have far too much information that you can encode in a finite dimensional vector. Thankfully a lot of intution carries over to infinite dimension (except when it does not). This goes by the name functional analysis. Not absolutely essential, but then lack of intution here can rein you in from doing some certain kinds of work. You will just get a better (at times spatial or geometric) understanding of the picture, etc etc.
Other than theeir motivating narratives, there is not much difference btween probability/stats and information theory. There is a one to one mapping between many if not all of their core problems. A lot of this applies to signal processing too. Many of the problems that we are stuck at in these domains are the same. Sometimes a problem seems better motivated in one narrative over the other. Some will call it finding the best code for the source, others will call it parameter estimation, yet others will call it learning.
Or If I may paraphrase for the CS audience, blame the reals \mathbb{R}. Otherwise it would have been the problem of reverse engineering a noisy Turing machine that we can access only through its input and output. Pretty damn hard even if we dont get into reals. In those situations you could potentially get by without calculus, algebra by itslef should go a long way, but as I said it gets frigging hard. Learning even the lowly regular expression from examples is hard. Calculus would still be helpful because many combinatorial / counting prolems that come up can be dealt with generating function techniques where you would run into integral calculus with complex numbers.
> Almost every corner of an ML problem has an optimization problem that needs to be solved: There is a function that you want to minimize subject to constraints. Typically these are everywhere smooth, or sometimes almost everywhere smooth. So calculus shows up in (i) algorithms to find the bottom of these functions (if they exist) or (ii) deriving the location of the minima in closed form. These functions would be "how close am I to the correct parameter", "What losses would these settings rake up on average" etc etc.
> The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture
The optimization techniques required to actually fit models are almost all powered by some form of gradient descent, and integration is usually required in truly probabilistic models to go from a density function to predictions.
All of statistics and machine learning involves lots of integrals and derivatives. For example: expected values are integrals, and model fitting is done by hill climbing in the direction of the derivative.
> "Bayesian Data Analysis" by Andrew Gelman is another great read.
If you want to read that book you need real analysis more specifically measure theory (unless that subject is in probability theory for you). You cannot get into the last few chapters without it. Dirichlet Process are described using measures.
I don't believe you need multivar calc or info theory. Info theory stuff are used but not as often. I believe you're slanted toward researcher phd position. Gini index, entropy, etc... and such are taken as given when needed.
My recollection is that you need neither real analysis nor measure theory to appreciate it, but it's been a while since I read it. You might get more out of it if you have studied those.
I disagree on multivar calc. Statistics often makes use of matrix derivatives. I have found it helpful to know.
What's required as a prereq to Measure Theory? Any suggestions on good resources for learning Measure Theory?
I have a vague notion that Probability and Measure Theory are intertwined / related somehow, but have never studied the latter specifically.
The relationship is that measure theory provides the theoretical framework for making probability theory rigorous.
The only formal prerequisite for learning measure theory is that you should know series and sequences. For a reference, I'm not so sure, maybe Halmos's book. The important parts are probably:
- Monotone convergence theorem
- Dominated convergence theorem
- The construction of the Lebesgue integral
- Fubini's theorem and Tonelli's theorem
I would probably try not to get bogged down in details of construction of measures (unless you like that) and take the Lebesgue measure (essentially length) as given. Also check out the Radon-Nikodym theorem which states that we can always (ish) work with density functions.
The typical prerequisite for measure theory is a two-semester real analysis course, a la Rudin or any of its alternatives (I particularly like Pugh's book). A solid topological background is also a good idea, although you can probably get away with whatever you learned in real analysis. Two standard measure theory texts are Folland's Real Analysis and the first half of Rudin's Real and Complex Analysis.
Probability theory is the study of distributions of constant measure in measure theoretic terms. There are some good resources that mtzet mentions, but I just wanted to note that a lot of the integration terminology which you take for granted reading about probability theory is formally defined in measure theory. It's also very nice for making signal processing math more formal.
I disagree that you need a solid founding in information theory. Almost all that I've seen about IT in ML is minimizing the KL divergence, which can be learned by browsing the wiki page.
Well, information theory isn't much more than the logarithm of probability theory, so it doesn't hurt to learn it anyway. The only thing you need to know is that given a probability distribution P there exist a compression scheme to encode a value X with a message of P_length(X) = log(1/P(X)) bits. This can be summarised as BITS = log(1/PROBABILITY). Entropy is just the average number of bits you need to encode a random value from distribution P with the compression scheme of distribution P, i.e. E_P[P_length(X)]. The KL(P,Q) divergence is when you encode a random value from distribution P with the compression scheme of distribution Q. Say you're compressing english text but you're using a compressor tailored to spanish. The KL divergence is how many extra bits you need (on average) compared to encoding the english text with the english compressor:
Maybe more all that is essential for a molecular biologist isn't necessary for a general practitioner? It's just... those conference calls where you're explaining that because the classifier is working really well now doesn't mean that we can use it in production, those calls can get difficult and annoying, and sometimes the "other side" wins - with predictable results.
You bring up a very important point and a difficult one which is, if the decision making is in the hands of someone who does not understand the nuances too well nor has the time or inclination, what do you do ?
If your salary is going to depend on how many models you pushed out and not how well they continued to perform, many will optimize over the number of models pushed out.
A major source of problem (and sometimes a gift) is that you cannot prove a empirical statistical claim true or false in finite time. There is always this non-zero probability that the weirdest thing would happen. It could be just sheer bad luck that the model did so poorly in this cycle.
That's not because you need little background in information theory. That's because KL-divergences are such a universal info-theoretic quantity that if you deeply understand them, you understand much to most of information theory.
This is like saying, "You don't need to really know calculus, just integrals."
A lot of data is best represented graphically, and while you can shoehorn this sort of data into a vector space by projecting using a graph distance metric, the results are likely to be inferior.
I agree a lot of data can be represented graphically, but if you look at the literature it mostly is getting shoehorned into vector spaces. This doesn't mean people shouldn't learn about Graph Laplacians and friends, but I don't think it's an entry requirement.
For calculus, I'd skip the more physics like finding of integrals and derivatives. What matters is understanding the concepts of integrals and derivatives, and knowing properties like the chain rule. It pays much less to know that the integral of 1/x is ln(x) (or the other way round).
The linear algebra and probability theory are most important imho. I'd also distinguish between probability theory and statistics. Both are important, but they are distinct disciplines.
It will depend on the level you plan to engage in the ML/AI space. If you just want a job in ML/AI , you are in luck. Due to the growing assortment of available, mostly to fully automated, solutions like Datarobot, H2O, sckit-learn, keras(w/ tensorflow) the only math you will absolutely 'need' is probably just Statistics. Regardless of what's going on behind the scenes with whatever automatically tuned and selected algorithm your chosen solutions uses, you will still need some stats in the end to show the brass that 'your' model works. the upside is that then you can spend time, learning feature extraction, data engineering, and the aforementioned toolkits, in particular what models they make available.
If you want to develop new techniques and algorithms, the the skies the limit, you'll of course want Stats too though.
I found this course to be very helpful, it has a good balance of reading material and labs to apply what you learn. The course is from the Austin’s Department of Statistics and Data Sciences.
Note: In this course, Dr. Michael J. Mahometa uses R. But I'd recommend you not to focus on R vs Python debates; the goal of this course is to learn about Statistics & Data Analysis in real-world scenarios. With that in mind, even just going through the reading material and lecture videos will be valuable enough if you're starting from scratch (but I'd recommend you to take the extra step and complete the Labs too).
This
https://www.amazon.com/Probability-Statistics-Engineers-Scie...
is the newer version of the stats book i had in undergrad,
But @anst makes a good point about scikit learn. there is alot of good math to learn just from the docs and you can then investigate further on wiki, quora, stackexchange.
for the what's up in Data Science i like datatau.com.
and there are some great podcasts too, like datascienceathome and partiallyderivative (there are lists).
There's a series of courses on Coursera, part of a Specialization from Duke titled something like "Statistics and Probability with R" or something like that. I've taken the first few classes in that series and have found them pretty good. The class on Bayesian Statistics is a little more difficult, but not too bad. I'll just say that you might want to complement the class with another book or other references on Bayesian stats. I've used this book:
What "maths" is keras? Or scikit-learn?
For what it's worth, to understand scikit-learn doc/tutorial I'd say you'll need Probability, Linear Algebra, Multivariate Calculus and, yeah, Stats.
Not necessarily at a PhD level but still. And more you understand maths farther you can get in AL/ML.
These libraries leave most of the actual day to day work for ETL. ETL happens to be highly data and problem dependent, so it can't be easily automated or reused. For this reason I think the best thing to be a good applied ML person is a solid programming background. You should have a working knowledge of statistics and linear algebra, but the most useful skill really is being able to write good code. It's different for research of course.
Those are ML and AI frameworks that use a tremendous amount of mathematics under the hood, but you can also reliably treat them as blackbox learning systems too. Understanding the model generation procedure and setup is often unneeded. And many tools will help direct you toward what algorithms makes the most sense for your data, and even have competitions to figure out which actually works best. I agree, it's a little disappointing, but admittedly it doesn't take a PhD to do this stuff anymore.
It is important to note that just because you can do all the stuff a PhD Scientist might regularly do, doesn't mean that someone will hire you for it. In that case you might need to have a PhD in mathematics, computer science or a related field. But that is more a consequence of competition and long term talent investment, than the practice of ML/AI itself.
Competition (labor supply side) and ultimate success of current ML approaches.
As the market starts to overheat, it seems that there will be a labor shortage/good quality workers will be scarce and we'll have to make simple tools for simpletons. But this is all a huge "if". Eventually the market will contract a lot and slack labor market conditions will have companies hiring them PhDs.
There exists a tool called TPOT (Tree-based Pipeline Optimization Tool) [0] that aims to automate the knob-twiddling that tends to go with optimizing Machine Learning models. As these models often have a number of parameters to tune and tweak over large scales, such a tool can be useful to identify performant combinations of these parameters and save time in doing so.
However, many ML practitioners are wary of similar automated ML pipelines, especially as they focus on non-expert users. A huge part of "data science" is the "data" itself. It often has idiosyncrasies and quirks that must be identified and accounted for in any model that hopes to make useful predictions. There are many pitfalls that come from not understanding the base statistical/mathematical assumptions of these tools, and a simplified Automatic ML Suite runs the risk of providing misleading results when used as a one-size-fits-all solution.
Even for expert users, such tools often make it difficult (either by mathematical need or software design) to interpret the reasons and causes for their results. "Black boxes" like this are definitely hard to sell up the chain.
These tools do, however, have an important place in saving practitioners time and energy on the "knob-twiddling". It's a little like robot-assisted surgery: the robot doesn't actually do the surgery, but it makes the surgeon's job a whole lot easier.
That is making the assumption that the person using the tool is a surgeon (an expert in the field who could function independently if needed) which is not who the targeted demographic of such tools is. No-one who understands ML to some non-zero extent would use a plug-and-play ML tool, given that there is ML left to do otherwise. A better analogy would be a janitor activating the red button of the robot machine, which then does its complex surgery where if something goes wrong, the janitor would not be able to replace/understand the problem other than trying to restart it/kick it.
Perhaps, but the meta/hyper-optimization techniques used to implement TPOT, AutoML, etc. are perfectly valid replacements for grid search and stepwise feature selection.
Basic probability is very helpful: Expectation, Standard deviation, P(A and B) = P(A)*P(B) is A and B are independent, P(A or B) = P(A)+P(B) if A and B are mutually exclusive. Also, knowing algebra is very helpful.
In a way, you don't really need to know much more because there is a lot of good software out there.
If you want to learn more math, learn Linear Regression, Logistic Regression, p-values, probability density functions, cumulative density function, the Central Limit Theorem, Gaussian Distributions, Exponential Distributions, Binomial Distribution, (maybe) Student-T distribution.
If you want to learn even more, first learn matrices (adding, multiplying, inverting, rank, span, matrix decomposition (SVD, and eigendecomposition are the most important)).
If you want to learn even more, it's time to learn calculus. Integral calculus is needed for continuous probability distributions and information theory. Differential calculus is needed to understand back propagation.
There are a lot of other good suggestions written by the other commentators.
If you care about actually reading the nournals, as I do, and you had a very poor math education (as mine was abysmally opposed to both math and science as enemies of religion) then here are things I've determined I need to know to read journals:
- Core statistics. You need to be familiar with how statisticians treat data, because it comes up a lot.
- Calculus. You do not need to be a wizard at working the numbers but you do need to understand how to describe the process of differentiation and integration over multiple variables comfortably.
- Linear algebra. It's essentially the basis for everything, even more than statistics.
- Numerical nethods for computing. I constantly have to refer to references to understand why people make the choices they do.
- Theory of computation and the research clustered around it. Familiarity here helps a lot. Sometimes I even catch errors or am able to recognize improvements available. Also there is a lot of crossover, as one would expect. An example: everyone is remembering how good automatic differentiation is! And given that properly combined differentiable equations are also differentiable, AD let's you optimize over your optimization process. It's differentiable turtles all the way down.
My next big challenge is nonparametric statistics. Many researchers tell me that this is a very fruitful place to be and many methods there are increasingly making improvements in ML.
It depends on how deep you want to go and what your goals are, but I'd say that CuriouslyC pretty much nailed it. Multi-variable calculus, linear algebra, and probability / stats are definitely the core.
If you're interested in finding more "freely available online" maths references, check out:
There's also a TON of high-quality maths instructional content on Youtube, Videolectures.net, etc. For example, there's some really good stuff by David McKay (also mentioned in CuriouslyC's post) here:
Surprising level of disagreement here on a few items for a sub field that has its own degree tracks.
Multiavariable calc you either "abolsutely" need or don't really need. Should be well versed in graph theory, or don't need it much.
Surely some of the contradiction is caused by different assumptions of what the goal is. But some of its hard to relate to as a reader. For example, I haven't been in the field but but have tried to read enough to understand the concepts, and having studied graph theory I don't see how it's a top 5 recommendation.
I don't doubt anyone's experience, would just be nice to know which assumption is behind a suggestion.
To apply known methods in cases where they mostly work, you don't need to know the math behind them, you just need to know basic stats and basic probability to interpret the results. So if the assumption is that you'll simply be solving your problems by applying the known methods using the (great!) tooling made by others, then you don't need the math background; you can certainly train undergrads to solve quite nifty problems with the powerful tools without going into much if any detail about the underlying math, treating it as an engineering problem of following best practices. After all, the choice of e.g. a particular gradient descent optimization algorithm is not based on their mathematical properties (the proven bounds are so far away from practical results, and a better proven bound doesn't correlate that much with having better results) but on empirical evaluation, and in most cases you're not going to implement any of the low-level structures/formulas on your own anyway, in practical solution development you're just going to choose them from a list by name in the framework of your choice.
On the other hand, if the assumption is that your particular problem is not solvable easily and reliably with the current approaches, then quite a lot of the math background helps - if you want to improve on the current results, or debug/understand why your solution doesn't work as intended, or why the conceptual solution can't work on your problem because of incompatible assumptions, then these areas of math are useful. If you want to use a new bleeding-edge construct, or a rare niche construct that's not yet implemented in the framework of your choice, then you're going to need to write it yourself, and then you need to understand how it works.
There's a large distance between using and applying ML techniques and researching and improving ML techniques; it's a continuum, but there's space for many people standing purely in the applied end.
I think it's more nuanced. On average the better grasp of the theory an engineer has, the more pathways to success they have. Making better decisions, less guessing, leading a team, wanting to have input into future products and services, and so on.
Just having things be less opaque reduces cognitive load, makes more room for creative solutions.
None. You can be a productive ML engineer without understanding the math. Many elitist engineers here will downvote me, but its true. ML libraries that allow you to quickly get productive have come a long way. BUT, you have to have a solid understanding of WHICH algorithms/tools to use WHEN. There is also a lot of "voodoo" knowledge to gain that isn't well documented or explained (unrelated to maths).
But a good number of people that are doing work haven't taken real analysis, or it's been awhile and so you should be current on multivariable and vector calculus. Calculus of variations shows up from time to time.
For math reviews, look at the following (there's others if you want more refs, ping me):
Make sure to differentiate between AI researcher and applied AI software engineer, or whatever that is called.
The former needs the mathematical background mentioned here to develop groundbreaking algorithms or improve on existing ones, while the latter merely implements them and requires a much smaller mathematical toolset.
It depends whether you want to work more as an engineer / data analyst, or more as a "ML researcher". For the latter, then, yes, as everyone says below, you need to be totally comfortable with multivariable calculus, linear algebra, probability and statistics, numerical optimization etc. But many jobs are more practical in nature, in which the main case essential skill is, being able to run a bunch of different models with different parameter values and collect and interpret the results, efficiently and reproducibly, and be able to talk about them and make recommendations for the way forwards. In those jobs you're not actually going to need to be able to derive updates for backpropagation, even though it's certainly satisfying to understand it.
Yep. We have to keep in mind the distinction between "applied ML" and "ML research" (while realizing that this is a continuum, not a binary distinction). Not everybody is doing cutting edge original research... some people really can just get by with downloading DL4J, reading a few tutorials, and then applying a basic network to their problem, and create some value in the process.
I think cars are a good analogy. In the early days of automobiles, you needed to be something just short of a mechanical engineer to keep one going for any length of time, and it was routine to need to carry around tools and spare parts to perform significant repairs. You really needed to know a pretty good bit about how the car worked to use it effectively. But over time cars developed better abstractions and became more dependable and it became possible to operate a car without caring one lick about how it works, beyond know that it needs gas (or electricity!) and taking it in for the occasional tuneup /tire change / alignment / etc.
I wouldn't say we're at the point yet where ML afford one the opportunity to be completely divorced from caring about the underlying details, but I think we are at a point where you can legitimately get useful stuff done without needing to be able to, say, derive the equations for backprop by hand.
Honestly, I don't think having to learn some stuff before starting anything is nessecary, especially for learning a field as wide as ML/AI. It's much better to start out trying to learning something you're interested in, and then trying to fill in the gaps. This will also help you understand and motivate the underlying theory you're reading.
So for example, start with some source in ML/AI you'd like to read. If you get stuck, ask somewhere (possibly an online forum like this) what field you're having trouble with and how to get started there.
Maybe unrelated to OP's question but I have always felt that it is impossible to get a job in AI/ML without a PhD in that field (by getting a job I mean do something new/useful and not just coding algorithm devised by other people). I studied mechatronics in university and fairly comfortable with math (calculus, linear algebra and stats), I have even written a small neural network back then to optimise parameters for lathe machining. But that's no where near enough for a job in AI/ML. Unlike writing a web page, which someone can learn within a week to produce something usable, I feel like you need years and years of studying to barely get a start in ML/AI and there is no hope for us non computer scientist at all.
[Added] Of course writing webpages pays well enough, but I still can't shake off this feeling that I am missing something by not jumping on the AI/ML train though.
Statistics and Probability - For non-math background, Openintro.org with R and Sas lab is a good one. Khan academy videos on the same again makes a lot of concepts easier.
I won't really comment about ML/AI in general. But, if you specifically care about getting into Deep Learning, I would say only bother looking into:
- Basic linear algebra and matrix algebra.
Since you would rely on frameworks like Tensorflow to handle figuring out the derivatives for you, you don't really need to know much calculus. Just read up on what derivative of a function at a particular point signifies. This should give you enough intuition to understand things initially.
A skill that would really come in useful would be ability to look at a function and think how increasing/decreasing one of the variables would affect it's value. This would help develop intuition around a lot of concepts used in Deep Learning topologies.
I previously attempted Andrew Ng's old course, but didn't complete the tutorials. Now I would start in this order:
Watch the course.fast.ai lectures quickly, just to see a lot of practical ML/AI applications. You'll see how effective you can be just by knowing the tools with very little math background.
Next I'd look at the NEW Andrew Ng introduction on Coursera. It is much more approachable than his first course. You might still feel a little overwhelmed by a few equations, but then you'll implement them yourself in numpy. (And the ipython/jupyter notebooks are really well written, walking you through every step.)
I wish the people who answer this question are people that are current deep learning engineers or data scientist that use deep learning in real world settings, I am worried that people who are not credible are giving advice, which is not valuable. I am a masters student taking a PhD class in Bayesian machine learning to figure this out as well. I hope to have a better answer for this by the end of the course!
I wish the people who answer this question are people that are current deep learning engineers or data scientist that use deep learning in real world settings,
Why do you want answers only from people doing deep learning? Deep learning is just a subset of the overall field (albeit an incredibly popular and useful one).
Anyway, the simple solution is just to use some simple machine learning of your own to analyze the data set which these threads constitute. Look for patterns... are certain answers being repeated over and over again, by different posters? Then I'd argue that your Bayesian posterior for "this is legitimately important" should go up.
Take Linear Algebra for example... given the sheer number of people saying "linear algebra" in their answers, it seems a reasonably bet to me that LA is really, truly useful. Either that or there's some really freaking group-think shit going on. :-)
I guess what I am looking for is advice from practitioners who wont lead people astray who are really interested in diving deep into ML.
I have attempted to read the Statistical Learning book, and its so daunting because the book expects a lot of background knowledge, and it takes a while to really wrap your head around these concepts. I think people should learn from a lighter book, before diving into these books if you are lacking the background.
My current approach to pursuing a career in DL and ML is going to graduate school, taking a graduate ML course, and trying to apply my knowledge to different problems I am interested in.
I am reading the Bishop book Pattern Recognition now. I think from the perspective of having to re-learn a lot of calculus and probability, that book is more approachable than Statistical learning.
My advice (which I am attempting now) to dive deep into ML is follows:
1. Taking Bayesian ML class (at Cornell)
2. Read/Study Pattern Recognition by Bishop, for 5hrs/day
3. Try exercises, if fail, review solutions
4. If lost(which is usually), review missing concepts from MIT OCW scholar courses
>I wish the people who answer this question are people that are current deep learning engineers or data scientist that use deep learning in real world settings
There simply aren't very many people in those roles because the number of ML/AI/DL jobs out there are still limited, I think.
1. You can get a long way with high school calculus and probability theory.
2. Regarding books I second the late David McKay's "Information Theory, Inference and Learning Algorithms" and the second edition of "Elements of Statistical Learning" by Tibshirani et al. (there's also a more accessible version of a subset of the material targeting MBA students called James et al., An Introduction to Statistical Learning). Duda/Hart/Stork's Pattern Classification (2nd ed.) is also great.
The self-published volume by Abu-Mostafa/Magdon-Ismail/Lin, Learning from Data: A Short Course is impressive, short and useful for self-study.
3. Wikipedia is surprisingly good at providing help, and so is Stack Exchange, which has a statistics sub-forum, and of course there are many online MOOC courses on statistics/probability and more specialized ones on machine learning.
4. After that you will want to consult conference papers and online tutorials on particular models (k-means, Ward/HAC, HMM, SVM, perceptron, MLP, linear and logistic regression, kNN, multinomial naive Bayes, ...).
Probability theory and linear algebra are pretty much the core. Learning LA will help you become comfortable with multi-dimensional quantities, vector spaces, and give you some powerful computational techniques, e.g., SVG==PCA.
If you want to understand SVMs deeply, a course in convex optimization. In general, proving maximum likelihood estimation for a lot of classic machine learning models involves using the method of Lagrange multipliers. But not deep neural networks :)
Provides a very good idea of the courses required and their time frame. I roughly followed along this path but took "Analytics Edge" https://www.edx.org/course/analytics-edge-mitx-15-071x-3 for introduction into ML algorithms.
A thorough, intuitive grounding in statistics is crucial, IMO.
Doing any kind of ML means questioning all the assumptions that go into your results and understanding how those assumptions could affect the outcome. That process starts in stats.
It depends on what "pursuing ML/AI" means. I've written a recommendation engine with barely understanding linear algebra and a spam filter without knowing Bayes theorem. A programmer can work on ML systems without having a solid foundation in higher maths. However, if you want to develop your own solutions then you surely need the math.
For ML, the other users gave a good coverage of topics. But AI is an incredibly broad field, and each specialty uses different math topics. Learning all of the math would be infeasible. What are your particular interests?
Russell and Norvig have a good book at http://aima.cs.berkeley.edu that covers many different topics in AI, although it is definitely not comprehensive. I would say that whatever you learn in an undergraduate CS degree would give you a good starting point for learning any particular AI topics.
- you are starting with the equivalent of a high school level of maths
- you want to take a ML course or read an ML book without feeling totally lost
As some commenters have said, Calculus, Probability and Linear Algebra will be very helpful.
Some people like to recommend the "best" or "most important" books which you "should" read, but there is a strong chance these will end up sitting on a bookshelf, barely touched.
So I will recommend some books which are perhaps more accessible.
- Calculus by Gilbert Strang
- Linear Algebra by Gilbert Strang
For Probability: I don't have any favourites, sorry.
Various universities have very good course content freely available online, often including textbook recommendations, course notes, exercises, sample exams, and video lectures. Realistically it is probably going to be quite difficult to learn this on your own.
Probability, and thus multivariate calculus and partial differential equations. Linear algebra. Convex Optimization, and thus multivariate and partial differential equations. Some principals of statistics is usually helpful
Why do you need partial differential equations? I don't think you necessarily need any knowledge of differential equations to do ML, though the top ML people certainly would know it because of their general math education.
I spent a lot of time messing with PDEs as a student but sadly that knowledge hasn't been very useful - I've only seen them come up in quite specialised areas like optical flow...
Some people have had a more comprehensive view on this -- if I were to focus on one field of math to understand really well though, it'd be statistical reasoning and the understanding of probability and uncertainty.
Calculus (preferably both multi-variate and discrete), probability, statistics, operations research, graph theory, topology, computational complexity. All depends on how deep you'd like to go.
Generally should have college freshman and
sophomore calculus.
(1.1) Functions
So, there can understand better what a
function is. E.g., function
f(x) = 3x^2 + 1.
(1.2) Derivatives
Then will learn how to find the slope of
the graph of a function. That is the
derivative of the function. E.g., for
function f with f(x) = 3x + 2, as in high
school algebra, the slope is 3. Then for
each x, the derivative of f at x is just
3.
The derivative of function f is denoted by
either of
f'(x) = d/dx f(x)
E.g., for function f(x) = 3x^2 + 1 it turns
out that
f'(x) = 6x.
(1.3) Integration
For function
g(x) = 6 x
maybe we want to know what function f(x)
will give us
f'(x) = g(x)
Finding such a function f is
anti-differentiation, that is, undoes
differentiation. So, sure,
f(x) = 3x^2 + C
for any constant C.
Such anti-differentiation is also the way
to find the area under a curve. So, can
use that to find the area of a circle,
volume of a cylinder, etc. Doing that the
anti-differentiation is integration.
The fundamental theorem of calculus shows
how differentiation and integration are
related.
(1.4) Analytic Geometry
Commonly taught at the beginning of a
calculus course is analytic geometry.
So, take a cone an cut it. Then the cut
surfaces will be one of a circle, an
ellipse, a parabola, a hyperbola, or just
two crossed straight lines. So, those
curves are from a cone and are the conic
sections.
There is some simple associated algebra.
Conic sections are important off and on;
e.g., applied math is awash in circles;
the planets move in ellipses; a baseball
moves in a parabola or nearly so; an
electron moving toward a negative charge
will turn away from that charge in a
hyperbola.
It turns out that in linear algebra
(below) circles and ellipses are important.
(1.5) Role of Calculus
Calculus was invented by Newton as part of
working with force and acceleration for
understanding the motion of the planets.
E.g., if at time t function d(t) gives
distance traveled, then function v(t) =
d'(t) is the velocity at time t and
function a(t) = v'(t) is the acceleration
at time t.
Then Newton's second law is
F(t) = m a(t)
where F(t) is the force at time t applied
to mass m.
Calculus is the first approach to the
analysis of continuous change and is a
pillar of civilization.
Knowledge of calculus will commonly be
assumed in work in ML/AL, data science,
statistics, optimization, applied math,
engineering, etc.
E.g., a lot in ML, AI, and data science is
getting best fits to data; best fitting is
to minimize errors in the fit; such
minimization is mostly a calculus problem;
one of the main steps in ML is steepest
descent, and that is from a derivative.
Probability theory (e.g., evaluating coin
tossing, poker hands, accuracy in ML) will
be important in ML/AI, etc.; two of the
basic notions in probability are
cumulative distributions and density
distributions; the cumulative is from an
integration, and the density is from a
differentiation.
The start of linear algebra was seen in
high school algebra, solving systems of
linear equations.
E.g., we seek numerical values of x and y
so that
3 x - 2 y = 7
-x + 2 y = 8
So, that is two equations in the two
unknowns x and y.
Well, for positive integers m and n, we
can have m linear (linear is in the
above example but omitting here a careful
definition) equations in n unknowns.
Then depending on the constants, there
will be none, one, or infinitely many
solutions.
E.g., likely the central technique of ML
and data science is fitting a linear
equation to data. There the central idea
is the set of normal equations which are
linear (and, crucially, symmetric and
non-negative semi-definite as covered
carefully in linear algebra).
(2.2) Gauss Elimination
The first technique for attacking linear
equations is Gauss elimination. There can
determine if there are none, one, or
infinitely many solutions. For one
solution, can find it. For infinitely
many solutions can find one solution and
for the rest characterize them as from
arbitrary values of several of the
variables.
(2.3) Vectors and Matrices
A nice step forward in working with
systems of linear equations is the subject
of vectors and matrices.
A good start is just
3 x - 2 y = 7
-x + 2 y = 8
we saw above. What we do is just rip out
the x and y, call that pair a vector,
leave the constants on the left as a
matrix, and regard the constants on the
right side as another vector. Then the
left side becomes the matrix theory
product of the matrix of the constants
and the vector of the unknowns x and y.
The matrix will have two rows and two
columns written roughly as in
/ \
| 3 - 2 |
| |
| -1 2 |
\ /
So, this matrix is said to be 2 x 2 (2 by
2).
Sure, for positive integers m and n, we
can have a matrix that is m x n (m by n)
which means m rows and n columns.
The vector of the unknowns x and y is 2 x
1 and is written
/ \
| x |
| |
| y |
\ /
So, we can say that the matrix is A; the
unknowns are the components of vector v;
the right side is vector b; and that the
system of equations is
Av = b
where the Av is the matrix product of A
and v. How is this product defined? It is
defined to give us just what we had with
the equations we started with -- here
omitting a careful definition.
So, we use a matrix and two vectors as new
notation to write our system of linear
equations. That's the start of matrix
theory.
It turns out that our new notation is
another pillar of civilization.
Given a m x n matrix A and an n x p matrix
B, we can form the m x p matrix product
AB. Amazingly, this product is
associative. That is, if we have p x q
matrix C then we can form m x q product
ABC = (AB)C = A(BC)
It turns out this fact is profound and
powerful.
The proof is based on interchanging the
order two summation signs, and that fact
generalizes.
Matrix product is the first good example
of a linear operator in a linear
system. The world is awash in linear
systems. There is a lot on linear
operators, e.g., Dunford and Schwartz,
Linear Operators. Electronic
engineering, acoustics, and quantum
mechanics are awash in linear operators.
To build a model of the real world, for
ML, AL, data science, ..., etc., the
obvious first cut is to build a linear
system.
And if one linear system does not fit very
well, then we can use several in patches
of some kind.
(2.4) Vector Spaces
For the set of real numbers R and a
positive integer n, consider the set V of
all n x 1 vectors of real numbers. Then V
is a vector space. We can write out the
definition of a vector space and see that
the set V does satisfy that definition.
That's the first vector space we get to
consider.
But we encounter lots more vector spaces;
e.g., in 3 dimensions, a 2 dimensional
plane through the origin is also a vector
space.
Gee, I mentioned dimension; we need a
good definition and a lot of associated
theorems. Linear algebra has those.
So, for matrix A, vector x, and vector of
zeros 0, the set of all solutions x to
Ax = 0
is a vector space, and it and its
dimension are central in what we get in
many applications, e.g., at the end of
Gauss elimination, fitting linear
equations to data, etc.
(2.5) Eigen Values, Vectors
Eigen in German translates to English as
special, unique, singular, or some such.
Well, for a n x n matrix A, we might have
that
Ax = lx
for number l. In this case what matrix A
does to vector x is just change its length
by l and keep its direction the same. So,
l and x are quite special. Then l is an
eigenvalue of A, and x is a
corresponding eigenvector of A.
These eigen quantities are central to the
crucial singular value decomposition, the
polar decomposition, principal components,
etc.
(2.6) Texts
A good, now quite old, intermediate text
in linear algebra is by Hoffman and Kunze,
IIRC now available for free as PDF on the
Internet.
A special, advanced linear algebra text is
P. Halmos, Finite Dimensional Vector
Spaces written in 1942 when Halmos was an
assistant to John von Neumann at the
Institute for Advanced Study. The text is
an elegant finite dimensional introduction
to infinite dimensional Hilbert space.
is an entertaining article about Harvard's
course Math 55. At one time that course
used that book by Halmos and also, see
below, Baby Rudin.
For more there is
Richard Bellman, Introduction to Matrix
Analysis.
Horn and Johnson, Matrix Analysis.
There is much more, e.g., on numerical
methods. There a good start is LINPACK,
the software, associated documentation,
and references.
(5) More
The next two topics would be probability
theory and statistics.
For a first text in either of these two,
I'd suggest you find several leading
research universities, call their math
departments, and find what texts they are
using for their first courses in
probability and statistics. I'd suggest
you get the three most recommended texts,
carefully study the most recommended one,
and use the other two for reference.
Similarly for calculus and linear algebra.
For more, that would take us into a ugrad
math major. Again, make some phone calls
for a list of recommended texts. One of
those might be
W. Rudin, Principles of Mathematical
Analysis.
aka, "Baby Rudin". It's highly precise
and challenging.
For more,
H. Royden, Real Analysis
W. Rudin, Real and Complex Analysis
L. Breiman, Probability
M. Loeve, Probability
J. Neveu, Mathematical Foundations of the
Calculus of Probability
The last two are challenging.
For Bayesian, that's conditional
expectation from the Radon-Nikodym theorem
with a nice proof by John von Neumann in
Rudin's Real and Complex Analysis.
After those texts, often can derive the
main results of statistics on your own or
just use Wikipedia a little. E.g., for
the Neyman-Pearson result in statistical
hypothesis testing, there is a nice proof
from the Hahn decomposition from the
Radon-Nikodym theorem.
I have been inspired by some of your past posts suggesting a path for studying mathematics and doing graduate level work, and have changed my direction to try and follow what you suggest. Is there any way I can get in touch with you privately? (I'm not looking for help with specific technical questions if you're concerned about that.)
Are you doing the "Get the book. Read the book. Do the exercises." method? If you are, what's your experience?
I have had some books stored up since forever, and graycat's post did motivate me to finally get around to reading them, but I find it hard to integrate into my daily routine. His 24h challenge killed my productivity for a day, and I can't really afford to get distracted by some tricky proof when I'm supposed to do something else.
Yes, I'm working through a few books that way. I didn't see his 24h challenge so I'm not sure what it is, but what has been effective for me is blocking off a few hours every day to work on this stuff. I haven't gotten to the really difficult material he's talking about yet, but I'm looking forward to seeing how this goes. Good luck to both of us!
In a different comment chain on the same submission (https://news.ycombinator.com/item?id=15024640), he challenged the commenters disagreeing with him to do these exercises in 24 hours. The tone was pretty abrasive, TBH, but I found the questions interesting enough that I tackled them in earnest.
I posted my solution attempts, so don't scroll down too far if you want to try them on your own ;)
It very much does. Boosting, a one of the best off the shelf ensemble classifier is derived from a game theoretic formulation. Besides that there is this huge body of literature about prediction under non-probabilistic sequence of test cases. This line of work is primarily held up by game theoretic arguments and that of online convex optimization.
Not a mention so far about game theory or Nash equilibrium.
It depends on what you're doing. I was literally just watching a video on Generative Adversarial Networks this morning, and game theory did come up there, at least in passing. If one sat down and started reading the papers on this subject and trying to implement / improve stuff in this area, I suspect game theory would be at least moderately important.
There is also the field of Competitive Learning where game theory has some application. See, for example:
> What maths must be understood to enable pursuit of either of the above fields?
None.
> Are there any seminal texts/courses/content which should be consumed before starting?
No.
You don't need to know binary to start being a programmer/developer either. Just start already. As long as you are not in charge of a medical diagnosis or financial model, you don't get any drawback in experimenting (and failing miserably).
Assuming applied ML, the most difficult part will be the human-political business element of it: People not understanding your model or using its output correctly, bias, feedback loops, acquiring enough resources, etc. The more you can explain to them, without resorting to heavy maths, the better communicator you are.
That said, it can't hurt to do Ng's Coursera course (a lot of top performers started out with this course). Learning from Data by Caltech's Abu-Mostafa goes very wide on machine learning. "Programming Collective Intelligence" is a, somewhat dated, good book.
As for seminal texts, the field is too wide for this. A better bet is: Find a professor in the field you are interested in. Say "Deep Learning", you could have a look at LeCun, Hinton, Schmidhuber, Bengio, ... Now look at their PhD-students, their papers, their courses, their conference talks, their software, their current research. Basically become a student under the most authoritative professor in the subfield you can find and resonate with, without ever paying any university tuition or them knowing you exist. This is very possible these days.
But by all means: Just start out. Machine learning is fun. Learning about dry 100 year old maths not so much. Make mistakes. Learn to detect and avoid overfit. Find out if you are passionate and curious about parts of the field, then the theory will come eventually. A lot of the time these questions seem to demand answers like: "You need a PhD-level understanding of mathematics" Just so your brain can go: "I am not good enough for this, so let's look at something easier". Don't use this as an excuse. Start making intelligent stuff. There are 16-year-olds on Kaggle routinely beating maths PhD's.
Also remember that, despite the current trend of calling everything "AI", that AI is a very wide field, of which mathematics is only a small part. There is philosophy, linguistics, cognitive science, physics, neuroscience, psychology, computer science, robotics, logic, ... all these parts vary wildly in their prerequisite maths knowledge.
In my opinion one of the best starting points is "Information Theory, Inference and Learning Algorithms" by David MacKaye. It's a bit long in the tooth now, but it is still one of the most approachable and well written books in the field.
Another old book that stands up very well is "Probability Theory: the Logic of Science" by E. T. Jaynes.
"Elements of Statistical Learning" by Tibshirani is also good.
"Bayesian Data Analysis" by Andrew Gelman is another great read.
"Deep Learning" by Ian Goodfellow and Yoshua Bengio is useful for getting caught up with recent advances in that field.