1. Convex Optimization -- not all problems are convex, but solutions for nonconvex problems end up primarily using convex methods with slight adaptations.
2. Stochastic Optimization -- ML is pretty much all stochastic optimization. No surprise there.
3. Statistical/Theoretical Machine Learning -- courses built around concentration bounds, PAC learnability, and the Valiant/Vapnik school of thought. This gives you what you need to talk about generalizability and sample complexity.
4. Numerical Linear Algebra -- being smart about linear algebra is most of efficient machine learning. Knowing which kinds of factorizations help you solve problems efficiently. Can you do a Gram-Schmidt factorization? Cholesky decomposition? LU factorization? When do these things fail? When do you benefit from sparse representations?
5. Graphical Models -- Markov chains, Markov fields, causal relationships, HMMs, factor graphs, forward-backward algorithm, sum-product algorithms.
If you're in school, take advantage of the fact that you're in school.
Once you have a grasp on these things (and you'll have to catch up on real analysis, matrix calculus, and a few other fields of math), you'll be able to start reasoning about ways to improve existing methods or come up with your own. I think a lot of it is just developing mathematical maturity to give you a vocabulary to think about things with.
Very specifically, there are two books that I think provide a good foundation. First, is Convex Functional Analysis by Kurdila and Zabarankin. Not many people know about this book, but essentially it provides a self-contained background to prove the Generalized Weirstrass Theorem, which details the conditions necessary for the existence and/or uniqueness of an optimization problem. This is important because even convex problems don't necessarily have a minimum. For example, min exp(-x) doesn't have one, but it does have an infimum. The background necessary to understand this book is real analysis and as a quick aside I think Rudin's Principles of Mathematical Analysis is the best for this. Second, is Nocedal and Wright's Numerical Optimization book. It provides a good overview of the powerful algorithms in optimization that we should be using. Now, it's weakness is that it often cheats and uses a stronger constraint qualification than we're afforded in practice. Candidly, I find that the derivative of the constraints will not remain full rank and we will likely violate the LICQ. Further, it covers a number of algorithms that really shouldn't be used in practice, ever. That said, it does cover the good algorithms and it generally has the best presentation out of the other books.
Sadly, I don't know of any killer books for numerical linear algebra. And, yes, I've read cover to cover things like Saad's Iterative Methods for Sparse Linear Systems, Trefethen and Bau's Numerical Linear Algebra, and Golub and van Loan's Matrix Computations. They're valuable and well-written, but don't quite cover what I end up having to do to make linear algebra work on my solvers.
Anyway, this is all biased and opinion, so take what you will. If someone else has some of their favorite references for optimization or numerical linear algebra, I'd love to hear.
A CS/AI PhD program is nothing like a Math/AMath/Stat PhD. In the latter, there is zero expectation that you will start producing anything in the first few semesters. In fact, it is explicitly required of you to load up on rigorous core courses, so that you can pass your prelims at the end of 2nd year and become a formal “PhD candidate”. The attrition rate in those programs is about 40% or more, so a lot of these people simply find out they don’t have the math maturity, drop out with a Masters at the end of2 years, get a job and call the whole thing off.
So in the latter case, yes, such a person can follow your guidelines. In fact, most of the material you listed in called AMC ( applied math core ) or CACM or other abbrev...and is already taught as part of core.
Now in the former case is where this particular student is. CS PhDs programs are a sort of weird beast in the US. They are housed in eng, not liberal arts. The expectations are to produce papers right out of the gate. Atleast lightweight papers, posters, something...you cannot coast for 2 years saying you are learning convex math & stoc calc. So if you read the material that comes out of students in that phase, it tends to be of low quality and heavy on empirical evidence ( i tested 7 functions on 3 datasets and these 2 came first, here are the charts and graphs). As the student matures, his papers gather more heft and by the time he graduates the final 2-3 papers will be very good....atleast that’s the expectation. Reality again is quite bleak and results are all over the place. Because of the hectic hiring climate, lots of cs grads will just take an ABD and get the hell out. 150k starting plus rsu is nothing to sneeze at. The ones who do finish tend to take a full 4+ years and are in the teens % of incoming cohort :(
Also, you are not really permitted to sign up for whatever you want just because you have math deficiencies. Your advisor will have to sign off on each sem load. He has to ensure you are on track, not just following your own whimsical path into theoretical math because you fancy it. CS core is quite distinct from AMath and touches upon the material you mentioned very superficially. All in all, this student is between a rock and a hard place. Its not going to be easy for him at all if he truly wants to understand everything. Best bet is to do what the top reddit advice is - pick some narrow corner where you are comfy, write 2-3 papers in that corner, get the hell out and learn the rest later on your own time in your research career.
We have an exam after the first two years and a similar process. It depends on what your advisor expects/wants. Mine has been flexible. And I focused more on efficient software engineering and applications for prior methods than on new research as I got up to speed on other matters. It made obvious a lot of ways I could improve them, just by being forced to look at and implement all of the details under the hood.
And, at least in my CS department, there is a very heavy emphasis on mathematics with the ML/AI folks. They coauthor a number of papers with the applied math department and the rest of their papers are mostly proofs. They'll usually back it up with proof-of-concept implementations, but in that regard, they're very much like researchers in applied math except that they use Python instead of MATLAB.
For ML, the OP already has linear algebra which is sufficient. Deep neural networks is back prop which is basically high school math. You could have mentioned ODEs, sensitivity analysis which I think are more relevant than convex optimization. For NNs we don't even care about identifiability in both the statistics and dynamic systems points of view. NNs blow away SVMs and almost everything except for random forests in some domains. Both of these have this interesting property that nobody understands them except in terms of black boxes for the most part. Boosting is another example. It really is stranger than fiction.
The being said I think statistics/probability theory and Bayesian stats/networks are useful to know for any scientist.
I would talk to your advisor about what to do. They will be able to advise on what's important and what to learn/focus on.
Is this true? Boosting is pretty well formulated in the PAC framework and the classical algorithms (e.g. Adaboost) are well-characterized.
(Source: http://l2r.cs.uiuc.edu/Teaching/CS446-17/LectureNotesNew/boo... "The original boosting algorithm was proposed as an answer to a theoretical
question in PAC learning [The Strength of Weak Learnability; Schapire, 89.]")
It took a while, but there's been a lot of work lately explaining neural nets' performance over the last 5 years of so, from papers showing PAC learnability for specific architectures (https://arxiv.org/abs/1710.10174) to work saying that most local optima are close to global optima (http://www.offconvex.org/2016/03/22/saddlepoints/), to work saying that the optimization error incurred (as separate from approximation and estimation errors) serve as a form of regularization for deep neural networks.
And understanding how these things work helps improve and speed up these methods and models: it's hybrid algorithms which are enabling performance in time-series data and more complex tasks. The future will nearly certainly use neural networks as part of many algorithms, but I doubt that the full machinery will be simple feed-forward nets of ever-increasing sizes.
Another problem is that most data scientist positions are filled by statisticians who will be giving you the job interview. Almost all of the questions will be around stats. i personally feel a mastery of those courses would be great, but they would also not help me land a job because improving LDA to run on small text input by using a variation auto encoder doesn't help me recite the formula for a t-test.
This is, essentially, Math 55. All 4 books have been used at different stages in this famous course.
Statistical/Theoretical: Shai Shalev-Schwartz & Shai Ben-David's Understanding Machine Learning (http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning...)
Mohri's Foundations of Machine Learning (https://cs.nyu.edu/~mohri/mlbook/)
The two above courses could share SSS's Online Learning text (https://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf). To be fair, the stochastic variants of most optimization algorithms can be learned reasonably quickly off of a statistical machine learning/basic optimization background. There's the option of Spall's Intro to Stochastic Search and Optimization, which covers neural networks, reinforcement learning, annealing, MCMC, and a wide variety of other applicaitons and techniques. (http://www.jhuapl.edu/ISSO/)
Similar to what kxyvr said, I also don't know of any killer linear algebra text, which is why I think a course is so useful. The matrix cookbook is helpful along the way. kxyvr is also entirely right that general nonlinear optimization is important -- though perhaps less indispensable. (Going the other way, the Bertsimas linear optimization textbook I've had for years mostly gathers dust.)
For PGMs: I got Predicting Structured Data back when it was new (https://mitpress.mit.edu/books/predicting-structured-data), but I think that Chris Bishop's treatment in PRML is easier to follow. He has some lecture slides which expand on it quite well. (https://www.microsoft.com/en-us/research/people/cmbishop/)
Bishop would also be my go-to intro ML book over Murphy.
I can't in fairness offer recommendations for the rest of the intermediate undergraduate math texts because I took them so long ago, but I can say that I have benefited from reviewing the MIT OCW courses from time to time.
I find a better approach is to focus on a few basic ideas you need specifically for your work and digging deep in there. Nobody can be expert-level in everything, but you can be expert level in your specific domain of research.
Also for ML stuff, it's hard to overemphasize the importance of understanding linear algebra really well. Here is a excerpt of a book I wrote on LA which should get you started on learning this fascinating swiss-army-knife-like subject: https://minireference.com/static/excerpts/noBSguide2LA_previ...
BTW I really like what i've seen of your guidebooks, or whatever you call them (neo-textbooks?).
Not in my experience. It's possible to get the equivalent of a bachelors and masters in math within two years (which is enough to overcome the issues listed in the post), but it's all you'll be doing for that period of time. Well worth it imo.
In fact, most people who learn this math learn it in a period of 2-3 years. Far from being impossible, learning this math in a few years is normal. It's not even a full time job. Most people learn all this math while also doing other classes and school stuff. Even a very dedicated math major probably only spends 20-25 hours a week actually studying math. I'm not sure much more than that is sustainable for most people anyway.
Now, I'll grant, this is going to be a lot harder to do without the structure of well thought out syllabi and lectures, but it's certainly manageable.
From what I've heard from some foreign students, the 400 level American undergrad courses are what they are expected to learn as freshmen.
About that - I don't think OP (or their research group) can afford to hold off publishing and attending ML conferences for 2 whole years while brushing up on math.
When Einstein was working on general relativity, he had a lot of help from friends and colleagues who pointed him towards the math he needed. He didn’t learn differential geometry until he was already deep into general relativity.
Find a level of abstraction that you’re comfortable with and learn to be okay with black boxes at the lower level, and only dig into those boxes when what’s inside them actually matters.
It helps a lot that in CS you can often see the code that the authors published along with the paper. Just staring at formulae doesn't mean much, because for all you know the author just hammed up the equations to get their paper into a top conference. That's not to say that the equations are excessive, or the authors are being misleading, but I think there is definitely an expectation in some fields that putting equations in makes your paper look clever even if they're broadly unecessary.
It's also wildly different depending on the field. If you look at variational methods in computer vision, images are [continuous] mappings from some domain onto the reals (I : Ω->R3 for colour). Does that change the fact that an image in memory is just a bunch of numbers in a grid? Not really, but it's bloody confusing the first time you see it.
This doesn't help with understanding the maths, but at some point you have to give up and say "This guy proved it, and someone else peer reviewed it, so I can use it to solve my problem". It's perfectly OK to stand on other people's work and still make creative contributions to your field, that's the point of research.
We actively work to make our writing hard to understand in this field. I do this all the time myself. I don't really need this complex looking equation to make my point. But if I don't have it in there a reviewer will think my writing is not academic enough. So there you have it. Once you go in realizing this is the case everywhere, it becomes a lot easier to understand academic papers.
I wish there was a ELI5 section in each paper.
As well as his lover, later wife, lest we forget:
Researchers are generally expected to be experts in their field. The people writing the papers on arxiv likely spent most of their lives learning about machine learning and mathematics.
Unfortunately, there’s not an easy path to become an expert. One just has to dig in and learn from the ground up.
Edit: The good news is that it’s never been as easy to learn math than it is now. When I was an undergrad in math, there were almost no resources available to learn the intuitions behind the math. One just has to keep doing proofs and exercises over and hope that it would ‘click’ at some point. But, sometimes that wouldn’t happen until many years later. Nowadays, one can watch YouTube videos where experts describe the intuition behind the math. It’s awesome.
That also ignores applications of machine learning, which is also a massive (and lucrative) field. But because it's a trendy field, I think there is an obsession with people needing to understand everything theoretical that comes out for fear of missing the boat.
Some of the really interesting papers that have come out over the last few years - for example artistic style transfer and Faster R-CNN - have hardly any maths in. You can count the equations on one hand in both those papers. No doubt the authors know their stuff, but how readable are those papers compared to e.g. a 100-page proof? Which did I learn more from?
They're a combination of two things: intuitive network architecture and a clever loss function. The first thing is a combination of intuition and programming, the second involves a little maths, but nothing outrageous.
If the goal is to make a lot of money doing applied ML, then become a consultant and aim to know 10% more than the customers. If the goal is to create models that are relatively effective, then read tutorials, play with data, experiment, and iterate. But, if the goal is to create very effective models and be able to actually explain why they work (which I think is what many companies want), then one has to understand the math.
That is, showing some interesting relationships, trends, predictions, or inferences on a data analysis portal or consumer web site is one thing. But, using ML to dispense medication, regulate a medical device, drive a power plant, or identify criminal suspects - those may require different skills.
(BTW I don't mean to disparage the middle group, as that's largely what I do. But, luckily I have people in the latter group who can validate what I'm doing.)
This approach will help make incremental improvements. You might even get lucky and hit in something that cites really well.
1. Describe a new technique;
2. Show that it works;
3. Explain why it works.
Understanding why things work is easily the hardest thing. This is where the most maths gets deployed...But often people are reaching for the fancier maths when they can't find a simpler intuition behind the idea. You can also use fancier analysis to substitute for less impressive empirical results. These explanations might convince reviewers, but that doesn't make them any more likely to be correct.
I find it effective to take a very "computer's eye view" of things. Instead of thinking primarily about the formalisation, I mostly think about what's being computed. What sort of information is flowing around, during both the prediction and the updates? What dynamics emerge?
Eh, I would say math is the language we use to communicate to computers presently, but the underlying _concepts_ don't require "more maths" and can often be grasped through an intuitive approach. For example, almost everyone who uses photoshop understands the underlying concept of a convolution (e.g. Gaussian blur), even if they don't know the mathematics that can be used to describe the operation. Yes, there are difficult notions that formalization or generalization assist with--perhaps it is better to see math as augmenting the initial intuition, rather than driving the intuition?
For example: convolution is not _just_ Gaussian blur; it allows someone to find an object on the scene (or a shape of something in time dimension). How is that related to Gaussian blur and why are they the same? It takes time to understand the full domain of the concept.
Indeed, there's no royal road to geometry!
Still, I thought this line was famous "enough".
When you start the book, it would give the theorem and proof at a level that would be used in a research journal. For each step of the proof, you would have two options for getting more detail.
The first option would be at the same level, but less terse. E.g., if the proof said something like "A implies B", asking for more detail might change that to "A implies B by the Soandso theorem". Asking for more detail there might elaborate on how you use the Soandso theorem with A".
The second expansion options gives you the background to understand what is going on. In the above example, doing this kind of expansion on the Soandso theorem would explain that theorem and how to prove it.
Both types of expansion can be applied to the results of either type of expansion. In particular, you can use the second type to go all the way down to high school mathematics.
If you started with just high school math, and used one of these books, you would get the basics...but only those parts of the basics you need to understand the starting theorem.
Pick a different starting theorem, and you get a different subset of the basics. It should be possible to pick a set of theorems to treat this way that together end up covering most of the basics.
That might be a more engaging way to teach mathematics, because you are always working directly toward some interesting theorem.
Sadly, the monetization of this is tricky. Probably has to be an open source effort. Need some visionary like wales or khan, but they are very very rare.
You may be interested in this kind of laying out a proof:
There is a point where one starts to see "behind" the symbols. It's a strange sensation, as if one could understand the ideas in a non-verbal way. The symbols become optional. Intimidation crawls back before curiosity at this point.
An amazing book on the subject is:
"Hadamard - The psychology of invention in the mathematical field"
At one level, you're observing a technical construction and trying to ensure that it's (mostly) sound; but at another level, you're trying to understand the broader picture of how it fits in, what the builder was trying to accomplish or what perspective of the world they're trying to share.
Mathematics is -- like any language -- just the articulation of an experience, of an insight, of an understanding. As you get further into mathematics (and possess more technical skills of your own), it becomes more important to see "Oh, he's trying to apply the machinery of homotopy to type theories as a means of discussing equivalence" than it is to get bogged down in the technical details. Often, the details are wrong in the first draft, but in a fixable way. (This is extremely common in major proofs.)
> There is a point where one starts to see "behind" the symbols. It's a strange sensation, as if one could understand the ideas in a non-verbal way
I think at some point, you have to compile mathematics to non-verbal ideas for computational reasons -- your verbal processing skills are simply too slow and too simple compared to other systems. Your visual and motor systems are way more powerful and (in the case of motor systems) operate in high dimensions. Much like GPUs in computers, if you can find a representation of a problem that works on a specialized system, you can often get a big computational boost; in mathematics, we have to push our understanding of self and experience to the limits to find more efficient representations of ideas, so we can operate on more interesting or complex ones.
I think most mathematicians work in extremely personal, non-portable internal representations, and then use the symbols as a way to create an external representation that the other mathematicians can compile into their own internal representations.
If you see mathematics as extremely high level code meant to be compiled to equivalent internal representations on thousands of slightly different compilers, I think the language starts to make more sense -- it's meant to be a reverse compilation target for machine code that's been under revision for ~3000 years, so of course it looks a little funky.
I will say this --
One thing I've noticed as I've gotten older is that we do a really poor job of teaching students the story of mathematics -- the human motivations, the community, the long standing projects (some have gone on for hundreds of years; some are still ongoing).
I sincerely believe that for young kids (less than, say 10), it would be better for their development to teach skills 4 days a week and simply tell them part of the story on the 5th. It would make mathematics much more relatable and understandable.
A few people have thought about this very idea.
You may take a look at:
In a way it is like a magic trick. Frustrating when you don't know how it works, but when you find out it's like: oh was that all there's to it? However, unlike a magic trick, math leaves you with something that can be actually useful.
> the age of 14
And also that a few people have exceptional intellectual
Modern AI is evolving rapidly but there is a foundation upon which everyone draws upon. The Sutton and Barto book is one such foundational text.
Find a collaborator in the Math department to work with. And participate daily in stack overflow forums for math and stats, such as Cross Validated.
I can also recommend CASI by Efron and Hastie. Deep historical understanding of where we are today in probabilistic inference.
To be honest the 3blue1brown videos seems really wonderful at explaining what is going on without going too deep, as the math in ML lecture seems to be trying to prove everything, and is always trying to teach using math notation all the time.
I guess this is happening because most of ML is mostly coming from research since it's all new, so it's being taught mostly by people who can grok the math, meaning mathematicians, it's not taught by people who are programmers. This really shows how much math should keep being math, and not leak into fields where practice matters more. Programming languages and pseudo code are not for nothing. Computers don't talk math.
So as years go by, ML will be taught more as a practice subject rather than a theory one, and things will get better. I think it's just a matter of how it's being taught, because reading code will always make more sense than reading high level math. Videos and oral explanations also help a lot.
I felt similar to you when I first started learning ML but their code first approach really helped it click for me on an intuitive level. Then you can go back and dig into the maths behind it.
And when the time comes to write his own papers, he should remember to intentionally make it harder to read for outsiders. E.g. instead of writing “I calculated the total error by summing the per-neuron errors”, one should write “the loss function utilized an integral over the output lattice using a discretized method by Newton et. al.”, or some other bullshit.
It's a huge, slow, painful investment, but totally doable and with tremendous ROI if you want to work with stats/ML/optimmization/really any numerical computing for a living.
The reason I recommend this route is that most of the more advanced math books you will encounter will assume this stuff as the readers' common knowledge. Having that foundation, the majority of the literature is already tailored to you!
Showing up at thesis defenses is good too. Learn a lot from the back and forth with advisors.
The key is to understand at a high level why the different math techniques are being used without actually understanding all the details. This won't be sufficient for your own work,but at least you'll have a good idea how your part fits in the scheme of things.
One solution would be simply to a arrange a local seminar,
and understand a couple of papers in full detail. It would help to invite a couple of mathematically aware students, from mathematics, physics, or the part of cs
faculty where they prove stuff. They should be able to explain and answer questions immediately, which is way more effective than reading whole books or taking courses.
Those can be read for details later.
If the papers for the seminar are deep learning papers, part of the outcome of the seminar is likely to be an understanding that the authors of these papers do not necessarily understand the mathematics themselves.
I got into math for exactly the same reasons while doing research in computer vision as an undergrad, and taking the requisite time off to learn advanced math (actually going overboard on it) has been an incredible boon to my AI career.
The point is that researchers have their little niche and they try to make contributions in areas adjacent to it. It's unrealistic to think everyone publishing papers understand all the other papers, particularly in such a cross-disciplinary field like ML. There's also a big gap between a researcher deep in their career and a student fresh out of a masters program.
It's also hard to transition from someone who's used to reading and understanding textbooks to someone who's often reading really technical research and understanding very little of it at first. You just have to push through and have confidence that you'll eventually learn enough to make a contribution. That's what it means to "become an expert"--you start off as not being an expert and then beat your head against the wall for a few years until you bootstrap your way out of it. And if you want to do it in a reasonable amount of time, you should probably choose something you have some of the fundamentals for.
>"Professional heavy math people are those who said in the 60s that the perceptron's limitations proved all AI was impossible. And in the 90s that one hidden layer was all you needed, deep learning was useless."
Can anyone provide the citations for this? I was aware of the latter but not the first one. You can find people still repeating the one layer stuff up to a few years ago just by reading stackexchange.
I realize that even this alleged misunderstanding is not the same as a claim that AI is impossible. The closest attempt of a mathematical proof of the impossibility of AI that I am aware of is the Lucas-Penrose argument from Gödel's first incompleteness theorem .
 Minsky M. L. and Papert S. A. 1969. Perceptrons. Cambridge, MA: MIT Press.
But sorry, I don't have citations. It's stuff I've read a long time ago, in random books picked at library shelves by looking at the covers :)
And I'm so impressed by how much better the comments are here than on reddit! Good job HN, you rock.
That is their main problem. All those useless pages are what becomes useful later.
And we find the same kind of attitude everywhere in tech: why read a full RFC when you can assume shit and get done with a 2 paragraphs tutorial?
I'll give some high level views and also
outline the math topics mentioned, high
school through much of a Ph.D. in parts of
I respond in three parts:
I have a good pure/applied math Ph.D. and
work in applied math and computing; while
I call my work applied math and not
artificial intelligence (AI), machine
learning (ML), or computer science, it
appears from the OP that there is
significant overlap between my background
and work and what the OP is concerned
The Reddit post by the guy in Germany was
terrific although easy to parody as a big
feature of old German culture! :-)!
That post is maybe a bit over organized.
I've posted too often that the best future
for computer science was pure/applied
math, e.g., that someone seriously
interested in the future of computing
should as an undergraduate just major in
pure math with some applied math and,
essentially f'get about anything
specifically about computer science.
Or, for the essential computer science,
write some code in some common procedural
programming language for some simple
exercises, check out at the library D.
Knuth's The Art of Computer Programming,
Volume 3, Sorting and Searching, learn
about big-O notation, the heap data
structure and heap sort, as an exercise
program, say, a priority queue based on
the heap data structure, learn about the
Gleason bound and how heap sort achieves
it so is in an important sense the fastest
possible sort algorithm, as a side
exercise look at AVL trees, and call
computer science done for an
undergraduate! This is partly a joke but
Well, it appears that the OP has started
to discover some of why I've said such
things about math.
This role for math is just a special case
of the old standard situation that, in
nearly all fields, the best work
mathematizes the field as in, e.g.,
mathematical physics. Indeed, there is
the old joke that good coverage of the
math needed for theoretical physics is so
much about just the math that can do the
physics just in the footnotes.
It is a standard situation that nearly
everyone in the STEM fields is convinced
that they need to know more math. As I
read papers by computer science
professors, I tend to agree!
Here I'll try to help the person with
their lament in the OP:
(1) Start at about 50,000 feet up and
begin to identify in what fields, broadly
on what problems, you want to work.
Remember: One of the keys to success is
good, early work in problem selection.
(2) If you want to work in AI, I suggest
you try to regard the current headline
topics in AL/ML as nearly irrelevant.
Sure, for political reasons, might have to
keep up, and maybe there are some current,
hot applications, but I don't see that
work -- first-cut, ballpark, basically
empirical curve fitting to huge quantities
of data -- as much of a start on AI.
E.g., at one time I was hired to work in
an AI project, specifically expert
systems. My first reaction was that
expert systems -- rule based programming,
working memory and the RETE algorithm,
rule firing conflict resolution -- were
junk and, in particular, nothing like a
good start on AI, programming style, or
anything else. After 20 years my opinion
has changed: Expert systems were worse
than junk, as a style of programming, for
much of anything, and in particular on
anything like AI.
Maybe I got some revenge: We were trying
to use expert systems for the monitoring
and management of large server farms and
networks. For the monitoring, for
essentially health and wellness, they
were using essentially just thresholds set
by hand on single variables one at a time.
My view was, wrap that data and processing
in all the rules possible, and still the
results won't be very good.
Sure, for monitoring there are two ways to
be wrong, (A) a false alarm and (B) a
missed detection. So, right, we're forced
into statistical hypothesis testing with
(i) a null hypothesis that the system is
healthy, (ii) a false alarm is Type I
error, (iii) a missed detection is Type II
error. Quickly decide that need some
statistical hypothesis tests, with the
null hypothesis that the system is
healthy, that are both multidimensional
(treat several variables jointly)
distribution-free (make no assumptions
about probability distributions) and find
that apparently there were none such in
the literature. So, I invented a large
class of such tests that, really, totally
knocked the socks off expert systems for
much of monitoring. And what I did was
just some applied math and applied
probability -- some group theory, some
probability theory based on measure
theory, some measure preserving as in
ergodic theory, etc.
So, yes, I could say that my monitoring
took in training data, did some machine
learning, was better at its monitoring
than humans so was some artificial
intelligence, was computer science,
etc., but I didn't: I just called my work
some applied math. E.g., I got real
hypothesis tests where false alarm rate
was adjustable and known in advance and
some useful best possible results on
detection rate. So, as I wrote in the
paper, take away the mathematical
assumptions, derivations, theorems and
proofs, forget about false alarm rate and
detection rate, and just do the specified
data manipulations and call the work AI.
So, such real problems and applied things
are one approach to computer science, AI,
ML, etc. Then, sure, for such work as in
what I did for detection, need, really,
the Ph.D. coursework in pure and applied
math and some ability to do publishable
theorems and proofs in math -- for both,
net, need a pure/applied math Ph.D.
So, back to AI and mostly setting aside
the current headlines: For AI, I'd
suggest that from 50,000 feet up start by
with a mommy kitty cat (domestic short
hair tabby) and 6 or so of her kittens,
maybe only 2 days old, and how they learn.
Their learning is astoundingly fast.
E.g., for how they learn to use their hind
legs, can see fairly directly some of just
how they do that. In just that one video
clip, in real time of likely just a week
or so, those kittens go from nearly
helpless balls of fur to young kitty cats.
Easy to guess that in two months they will
be safely 40 feet up a tree catching
something or other, effortlessly doing
gymnastic feats that would shame Olympic
athletes, etc. Astounding.
Okay for AI from 50,000 feet up, start by
trying to guess how those kittens learn.
Maybe there will be some math in that,
maybe not. Then if you can implement your
guess in software, if the software appears
from good tests in practice to learn some
things well, and if your guess is fairly
general, then maybe you have some progress
on AI. Here f'get about my view and
confirm with some AI profs who would have
to review your work, hire you for a
research slot, give you a research grant,
So, for some research, (i) could do some
math as I did for that monitoring or (ii)
try to do some real AI, e.g., by starting
by watching those kittens learn where
maybe, eventually there will be some math.
For both those research directions, notice
that need (i) some overview of the real
problem, (ii) some intuitive insights,
(iii) some good, new ideas. Maybe the
ideas will be mathematical or use math and
maybe not. E.g., might go a long way on
what those kittens are doing before use
(3) But for the math as mentioned by the OP,
I'll try to give an outline:
(i) Start with the real numbers, e.g., as
learned likely well enough by the 9th
grade. Then, sure, learn about the
complex numbers. So, net, have a high
school major in math, e.g., everything
short of calculus.
(ii) Learn college calculus well. At
least in part, you can do well alone: Due
to some circumstances beyond my control,
for freshman calculus I just got a good
book, studied, worked the exercises, and
learned. Then at a college with a good
math department, I started on sophomore
calculus. You can do such stuff yourself.
(iii) It would be good to take a course in
abstract algebra, especially one where
nearly all the exercises are proofs. So,
learn about sets, functions, groups,
rings, fields, more about the real and
complex numbers, maybe some basic number
theory, the greatest common divisor and
least common multiple algorithms, and some
about vector spaces. Might touch on
cryptography and coding theory.
Really the more important, maybe main,
value of the course is just learning how
to write proofs. The math there is nearly
all just so childishly simple that it's
easy to learn to write very highly precise
proofs -- crucial stuff if later want to
publish theorems and proofs.
Blunt fact of life, politically incorrect
observation: Without such training in
writing theorems and proofs, and, really,
just in math notation and how to do math
derivations, tough ever to learn how. So,
can find chaired professors of computer
science at top computer science
departments in top US research
universities who, however, fumble terribly
with just standard math notation and,
especially, with how to write theorems and
Super simple view: In math, there are
sets with elements. That's the logical
foundation of essentially all of current
pure/applied math. The details are in
Zermalo-Fraenkel axiomatic set theory
assuming the axiom of choice. E.g., can
construct from sets a set that looks like
the real numbers we knew about in the 9th
grade. Soon define ordered pairs and,
then, functions. After that, a huge
fraction of everything is functions. The
proofs -- as actually written but without
the crucial intuitive ideas that permitted
finding the proof -- are all essentially
just symbol substitution as in basic logic
and Whitehead and Russell. Warning: Any
math you write for publication should be
easily translated back to just sets and
symbol substitution; that and nothing else
is the criterion. If you also want the
proof to be readable by humans, there is
more. E.g., in the proof might mention
one by one each theorem assumption and
where it gets used.
(iv) Learn linear algebra. Really the
subject grew out of Gauss elimination for
solving systems of linear equations --
with some additional attention to
numerical stability (e.g., partial
pivoting, double precision inner product
accumulation, and iterative improvement)
-- that's nearly always still the way to
do it. It's fun to program a good Gauss
elimination routine, e.g., just in C.
There see clearly that any such system of
equations has none, one, or infinitely
many solutions. Later will discover that
the set of all the solutions is an affine
subspace, that is, a vector space plus
some one vector; that is, a plane that
does not pass through the origin. And
will discover that the left side with the
unknowns is a linear function -- big time
So, that start was a small thing for a
Next up, we consider n-tuples of real (or
complex, here and always in linear
algebra) numbers. Then we see how to make
the n-tuples a vector space. They are the
most important of the vector spaces.
But should also see the abstract (with no
mention of n-tuples) definition of a
vector space because (a) the n-tuples are
the most important example, (b) even when
working with just n-tuples often need the
more abstract definition (especially for
subspaces, which are often the real
interest, e.g., hyperplanes in curve
fitting and multivariate statistics), and
(c) there are other important vector
spaces that are not just n-tuples (in
signal processing, sets of random
variables, wave functions in quantum
mechanics, solutions to some differential
equations, and much more).
So, learn about linear independence and
bases (essentially coordinate systems).
Learn about inner products, distance,
angles, and orthogonality. See
generalizations of cosines, e.g., the
Schwarz inequality, and the Pythagorean
Learn about eigenvalues and eigen vectors
-- those eigen vectors are often the most
important ones in applications, e.g., your
favorite coordinate axes.
Then for the crown jewel, learn about the
polar decomposition and, thus, singular
values, principal components (e.g., data
compression), the core of the normal
equations in statistics, etc.
There is a remark in G. Simmons that the
two pillars of mathematical analysis are
linearity and continuity. The
superposition in physics is essentially
linearity. In applied math, linearity is
the main tool, the key to the land of milk
and honey. Well, a good course in linear
algebra is a good start on linearity. In
particular, those linear equations solved
with Gauss elimination are linear as in
linear transformations in, right, linear
Then, sure, that version of linearity
takes one through much of all of applied
math, e.g., Fourier analysis, the fast
Fourier transform, X-ray diffraction,
Banach space, Hilbert space, the classic
Dunford and Schwartz, Linear Operators,
So, a Banach space is just a vector space
where the scalars are the real or complex
numbers, there is a norm, that is, a
definition of distance, and the space is
complete in that norm. Complete is
what the rational numbers are not but the
real numbers are. Or, complete means
that a sequence that appears to converge,
that is, converges in the weaker sense of
Cauchy, actually does have something to
converge to and does. E.g., in the
rationals, the approximations to more and
more decimal places of pi have nothing to
converge to but in the reals do.
In theorem proving, nice to have
completeness. But, sure, computing knows
next to nothing about completeness because
we compute essentially only with rational
numbers. So, can do a lot of work without
completeness. Indeed, in applications,
often we are just approximating, and the
rationals can get as close as we please to
Banach spaces are not trivial or useless:
E.g., based on the Hahn-Banach theorem,
there is the grand applied math dessert
David G. Luenberger, Optimization by
Vector Space Methods, John Wiley and
A Hilbert space is a Banach space where
the norm comes from an inner product.
E.g., the set of all real valued random
variables X such that E[X^2] is finite
form a Hilbert space. That the space is
complete for those random variables is
nearly mind blowing; actually the proof is
(v) There remains Baby Rudin, Principles
of Mathematical Analysis.
There see calculus done with essentially
full care, as theorems and proofs. So,
again, get lessons in how to write
theorems and proofs on the way to being a
The main content of the book is just
showing that a real valued continuous
function defined on a closed interval of
the reals has a Riemann integral.
The key is that that interval is
compact. So, learn about compactness,
which is of quite general usefulness.
Then with compactness and continuity, have
uniform continuity. Now the doors to
grandeur start to open: That the Riemann
integral works is a short proof. And,
later in the book, get the three epsilon
proof that the uniform limit of continuous
functions is a continuous function (was a
question on one of my Ph.D. qualifying
exams -- from baby Rudin, I got it!).
Compactness is so powerful that it is
nearly the same as just finiteness --
there's a famous, old paper on that.
Well for a positive integer n and the set
of real number R, in R^n a set is compact
if and only if it is closed and bounded.
Now we are cleaning up the Riemann
integral and a lot of associated stuff.
At the back, Rudin gives a nice, short
definition of a set of the real numbers
that has measure zero (without really
getting deep into measure theory) and,
then, shows the Riemann integral exists if
and only if the function is continuous
everywhere except on a set of measure
zero. Nice. Now an exercise is to find a
function that is differentiable but whose
derivative is not Riemann integrable.
Might look in Gelbaum and Olmstead,
Counterexamples in Analysis.
Also the later editions of baby Rudin
cover the exterior algebra of differential
forms. That material is of interest in
differential geometry, some applications,
and general relativity.
That's an overview of what baby Rudin is
Go through that book carefully and will
come out with (a) some good knowledge of
the "principles" of the analysis part of
pure math and (b) much better skills at
doing math derivations, definitions,
theorems, and proofs. If you want to
write and publish new proofs for, say, AI,
baby Rudin is one of your best mentors,
maybe your Fairy Godmother?
For being an applied mathematician, sure,
you already guessed, and you were correct,
that essentially always in practice the
Riemann integral exists; so, why sweat the
details? Okay, then just use baby Rudin
to learn about compactness, continuity,
uniform continuity, measure zero, and more
on how to write proofs. And, really,
focusing just on compactness, continuity,
etc., can pull that off in a few, nice
weekends, maybe just one. Then look at
that result near the back on the uniform
limit of continuous functions is
continuous to see how to do such work.
Sure, Rudin discusses compactness, etc. on
metric spaces. Well, easily enough, the
set of real numbers R is also a metric
space! And for positive integer n, so is
So, why say metric space instead of, say,
just R^n? Well, first, the theory is
cleaner because a metric space has so
many fewer assumptions than R^n so that
can see more clearly just what assumptions
make the results true. Second, maybe some
fine day all you will have is just a
metric space and, then, can still use the
results -- don't hold your breath while
waiting for a significant application,
either pure or applied, with a non-trivial
metric space that isn't also much more!
Or, proving the stuff in a metric space is
no more difficult, more general, and
maybe, and actually, more useful.
Or, maybe math had the results in R^n and
then invented a metric space just to have
a place to have just enough to make the
results true! So, a metric space was
invented to have the least assumptions
needed for proving the results; so, what
came first were the results, and the
metric space definition came later.
(vi) Continuing on, there is the subject
of measure theory. That was from H.
Lebesgue, a student of E. Borel, right, in
France near 1900.
They were correct: They improved on the
Riemann integral. Don't worry: Whenever
the Riemann integral exists, the integral
from Lebesgue's measure theory gives the
same numerical answer.
So, why a new (way of defining the)
integral? Two biggie reasons:
(a) For a lot of theorem proving about
integrals, e.g., differentiation under the
integral sign, clean treatment of what
physics needed from the Dirac delta
function (right, there can be no such
function, but measure theory has a good
answer), definition of an integral of
several variables, interchange of order of
integration in iterated integrals, tying
off some old, loose ends in Fourier
theory, the deep connections between
integration and linear operators, and
more, Lebesgue's work is crucial and
(b) Lebesgue's integral is much more
general than the Riemann integral, and
that generality is crucial, especially as
the foundation for probability theory.
Now, for what Lebesgue did:
First, he developed measure theory.
That's essentially just a grown up theory
of area like you have known about since
grade school. E.g., given a set, in the
real numbers, R^n, something more
complicated, or something fully abstract,
the measure of that set is essentially
just its area. With the generalization,
sure, can have some sets with measure
infinity, negative measure, complex valued
measure, etc. But on the real line, with
the usual measure, Lebesgue measure, the
measure of an interval is just its length,
and you already know about that.
But Lebesgue measure is darned general:
It's tricky to show that there is a set of
the reals that is not Lebesgue measurable,
and the usual proof uses the axiom of
So, for measure theory, there is a
measure space with three things: There
is a space, just some non-empty set,
say, M; there is a collection of
measurable sets, subsets of M called,
say, S; and there is a measure m. Then the
collection of measurable sets S satisfies
the simple, essentially obvious axioms we
would want for area and, thus, is a sigma
algebra of sets. Then for each set A in
S, its measure is the real (or complex)
number m(A). We ask that m have the
properties we want for a good theory of
area. It's just a grown up version of
area. The theory is not trivial; it was
tricky to get all the details just right
so that have a good theory of area, that
is, so that area works like we want it to.
E.g., for the space, can have the set of
real numbers R. For the measurable sets,
have the intervals and, then, all the
other sets needed to have a sigma algebra
(a short proof shows that this definition
is well defined). Then the Lebesgue
measure of an interval is just its length,
and the measure of all the other
measurable sets are what they, then, have
to be (need some theorems and proofs
So, for probability theory, a
probability space is just a measure
space (each point in that space is one
experimental trial); a probability, P,
is just a positive measure with maximum
value 1; an event A is just one of the
measurable sets; and the probability of
A is just its measure P(A). What we want
for probability is already so close to a
theory of area that, really, we have
little choice but just to follow what
Lebesgue did. That's what A. Kolmogorov
observed in his 1933 paper.
Second, with the foundation of measure
theory, Lebesgue defined the Lebesgue
So, what is being integrated is (usually)
a function taking real or complex values.
The domain of the function is a measure
Then for the integral, say, in the
case of a real valued function,
we partition on the Y axis,
that is, in the range of the function
instead of in its domain. So, we don't
have to do Riemann-like partitions of the
domain of the function and, thus, the
domain can be much more general.
As the first step, we only integrate
functions that are >= 0, and we do that by
approximating, right, again with
essentially rectangles, only from below
the function, not both above and below as
for Riemann. Here the domain of the
function can be the whole space, e.g., the
whole real line. We don't care about
either continuity or compactness.
For a function that is both positive and
negative, we multiply the negative part by
-1 and integrate the two parts separately.
If at least one of the two results is not
infinity, then we subtract, and that's the
A random variable is just such a
function, and its expectation is just
Don't feel like the Lone Ranger; not
everyone knows this stuff. E.g., from all
I've been able to see from quantum
mechanics, the wave functions are
differentiable. Then I'm told that the
wave functions, wondrous, are also
continuous. From baby Rudin, of course
they are continuous; every differentiable
function is continuous! Then I'm told
that the wave functions form a Hilbert
space. Well, I can see that they can be
points in a Hilbert space, but they can't
form a Hilbert space because the
continuous functions won't be complete.
In elementary probability, it is common,
e.g., for finding the expectation of a
Gaussian random variable, to integrate
over the whole real line. Tilt: Baby
Rudin defines the Riemann integral only
over compact sets. So, can use an
improper integral, and then they want to
differentiate under the integral sign.
Tilt again -- the needed theorems are not
so easy to find. So, really, this common
stuff, in probability, physics, etc. of
integrating over the whole real line or
over all of R^n is using the Lebesgue
theory where have clean theorems for such
That's your overview and your work for the
weekend. You are now permitted two beers
and half a pizza but only if you have
someone you like a lot for the rest of the
pizza and some more beer. Can substitute
Chianti for the beer. But no more math
permitted; maybe a movie, but not math!
I had always heard variations on the first part -- that going to a good school was supposed to humble you by showing you how much you don't actually know.
Never heard the second part. That's great.
"Complex models are rarely useful (unless for those writing their dissertations)." (V.I.Arnold)
The reddit OP is asking about specific resources that help with intuition. In contrast, many math teaching materials work from an analytical and symbolic approach (e.g. axioms and properties.) Unfortunately, most people can't truly learn intuition that way and they end up in mindless "plug & chug" to pass the test.