Hacker News new | comments | show | ask | jobs | submit login
Confession as an AI researcher; seeking advice (reddit.com)
346 points by subroutine 8 months ago | hide | past | web | favorite | 98 comments

Being a grad student, this person is at a perfect place to build up their math background. Any school almost certainly offers the following:

1. Convex Optimization -- not all problems are convex, but solutions for nonconvex problems end up primarily using convex methods with slight adaptations.

2. Stochastic Optimization -- ML is pretty much all stochastic optimization. No surprise there.

3. Statistical/Theoretical Machine Learning -- courses built around concentration bounds, PAC learnability, and the Valiant/Vapnik school of thought. This gives you what you need to talk about generalizability and sample complexity.

4. Numerical Linear Algebra -- being smart about linear algebra is most of efficient machine learning. Knowing which kinds of factorizations help you solve problems efficiently. Can you do a Gram-Schmidt factorization? Cholesky decomposition? LU factorization? When do these things fail? When do you benefit from sparse representations?

5. Graphical Models -- Markov chains, Markov fields, causal relationships, HMMs, factor graphs, forward-backward algorithm, sum-product algorithms.

If you're in school, take advantage of the fact that you're in school.

Once you have a grasp on these things (and you'll have to catch up on real analysis, matrix calculus, and a few other fields of math), you'll be able to start reasoning about ways to improve existing methods or come up with your own. I think a lot of it is just developing mathematical maturity to give you a vocabulary to think about things with.

Alright, so I do research in and produce optimization tools professionally. From my biased point of view, someone is better off learning generic optimization theory and algorithms rather than specialized versions like convex or stochastic optimization. Generally speaking, generic nonlinear optimization methods form the foundation for everything and then there's a series of tricks to specialize the algorithm.

Very specifically, there are two books that I think provide a good foundation. First, is Convex Functional Analysis by Kurdila and Zabarankin. Not many people know about this book, but essentially it provides a self-contained background to prove the Generalized Weirstrass Theorem, which details the conditions necessary for the existence and/or uniqueness of an optimization problem. This is important because even convex problems don't necessarily have a minimum. For example, min exp(-x) doesn't have one, but it does have an infimum. The background necessary to understand this book is real analysis and as a quick aside I think Rudin's Principles of Mathematical Analysis is the best for this. Second, is Nocedal and Wright's Numerical Optimization book. It provides a good overview of the powerful algorithms in optimization that we should be using. Now, it's weakness is that it often cheats and uses a stronger constraint qualification than we're afforded in practice. Candidly, I find that the derivative of the constraints will not remain full rank and we will likely violate the LICQ. Further, it covers a number of algorithms that really shouldn't be used in practice, ever. That said, it does cover the good algorithms and it generally has the best presentation out of the other books.

Sadly, I don't know of any killer books for numerical linear algebra. And, yes, I've read cover to cover things like Saad's Iterative Methods for Sparse Linear Systems, Trefethen and Bau's Numerical Linear Algebra, and Golub and van Loan's Matrix Computations. They're valuable and well-written, but don't quite cover what I end up having to do to make linear algebra work on my solvers.

Anyway, this is all biased and opinion, so take what you will. If someone else has some of their favorite references for optimization or numerical linear algebra, I'd love to hear.

A lot of practical difficulties with your proposal.

A CS/AI PhD program is nothing like a Math/AMath/Stat PhD. In the latter, there is zero expectation that you will start producing anything in the first few semesters. In fact, it is explicitly required of you to load up on rigorous core courses, so that you can pass your prelims at the end of 2nd year and become a formal “PhD candidate”. The attrition rate in those programs is about 40% or more, so a lot of these people simply find out they don’t have the math maturity, drop out with a Masters at the end of2 years, get a job and call the whole thing off.

So in the latter case, yes, such a person can follow your guidelines. In fact, most of the material you listed in called AMC ( applied math core ) or CACM or other abbrev...and is already taught as part of core.

Now in the former case is where this particular student is. CS PhDs programs are a sort of weird beast in the US. They are housed in eng, not liberal arts. The expectations are to produce papers right out of the gate. Atleast lightweight papers, posters, something...you cannot coast for 2 years saying you are learning convex math & stoc calc. So if you read the material that comes out of students in that phase, it tends to be of low quality and heavy on empirical evidence ( i tested 7 functions on 3 datasets and these 2 came first, here are the charts and graphs). As the student matures, his papers gather more heft and by the time he graduates the final 2-3 papers will be very good....atleast that’s the expectation. Reality again is quite bleak and results are all over the place. Because of the hectic hiring climate, lots of cs grads will just take an ABD and get the hell out. 150k starting plus rsu is nothing to sneeze at. The ones who do finish tend to take a full 4+ years and are in the teens % of incoming cohort :(

Also, you are not really permitted to sign up for whatever you want just because you have math deficiencies. Your advisor will have to sign off on each sem load. He has to ensure you are on track, not just following your own whimsical path into theoretical math because you fancy it. CS core is quite distinct from AMath and touches upon the material you mentioned very superficially. All in all, this student is between a rock and a hard place. Its not going to be easy for him at all if he truly wants to understand everything. Best bet is to do what the top reddit advice is - pick some narrow corner where you are comfy, write 2-3 papers in that corner, get the hell out and learn the rest later on your own time in your research career.

I can only speak from experience in my CS PhD program, but my above recommendations are based directly on that experience.

We have an exam after the first two years and a similar process. It depends on what your advisor expects/wants. Mine has been flexible. And I focused more on efficient software engineering and applications for prior methods than on new research as I got up to speed on other matters. It made obvious a lot of ways I could improve them, just by being forced to look at and implement all of the details under the hood.

And, at least in my CS department, there is a very heavy emphasis on mathematics with the ML/AI folks. They coauthor a number of papers with the applied math department and the rest of their papers are mostly proofs. They'll usually back it up with proof-of-concept implementations, but in that regard, they're very much like researchers in applied math except that they use Python instead of MATLAB.

I think this is way too much for a pure CS person. It is not likely they will make a big contribution on the math side without being a mathematician first. E.g. an applied mathematician to CS.

For ML, the OP already has linear algebra which is sufficient. Deep neural networks is back prop which is basically high school math. You could have mentioned ODEs, sensitivity analysis which I think are more relevant than convex optimization. For NNs we don't even care about identifiability in both the statistics and dynamic systems points of view. NNs blow away SVMs and almost everything except for random forests in some domains. Both of these have this interesting property that nobody understands them except in terms of black boxes for the most part. Boosting is another example. It really is stranger than fiction.

The being said I think statistics/probability theory and Bayesian stats/networks are useful to know for any scientist.

I would talk to your advisor about what to do. They will be able to advise on what's important and what to learn/focus on.

> Boosting is another example. It really is stranger than fiction.

Is this true? Boosting is pretty well formulated in the PAC framework and the classical algorithms (e.g. Adaboost) are well-characterized.

You're correct. Boosting was directly formulated in the PAC framework.

(Source: http://l2r.cs.uiuc.edu/Teaching/CS446-17/LectureNotesNew/boo... "The original boosting algorithm was proposed as an answer to a theoretical question in PAC learning [The Strength of Weak Learnability; Schapire, 89.]")

It took a while, but there's been a lot of work lately explaining neural nets' performance over the last 5 years of so, from papers showing PAC learnability for specific architectures (https://arxiv.org/abs/1710.10174) to work saying that most local optima are close to global optima (http://www.offconvex.org/2016/03/22/saddlepoints/), to work saying that the optimization error incurred (as separate from approximation and estimation errors) serve as a form of regularization for deep neural networks.

And understanding how these things work helps improve and speed up these methods and models: it's hybrid algorithms which are enabling performance in time-series data and more complex tasks. The future will nearly certainly use neural networks as part of many algorithms, but I doubt that the full machinery will be simple feed-forward nets of ever-increasing sizes.

This would’ve literally been my answer. Brilliant. I can’t recommend this answer enough.

i finished my masters, but one problem is that is over half the courses in a 10 course masters program, which generally only allows for 4 total electives. As a CS grad, I felt handcuffed by required core courses and other 'select 2/4 of these, which took up 6 courses in the major. the remaining 4 courses were electives, and 3 were allowed outside of CIS. those courses were mostly offered in the same semester, and would also have conflicts with required core courses.

Another problem is that most data scientist positions are filled by statisticians who will be giving you the job interview. Almost all of the questions will be around stats. i personally feel a mastery of those courses would be great, but they would also not help me land a job because improving LDA to run on small text input by using a variation auto encoder doesn't help me recite the formula for a t-test.

What courses (starting from pre-calculus) should one take to do what you listed above? I want to match your recommendations to course titles starting with pre-calculus. List book recommendations as well if you would. Thanks!

I guess baby Rudin (or Hubbard & Hubbard for something simpler) in the analysis department; and Halmos (or Axler) in the linear algebra department.

This is, essentially, Math 55. All 4 books have been used at different stages in this famous course.

Halmos seems to discuss the same things as Hoffman and Kunze, which is the more “standard” and recommended book. Nevertheless after these you will still have to read up on multilinear algebra (tensors and determinant-like functions) as well as stuff on the numerical side of linear algebra.

Convex: Bertsekas - Convex Optimization Theory, Convex Optimization Algorithms. Nesterov - Lecture Notes (http://citeseerx.ist.psu.edu/viewdoc/download?doi=

Statistical/Theoretical: Shai Shalev-Schwartz & Shai Ben-David's Understanding Machine Learning (http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning...) Mohri's Foundations of Machine Learning (https://cs.nyu.edu/~mohri/mlbook/)

The two above courses could share SSS's Online Learning text (https://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf). To be fair, the stochastic variants of most optimization algorithms can be learned reasonably quickly off of a statistical machine learning/basic optimization background. There's the option of Spall's Intro to Stochastic Search and Optimization, which covers neural networks, reinforcement learning, annealing, MCMC, and a wide variety of other applicaitons and techniques. (http://www.jhuapl.edu/ISSO/)

Similar to what kxyvr said, I also don't know of any killer linear algebra text, which is why I think a course is so useful. The matrix cookbook is helpful along the way. kxyvr is also entirely right that general nonlinear optimization is important -- though perhaps less indispensable. (Going the other way, the Bertsimas linear optimization textbook I've had for years mostly gathers dust.)

For PGMs: I got Predicting Structured Data back when it was new (https://mitpress.mit.edu/books/predicting-structured-data), but I think that Chris Bishop's treatment in PRML is easier to follow. He has some lecture slides which expand on it quite well. (https://www.microsoft.com/en-us/research/people/cmbishop/)

Bishop would also be my go-to intro ML book over Murphy.

I can't in fairness offer recommendations for the rest of the intermediate undergraduate math texts because I took them so long ago, but I can say that I have benefited from reviewing the MIT OCW courses from time to time.

Great, thanks!

Learning all the advanced math within a few years is a hopeless endeavour. It would take decades of hard work because there is so much out there, and we need to know all of it if you want to make progress (e.g. pull out a fancy-name theorem out of nowhere to solve some practical problem).

I find a better approach is to focus on a few basic ideas you need specifically for your work and digging deep in there. Nobody can be expert-level in everything, but you can be expert level in your specific domain of research.

Also for ML stuff, it's hard to overemphasize the importance of understanding linear algebra really well. Here is a excerpt of a book I wrote on LA which should get you started on learning this fascinating swiss-army-knife-like subject: https://minireference.com/static/excerpts/noBSguide2LA_previ...

What you say is correct, but that reddit thread concerns (I think) a narrow class of devs of which there are many here: US CS majors who graduate with only a few semesters of calc and discrete and are underequipped to bootstrap themselves if they need to understand, say, topology. But they have to gain some understanding, so i put a few book lists in that reddit thread, took about ... 5 minutes.

BTW I really like what i've seen of your guidebooks, or whatever you call them (neo-textbooks?).

"Learning all the advanced math within a few years is a hopeless endeavour."

Not in my experience. It's possible to get the equivalent of a bachelors and masters in math within two years (which is enough to overcome the issues listed in the post), but it's all you'll be doing for that period of time. Well worth it imo.

> Not in my experience. It's possible to get the equivalent of a bachelors and masters in math within two year


In fact, most people who learn this math learn it in a period of 2-3 years. Far from being impossible, learning this math in a few years is normal. It's not even a full time job. Most people learn all this math while also doing other classes and school stuff. Even a very dedicated math major probably only spends 20-25 hours a week actually studying math. I'm not sure much more than that is sustainable for most people anyway.

Now, I'll grant, this is going to be a lot harder to do without the structure of well thought out syllabi and lectures, but it's certainly manageable.

If you're already a grad student you can usually take any undergrad course at your institution for free. It will slow down your progress on your graduate degree, but you might as well do it right if it's what you really want.

If you want to learn decent undergrad math you really dont even need that much. 1yr analysis, 1yr algebra, topology. Then maybe a combination of a more indepth linear algebra course, an optimization course, a probability course, a stats course. You could probably do all that and a few more electives in one year full time.

From what I've heard from some foreign students, the 400 level American undergrad courses are what they are expected to learn as freshmen.

> It's possible to get the equivalent of a bachelors and masters in math within two years (which is enough to overcome the issues listed in the post), but it's all you'll be doing for that period of time.

About that - I don't think OP (or their research group) can afford to hold off publishing and attending ML conferences for 2 whole years while brushing up on math.

The ratio of explore /exploit depends on what your goals are. If it's just to bang out a PhD in applied ML then perhaps, but if you're in it for the long haul it's well worth it. It would unlock a whole bunch of research directions and catching back up to the field with solid math skills in hand would be quick. If you pay attention to the leaders in AI, they're mostly applied mathematicians in disguise.

You don’t need to understand everything in every field and I doubt very many people do.

When Einstein was working on general relativity, he had a lot of help from friends and colleagues who pointed him towards the math he needed. He didn’t learn differential geometry until he was already deep into general relativity.

Find a level of abstraction that you’re comfortable with and learn to be okay with black boxes at the lower level, and only dig into those boxes when what’s inside them actually matters.

I think an important lesson for any grad student is to learn to read through the bullshit in papers and try and understand what the authors actually did.

It helps a lot that in CS you can often see the code that the authors published along with the paper. Just staring at formulae doesn't mean much, because for all you know the author just hammed up the equations to get their paper into a top conference. That's not to say that the equations are excessive, or the authors are being misleading, but I think there is definitely an expectation in some fields that putting equations in makes your paper look clever even if they're broadly unecessary.

It's also wildly different depending on the field. If you look at variational methods in computer vision, images are [continuous] mappings from some domain onto the reals (I : Ω->R3 for colour). Does that change the fact that an image in memory is just a bunch of numbers in a grid? Not really, but it's bloody confusing the first time you see it.

This doesn't help with understanding the maths, but at some point you have to give up and say "This guy proved it, and someone else peer reviewed it, so I can use it to solve my problem". It's perfectly OK to stand on other people's work and still make creative contributions to your field, that's the point of research.

>I think an important lesson for any grad student is to learn to read through the bullshit in papers and try and understand what the authors actually did.

We actively work to make our writing hard to understand in this field. I do this all the time myself. I don't really need this complex looking equation to make my point. But if I don't have it in there a reviewer will think my writing is not academic enough. So there you have it. Once you go in realizing this is the case everywhere, it becomes a lot easier to understand academic papers.

I get your point but I wish this wasn't the case with most research. I, like the author, am not a math guy but have been reading tons of ML papers recently. I usually skip the formal definition parts and get to the 'juicy' implementation parts.

I wish there was a ELI5 section in each paper.

What have been your favorite papers so far?

That's hard as I haven't read too many. The recent deepmind papers (the ones about imagination) were good. The papers were pretty standard but they came along with explanatory blogpost[1] and some videos covered them too[2][3][4]. This supplementary content is what made them accessible for me.

[1] https://deepmind.com/blog/agents-imagine-and-plan/ [2] https://www.youtube.com/watch?v=xp-YOPcjkFw [3] https://www.youtube.com/watch?v=agXIYMCICcc [4] https://www.youtube.com/watch?v=56GW1IlWgMg

> When Einstein was working on general relativity, he had a lot of help from friends and colleagues

As well as his lover, later wife, lest we forget:


Are you sure? I thought his first wife helped him only at the beginning, the GP talked about GR which came later.

Quite possibly. I suppose I read "when Einstein was working he got help with the math" - so "he also got help (from his wife and others) when working on special relativity" might be more accurate.

In particular, Einstein worked with Marcel Grossman: https://en.wikipedia.org/wiki/Marcel_Grossmann who did most of the calculating and verifying of intuitive ideas that Einstein came up with.

How does one expect to do ‘AI research’ without much of a background in math? Machine learning is pretty much all math.

Researchers are generally expected to be experts in their field. The people writing the papers on arxiv likely spent most of their lives learning about machine learning and mathematics.

Unfortunately, there’s not an easy path to become an expert. One just has to dig in and learn from the ground up.

Edit: The good news is that it’s never been as easy to learn math than it is now. When I was an undergrad in math, there were almost no resources available to learn the intuitions behind the math. One just has to keep doing proofs and exercises over and hope that it would ‘click’ at some point. But, sometimes that wouldn’t happen until many years later. Nowadays, one can watch YouTube videos where experts describe the intuition behind the math. It’s awesome.

Absolutely you can do 'AI Research' without a degree in maths. Sure you need a grounding in linear algebra, stats, probability and calculus, but not much more than a CS or physics degree will teach you. That stuff is indeed learned easily, and it's not what the Reddit user is worried about.

That also ignores applications of machine learning, which is also a massive (and lucrative) field. But because it's a trendy field, I think there is an obsession with people needing to understand everything theoretical that comes out for fear of missing the boat.

Some of the really interesting papers that have come out over the last few years - for example artistic style transfer and Faster R-CNN - have hardly any maths in. You can count the equations on one hand in both those papers. No doubt the authors know their stuff, but how readable are those papers compared to e.g. a 100-page proof? Which did I learn more from?

They're a combination of two things: intuitive network architecture and a clever loss function. The first thing is a combination of intuition and programming, the second involves a little maths, but nothing outrageous.

[1] https://arxiv.org/abs/1508.06576 [2] https://arxiv.org/abs/1506.01497

You're right - I see now when he says 'AI Research' he really means doing applied ML. And, that can certainly be done without knowing everything.

If the goal is to make a lot of money doing applied ML, then become a consultant and aim to know 10% more than the customers. If the goal is to create models that are relatively effective, then read tutorials, play with data, experiment, and iterate. But, if the goal is to create very effective models and be able to actually explain why they work (which I think is what many companies want), then one has to understand the math.

That is, showing some interesting relationships, trends, predictions, or inferences on a data analysis portal or consumer web site is one thing. But, using ML to dispense medication, regulate a medical device, drive a power plant, or identify criminal suspects - those may require different skills.

(BTW I don't mean to disparage the middle group, as that's largely what I do. But, luckily I have people in the latter group who can validate what I'm doing.)

Not a degree, no, but you should have depth in the math techniques you are using to solve your problem. You should also develop a high level abstracted understanding of math to have the intuition that your techniques are likely the right ones.

This approach will help make incremental improvements. You might even get lucky and hit in something that cites really well.

Actually I think it's sometimes harmful to take the maths too seriously. There are three parts to the ideal paper:

1. Describe a new technique; 2. Show that it works; 3. Explain why it works.

Understanding why things work is easily the hardest thing. This is where the most maths gets deployed...But often people are reaching for the fancier maths when they can't find a simpler intuition behind the idea. You can also use fancier analysis to substitute for less impressive empirical results. These explanations might convince reviewers, but that doesn't make them any more likely to be correct.

I find it effective to take a very "computer's eye view" of things. Instead of thinking primarily about the formalisation, I mostly think about what's being computed. What sort of information is flowing around, during both the prediction and the updates? What dynamics emerge?

> Machine learning is pretty much all math.

Eh, I would say math is the language we use to communicate to computers presently, but the underlying _concepts_ don't require "more maths" and can often be grasped through an intuitive approach. For example, almost everyone who uses photoshop understands the underlying concept of a convolution (e.g. Gaussian blur), even if they don't know the mathematics that can be used to describe the operation. Yes, there are difficult notions that formalization or generalization assist with--perhaps it is better to see math as augmenting the initial intuition, rather than driving the intuition?

It's not always helpful even when someone gives you a correct direction but none of his abstracted vision of how to apply that direction.

For example: convolution is not _just_ Gaussian blur; it allows someone to find an object on the scene (or a shape of something in time dimension). How is that related to Gaussian blur and why are they the same? It takes time to understand the full domain of the concept.

To understand the underlying principles, but not to use it. In this case, it's a point and click -- if they like the effect, keep it. Literally no maths are required for understanding, yet the outcome is achieved.

> Unfortunately, there’s not an easy path to become an expert. One just has to dig in and learn from the ground up.

Indeed, there's no royal road to geometry!

I have no idea why you were downvoted this was the exact quote that came to my mind when I read the text you quoted as it sums up the principle perfectly that knowledge of math just doesn't come passively. Some context from wikipedia about the quote, 'Euclid is said to have replied to King Ptolemy's request for an easier way of learning mathematics that "there is no Royal Road to geometry," according to Proclus' [1].

[1]: https://en.wikipedia.org/wiki/Royal_Road#A_metaphorical_.E2....

I guess I cannot rely on a widespread knowledge of the history of mathematics :(

Still, I thought this line was famous "enough".

"“I understood nothing, but it was really fascinating,” he said. So Scholze worked backward, figuring out what he needed to learn to make sense of the proof. “To this day, that’s to a large extent how I learn,” he said. “I never really learned the basic things like linear algebra, actually — I only assimilated it through learning some other stuff.”"


I've long wanted a series of interactive math ebooks that work that way. Each would take one interesting theorem, such as the prime number theorem, and work backward.

When you start the book, it would give the theorem and proof at a level that would be used in a research journal. For each step of the proof, you would have two options for getting more detail.

The first option would be at the same level, but less terse. E.g., if the proof said something like "A implies B", asking for more detail might change that to "A implies B by the Soandso theorem". Asking for more detail there might elaborate on how you use the Soandso theorem with A".

The second expansion options gives you the background to understand what is going on. In the above example, doing this kind of expansion on the Soandso theorem would explain that theorem and how to prove it.

Both types of expansion can be applied to the results of either type of expansion. In particular, you can use the second type to go all the way down to high school mathematics.

If you started with just high school math, and used one of these books, you would get the basics...but only those parts of the basics you need to understand the starting theorem.

Pick a different starting theorem, and you get a different subset of the basics. It should be possible to pick a set of theorems to treat this way that together end up covering most of the basics.

That might be a more engaging way to teach mathematics, because you are always working directly toward some interesting theorem.

Yes, you and absolutely everyone else in the world that loves math, didn't have time to get a phd and isn't elitist wants this.

Sadly, the monetization of this is tricky. Probably has to be an open source effort. Need some visionary like wales or khan, but they are very very rare.

It's a great idea and I think it's much bigger than maths. If you do not already know about it, searching around what a "Dynabook" is cannot be a waste of time.

You may be interested in this kind of laying out a proof: https://lamport.azurewebsites.net/pubs/proof.pdf

Yeah. Reading the post I see a guy overwhelmed by a bunch of equations and numbers. Which isn't to say he shouldn't learn them, but math is always far more intimidating when you don't understand it than other subjects.

> "intimidating"



There is a point where one starts to see "behind" the symbols. It's a strange sensation, as if one could understand the ideas in a non-verbal way. The symbols become optional. Intimidation crawls back before curiosity at this point.

An amazing book on the subject is:

  "Hadamard - The psychology of invention in the mathematical field"

What took me a long time -- and is still a skill I'm developing -- is to both verify and "read" the math at the same time, to see the proof and the story at the same time.

At one level, you're observing a technical construction and trying to ensure that it's (mostly) sound; but at another level, you're trying to understand the broader picture of how it fits in, what the builder was trying to accomplish or what perspective of the world they're trying to share.

Mathematics is -- like any language -- just the articulation of an experience, of an insight, of an understanding. As you get further into mathematics (and possess more technical skills of your own), it becomes more important to see "Oh, he's trying to apply the machinery of homotopy to type theories as a means of discussing equivalence" than it is to get bogged down in the technical details. Often, the details are wrong in the first draft, but in a fixable way. (This is extremely common in major proofs.)

> There is a point where one starts to see "behind" the symbols. It's a strange sensation, as if one could understand the ideas in a non-verbal way

I think at some point, you have to compile mathematics to non-verbal ideas for computational reasons -- your verbal processing skills are simply too slow and too simple compared to other systems. Your visual and motor systems are way more powerful and (in the case of motor systems) operate in high dimensions. Much like GPUs in computers, if you can find a representation of a problem that works on a specialized system, you can often get a big computational boost; in mathematics, we have to push our understanding of self and experience to the limits to find more efficient representations of ideas, so we can operate on more interesting or complex ones.

I think most mathematicians work in extremely personal, non-portable internal representations, and then use the symbols as a way to create an external representation that the other mathematicians can compile into their own internal representations.

If you see mathematics as extremely high level code meant to be compiled to equivalent internal representations on thousands of slightly different compilers, I think the language starts to make more sense -- it's meant to be a reverse compilation target for machine code that's been under revision for ~3000 years, so of course it looks a little funky.


I will say this --

One thing I've noticed as I've gotten older is that we do a really poor job of teaching students the story of mathematics -- the human motivations, the community, the long standing projects (some have gone on for hundreds of years; some are still ongoing).

I sincerely believe that for young kids (less than, say 10), it would be better for their development to teach skills 4 days a week and simply tell them part of the story on the 5th. It would make mathematics much more relatable and understandable.

> Ed: ...

A few people have thought about this very idea. You may take a look at:


I liked but didn't love mathematics in high school and as such I just did what I had to do and moved on. A decade later I worked through a CS degree and gravitated towards books about mathematicians and now I have a deep fascination with mathematics and I wish I read these books when I was in high school!

A survey of how mathematicians think about mathematics [citation needed] found 80% visually, 15% kinesthetically, and 5% symbolically (i.e. in terms of notation).

> math is always far more intimidating when you don't understand it than other subjects.

In a way it is like a magic trick. Frustrating when you don't know how it works, but when you find out it's like: oh was that all there's to it? However, unlike a magic trick, math leaves you with something that can be actually useful.

And once you understand it you can't see how you didn't understand it before.

Hmm that's very interesting. I just don't understand how he made it through university. When I was enrolled in CS I somewhat got along with Algebra and was completely lost when it came to Analysis and so I dropped out. Back then I was working so hard at my courses I felt that I simply had no time to even consider "other stuff". I would like to know how it was obvious to him what he had to do.

> Scholze started teaching himself college-level mathematics at

> the age of 14

And also that a few people have exceptional intellectual abilities, built-in.

I think this is fairly common first-year grad student emotional response. And quite frankly, it is the job of your mentor and department to ensure you receive sufficient training for an academic or research career.

Modern AI is evolving rapidly but there is a foundation upon which everyone draws upon. The Sutton and Barto book is one such foundational text.

Find a collaborator in the Math department to work with. And participate daily in stack overflow forums for math and stats, such as Cross Validated.

I can also recommend CASI by Efron and Hastie. Deep historical understanding of where we are today in probabilistic inference.


The explanation by TillWinter about systematic task assessment is quite impressive


I always had the same feeling. I'm not bad at math in general (I'm not well versed in high level maths either), but as a developer, trying to jump in the ML field seems really impossible. One would think that he could teach himself ML algorithms, but you ALWAYS end up reading math notation instead of pseudo code.

To be honest the 3blue1brown videos seems really wonderful at explaining what is going on without going too deep, as the math in ML lecture seems to be trying to prove everything, and is always trying to teach using math notation all the time.

I guess this is happening because most of ML is mostly coming from research since it's all new, so it's being taught mostly by people who can grok the math, meaning mathematicians, it's not taught by people who are programmers. This really shows how much math should keep being math, and not leak into fields where practice matters more. Programming languages and pseudo code are not for nothing. Computers don't talk math.

So as years go by, ML will be taught more as a practice subject rather than a theory one, and things will get better. I think it's just a matter of how it's being taught, because reading code will always make more sense than reading high level math. Videos and oral explanations also help a lot.

You might appreciate the Deep Learning for Coders courses from Fast.ai. It's basically ML as a practice subject rather than theory as you suggested.

I felt similar to you when I first started learning ML but their code first approach really helped it click for me on an intuitive level. Then you can go back and dig into the maths behind it.

Currently getting my Masters in AI. I'll be honest, I can understand the concepts when presented to me, but the mathematical proofs are beyond me. I've learned to be ok with that. There's just not enough time in a 2 year program to teach myself the underlying vagaries of everything I encounter.

I think he should just go on with his research and don’t bother with understanding every obscure reference in papers. One can grasp the core ideas surprisingly well even when skipping over proofs and formulas.

And when the time comes to write his own papers, he should remember to intentionally make it harder to read for outsiders. E.g. instead of writing “I calculated the total error by summing the per-neuron errors”, one should write “the loss function utilized an integral over the output lattice using a discretized method by Newton et. al.”, or some other bullshit.

As an amateur who has jumped in and out of learning basic ML over the years, it has been interesting to see the web of terminology expand to the point where your post is no longer satire. Writing a dictionary or annotating papers to decipher ML-speak to basic-math speak would be a pretty worthwhile endeavor for someone (I see glimmers of this in the work being done by the folks at fast.ai) and in general would probably not remove much real information.

Sound like a need for a “imposters handbook” for Mathematics, just like there is for Computer Science: https://bigmachine.io/products/the-imposters-handbook

A little off-topic, but hope this would be an another reminder that despite having huge success/hype these days, deep learning/machine learning are just another tools for solving problems, and you can't go really far if you just treat it like a magic framework (i.e. import tensorflow as tf) and fail to understand its underlying principles.

When I left school, I found myself in a similar boat, and decided to set a goal of getting myself the knowledge equivalent of an undergrad degree in math. I already had a physics degree under my belt, so it wasn't as long of a path as it might be for others, but over a bunch of years of self study it paid off. When it comes to your career a few years is nothing. The strong foundation pays regular dividends because learning things that use it comes so much more quickly.

It's a huge, slow, painful investment, but totally doable and with tremendous ROI if you want to work with stats/ML/optimmization/really any numerical computing for a living.

The reason I recommend this route is that most of the more advanced math books you will encounter will assume this stuff as the readers' common knowledge. Having that foundation, the majority of the literature is already tailored to you!

A technique I have found that works well is read a lot of paper and their citations, but don't dive deep. Each paper usually provides some easy to grasp insight (far too little per paper, but that is elitism for you) that you can use to get a good picture of the field. Reread papers to grasp more insight. Once you have a good overall picture, find some area/problem that really interests you, you like the math, and isn't covered well and bone up on the math techniques. Do your research and present.

Showing up at thesis defenses is good too. Learn a lot from the back and forth with advisors.

The key is to understand at a high level why the different math techniques are being used without actually understanding all the details. This won't be sufficient for your own work,but at least you'll have a good idea how your part fits in the scheme of things.

That person approaches the issue as his personal problem. However, there are likely many other students around him with the same problem. It is a problem of the whole field recently.

One solution would be simply to a arrange a local seminar, and understand a couple of papers in full detail. It would help to invite a couple of mathematically aware students, from mathematics, physics, or the part of cs faculty where they prove stuff. They should be able to explain and answer questions immediately, which is way more effective than reading whole books or taking courses. Those can be read for details later.

If the papers for the seminar are deep learning papers, part of the outcome of the seminar is likely to be an understanding that the authors of these papers do not necessarily understand the mathematics themselves.

This all does bring up a good point. Why not create a publicly accessible 'math tree' that people can use to learn about any kind of math. If there is a symbol or step they don't understand they should be able to follow it all the way down to basic counting.

Based on your post, I highly recommend taking a year or two off to focus on math only. You can get the equivalent of a bachelors and masters in pure math in just two years (if that's all you're doing with your time), and it would be enough to fix all the issues you're experiencing. Just take the pure math courses instead of computational math, as abstract and difficult as possible, it will generalize much better :).

I got into math for exactly the same reasons while doing research in computer vision as an undergrad, and taking the requisite time off to learn advanced math (actually going overboard on it) has been an incredible boon to my AI career.

I think the real issue is that the OP has a mistaken expectation that they should understand everything. For instance, the group that wrote the Wasserstein GAN paper are surely those that think night and day about distance metrics. And they might be totally lost reading a paper about some energy based method that relies on concepts from physics.

The point is that researchers have their little niche and they try to make contributions in areas adjacent to it. It's unrealistic to think everyone publishing papers understand all the other papers, particularly in such a cross-disciplinary field like ML. There's also a big gap between a researcher deep in their career and a student fresh out of a masters program.

It's also hard to transition from someone who's used to reading and understanding textbooks to someone who's often reading really technical research and understanding very little of it at first. You just have to push through and have confidence that you'll eventually learn enough to make a contribution. That's what it means to "become an expert"--you start off as not being an expert and then beat your head against the wall for a few years until you bootstrap your way out of it. And if you want to do it in a reasonable amount of time, you should probably choose something you have some of the fundamentals for.

From one of the comments:

>"Professional heavy math people are those who said in the 60s that the perceptron's limitations proved all AI was impossible. And in the 90s that one hidden layer was all you needed, deep learning was useless."

Can anyone provide the citations for this? I was aware of the latter but not the first one. You can find people still repeating the one layer stuff up to a few years ago just by reading stackexchange.

I wonder if the author is referring to the oft-repeated claim that Minsky's and Papert's proof that a perceptron cannot learn the Xor function had a chilling effect on research into neural networks generally, even though Minsky and Papert themselves had shown that multi-layer networks were capable of doing so [1][2].

I realize that even this alleged misunderstanding is not the same as a claim that AI is impossible. The closest attempt of a mathematical proof of the impossibility of AI that I am aware of is the Lucas-Penrose argument from Gödel's first incompleteness theorem [3].

[1] https://en.wikipedia.org/wiki/Perceptrons_(book)

[2] Minsky M. L. and Papert S. A. 1969. Perceptrons. Cambridge, MA: MIT Press.

[3] http://www.iep.utm.edu/lp-argue/

Thanks, the first ref looks like it may be the one.

Besides, heavy math people were also those who proved that recurrent multi-level neural networks were Turing complete and pushed the field up from the ground.

But sorry, I don't have citations. It's stuff I've read a long time ago, in random books picked at library shelves by looking at the covers :)

Do you really not need familiarity with the relevant math to be admitted to AI doctoral programs? I wouldn't have thought that was the case.

How does one get into an ML PhD like this? I was under the impression it was impossible if you’re not a math double major.

I am a math double major doing a research masters in ML/CV right now, and I know plenty of pure CS majors who are doing just fine. The math that 95% of ML scientists use is not that hard to grasp. Sure, when they encounter functional analysis stuff, they start to cry inside a little, but that doesn't happen very often.

Interesting. I’m interested in a related MS/PhD with a similar math background as OP and assumed I was disqualified.

As Alan Kay noted, the right point of view can add 80 IQ. I was in a quantitatively heavy field and always felt out classed by those with strong physics and maths backgrounds. Nevertheless I published to papers in Nature journals and overturned about 10 years of high profile research, not because I was smarter but because I spent more time trying to find the right perspective and when I found anomalies instead of brushing over it, confident in my own intelligence, I instead drilled down until I found the root of the problem — something that everyone else had overlooked. You don’t need to be a classical genius to make a contribution but you probably do have to be tenacious.

One thing that hasn't been mentioned: learning mathematics from talking to another human can be 10-100 times faster than getting it from books. Another thing: mathematics is huge and seems to accommodate all personality types. Pick something that turns you on and grow outwards from there. The folks on reddit seem to be obsessed with Rudin, and that's good stuff, but there's so many other roads to follow.

And I'm so impressed by how much better the comments are here than on reddit! Good job HN, you rock.

> the “utility density” of reading those 1000-page textbooks is very low. A lot of pages are not relevant, but I don’t have an efficient way to sift them out. I understand that some knowledge might be useful some day, but the reward is too sparse to justify my attention budget. The vicious cycle kicks in again.

That is their main problem. All those useless pages are what becomes useful later.

And we find the same kind of attitude everywhere in tech: why read a full RFC when you can assume shit and get done with a 2 paragraphs tutorial?

I'll try to reply to the frustrations of the author of the OP.

I'll give some high level views and also outline the math topics mentioned, high school through much of a Ph.D. in parts of applied math.

I respond in three parts:

Part I

I have a good pure/applied math Ph.D. and work in applied math and computing; while I call my work applied math and not artificial intelligence (AI), machine learning (ML), or computer science, it appears from the OP that there is significant overlap between my background and work and what the OP is concerned about.

The Reddit post by the guy in Germany was terrific although easy to parody as a big feature of old German culture! :-)! That post is maybe a bit over organized.

I've posted too often that the best future for computer science was pure/applied math, e.g., that someone seriously interested in the future of computing should as an undergraduate just major in pure math with some applied math and, essentially f'get about anything specifically about computer science.

Or, for the essential computer science, write some code in some common procedural programming language for some simple exercises, check out at the library D. Knuth's The Art of Computer Programming, Volume 3, Sorting and Searching, learn about big-O notation, the heap data structure and heap sort, as an exercise program, say, a priority queue based on the heap data structure, learn about the Gleason bound and how heap sort achieves it so is in an important sense the fastest possible sort algorithm, as a side exercise look at AVL trees, and call computer science done for an undergraduate! This is partly a joke but not entirely.

Well, it appears that the OP has started to discover some of why I've said such things about math.

This role for math is just a special case of the old standard situation that, in nearly all fields, the best work mathematizes the field as in, e.g., mathematical physics. Indeed, there is the old joke that good coverage of the math needed for theoretical physics is so much about just the math that can do the physics just in the footnotes.

It is a standard situation that nearly everyone in the STEM fields is convinced that they need to know more math. As I read papers by computer science professors, I tend to agree!

Here I'll try to help the person with their lament in the OP:

(1) Start at about 50,000 feet up and begin to identify in what fields, broadly on what problems, you want to work. Remember: One of the keys to success is good, early work in problem selection.

(2) If you want to work in AI, I suggest you try to regard the current headline topics in AL/ML as nearly irrelevant. Sure, for political reasons, might have to keep up, and maybe there are some current, hot applications, but I don't see that work -- first-cut, ballpark, basically empirical curve fitting to huge quantities of data -- as much of a start on AI.

E.g., at one time I was hired to work in an AI project, specifically expert systems. My first reaction was that expert systems -- rule based programming, working memory and the RETE algorithm, rule firing conflict resolution -- were junk and, in particular, nothing like a good start on AI, programming style, or anything else. After 20 years my opinion has changed: Expert systems were worse than junk, as a style of programming, for much of anything, and in particular on anything like AI.

Maybe I got some revenge: We were trying to use expert systems for the monitoring and management of large server farms and networks. For the monitoring, for essentially health and wellness, they were using essentially just thresholds set by hand on single variables one at a time. My view was, wrap that data and processing in all the rules possible, and still the results won't be very good.

Sure, for monitoring there are two ways to be wrong, (A) a false alarm and (B) a missed detection. So, right, we're forced into statistical hypothesis testing with (i) a null hypothesis that the system is healthy, (ii) a false alarm is Type I error, (iii) a missed detection is Type II error. Quickly decide that need some statistical hypothesis tests, with the null hypothesis that the system is healthy, that are both multidimensional (treat several variables jointly) distribution-free (make no assumptions about probability distributions) and find that apparently there were none such in the literature. So, I invented a large class of such tests that, really, totally knocked the socks off expert systems for much of monitoring. And what I did was just some applied math and applied probability -- some group theory, some probability theory based on measure theory, some measure preserving as in ergodic theory, etc.

So, yes, I could say that my monitoring took in training data, did some machine learning, was better at its monitoring than humans so was some artificial intelligence, was computer science, etc., but I didn't: I just called my work some applied math. E.g., I got real hypothesis tests where false alarm rate was adjustable and known in advance and some useful best possible results on detection rate. So, as I wrote in the paper, take away the mathematical assumptions, derivations, theorems and proofs, forget about false alarm rate and detection rate, and just do the specified data manipulations and call the work AI.

So, such real problems and applied things are one approach to computer science, AI, ML, etc. Then, sure, for such work as in what I did for detection, need, really, the Ph.D. coursework in pure and applied math and some ability to do publishable theorems and proofs in math -- for both, net, need a pure/applied math Ph.D.

So, back to AI and mostly setting aside the current headlines: For AI, I'd suggest that from 50,000 feet up start by watching


with a mommy kitty cat (domestic short hair tabby) and 6 or so of her kittens, maybe only 2 days old, and how they learn. Their learning is astoundingly fast. E.g., for how they learn to use their hind legs, can see fairly directly some of just how they do that. In just that one video clip, in real time of likely just a week or so, those kittens go from nearly helpless balls of fur to young kitty cats. Easy to guess that in two months they will be safely 40 feet up a tree catching something or other, effortlessly doing gymnastic feats that would shame Olympic athletes, etc. Astounding.

Okay for AI from 50,000 feet up, start by trying to guess how those kittens learn. Maybe there will be some math in that, maybe not. Then if you can implement your guess in software, if the software appears from good tests in practice to learn some things well, and if your guess is fairly general, then maybe you have some progress on AI. Here f'get about my view and confirm with some AI profs who would have to review your work, hire you for a research slot, give you a research grant, etc.

So, for some research, (i) could do some math as I did for that monitoring or (ii) try to do some real AI, e.g., by starting by watching those kittens learn where maybe, eventually there will be some math.

For both those research directions, notice that need (i) some overview of the real problem, (ii) some intuitive insights, (iii) some good, new ideas. Maybe the ideas will be mathematical or use math and maybe not. E.g., might go a long way on what those kittens are doing before use much math.

(3) But for the math as mentioned by the OP, I'll try to give an outline:

(i) Start with the real numbers, e.g., as learned likely well enough by the 9th grade. Then, sure, learn about the complex numbers. So, net, have a high school major in math, e.g., everything short of calculus.

(ii) Learn college calculus well. At least in part, you can do well alone: Due to some circumstances beyond my control, for freshman calculus I just got a good book, studied, worked the exercises, and learned. Then at a college with a good math department, I started on sophomore calculus. You can do such stuff yourself.

(iii) It would be good to take a course in abstract algebra, especially one where nearly all the exercises are proofs. So, learn about sets, functions, groups, rings, fields, more about the real and complex numbers, maybe some basic number theory, the greatest common divisor and least common multiple algorithms, and some about vector spaces. Might touch on cryptography and coding theory.

Really the more important, maybe main, value of the course is just learning how to write proofs. The math there is nearly all just so childishly simple that it's easy to learn to write very highly precise proofs -- crucial stuff if later want to publish theorems and proofs.

Blunt fact of life, politically incorrect observation: Without such training in writing theorems and proofs, and, really, just in math notation and how to do math derivations, tough ever to learn how. So, can find chaired professors of computer science at top computer science departments in top US research universities who, however, fumble terribly with just standard math notation and, especially, with how to write theorems and proofs.

Part II

Super simple view: In math, there are sets with elements. That's the logical foundation of essentially all of current pure/applied math. The details are in Zermalo-Fraenkel axiomatic set theory assuming the axiom of choice. E.g., can construct from sets a set that looks like the real numbers we knew about in the 9th grade. Soon define ordered pairs and, then, functions. After that, a huge fraction of everything is functions. The proofs -- as actually written but without the crucial intuitive ideas that permitted finding the proof -- are all essentially just symbol substitution as in basic logic and Whitehead and Russell. Warning: Any math you write for publication should be easily translated back to just sets and symbol substitution; that and nothing else is the criterion. If you also want the proof to be readable by humans, there is more. E.g., in the proof might mention one by one each theorem assumption and where it gets used.

(iv) Learn linear algebra. Really the subject grew out of Gauss elimination for solving systems of linear equations -- with some additional attention to numerical stability (e.g., partial pivoting, double precision inner product accumulation, and iterative improvement) -- that's nearly always still the way to do it. It's fun to program a good Gauss elimination routine, e.g., just in C.

There see clearly that any such system of equations has none, one, or infinitely many solutions. Later will discover that the set of all the solutions is an affine subspace, that is, a vector space plus some one vector; that is, a plane that does not pass through the origin. And will discover that the left side with the unknowns is a linear function -- big time stuff.

So, that start was a small thing for a huge future.

Next up, we consider n-tuples of real (or complex, here and always in linear algebra) numbers. Then we see how to make the n-tuples a vector space. They are the most important of the vector spaces.

But should also see the abstract (with no mention of n-tuples) definition of a vector space because (a) the n-tuples are the most important example, (b) even when working with just n-tuples often need the more abstract definition (especially for subspaces, which are often the real interest, e.g., hyperplanes in curve fitting and multivariate statistics), and (c) there are other important vector spaces that are not just n-tuples (in signal processing, sets of random variables, wave functions in quantum mechanics, solutions to some differential equations, and much more).

So, learn about linear independence and bases (essentially coordinate systems).

Learn about inner products, distance, angles, and orthogonality. See generalizations of cosines, e.g., the Schwarz inequality, and the Pythagorean theorem.

Learn about eigenvalues and eigen vectors -- those eigen vectors are often the most important ones in applications, e.g., your favorite coordinate axes.

Then for the crown jewel, learn about the polar decomposition and, thus, singular values, principal components (e.g., data compression), the core of the normal equations in statistics, etc.

There is a remark in G. Simmons that the two pillars of mathematical analysis are linearity and continuity. The superposition in physics is essentially linearity. In applied math, linearity is the main tool, the key to the land of milk and honey. Well, a good course in linear algebra is a good start on linearity. In particular, those linear equations solved with Gauss elimination are linear as in linear transformations in, right, linear algebra.

Then, sure, that version of linearity takes one through much of all of applied math, e.g., Fourier analysis, the fast Fourier transform, X-ray diffraction, Banach space, Hilbert space, the classic Dunford and Schwartz, Linear Operators, etc.

So, a Banach space is just a vector space where the scalars are the real or complex numbers, there is a norm, that is, a definition of distance, and the space is complete in that norm. Complete is what the rational numbers are not but the real numbers are. Or, complete means that a sequence that appears to converge, that is, converges in the weaker sense of Cauchy, actually does have something to converge to and does. E.g., in the rationals, the approximations to more and more decimal places of pi have nothing to converge to but in the reals do.

In theorem proving, nice to have completeness. But, sure, computing knows next to nothing about completeness because we compute essentially only with rational numbers. So, can do a lot of work without completeness. Indeed, in applications, often we are just approximating, and the rationals can get as close as we please to pi!

Banach spaces are not trivial or useless: E.g., based on the Hahn-Banach theorem, there is the grand applied math dessert buffet

David G. Luenberger, Optimization by Vector Space Methods, John Wiley and Sons.

A Hilbert space is a Banach space where the norm comes from an inner product.

E.g., the set of all real valued random variables X such that E[X^2] is finite form a Hilbert space. That the space is complete for those random variables is nearly mind blowing; actually the proof is short.

(v) There remains Baby Rudin, Principles of Mathematical Analysis.

There see calculus done with essentially full care, as theorems and proofs. So, again, get lessons in how to write theorems and proofs on the way to being a good mathematician.

The main content of the book is just showing that a real valued continuous function defined on a closed interval of the reals has a Riemann integral.

The key is that that interval is compact. So, learn about compactness, which is of quite general usefulness.

Then with compactness and continuity, have uniform continuity. Now the doors to grandeur start to open: That the Riemann integral works is a short proof. And, later in the book, get the three epsilon proof that the uniform limit of continuous functions is a continuous function (was a question on one of my Ph.D. qualifying exams -- from baby Rudin, I got it!).

Compactness is so powerful that it is nearly the same as just finiteness -- there's a famous, old paper on that.

Well for a positive integer n and the set of real number R, in R^n a set is compact if and only if it is closed and bounded. Now we are cleaning up the Riemann integral and a lot of associated stuff.

At the back, Rudin gives a nice, short definition of a set of the real numbers that has measure zero (without really getting deep into measure theory) and, then, shows the Riemann integral exists if and only if the function is continuous everywhere except on a set of measure zero. Nice. Now an exercise is to find a function that is differentiable but whose derivative is not Riemann integrable. Might look in Gelbaum and Olmstead, Counterexamples in Analysis.

Also the later editions of baby Rudin cover the exterior algebra of differential forms. That material is of interest in differential geometry, some applications, and general relativity.

That's an overview of what baby Rudin is about.

Go through that book carefully and will come out with (a) some good knowledge of the "principles" of the analysis part of pure math and (b) much better skills at doing math derivations, definitions, theorems, and proofs. If you want to write and publish new proofs for, say, AI, baby Rudin is one of your best mentors, maybe your Fairy Godmother?

For being an applied mathematician, sure, you already guessed, and you were correct, that essentially always in practice the Riemann integral exists; so, why sweat the details? Okay, then just use baby Rudin to learn about compactness, continuity, uniform continuity, measure zero, and more on how to write proofs. And, really, focusing just on compactness, continuity, etc., can pull that off in a few, nice weekends, maybe just one. Then look at that result near the back on the uniform limit of continuous functions is continuous to see how to do such work.

Sure, Rudin discusses compactness, etc. on metric spaces. Well, easily enough, the set of real numbers R is also a metric space! And for positive integer n, so is R^n.

So, why say metric space instead of, say, just R^n? Well, first, the theory is cleaner because a metric space has so many fewer assumptions than R^n so that can see more clearly just what assumptions make the results true. Second, maybe some fine day all you will have is just a metric space and, then, can still use the results -- don't hold your breath while waiting for a significant application, either pure or applied, with a non-trivial metric space that isn't also much more! Or, proving the stuff in a metric space is no more difficult, more general, and maybe, and actually, more useful.

Or, maybe math had the results in R^n and then invented a metric space just to have a place to have just enough to make the results true! So, a metric space was invented to have the least assumptions needed for proving the results; so, what came first were the results, and the metric space definition came later. Maybe!

Part III

(vi) Continuing on, there is the subject of measure theory. That was from H. Lebesgue, a student of E. Borel, right, in France near 1900.

They were correct: They improved on the Riemann integral. Don't worry: Whenever the Riemann integral exists, the integral from Lebesgue's measure theory gives the same numerical answer.

So, why a new (way of defining the) integral? Two biggie reasons:

(a) For a lot of theorem proving about integrals, e.g., differentiation under the integral sign, clean treatment of what physics needed from the Dirac delta function (right, there can be no such function, but measure theory has a good answer), definition of an integral of several variables, interchange of order of integration in iterated integrals, tying off some old, loose ends in Fourier theory, the deep connections between integration and linear operators, and more, Lebesgue's work is crucial and terrific stuff.

(b) Lebesgue's integral is much more general than the Riemann integral, and that generality is crucial, especially as the foundation for probability theory.

Now, for what Lebesgue did:

First, he developed measure theory. That's essentially just a grown up theory of area like you have known about since grade school. E.g., given a set, in the real numbers, R^n, something more complicated, or something fully abstract, the measure of that set is essentially just its area. With the generalization, sure, can have some sets with measure infinity, negative measure, complex valued measure, etc. But on the real line, with the usual measure, Lebesgue measure, the measure of an interval is just its length, and you already know about that.

But Lebesgue measure is darned general: It's tricky to show that there is a set of the reals that is not Lebesgue measurable, and the usual proof uses the axiom of choice.

So, for measure theory, there is a measure space with three things: There is a space, just some non-empty set, say, M; there is a collection of measurable sets, subsets of M called, say, S; and there is a measure m. Then the collection of measurable sets S satisfies the simple, essentially obvious axioms we would want for area and, thus, is a sigma algebra of sets. Then for each set A in S, its measure is the real (or complex) number m(A). We ask that m have the properties we want for a good theory of area. It's just a grown up version of area. The theory is not trivial; it was tricky to get all the details just right so that have a good theory of area, that is, so that area works like we want it to.

E.g., for the space, can have the set of real numbers R. For the measurable sets, have the intervals and, then, all the other sets needed to have a sigma algebra (a short proof shows that this definition is well defined). Then the Lebesgue measure of an interval is just its length, and the measure of all the other measurable sets are what they, then, have to be (need some theorems and proofs here).

So, for probability theory, a probability space is just a measure space (each point in that space is one experimental trial); a probability, P, is just a positive measure with maximum value 1; an event A is just one of the measurable sets; and the probability of A is just its measure P(A). What we want for probability is already so close to a theory of area that, really, we have little choice but just to follow what Lebesgue did. That's what A. Kolmogorov observed in his 1933 paper.

Second, with the foundation of measure theory, Lebesgue defined the Lebesgue integral.

So, what is being integrated is (usually) a function taking real or complex values. The domain of the function is a measure space.

Then for the integral, say, in the case of a real valued function, we partition on the Y axis, that is, in the range of the function instead of in its domain. So, we don't have to do Riemann-like partitions of the domain of the function and, thus, the domain can be much more general.

As the first step, we only integrate functions that are >= 0, and we do that by approximating, right, again with essentially rectangles, only from below the function, not both above and below as for Riemann. Here the domain of the function can be the whole space, e.g., the whole real line. We don't care about either continuity or compactness.

For a function that is both positive and negative, we multiply the negative part by -1 and integrate the two parts separately. If at least one of the two results is not infinity, then we subtract, and that's the integral.

A random variable is just such a function, and its expectation is just its integral.


Don't feel like the Lone Ranger; not everyone knows this stuff. E.g., from all I've been able to see from quantum mechanics, the wave functions are differentiable. Then I'm told that the wave functions, wondrous, are also continuous. From baby Rudin, of course they are continuous; every differentiable function is continuous! Then I'm told that the wave functions form a Hilbert space. Well, I can see that they can be points in a Hilbert space, but they can't form a Hilbert space because the continuous functions won't be complete.

In elementary probability, it is common, e.g., for finding the expectation of a Gaussian random variable, to integrate over the whole real line. Tilt: Baby Rudin defines the Riemann integral only over compact sets. So, can use an improper integral, and then they want to differentiate under the integral sign. Tilt again -- the needed theorems are not so easy to find. So, really, this common stuff, in probability, physics, etc. of integrating over the whole real line or over all of R^n is using the Lebesgue theory where have clean theorems for such things.

That's your overview and your work for the weekend. You are now permitted two beers and half a pizza but only if you have someone you like a lot for the rest of the pizza and some more beer. Can substitute Chianti for the beer. But no more math permitted; maybe a movie, but not math!

Perhaps you can focus on improving the usability of AI tool-sets to a wider market rather than focus on finding the Next Big Magic Equation. Example: https://github.com/RowColz/AI An AI expert(s) may do the initial setup, but factor tables allow more "typical" office workers to tune and prune the results.

I like the comment that likens this sort of deeply-linked knowledge to a DAG. In my own (limited) experience, once I’ve mentally found the DAG where every node either references some other node or baseline knowledge, the learning task almost immediately switches from daunting to routine. Just work on understanding each node in the dependency chain until you get to the one you seek!

From TillWinter's response: > Also: doing the master is to understand that you don't know anything, and doing your doctorate is to learn the others know nothing as well.

I had always heard variations on the first part -- that going to a good school was supposed to humble you by showing you how much you don't actually know.

Never heard the second part. That's great.

I am an AI researcher and faculty member at a large and famous university. I probably know less math than that Reddit poster. Math is important if you are specifically interested in the math of AI. If you are interested in inventing algorithms and solutions you mostly don't need the math.

Starting from pre-calculus what areas of mathematics (with book recommendations) should one study rigorously to have the foundations to pursue a PhD in Machine Learning?

The depth and illegibility of the field he is describing make me believe that we are much further away from general purpose ai than I previously thought


"Complex models are rarely useful (unless for those writing their dissertations)." (V.I.Arnold)

Perhaps the author should write a ML tool to help sift through all the material ;)

Most-upvoted reply is excellent.

The top-voted reply by TillWinter may be excellent but it actually doesn't answer the question posed by the OP. TillWinter's reply is mostly about the "divide-and-conquer" approach to learning. Yes, efficient study habits are good to know but that type of answer can be generically applied to any learning endeavor.

The reddit OP is asking about specific resources that help with intuition. In contrast, many math teaching materials work from an analytical and symbolic approach (e.g. axioms and properties.) Unfortunately, most people can't truly learn intuition that way and they end up in mindless "plug & chug" to pass the test.

you should link to the reply instead, because the most upvoted reply might change.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact