
Foundations Machine Learning - gohwell
https://bloomberg.github.io/foml/
======
hackermailman
If you know basic undergrad probability/statistics/linear algebra (matrices,
vectors, eigenvalues) you can do CMU's graduate course in ML that is intended
to prepare PhD students to understand research papers in the field

(Includes recorded lectures)
[https://sites.google.com/site/10715advancedmlintro2017f/lect...](https://sites.google.com/site/10715advancedmlintro2017f/lectures)

CMU also has an 'Applied Machine Learning' undergrad course that is paywalled
fully unfortunately, but they use the text: Witten, I. H. & Frank, E. (2005).
Data Mining: Practical Machine Learning Tools and Techniques, second edition

~~~
cbHXBY1D
And what about after that? In my experience at least, I can highly recommend
Stanford's EE364a Convex Optimization.
[http://web.stanford.edu/class/ee364a/](http://web.stanford.edu/class/ee364a/)

When I was an undergrad at Berkeley, one particularly well known research lab
would only look at your resume if you took that course online. Warning: that
class is not for the faint of heart. And be good at linear algebra.

~~~
graycat
Stephen Boyd's _Convex Optimization_ ,

[http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf](http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf)

is gorgeous material. Large parts it are from theorems of the alternative,
linear programming, Jensen's inequality, R. T. Rockafellar, etc., but Boyd
makes it all a clean whole.

That's the upside! The downside is that, IMHO, rarely does any one person need
more than a small part of the whole book, and the parts that someone needs are
likely also available in the subject they are working with.

So, at least download Boyd's book and use it for reference: If ever have a
question about convexity, then start with Boyd!

~~~
davidsrosenberg
Funny - I had the same thoughts and Boyd and Vandenberghe’s book, which is why
I compiled this “extreme abridgment” for what you need for the class:
[https://davidrosenberg.github.io/mlcourse/Notes/convex-
optim...](https://davidrosenberg.github.io/mlcourse/Notes/convex-
optimization.pdf)

~~~
iampno
Nice

------
stochastic_monk
The title of this page feels like a reference to the excellent Mohri et al.
Foundations of Statistical Machine Learning. I’d recommend both it and Shai
Shalev-Schwartz for VC theory/Rademacher complexity sorts of statistical ML.

However, I’m not crazy about this summary. It’s less foundational than it is a
survey. I probably dislike the format and formatting more than anything, but I
would not recommend this over other resources.

~~~
benrbray
Yeah, this space feels pretty saturated already, and people seem to use the
same topics / presentation / ordering over and over. What a ton of duplicate
effort!

It'd be nice to see a course try to put its own spin on "machine learning" and
1) present standard topics in an unusual way and 2) include topics that intro
ML students might not normally see.

~~~
LolWolf
Take a peek at our class [0]! We present a bunch of topics not covered in
undergraduate courses (at least to our knowledge), such as proximal gradient
descent, random features, etc., without referencing any probability. The
course is aimed at mostly Sophomores/Juniors in any STEM field with the only
prerequisite being the introductory linear algebra course EE103. [1] All of
the material (minus solutions) is available online for both courses.

I'm probably highly biased, but I'd like to say this is a fairly fresh take
which departs heavily from the usual CS229 (Ng's course) presentation style
and order since it's meant for a completely different audience (and was, to be
fair, written this past quarter, unlike 229 which was written perhaps 15 years
ago).

\---

[0] [http://ee104.stanford.edu](http://ee104.stanford.edu)

[1] [http://ee103.stanford.edu](http://ee103.stanford.edu)

~~~
wodenokoto
> All of the material (minus solutions) is available online for both courses.

I can only find lecture slide when looking at the course site.

Are there no readings or problems?

~~~
LolWolf
There are problems, but I’m afraid that we took them down since we need to
rewrite a few before next year. Sorry! I’m sure poking around the site might
yield some results, if you’re careful enough ;) (at least for 104, someone
else is teaching 103 and I’m not sure what all they’re doing with the course,
but all of the main problems are available on the book’s website).

------
graycat
I looked over the whole list of topics. They have a page of questions about
prerequisites, and for all the questions I was able to answer that I was fully
comfortable with the question. But the questions have to do with probability
and optimization, and my applied math Ph.D. was in those fields with research
in stochastic optimal control.

I can begin to see some of why Bloomberg is interested in the material: Maybe
with all the economic and stock market data they collect, they can build some
surprisingly useful predictive _machine learning_ models. Okay. Maybe one good
result will be that Michael Bloomberg will give some more money to Johns
Hopkins!

But otherwise I was disappointed:

(1) Put simply and bluntly, it's all about just one now very old topic --
regression analysis. Well, when old regression analysis doesn't fit very well,
then try logistic regression, ridge regression, regression trees, other forms
of regression, other forms of curve fitting, e.g., neural networks.

Or, look, guys, it's all just empirical curve fitting. Ptolemy tried empirical
curve fitting. He used his _epicycles_. They didn't fit very well. Then,
later, from work of Kepler, a falling apple, etc., Newton guessed that (A)
there was a law of motion due to force equals mass time acceleration and (B)
there was a force directly proportional to the product of two masses and
inversely proportional the the square of the distance between them. That fit
the data great!

Lesson: Lots of things change continuously. Usually the changes are
differentiable, that is, have a well defined tangent. Well, that tangent is
linear. So at least locally, lots of things are linear. So, for lots of
things, a promising first cut approach is to do a linear fit. For more, neural
networks can approximate anything continuous, and lots of things are
continuous. So, net, curve fitting has some promise and utility. But, still,
curve fitting is just guessing without any real basis in science, e.g.,
couldn't find or replace what Newton did.

Really, the now classic texts in regression analysis start with something the
machine learning curve fitting does not: The classic texts assume that there
really is a linear equation, that we would have the equation exactly except
for some errors in the data, and then do some nice applied math to show how to
get the errors down and get a good approximation to the equation that has
already been assumed to exist. The assumptions can vary, stronger or weaker,
but in the theory there was little or no role for just empirical fitting.
Well, machine learning is charging ahead without that assumption that a linear
equation exists. And, wonder of wonders, apparently often now such an
equation, even as a good approximation, doesn't exist. So, we are back to
empirical curve fitting, struggling like Ptolemy. We already know: Some
successes are possible, but like Ptolemy we face some severe limitations.

(2) Okay, when simple regression doesn't fit very well, we keep trying? Okay.
Say, we try logistic regression, ridge regression, L1 or L2 regularization
regression, regression trees, boosting, ..., neural networks, etc. Uh, which
is better, L1 regularization or L2 regularization? Uh, this sounds like
throwing stuff against the wall until something appears to stick. Sure, that
can work at times, but are we really satisfied with that? Wouldn't we want
some more solid reasons for the tool we pick? Lots of places elsewhere in
applied math, applied probability, and applied statistics we do have solid
reasons.

With so many efforts to patch up regression, maybe we might suspect that for a
lot of problems regression is not the right tool?

(3) There is a lot more to applied statistics, applied probability, etc. and
possibly of value for applications, maybe including Bloomberg's customers,
than empirical curve fitting. Commonly this work has equal justification to be
called _machine learning_ because it also takes in data, estimates some
parameters, builds a model, and gives results. Moreover, the better cases work
with clear assumptions so that when the assumptions hold we know we have
something solid, and some of the methods use meager assumptions that are
relatively realistic in practice. The Bloomberg course is just empirical curve
fitting, nearly all versions of regression analysis, and omits all the rest.
On the applied math shelves of the research libraries, regression is only a
tiny fraction of the whole. Where's the rest?

~~~
clircle
Nice comment, echoes a lot of my feelings about ML. I have a question. You
write

> (2) Okay, when simple regression doesn't fit very well, we keep trying?
> Okay. Say, we try logistic regression, ridge regression, L1 or L2
> regularization regression, regression trees, boosting, ..., neural networks,
> etc. Uh, which is better, L1 regularization or L2 regularization? Uh, this
> sounds like throwing stuff against the wall until something appears to
> stick. Sure, that can work at times, but are we really satisfied with that?
> Wouldn't we want some more solid reasons for the tool we pick? Lots of
> places elsewhere in applied math, applied probability, and applied
> statistics we do have solid reasons.

I'm wondering what these "solid reasons" could be? Some sort of experience
based on past data? An example would be helpful.

~~~
graycat
Let's see: Pick a system that we know to be linear. Get a lot of pairs of real
inputs and outputs and find the coefficients of the linear function that
relates the inputs to the outputs. Then given a new input, can say what the
output would be.

Okay, the function, _system_ between a violin on the stage at Carnegie Hall
and a seat near the roof is linear. So, I'm typing quickly here, do regression
with a recording at the stage and at the seat and look for the coefficients in
the convolution. Then given an oboe, can say what it will sound like in the
seat.

Hooks law with small deflections is linear. So, take independent variables the
forces on a space frame and the dependent variables the deflections and
estimate all the spring stiffness values.

Take 200 recipes for tomato sauce all from the same 10 ingredients, for each
recipe measure the weight of each ingredient and the weight of the protein in
the final sauce and estimate the protein in each of the 10 ingredients. Then
for any new tomato sauce recipe, weigh the ingredients and get the protein in
the final sauce.

------
Jasonseah
I am new in machine learning and have difficulty to understand the equation. I
finish 2 courses from Andrew Machine Learning Class from Coursera, I
understand the flow, concept and knew how to write those equations/algorithm
but it always bugs me that I don't understand those equations. Do you guys
have any suggestion that where should I start for learning those
equations/maths online?

~~~
siddboots
If you want to keep up with the maths, the main prerequisites for more
rigorous ML is typically undergrad-level calculus, linear algebra, and bit of
basic probability and statistics.

For calculus, google "MIT 18.01", "MIT 18.02", (and "MIT 18.03" if you like),
which are all freely available on youtube. You should be comfortable with
single-variable calculus, and at least familiar with multi-variable
techniques.

For linear algebra, try "MIT 18.06", which is Gilbert Strang's MIT course. Or
try 3blue1brown's "The essence of linear algebra" series, which is the best
explanation I've ever seen of many concepts, but it is shorter and less in-
depth than a full course.

For basic statistics, try Khan Academy's "AP Statistics" sequence.

~~~
colmvp
While I think 3b1b video's are great for developing a better understanding of
the purpose of certain concepts, I don't think there's any way around ignore
manually doing the problems with say Strangs Linear Algebra book since that's
where you spend tim trying to apply these concepts to problems.

~~~
siddboots
Agreed. Nothing beats practice when it comes to math. I know that I spend far
too much time watching course videos, and no where near enough working through
problem sets.

~~~
id_rsa
Working through some of Strang's problems has also helped me. 3blue1brown is a
great introduction to give you intuition, but you cannot commit the skills to
long-term memory without struggling through problems.

------
graycat
Several of the comments in this thread clarify what the Bloomberg machine
learning work is about. Then with the clarification, maybe I see some issues
with the work.

We can start with just the simplest case, ordinary, plain old regression with
just one independent variable. So, for some positive integer n, we have pairs
of real numbers (y_i, x_i) for i = 1, 2, ..., n. Here we are trying to predict
the y_i from the x_i; we are trying to build a model that will predict y from
a corresponding x where x is not in the data. Our model is to be a straight
line, e.g., in high school form

y = ax + b

Okay, doing this, with the classic assumptions, we can draw a graph of the
data, the fitted line, and the confidence (prediction) interval at each real
number x.

For some intuition, suppose the x_i are all between 0 and 5. Maybe then the
fit is quite good, the line is close to the data, and the confidence intervals
are small.

But, IIRC, roughly or exactly, the confidence interval curves are actually
hyperbolas. So, while the upper and lower curves are close for x between 0 and
5, for x outside the interval [0,5] the curves can grow far apart. So, if we
are interested in the predictions of the model for, say, x = 20, the
confidence intervals may be very wide, enough for us to conclude that, even
though we have a straight line that fits our data closely, still our model is
useless at x >= 20.

So, this little example illustrates a broad point about such curve fitting:
The model might work well for independent variable x (likely a vector of
several components) close to the training data but commonly be awful
otherwise.

How serious this situation is can vary a lot depending on the application.
E.g., if are interested only in the values of y for x a lot like the training
data, e.g., in the interval [0,5], then maybe don't are about the y value or
the confidence interval when x = 20.

But quite broadly there are applications where what we want such models to
tell us is the value of y for some x not close to what we have seen in our
training data.

Uh, let's see: IIRC Bloomberg is selling stock market and economic data, often
in nearly real time, to investors, some of whom are traders and make trades
quickly, within a few seconds, based on the data from Bloomberg.

I'm no up to date expert on just what the Bloomberg customers are doing, but
from 20,000 feet or so up maybe the situation is something like, broadly the
investors vary:

(A) Some investors want a portfolio constructed much like in the work of H.
Markowitz or W. Sharpe. In simple terms, they want the portfolio to have good
expected return with low standard deviation of return and maybe, then, buy on
margin to raise the rate of return while still having the risk relatively low.

(B) Some investors are interested in relationships between stocks and options
-- e.g., the Black-Scholes work is an example of this. IIRC, a more general
case is some stochastic process, maybe Brownian motion, reaching a boundary
and a first exit. The exit has a value, so the investment problem is a
boundary value problem. IIRC can design and attempt to evaluate _exotic
options_ with such ideas.

(C) Some investors are just stock pickers and buy when they sense a sudden
rise in price.

But a theme in (A)-(C) is that the investors are looking for something
unusual. So, in the model building, the unusual may have been unusual in the
training data and the testing data. In that case, without more assumptions,
theories, or whatever the prediction of the model for unusual input data may
be poor.

That is, it appears that the model building techniques promise that the model
will do poorly in just the application cases of greatest interest to investors
-- the unusual cases that are not well represented in the training and test
data.

So, maybe first cut some of what is needed is some anomaly detection.

So, we could use more information about the systems we are trying to model. A
linearity assumption is one such. In Newton's second law and law of gravity,
we can check that for falling apples. Next we can try on the planets in our
solar system and, nicely enough, see that it works. And then we can be pretty
sure for a rocket at Mach 15 headed for stationary orbit, etc. But with just
empirical curve fitting, apparently mostly we don't have such additional
information.

IIRC L. Breiman's first interest in empirical curve fitting was for clinical
medical data. So, maybe in that data he was trying to predict some disease but
the independent variable data he was using was common in his training and
testing data. I.e., he wasn't really looking for exploiting some anomaly for
some once in 20 years way to get rich quick.

~~~
davidsrosenberg
So the extrapolation-type problem you describe (an input not near any of your
training examples) is an issue. Unless you have a world model you believe in
(i.e. you've done some science -- not just statistics), hard to know if your
prediction function works out there where you’ve never seen any examples. If
you’ve seen some data out there, but relatively fewer than you see in
deployment, then importance weighting or other approaches from covariate shift
/ domain adaptation could help.

Anomaly detection is definitely another important area, but I struggle to pull
together a coherent unit on the topic. One issue is that it’s difficult to
define precisely, at least partly because everybody means something a little
bit different by it.

Also, based on classical hypothesis testing, I think that to some extent you
have to know what you’re looking for to be able to detect it (ie to have power
against the alternatives/anomalies you care about)... For that reason, I think
it’s hard to separate anomaly detection from more general risk
analysis/assessment, because you need to know the type of thing you care about
finding.

In any case, I made an attempt on anomaly detection: There's
[https://bloomberg.github.io/foml/#lecture-15-citysense-
proba...](https://bloomberg.github.io/foml/#lecture-15-citysense-
probabilistic-modeling-for-unusual-behavior-detection) which is simply about
building a conditional probability model, and flagging behavior as anomalous
if it has low probability or prob density under the model. I also used to have
1-class SVM’s in a homework
([https://davidrosenberg.github.io/mlcourse/Archive/2017/Homew...](https://davidrosenberg.github.io/mlcourse/Archive/2017/Homework/hw4.pdf)
Problem 11).

~~~
graycat
So, for anomaly detection, before evaluating the model at x, might want to
know if x would be an _anomaly_ in the training data x_i, i = 1. 2, ..., n.
Sure, x is likely a vector with several to many components.

An anomaly detector should be at least as good as a statistical hypothesis
test.

So, for the null hypothesis, assume that x is distributed like the training
data.

Okay, except we don't really know the distribution of the training data.

"Ma! Help! What am I supposed to do now???"

So, we need a statistical hypothesis test that is both multi-dimensional and
distribution-free.

Let's, see: In ergodic theory we consider transformations that are measure
preserving .... Yup, can have a group (as in abstract algebra) of those, sum
over the group, ..., and calculate the significance level of the test and,
thus, get a real hypothesis test, multi-dimensional and distribution free. For
some of the details of the test, there are lots of variations, i.e., options,
knobs to turn.

Detection rate? Hmm. Depends ...! Don't have data enough to use the Neyman-
Person approach, but in a curious but still relevant sense the detection rate
is the highest possible.

I just call this work statistics, but maybe it would also qualify according to
some definitions as _machine learning_. But my work is not merely heuristic
and has nothing to do with regression analysis or neural networks. So, again,
my work is an example that there can be more to machine learning than
empirical curve fitting.

So, before applying an empirically fitted model at x, want x to be distributed
like the training data and at least want an hypothesis test not to reject the
null hypothesis that x is so distributed.

More generally, if are looking for anomalies in the data, say, a rapid real
time stream, when see an anomaly, investigate further. In this case, an
anomaly detector is a first cut filter, an alarm, to justify further
investigation.

Looking back on what I did, I suspect that more could be done and that some of
what I did could be done better.

Of course, my interests now are my startup. Yes, there the crucial core is
some applied math I derived.

Maybe I'll use my anomaly detection work for real-time monitoring for zero-day
problems in security, performance, failures, etc. in my server farm.

~~~
graycat
As a very general but crude and blunt approach to show that the hypotheses
tests were not _trivial_ , used the result of S. Ulam that Le Cam called
"tightness" as in P. Billingsley, _Convergence of Probability Measures_. When
doing both multi-dimensional and distrbution-free, are nearly way out in the
ozone so get pushed into some abstract techniques! Meow!

