Derive Yourself a Kalman Filter

Waterluvian · on May 13, 2019

"It's really quite simple. Here, let me show you all these formulas..."

This reminds me of last week. I bought a house and want to get into woodworking so I looked up intro videos on YouTube. "It's easy. Just follow me over to this table saw and router and planer and all these other tools you don't own."

I know I'm not being fair. But a recurring frustration I have is when experts claim it's easy or simple or for beginners and then talk right over you. They don't mean to, but it can be insulting and demoralizing. "well if it's for beginners and I don't know what these glyphs mean, the problem must be me."

So back to the topic. Am I wrong or could you begin with, "a kalman filter is a way to get a guesstimate of a value from different sources where the trust in each source can vary."

usgroup · on May 13, 2019

There are no Kalman filters for beginners because no one is out there trying to make maths more complicated than it needs to be.

To understand what’s happening you need work linear algebra l, probability theory and distribution theory. There’s no way to explain it without because it makes no sense without. Kalman filters are born of those things.

gjm11 · on May 13, 2019

You could, but you should follow it up with "where one of those sources is your (incomplete, noisy) information about what the value _used_ to be and how it changes over time".

If you just have (say) three different ways of estimating the position of an aeroplane, then you don't need a Kalman filter. The place where the Kalman filter adds value is where you measure its position _repeatedly_ and make use of the fact that it's known to be moving approximately in a straight line.

bigred100 · on May 13, 2019

How do EKF and ensemble Kalman Filters fit into this?

gjm11 · on May 13, 2019

The Kalman filter assumes that all errors are gaussian, that all updates are linear (i.e., new state = linear function of old state), and that observations are linear (i.e., what you measure is a noisy version of a linear function of the state).

The extended Kalman filter allows for state updates and observations to be nonlinear, by the straightforward expedient of replacing them with linear approximations near to the current estimate.

The ensemble Kalman filter also allows for nonlinear updates and observations; instead of keeping track of the expectation and covariance of the state (i.e., everything you need to define a Gaussian model, things that behave nicely under linear transformations but not under nonlinear ones) it keeps track of an "ensemble" of sample values, applies the update and observation functions to those, and then estimates expectations and covariances from this ensemble. It's a sort of Monte Carlo Kalman filter.

A couple of other things worth knowing about:

Intermediate between the extended KF and the ensemble KF is the "unscented Kalman filter". Like the ensemble KF it estimates things using a number of samples; but instead of propagating those samples through repeated steps, it picks the sample points at each step on the basis of the estimated expectation and covariance, and uses them only to compute new expectations and covariances. More expensive than the EKF but copes better with substantial nonlinearities.

Extrapolating beyond the ensemble KF is the "particle filter", which uses the same "track an ensemble of samples" approach but gives up the assumption that all the errors are Gaussian. You need a larger ensemble to get good results, I think, but it can cope with a wider range of scenarios. (I find the name "particle filter" annoyingly distracting; the "particles" are the samples, which I guess you're supposed to think of as a cloud of points in possible-configuration-of-the-system space, and of course it's a "filter" in the same way as the Kalman filter is.)

isegrim · on May 13, 2019

You may have chanced upon this already but here is a wonderfully explained (IMHO) intro to Kalman filters: https://www.bzarg.com/p/how-a-kalman-filter-works-in-picture...

SiempreViernes · on May 13, 2019

I don't know man, the link you clicked on says "Derive Yourself a Kalman Filter", so complaining about a formal treatment seems misguided rather than simply unfair.

your-nanny · on May 13, 2019

you are assuming that the reader is familiar with this specific understanding of "derive". someone not familiar with formal methods wouldn't.

mlevental · on May 13, 2019

what other understanding of derive is there???

coolgeek · on May 13, 2019

Derived data - e.g. indexes or summaries.

Derivative financial instruments - e.g. options, CDSs.

I'm sure there are many more examples.

krick · on May 13, 2019

You could, kind of. Basically, Kalman filter is this: you have 2 noisy data sources with a known real-life connection (like, vehicle position and velocity), so instead of an accurate measurement for each you have 2 "wide" gaussians. When you multiply them, you still get a gaussian, but a much narrower one, meaning: you still don't know the exact position for sure, but (kind of surprisingly) now you know it with a much greater precision than you had with the direct measurement.

In fact, I agree that this one is not really good explanation. I cannot remember, where I saw my favourite one, but there are really plenty of them in the internet, many are well illustrated and pretty clear overall, so if you are actually interested, I think you won't have much trouble understanding it after a few links from the first page of your favorite search-engine.

touisteur · on May 13, 2019

Hi,

Trying to simplify even further: you don't need 2 sources, just one, for which the distribution of noise is gaussian. And a model of how the data should evolve over time. Straight-line constant-speed for example, but also, static, turning, accelerating/decelerating... You need to be able to predict the next measurement from the previous one. The Kalman filter then uses the difference between the prediction from the model and the measurement, the incertitude on the measurement, and gives you: a smoothed estimation of the 'real' position of the object, an estimation of the 'noisyness' of your input data (i.e. How much does the data fit the model), and the ability to predict next positions accurately...

Well, not sure this word-soup simplifies anything...

Kalman filters are a bit counter intuitive if you don't have a strong background in probabilities... They clicked for me when 1- shown in comparison with other simpler filters 2- I played with each dimension of the formula on a toy example...

They're also very fun because they can be used in so many ways (sensor data fusion, non gaussian noise models, incredibly complex trajectory models...).

gdy · on May 13, 2019

"it can be insulting"

Insulting, really? Somebody cares to try to explain you something and the word you can find is "insulting"?

It's like everyone owes it to you to chew knowledge in small enough bits that you can swallow them without any effort.

What a kindergarden.

your-nanny · on May 13, 2019

Well, I dont agree with OP but i think you miss the mark. it's not deigning to provide an explanation that would be insulting, but the (apparent) promise of it being an accessible and intuitive explanation, and quite obviously not being one that OP considered insulting.

gdy · on May 13, 2019

I understood OP exactly as you said. This doesn't justify OP's whining, not one bit.

bonoboTP · on May 13, 2019

A Kalman filter is a specific instantiation of the Bayes filter, which is very elegant and intuitive. Then once you plug in a bunch of Gaussians and linearity in the Bayes filter and crunch the math, a Kalman filter falls out.

This is how it's treated in the book Probabilistic Robotics by Thrun, Burgard, Fox and I found it to be one of the best treatments on Kalman filters. A really great book overall.

commandlinefan · on May 13, 2019

Yeah, this piece reads like the Wikipedia page for Kalman filters. It’s right, but it’s not very informative if you don’t already understand the topic. There’s definitely no attempt to simplify anything here.

FPGAhacker · on May 13, 2019

Yes, when people try to explain things they will often say it’s easy in hopes that it will help the learner relax and not feel overwhelmed.

But that backfires the moment something is not easy for the learner. It’s completely demotivating.

It’s better in my opinion to say, “it’s okay if you don’t understand everything at first.” If the goal is to keep the learner from being immediately discouraged.

pranjalv123 · on May 13, 2019

There's a place for both kinds of explanations. This is a good way of explaining what a Kalman filter is to someone who has a mathematical background, just like if you go look up a woodworking video, it might show an interesting way for someone who is familiar with woodworking to build a chair.

Neither is useful to the novice, but an article like this going out of its way to explain, say, matrix notation, or a video about building chairs that spends the first ten minutes explaining how to use a table saw, is less useful and interesting to the experienced reader or viewer.

I suppose I agree that articles like this shouldn't be presented as easy, or for the beginner, but there's a difference between "simple" and "easy".

semi-extrinsic · on May 13, 2019

When it comes to the woodworking: a japanese pull saw ( https://en.m.wikipedia.org/wiki/Japanese_saw ) is a great and fairly cheap way to get started. It enables high accuracy and is still quite fast.

Then you'll need a cheap but sturdy table, a hand plane, some vices, a tape measure, a carpenter's square, a cheap cordless drill. Screwing and glueing is the best way to get solid joints for a novice.

NougatRillettes · on May 13, 2019

Thanks for your comment (and thanks to everyone who answered below). I have been way too many times in the same situation where someone says something is "trivial" or "easy" so I completely get your point and will try not to make the same mistake for the next posts.

Regarding your question, I think the answers you have had are on point!

heinrichhartman · on May 13, 2019

If you enjoyed this, you might also like this article from my blog:

http://heinrichhartmann.com/blog/2014/12/11/Generative-Model...

This is a study of generating time series/stochastic processes. The estimators for the parameters lead straight up to Kalman filters. The state space models are taken from the Kalman setup. This was the first time I understood how Kalman filters come about. Its really a three step process:

1. Stationary processes --> Classical parameter estimation.

2. Discrete state space --> Markov models

3. Continues state space --> Kalman filters.

xchip · on May 13, 2019

Make sure to read this bit to make the formal explanation easier to understand.

https://aguaviva.github.io/KalmanFilter/KalmanFilter.html

paillou · on May 13, 2019

This one is great to me, less formula and more drawings :)

https://www.bzarg.com/p/how-a-kalman-filter-works-in-picture...

richajak · on May 13, 2019

A few months ago I try to use Kalman Filter as one of the filters to stabilize the GPS from mobile phones due to various factors: loss of signal, jumpy stationary coordinate, fake GPS, etc. I found that Kalman Filter was not giving me satisfactory result, adjusting its parameters many times without further improvement. Finally, I just need to write my own simple filter, without fancy math filters (kalman/etc) as I can differentiate the speed of car/motorbike/pedestrian, distance of jumpy signal (less than 20 meters in four directions), accuracy of gps signal in the city/suburb, etc.

bonoboTP · on May 13, 2019

I think there is no such thing as a conditional random variable. There are conditional probability distributions, but I've never heard of A|B itself being a random variable and I think it doesn't fit the definition. A random variable (I'm not a mathematician) is a function that takes an elementary event and returns the value of the variable.

You may say this is pedantry, but I think it's important to keep track of what is what, especially when being a beginner. You can afford to be sloppy once you're more advanced.

kgwgk · on May 13, 2019

http://www.maths.qmul.ac.uk/~pettit/MTH5122/notes15.pdf

bonoboTP · on May 13, 2019

Hm, I googled it now, the issue is discussed in detail here: https://math.stackexchange.com/questions/612468/how-to-forma...

NougatRillettes · on May 13, 2019

Thanks for your comments! I don't see how what is discussed here conflicts with the notation I introduced into the post, do you still believe there is a soundness issue in what I have written?

bonoboTP · on May 13, 2019

I'm just saying that the notation of say A * X|Y * B seemed unfamiliar to me. I only know conditional notation within a P(...). Or an expectation, etc. Apparently your way of writing is used by others as well, but it may be good to know that it is not fully rigorous.

Again, there are different people preferring different presentations. I as a student was often frustrated by abused notations and was often confused by such things when trying to understand something in detail. For a more cursory and "practical" understanding it could be good enough.

kgwgk · on May 13, 2019

> it may be good to know that it is not fully rigorous

What is the problem with A|B=b being a random variable? (Apart from you unfamiliarity with the concept, I mean.)

Edit: I don’t say there are no problems, I ask what do you think the problem is? There is no problem in the discrete case. In the continuous setting things are indeed more complicated (but if the limiting process is well defined there are no issues).

Note that the same lack of rigour that you find in conditional ramdom variables affects conditional probabilities. If you can accept the latter there is no reason to reject the former.

bonoboTP · on May 13, 2019

A random variable is different concept from a distribution. For me personally it is helpful to keep them separate, but I can see that others may not care about the complete conceptual picture.

In the PDF file linked above I can see conditional probabilities, conditional distributions and conditional expectation etc, which are all valid and rigorous. I can see that the author thinks it's a good idea to merge these into a single concept of conditional random variable for didactic reasons, but that's not a rigorous concept.

Practically, if you have two random variables then you can take their joint distribution. What would be the joint distribution of (A|B) and (C|D)? For actual random variables it's simple: you can take intersections in event space, but a "conditional random variable" does not correspond to any subset of the event space.

Very simply speaking (this is my working model, not the exact precise math definition which involves a lot of measure theory): in probability theory we have an event space containing atomic events that cover all possible outcomes for the whole experiment/observation. A random variable is a function that maps from each such potential (atomic) event to a number. That's right. The random variable is a function but not the mass function, which maps from a number to a probability.

Conditional probability P(A|B) is an expression defined to mean P(A,B)/P(B). That's a clear definition. I am yet to see the actual definition of a conditional random variable.

Again, disclaimer 1: I can see the practicality of disregarding formality. Still I argue this is best done only when you do know better but it would be tedious to be technically correct all the time. But as a beginner I find it more useful to keep track of the correct concepts. For example not distinguishing random variables and distributions can be very confusing when considering more advanced things, like mutual information and KL-divergence. The former operates on random variables, the latter on distributions. I remember this was a difficult realization for me because the material we used didn't emphasize the difference enough, probably in the name of practicality.

Disclaimer 2: my point is a minor one overall.

kgwgk · on May 13, 2019

> Practically, if you have two random variables then you can take their joint distribution.

If they are defined in the same sample space.

> a "conditional random variable" does not correspond to any subset of the event space

I would say it's exactly the other way around, the domain of a "conditional random variable" is a subset of the domain of the "unconditioned" random variable (the subset where the conditioning holds).

srean · on May 13, 2019

I think it will help if you think in terms of conditioning on (for example, a coarser sigma algebra). You would get another random variable that is measurable on the sigma algebra you conditioned on. If that is coarser so would be the new function you obtained by conditioning.

bonoboTP · on May 13, 2019

Let's talk about a fair dice roll to make it concrete, and let the rolled number be X and let the event that we rolled an even number be E. P(X=6|E) = 1/3. P(X|E) is a distribution where 1,3,5 has 0 probability mass and 2,4,6 have 1/3 each.

If we consider X|E as a random variable, what is its value if we roll an odd number? Undefined? What does that mean? Random variables always have some value.

Sure you can build a new event space (sigma algebra) but then you can't use random variables over the original one.

Let's consider two independent rolls, X and Y. You can't compute the joint distribution P(Y, (X|E)), it just doesn't make sense as the two "variables" are defined over different spaces. Note that this is not the same as P(X,Y | E). The latter is simple a conditional probability, without any concept "conditional random variables".

Again, this is totally obvious to people who have experience with probabilities, but could be confusing to students. Such cases are where students who try to understand the details may be left more confused than students who just want to get the main idea.

srean · on May 13, 2019

Sure you can. The TLDR would be "piecewise constant projection"

I think picking up a standard graduate probability book will clear this up better than any long comment trail. There are no problems defining a coarser sigma algebra using an original one and then defining a function measurable on the new sigma algebra. Note this continues to be an r.v. in the original space as meaurability is preserved. A consistent definition the values of the conditioned r.v. would be the piecewise constant approximation of the original r.v. over the indivisible elements of the coarser sigma algebra.

Let me try another route.

You seem to be accepting of a conditional expectation. Now what is a conditional expectation if not a function. Now all we need is that function be measurable with respect to the new sigma algebra, thats ensured byconstruction. Hope it helped some

bonoboTP · on May 13, 2019

> I think picking up a standard graduate probability book will clear this up better than any long comment trail.

Can you recommend one? I just picked up Probability and Measure by Billingsley and it does not mention "conditional random variable" a single time in over 600 pages. It does have a lot of "conditional probability", "conditional distribution", "conditional expectation" etc.

> You seem to be accepting of a conditional expectation.

Conditional expectation is defined in terms of conditional probabilities, and those are in turn explicitly defined as P(A|B)=P(A,B)/P(B), so there's nothing not to accept.

srean · on May 13, 2019

Billingsley is pretty darn good. It might have left the connection as a dotted line given that the notion is no different from conditional expectation. The only connection you have to make is conditional expectation is a function and a random variable. You must have seen expectation taken of a conditional expectation. That should should convince you that condititional expectation is indeed a random variable. Since that r.v. was obtained by conditioning its not a stretvh to call it a conditioned r.v.

Any book that explains conditioning over a sigma algebra should suffice. You could try Loeve, Dudely or Neveu but dont remember if its mentioned explicitly.

BTW conditional expectation is really more fundamental than conditional probability. Its the former that yields the latter in measure theoretic probability. If you want to drink from the source that would be Kolmogorov.

Finally if you are reading Billingsley you are adequately qualified to call yourself a mathematician.

bonoboTP · on May 14, 2019

It's getting a little tedious. Please show me a concrete citation of a serious textbook (not a tutorial/handout by a grad student or a paper by a random researcher) that puts the three words "conditional random variable" next to each other (consistently, not simply as a one-off potential mistake). Google doesn't show serious sources for it.

While I agree with isolated points of your comment I think it doesn't add up to a useful/coherent concept of conditional random variable.

srean · on May 14, 2019

Thats a little too much to ask, perhaps if they were grep'able I could have obliged, unfortunately I dont have a photographic memory.

More concretely its just another name for conditional expectation. I am assuming you are aware that conditional expectation is a random variable obtained via conditioning (equivalently as a piecewise approximation in L_2). If you arent familiar with that view point that would be the place to start. Kolmogorov, Neveu, Dudely, Billingsley will all cover that view point.

bonoboTP · on May 15, 2019

> I am assuming you are aware that conditional expectation is a random variable

That's not what we're considering here, but things of the form X|Y=y for a concrete y. Even as E[X|Y=y], that's not a function, y is specified. Do you agree we shouldn't call X|Y=y a conditional random variable?

srean · on May 15, 2019

Oh absolutely for a specific y its not function (or a random variable) one usually thinks of Y as a variable and not a constant.

kgwgk · on May 15, 2019

The expectation E[X|Y=y] is a fixed value. (Edit: it’s the expectation of the random variable “X|Y=y”, while E[X|Y] is a random variable because it’s a function of the random variable Y: for each element in the sample space there is a corresponding value of “y” and in turn there is a value of the expectation E[X|Y=y].)

X|Y=y (as used in the blog post being discussed) is a random variable: it’s a function from a subset of the original sample space (corresponding to the elements for which the value of the random variable Y is y) to real values (or whatever the image of the X random variable is).

srean · on May 16, 2019

Yes you are right. I had messed up in the comment above. It continues to be a function on the restriction Y=y

kgwgk · on May 13, 2019

> If we consider X|E as a random variable, what is its value if we roll an odd number? Undefined? What does that mean? Random variables always have some value.

Random variables have some value on their domain, and for the random variable X | E=1 the sample space is restricted to the elementary events {2,4,6} which conform the composite event E=1. The original sample space is partitioned in the subspaces {1,3,5} and {2,4,6} when we condition on the values of the random variable E (0:odd, 1: even).

> Sure you can build a new event space (sigma algebra) but then you can't use random variables over the original one.

I guess we all agree then.

> Let's consider two independent rolls, X and Y. You can't compute the joint distribution P(Y, (X|E)), it just doesn't make sense as the two "variables" are defined over different spaces.

The variables X and Y describing independent rolls are also defined over different spaces and to have a joint distribution you have to define a "common" sample space of the form {x=1,y=1},{x=2,y=1},..,{x=6,y=6}.

You could do the same for a roll of a dice and the toss of a coin. Or do you think that computing the joint distribution of a coin toss and a dice roll doesn't make sense because they are defined over different spaces?

bonoboTP · on May 13, 2019

> You could do the same for a roll of a dice and the toss of a coin. Or do you think that computing the joint distribution of a coin toss and a dice roll doesn't make sense because they are defined over different spaces?

Of course it doesn't! You first have to define them on a common space (the Cartesian product), and for that you have to specify their joint probabilities. One example might be that you model them as independent. Otherwise we wouldn't know how the coin and the dice relate. Sure independence is usually a good default assumption, but it's still a necessary step.

kgwgk · on May 13, 2019

What did you mean with the following paragraph then?

> Let's consider two independent rolls, X and Y. You can't compute the joint distribution P(Y, (X|E)), it just doesn't make sense as the two "variables" are defined over different spaces.

Do you agree that you cannot compute the joint distribution P(Y,X) either because the two variables are defined over different spaces?

bonoboTP · on May 13, 2019

I meant them to be defined on the same space. It's a single experiment, the outcome of which are two rolls that happen to be independent.

kgwgk · on May 13, 2019

If you mean that the space for this single experiment composed of two rolls (random variables X and Y) is the cartesian product of {x=1,x=2,x=3,x=4,x=5,x=6} and {y=1,y=2,y=3,y=4,y=5,y=6}, then I agree.

But the fact that each variable alone is defined on the "same" sample space {1,2,3,4,5,6} is irrelevant.

The situation is no different from the joint probability for random variables X and Z corresponding to a single experiment consisting of a dice roll and a coin toss, where the relevant space is the cartesian product of {x=1,x=2,x=3,x=4,x=5,x=6} and {z=1,z=2}.

And it is also similar for the situation you asked about, with a random variable Y and a "conditional" random variable X|Even. The relevant space is the cartesian product of {y=1,y=2,y=3,y=4,y=5,y=6} and {x=2,x=4,x=6}.

bonoboTP · on May 13, 2019

Let's consider something with less independence, because it makes things harder to notice. Temperature indoors T1, temperature outdoors T2, IsOvercast O.

Let's say T2|O=1 is a "conditional random variable". Let's consider the average temperature indoors and outdoors. What would ((T1|O=1) + T2)/2 even mean? How could you use the two "variables" in the same expression? What is even their joint distribution? They are defined over different spaces!

This means, we must always carefully condition all variables used together on the exact same things. So ((T1|O=1) + (T2|O=1))/2 is valid. But then why do this on all variable instances that we use? It would be very tedious. At some point we want to get to a distribution (or some function of a distribution, like the expectation or variance), so it's much simpler to say for example P((T1 + T2)/2 | O=1), which is just a good old conditional distribution. Conditioning is an operation on a distribution and in my mind the bar (|) is really a slot in the P() notation and is short for P(A,B)/P(B). A bar popping up elsewhere (like in expectations) must be directly determined by the distribution (a random variable is not).

Overall, since you cannot mix differently conditioned "conditional random variables" in a single expression, you may just as well put your conditioning on the side of the whole expression in the P().

Not sure if my point is coming across...

kgwgk · on May 14, 2019

> How could you use the two "variables" in the same expression?

Do you expect to be able to use every random variable which can be conceived in the same expression?

If you object to the name “conditional random variable” [+] that’s debatable, but if you say that the resulting thing is not a random variable I think you are wrong.

Another thing that is a random variable, even though I suspect you may not like it, is the probability distribution of a random variable.

[+] which I don’t think was actually used by the OP, by the way.

a_imho · on May 13, 2019

In my very ignorant world view Kalman filters make some prediction from some input. As do various ML techniques. As do various statistical models and techniques. What are some good sources that show the connection between all these techniques and help me pick the right one for specific use cases?

For example, I want to predict a time series let's say the number of visitors of a site. I know some characteristics of the series (periodic, seasonal), but how should I go about it?

bonoboTP · on May 13, 2019

> In my very ignorant world view Kalman filters make some prediction from some input.

The word "filter" indicates that it turns one sequence (usually a time series of measurements) into another sequence (an estimate for underlying states). You can think of it as denoising the sequence of measurement by using knowledge about how the underlying system behaves. For example, if our GPS measurement says we suddenly jumped 100 meters compared to a second ago, we can weigh this against our prior knowledge that a car (the underlying system) is not likely to make such a sudden position change.

The Kalman filter weights the incoming measurement against what our model would predict. Both the prediction and the measurement are probabilistic and they are weighted according to the uncertainties of them. The more certain source of information is weighted higher.

> For example, I want to predict a time series let's say the number of visitors of a site. I know some characteristics of the series (periodic, seasonal), but how should I go about it?

That's not what the Kalman filter is for as you are not trying to denoise some sequence of noisy measurements.

mr_toad · on May 13, 2019

Most statistical time series models (e.g. ARIMA) models can be obtained using Kalman filters, if you fit them automatically, rather than choose the p-d-q values by hand. That’s how the auto.arima function in R works.

https://stats.stackexchange.com/questions/316676/arima-vs-ka...

thethirdone · on May 13, 2019

> In my very ignorant world view Kalman filters make some prediction from some input.

This doesn't seem like an accurate view of what Kalman filters do.

Fundamentally, they just combine noisy measurements over time to produce a more precise estimate of what the measurements should be.

There is a prediction step, but that is based on an existing model of how the state should evolve over time.