
Quick explanation of how a Kalman filter works - xchip
http://htmlpreview.github.io/?https://github.com/aguaviva/KalmanFilter/blob/master/KalmanFilter.html
======
ssivark
Nice summary. I think of the concept at a slightly higher level of
abstraction:

It's just Bayesian inference to update the posterior. A theoretical model
gives an estimate of what the state of the system should be (call that the
"prior"). A sensor measurement gives an independent estimate of the state
(call that the "likelihood"). Composing (multiplying) the two gives the
posterior. Further, in this formulation it doesn't really matter what you call
the prior or the posterior. You can easily combine estimates from many
different sources/sensors.

Kalman filters are just the special case where the likelihood and prior are
both Gaussian -- the distributions can be specified with a couple of numbers,
and there is a simple closed-form expression for composing them. More
generally, one can use all the tools available for Bayesian inference.

~~~
civility
> It's just Bayesian inference to update the posterior.

I've had this discussion with a number of smart people (math PhDs and so on)
over the years, and I've never gotten a good explanation. I understand in the
very abstract sense how Kalman filters (and particle filters and so on) can be
seen as Bayesian, but when it gets to actually applying Bayes' theorem it
really gets complicated very fast. For instance, if you're estimating a 3D
position (that will be A) and your measurements are some 2D subspace (these
will be B), the least squares approach works nicely (normalized products of
Normal PDFs).

However, I'm not sure I understand a single component of the expression P(A|B)
= P(B|A) P(A) / P(B) in this context. What does A|B or B|A even mean when A is
a 3D Gaussian and B is 2D? What are the dimensionalities of these operations?
What does it mean to divide a product of pdfs?

I don't expect an answer, but I always wince a bit when I see Bayes in this
context. For Normal distributions, I prefer thinking of it as least squares.

~~~
rsp1984
_However, I 'm not sure I understand a single component of the expression
P(A|B) = P(B|A) P(A) / P(B) in this context._

You've come to the right place :)

In the context of least squares estimation you can usually ignore the
denominator P(B) and only look at P(A|B) = P(B|A) P(A). Real mathematicians
(!= me) cringe here because the numerator alone isn't a proper PDF but since
we're only seeking the value where that PDF has its maximum, that's ok.

Good, so now we have P(A|B) ~ P(B|A) P(A). What you want to know is the
maximum of P(A|B), that is "where's the most probable position of my 3D point
A given all the 2D measurements B"?

Let's look at the solution P(B|A) P(A). The first factor is the likelihood
function, the second factor is the prior. The prior is easy: It's some prior
belief about the distribution of A. So if you know the position of your 3D
point somewhat (even without looking at B), say you know it must be in a range
of coordinates, or you know that Z must be positive or any other statement
about A that you know must hold, that information would go into the prior.

P(B|A) on the other hand is your likelihood. That's kind of the "meat" of
Bayes. Your likelihood is a distribution over B, given A. Meaning, if you knew
your 3D point A, how would the PDFs of your 2D observations of it look like?
Note that the likelihood is to be understood as a _function of state_ ,
meaning the domain of it is A.

So to summarize argmax(P(A|B)) = argmax(P(B|A) P(A)), meaning to get the most
probable location of your 3D point you need to look in your likelihood domain
A (in this case the 3D space) where the 3D points are that would produce the
observations B that you have and then also take into account your prior
beliefs about A, to narrow down that set of candidate points.

In the concrete case of least squares both likelihood and prior are assumed to
be gaussian, so if you multiply them you get a new gaussian with a single
maximum. Neat!

~~~
civility
I appreciate the kind reply, but I don't think you understood the gist of my
complaint. I have a degree (just undergrad) in math, and I've implemented
Kalman filters, Kalman smoothers, information filters, particle filters and so
on at least a dozen times. I know what operations to perform, and I even have
an intuition about why they work.

When I complain about the Bayes theorem way of describing or thinking about
Kalman filters, I really mean that I don't understand what almost anything in
that expression means. The notation is intuitive if read as English, but
opaque to what operations are being performed on what mathematical objects.
Capital P means probability of an event or predicate, and A is some
abstraction of the thing I'm trying to estimate, and B is the abstraction of a
new observation or measurement.

Think of it this way: I have some knowledge of my prior estimate, and I'm able
to characterize it as a Normally distributed random variable. It has a mean
and a covariance in 3D space (let's call those x and P). There is a
multivariate function for that pdf (lowercase p); it's an exponential of the
negative square of the Mahalanobis distance normalized to have a unit volume.
The same thing applies for my measurement, except it's only a 2D mean and
covariance (let's call those z and R). I already know that I can use my H
matrix, some matrix inverses and multiplications to combine this information.
It's really just a case of multiplying the pdf for the state estimate and the
measurement and re-normalizing to a unit volume. In other words, I believe
both things are true, and they are independent, so I can multiply their pdfs
and re-normalize to get a combined result. The rest is just linear algebra.

Now let's get to Bayes. P(A) is supposed to be the probability of an "event"
or predicate. I can squint sideways and translate that as integrate my pdf for
the prior estimate (Normal with mean x and covariance P) over a some
unspecified bounds and convert that to a probability of my estimate being in
those bounds, but I'm not sure that's what is intended. Again a similar thing
applies for B (Normal with mean z and covariance R). It's frustrating that the
bounds are never stated, because they could be radically different spaces in
the numerator and denominator. I guess I'll give that a pass because maybe the
notation would be too cumbersome if it was included, but this simplification
seems to never be stated in any of the books or papers I've read on it.

Next, P(B|A) has to be probability of another "event", and if I squint again,
the best I'm able to come up with is that it means a new Normal pdf with mean
= H'z and covariance = H'RH. However, I haven't seen that spelled out any
where, and so really that's just assuming the conclusion I want, which is
super questionable. It also doesn't help me understand the left side of the
Bayes equation - that vertical bar there seems to mean something different.
When I look to the definition of conditional probability, I don't see anything
about the vertical bar applied to multivariate Normal distributions.

If you translate it all to English, it reads as a coherent sentence, and
that's fine. This is the "prior state estimate", that's the "observation", and
this other thing is that's the "a posteriori" etc... However, if Bayes really
helps with the understanding the math, the vertical bar has to mean something
specific as an operator, and one would hope it meant the same thing on the
left and right sides of the equation.

Similarly, in order for all of those P( ... ) to become scalars so
multiplication and division are well defined, you need some integration bounds
to turn the event into a probability. But at this point, I'm not even sure if
that's what is intended by the notation - maybe they aren't scalars, and
multiplication and division mean something radically different here. I
honestly don't know.

~~~
abstrakraft
Probability notation generally works best for the people who already
understand the concept in question. Let me take a crack at your question.

The equation in question is P(A|B) = P(B|A) P(A) / P(B). In modern Kalman
filter literature, this would be stated as something like: P(x_k | z_k) =
P(z_k | x_k) P(x_k) / P(z_k). It is generally left as implicit in these sorts
of equations that everything is also conditioned on the sequence z_1 to
z_{k-1}. In this equation, x_k is a free variable in the state space (possibly
multi-dimensional, so a vector), while z_k is the measurement, which is a
realization of the random variable distributed as N(H(x_k), R_k). The result
is a PDF over the free variable x_k.

So let's tackle the terms one by one:

1\. P(z_k | x_k) - this is the probability that we measured z_k, given that
the true object is at x_k. This is the aforementioned normal distribution
N(H(x_k), R_k).

2\. P(x_k) - this is the prior probability of the state estimate, generally
after propagation through the motion model from P(x_{k-1} | x_{k-1}). In the
Kalman filter, this is also Gaussian, and conveniently has the same mean as
the term above (see note below).

3\. P(z_k) - this is the denominator that someone else mentioned earlier can
be effectively ignored, which is right - you only need it to normalize the
numerator. If you must compute it, it can be factored as the integral over the
entire state space of P(z_k|x_k)*p(x_k). Given z_k, this is a number, not a
function.

4\. P(x_k | z_k) - The result, which is a Gaussian PDF. You can arrive at it
numerically by plugging in specific values for x_k, in which case (1) and (2)
are numbers. Or symbolically, in which case (1) and (2) are functions, and
you'll end up with the form of a Gaussian PDF.

Note: The original article quotes a distribution for the product of two
Gaussians with arbitrary means. It does not state that this is an
approximation, which is exact only in the case of equal means. This is why
unbiased measurements are one of the Kalman assumptions.

~~~
gugagore
I think the crux of the complaint of is the imprecision in saying e.g. "P(z_k
| x_k) - this is the probability that we measured z_k, given that the true
object is at x_k"

Technically, you can only give a useful answer about the probability _density_
of the random variable Z_k at the value z_k, conditioned on X_k = x_k. In the
Bayesian interpretation of the Kalman filter, you never have an event "I
measured z_k" (that event has probability 0, of course).

I agree that the probability notation is the issue here. Look at how wikipedia
shows Bayes' rule for continuous random variables on both sides of the |. :
[https://en.wikipedia.org/wiki/Bayes%27_theorem#Random_variab...](https://en.wikipedia.org/wiki/Bayes%27_theorem#Random_variables)
That's the kind of explicit and precise notation I would use to help someone
understand the Kalman filter from a Bayesian perspective.

Once you use that definition of Bayes' rule, then you can substitute the
definitions of the multivariate normal pdf, Do The Math, and derive the Kalman
filter recursive updates.

~~~
civility
Thank you for your reply here, and the one below. I wish I had seen it before
posting my sibling message to this one. I'm a bit too tired to go any further
with this tonight, but I plan to look at your links tomorrow.

Cheers.

------
fermienrico
I remember struggling with understanding Kalman filters (and Particle
filters). I found [1] to be an amazingly clear explanation:

[1] [http://www.bzarg.com/p/how-a-kalman-filter-works-in-
pictures...](http://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/)

~~~
blt
Additionally, this video made particle filters crystal clear to me:
[https://www.youtube.com/watch?v=aUkBa1zMKv4](https://www.youtube.com/watch?v=aUkBa1zMKv4)

------
f1notformula1
This is the most intuitive introduction to Kalman filtering that I've ever
seen. Even just reading that part was the most enlightening 2-minutes of my
entire month. Kudos to the author!

~~~
xchip
Much appreciated! :)

------
ramzyo
This is a very nice breakdown of Kalman Filtering. One slight correction -
Kalman Filtering is one approach to solving the SLAM problem, but SLAM doesn’t
require a KF. A particle filter, for example, can be used to solve the SLAM
problem.

For anyone interested in further resources to better understand the intuition
behind KFs, I’ve found this resource incredibly valuable (although a little
pricey): [https://www.amazon.com/Kalman-Filter-Beginners-MATLAB-
Exampl...](https://www.amazon.com/Kalman-Filter-Beginners-MATLAB-
Examples/dp/1463648359/ref=br_lf_m_zuvo8fyr45xsoyk_ttl?_encoding=UTF8&s=books)

~~~
bcaine
Another absolutely fantastic resource is this Jupyter Notebook based textbook
on Kalman Filters and related topics: [https://github.com/rlabbe/Kalman-and-
Bayesian-Filters-in-Pyt...](https://github.com/rlabbe/Kalman-and-Bayesian-
Filters-in-Python)

An addition to your correction: there are many other ways to solve the SLAM
problem beyond (kalman/information/particle) filters. Optimization based
approaches are very popular (search terms: Graph SLAM, Factor Graphs, Pose
Graphs).

~~~
ramzyo
True! I don’t find too many people out there who know about this stuff. Any
interest in trading battle stories through DM? Would be interested to know
what you’re working on and where you’ve been!

------
rsp1984
_We also do the opposite, and we use our IMU to better estimate the position
of those features /clues. This is exactly what SLAM does, at its core it is
just a Kalman filter with the twist that each time it finds a reliable visual
clue it treats it as a sensor and makes it part of its state, a thing that
allow us to tell our position much better and build a map. _

Some nitpicking here: There are _some_ SLAM systems and flavors that use a
Kalman filter as the main estimator under the hood but most are either hybrids
or separate the tracking and map optimization (using batch / bundle adjustment
methods).

Digging a little deeper here I think it's important to mention that a Kalman
filter is just a recursive least squares estimator, and is in fact equivalent
to an Information Filter (which, IMO, is much easier to understand if you're
already familiar with linear regression / LS systems).

In fact a Kalman filter is just a fancy mathematical reformulation of an
Information Filter that allows you to save quite a bit of computation time in
case each observations is low-dimensional compared to your state.

------
laythea
I used to work at a company that did Dynamic Positioning systems for vessels
and oil rigs and I worked on the periphery software systems running with a
Kalman-based algorithm there. Didn't work on the Kalman stuff then, but I do
think its cool what they can achieve and have always been interested. I will
add this to my reading TODO. Thank you.

------
loxias
This is a positively fantastic explanation. One of the best I've seen despite
being so short.

Minor nitpick: It might be a bit more clear if the terms "F", "s", "v" were
explicitly defined, as well as a sentence explaining the matrix M inversion.
It's really great, but when I read "our prediction formula F could be as
simple as s = v*t_dot" I immediately think "F? was F defined earlier? Did I
miss it? Also what is s?".

[overall, really great explanation, bookmarked]

------
ianstallings
I wish I had this summary quite a few years ago. I ended up buying
aeronautical software textbooks and trying to glean the information from those
to figure out how to integrate multiple IMU inputs.

I learned that real world IMU use in navigation is pretty ugly and has to be
supplemented due to constant errors. This can be particularly difficult when
flying. In my case I used other sensors, such as a GPS. But a strictly IMU-
only approach to lengthy navigation is, AFAIK, impossible.

Either way, good write up.

~~~
tnecniv
> I learned that real world IMU use in navigation is pretty ugly and has to be
> supplemented due to constant errors. This can be particularly difficult when
> flying. In my case I used other sensors, such as a GPS. But a strictly IMU-
> only approach to lengthy navigation is, AFAIK, impossible.

Yeah IMU error can be annoying unless you shell out $$$, but this issue occurs
with many other kinds of sensors as well.

One common trick is to treat the IMU error as another state variable that is
evolving over time. It won't eradicate the compounding error, but it will
reduce the amount of calibration necessary.

------
piracykills
Interesting - I don't have the strongest math background so I'm usually
somewhat intimidated by topics like this, but this makes it seem like a fairly
simple weighted average based on squared error values like in standard
deviation? I assume this would also only work on a normal distribution too?

Makes it seem a lot more approachable than many people have made it sound to
me, but I may be horribly misunderstanding still.

~~~
dbcurtis
I struggled a long time trying to get past all the matrix math that the usual
Kalman filter tutorial starts out with. KF's make a lot more sense if you
start from an example. By the end of the example, you realize that matrix math
simplifies the notation hugely. But without the intuition about what it is
doing for you, it doesn't help much -- at least it doesn't help _me_ much.

So... my over-simplified touch-stones for KF's:

1\. Everything is a Baysian quantity -- a mean (best guess) and a std. dev.
(confidence).

2\. You have a model of the system state.

3\. You have some sensors, and you have some model for the accuracy &
precision (noise) in the measurements.

Now, two rules:

1\. The model runs open loop and at a regular cadence: "Tick tock, tick tock,
the model updates on the clock." Of course, since you are running open loop,
your confidence in every value of the model gets worse with each iteration
(std. dev. grows). How much you adjust (reduce) the confidence is based on the
precision of the system and the control inputs.

2\. Sensors readings are applied as they come in: "Use'em if you got'em."

So.... now the moving average part... if you are twice as confident in your
current estimate as you are in the reliability of the new measurement, do a
weighted average of 2/3 of the current estimate and 1/3 of the sensor. This is
why the confidence is carried along for all state values and sensor readings.
Of course compute the weights of the moving average according to the current
confidence values with every update.

And that is about 103% of what I know about Kalman filters :)

------
theoh
Another intuitive intro, previously:
[https://news.ycombinator.com/item?id=12648035](https://news.ycombinator.com/item?id=12648035)

------
SatvikBeri
Great explanation! If you want a very detailed deep dive into Kalman Filters,
I found this sequence of videos really helpful:
[https://www.youtube.com/watch?v=CaCcOwJPytQ](https://www.youtube.com/watch?v=CaCcOwJPytQ)

It's 55 short videos, each one going over a specific topic or example.

------
AceJohnny2
I love this. I've idly tried to understand the Kalman filter forever
(ever=10yr+), and always encountered explanations that threw you straight at
the linear algebra[1]. (yeah, even "quick" or "simple" explanation).

This explanation starting from the conceptual is great, and finally provides
the needed foundation for me to understand the rest.

[1] _huff_ not that I couldn't understand the linear algebra, just that I, uh,
didn't have time for that... >_>

~~~
xchip
Thanks, I wrote this because I had been in stuck in that same situation for a
ages too

------
rweba
Here is another "simple" explanation:

[https://rweba.livejournal.com/446326.html](https://rweba.livejournal.com/446326.html)

It's definitely WAAAY oversimplified, but it gives the first idea:

The "filter" just cleverly averages the noisy real data at time T with the
Kalman estimate from time (T-1).

------
shaklee3
I posted this a few days ago, but I think it's the best intro to Kalman
filters I've ever seen: [https://github.com/rlabbe/Kalman-and-Bayesian-
Filters-in-Pyt...](https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-
Python)

------
dennisgorelik
The inventor of Kalman Filter:
[https://en.wikipedia.org/wiki/Rudolf_E._K%C3%A1lm%C3%A1n](https://en.wikipedia.org/wiki/Rudolf_E._K%C3%A1lm%C3%A1n)

Born: Rudolf Emil Kálmán May 19, 1930 Budapest, Hungary

Died: July 2, 2016 (aged 86) Gainesville, Florida

------
YSFEJ4SWJUVU6
The page breaks back button.

~~~
knolan
Seems fine for me on MacOS Safari.

~~~
ly
It's broken for me on MacOS Firefox

------
gnachman
How do you know the variance of your sensors? Do you have to experiment in a
controlled environment?

~~~
xchip
Good question, yes, you measure it when the rocket is on the ground.

------
eggie5
also useful for incorporating noisy GPS sensor readings! I tried to do this in
an undergrad autonomous vehicle but failed at the time :(

------
xchip
Kalman experienced people, just wondering, what did you use the filter for?
And, what are you guys working on nowadays?

~~~
bit1
An aircraft's Flight Management Computer uses a Kalman filter to combine the
data from all available sensors (GPS, IMU, altimeter, ADC, VOR, DME, etc.) to
compute the position of the aircraft.

~~~
caf
A GPS itself uses a Kalman filter to combine the position estimates from many
satellites as well.

------
randyrand
More gain 'k' explanation. 'K' determines how much you weight you give sensor
A and sensor B.

A gain of 1.0 means B is weighted 100% and A is 0%. Gain of 0.5 means each
gets 50/50 weight.

Look at the maths and notice how when the gain is 1.0, the mean just becomes
the mean of B.

~~~
caf
Notice also that if you replace k with (1-k) then you get the same results,
but with a and b swapped.

------
nerfhammer
This seems to use measurement and average interchangeably? a measurement is an
instantaneous "average"?

if you figure out a decayed moving average as an "average"... why not just use
that smoothed value directly? what do you even need variance for?

------
RosanaAnaDana
Hmm. Could probably hammer this into a form applicable for a raster, where the
previous 'instance' is a linear interpolation via Roberts cross or some such.

------
IshKebab
> those readings represent gaussians and combining them means multiplying them

I don't think he means multiplying surely? That part is unclear.

~~~
mlevental
how else would you combine PDFs?

~~~
IshKebab
Ah I realised he meant the product of the two PDF's but it sounded like he
meant the PDF of a the product of two normally distributed random variables.
Which would make no sense.

------
albertTJames
Very cool way of introducing kalman filters

------
signa11
actually, iirc, there is a very cool document by ramsey (i guess) which
attempts to provides a simple and intuitive derivation of the Kalman filter
without going too deep into mathematics...

------
jquast
much like a baby bird learns to fly

