Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Markov chains explained visually (setosa.io)
1070 points by vicapow on July 29, 2014 | hide | past | favorite | 92 comments

Beautiful. I had seen Markov chains mentioned before, but had not looked them up. Skimming the wikipedia page made sense (it's a state machine with transitions determined by probabilities instead of defined events), but I would not have had an intuitive understanding of why they are useful. The explanation mid-way down about modeling the distribution of sunny and rainy days really made it click for me.

The explanation mid-way down about modelling the distribution of sunny and rainy days really made it click for me.

Another very easy to understand is language. Say, I give you a small text. You could create a small state machine containing the possible transitions between words. You could also compute probabilities (estimated from the text) of going from one word to another (e.g the -> text vs. the -> possible, etc), and you'll have a Markov chain.

Of course, this is a very weak model of language, and usually in such models probabilities are modelled on at least the previous two states. But it turns out to be very useful in practice, e.g. you can compute the fluency of different formulations of the same semantics.

Is that how things like Swiftkey or Google Now can predict words so well? If so, how do they do it so quickly?

Ray Kurzweil, one of the fathers of speech recognition, talks in his book How to Create a Mind about his use of "Hidden Markov Models" as a breakthrough in speech recognition work. I believe Kurzweil used his HMM's to recognize the flow of sounds and speech patterns as words, so it's not exactly the same as what Swiftkey does with word prediction, but it's safe to safe Markov chains have an important place in language modeling.

Hidden Markov Models: http://en.wikipedia.org/wiki/Hidden_Markov_model

Kurzweil's Book (He goes into more detail on speech recognition as well as a slew of other topics): http://www.howtocreateamind.com/

Hidden Markov Models are also an important tool in computational biology - we use them to represent and search for motifs in sequences, such as DNA binding sites, protein domains with a particular function, etc.

That's indeed one way to implement predictive typing. Looking up the list of possible words is not expensive, you basically step through a small automaton. The challenge is making the models small enough to fit it on a phone. See e.g.:


> Is that how things like Swiftkey or Google Now can predict words so well? If so, how do they do it so quickly?

Yes. It's most definitely an important part of their algorithm.

But why would it be slow? The average person has a vocabulary of max. 50-100k words. Maybe it was even less, I forget. You'd probably do more than fine with just the top 10k anyway.

With a clever data structure, that's peanuts for today's mobile hardware.

In particular (and I dunno if that's how they do it) just the lookup needs to be fast, the storing of new words can be done offline in some lost half-second when the user isn't interacting with the device. With that in mind they can even use really cool data-structures such as tries that can do super-efficient partial prefix matching.

Except they also seem to do fuzzy matching, so there needs to be some Levenshtein distance type of thing in there. That can be costly, but it's also been around for some decades and that part doesn't need to be super-exact, so I bet there's some really clever tricks for that as well. If anyone knows, I'd love to hear about it too :)

Very probably with the Viterbi algorithm: http://en.wikipedia.org/wiki/Viterbi_algorithm

Viterbi is used for decoding e.g. HMMs, when there are multiple possible state sequences given the observations, and you want to find the most probable. In a normal (non-hidden) Markov model, you can just follow the state transitions and get out a probability (or go to a state and see what the next possibilities are).

An HMM is exactly what you have in both predictive typing and speech recognition, since in both cases you've got some form of sensor noise to deal with.

But predictive typing usually corrects per word. Given a sequence of words, you can give the top suggestions (like Swype and others do) using a non-hidden Markov model. For partially typed words, it's easier to take words from the same suggestion list and rank them by combination of probability and some similarity measure (e.g. edit distance). Possibly complemented by non-suggestions through some other method (e.g. levenshtein automata). If you want to correct

  predctiv tping an spech recgnition
Yes, you you'll want to use a hidden Markov model. If you already have

  predictive typing and speech recgn
then a normal Markov model will serve you fine. And since such predictive keyboards do corrections per word, they are like the latter example and not the former. For speech recognition (e.g. Google Now), you indeed need an HMM.

...which is an implementation of a Hidden Markov Model, so, yes.

Shannon's paper becomes a lot more fun once you've grokked Markov chains.


There's a bunch of fun stuff that suddenly becomes possible.

It's gently distressing that by far the most use that Markov chains have seen so far is to generate English-like gibberish text to beat spam filters.

> It's gently distressing that by far the most use that Markov chains have seen so far is to generate English-like gibberish text to beat spam filters.

I don't think that's the most use they've gotten, just the most obvious and visible use.

This is really nice.

Minor nit #1: https://www.dropbox.com/s/2meqa8hhen9ztba/Screenshot%202014-... Seems like the graph visualization is sticking to the wrong coordinates (dragging it to the left doesn't help; it moves back to the center)

Minor nit #2. I'd love to see a visualization of the "probability mixing" interpretation of markov chains and stationary distributions, which is what PageRank is really about. That is, it'd be really nice to have a visualization of the fact that Markov chains are ultimately memoryless (it eventually doesn't matter in which state you start for the distribution of events). I think it could be done by exchanging "probabilities conditioned on the past", which is most easily done by multiplying the entire probability vector by the stochastic matrix and visualizing that.

One way to get at what you're saying in #2 is to have a counter that accumulates the number of epochs spent in state 1, state 2, state 3, for one of these examples. Have a way to reset this counter, and compute the relative frequency (by normalizing the counts-in-each-epoch by the number of epochs so far).

Come to think of it, you could even calculate the first eigenvector of the transition matrix, and show this eigenvector next to the relative frequency mentioned above.

In fact, you could then plot an L1 distance of the relative frequency (above) to the eigenvector, with one point for each epoch. This distance would decrease (non-monotonically) towards zero as the number of epochs increases. I believe it is (in the expectation sense) exponentially fast, but with an exponent that depends on the transition matrix. (Typical transition matrices would exhibit fast mixing, http://en.wikipedia.org/wiki/Markov_chain_mixing_time).

Now look up Hidden Markov models, http://en.wikipedia.org/wiki/Hidden_Markov_model

How they can be calibrated in the finite case http://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm

And how they can be evaluated for arbitrary models http://en.wikipedia.org/wiki/Particle_filter

This is fantastic. I have encountered markov chains in my career and always thought of them as a black box. This simple visualization makes it so easy to understand what has previously been so hard for me. Thank you.

The one thing to add to this is that usually each state doesn't emit a single token ("I am in state 1" then "I am in state 2") but instead you assume that each state has a range of possible actions and the likelihood of a choice of action varies with state.

So if might not be that your model is sunny versus rainy but instead cold front v warm front. Since rain is more likely during a cold front your observation of rain increases your belief that the system is in the "cold front" state.

That's a Hidden Markov Model.

Thought I'd point out a little typo. In your table with sliders for adjusting probabilities of state transitions, the P(B|B) probability reads "P(B|A)".

Edit 1: Also, P(A|B) reads "P(A|A)".

Edit 2: Not trying to be too nitpicky, though. It's a really nice visualization. Really excited about the growing use of d3 to visualize algorithms. Is this inspired by Mike Bostock's post by that title?

Presumably in the B row it should read "P(A|B)" and "P(B|B)"

Thank you! I understood what Markov Chains are now. Nicely done and in a simple understandable fashion.

I am also trying to understand what they call Hidden Markov Model (specifically, I just cannot wrap my head around how it gets used in speech. They just look like entirely different things). Would be awesome to see an update with the Hidden MM.

To understand an HMM: Start with a MM with two states. In state "A" we always observe "X" in state "B" we always observe "Y". That's a basic MM. Now lets change it to a HMM. Now, in state "A" we observe "X" 90% of the time and "Y" 10% of the time, and in state B we observe "Y" 90% of the time and "X" 10%. The observations don't tell us exactly what state we're in. The state is "hidden." We still get a clue about the state, but there is some uncertainty.

This is an excellent explanation, thanks!

A short overview of using HMMs for speech recognition:

Speech is temporal, a speech sample can be represented as a sequence of data points. Thus, a simple way to compare two speech samples can be an algorithm to compare their corresponding sequences. One such algorithm is DTW, which is an equivalent of the Levenshtein distance algorithm for comparing strings.

So now we have a method of comparing two speech samples. Thus, a simple way to recognize an unknown speech sample can be to keep samples of all possible utterances (phonemes, words, sentences) and just return the best match. That’s not possible - so in speech recognition, they only keep samples of phonemes, and try to find the most likely concatenation of them relative to the unknown sample. The search space is huge, but luckily dynamic programming algorithms exist to make the search fast (they’re called Connected Word Recognition algorithms).

The problem with this approach is that it does not scale. There are tons of variations possible for the same spoken unit (different styles and durations). This is where the Hidden Markov Model comes in. A single HMM can be used to represent all variations of that unit. For example, it can be used to represent multiple speech samples of the word “apple”, or of the phoneme “æ” and so on. There are methods for evaluating the similarity of an HMM and a speech sample, and for training an HMM using multiple samples.

You can also connect phoneme HMMs together to get a big HMM representing words/sentences. The same CWR algorithms as before apply.

Imagine that you've got a markov model just like what this article suggests. Now imagine that in each round your markov model emits a state and then sends that state to a noise box which converts that state into some kind of sound.

As a simple model, say S1 leads to a rising C, S2 leads to a falling F, S3 leads to a stable Bb.

In a more complex model, this sound box is actually plays two or three sounds all at once. As a further complexity, many output sounds overlap between states.

If you're a scientist observing just the sounds being made then you're talking about an HMM (because the markov model is "hidden" behind the sound box). We model speech this way by assuming (sort of) that speech is composed of phonemes (each state) which might sort of overlap in their actual sound.

So a scientist using an HMM tries to reverse the process—listen to the sounds and figure out the phonemes that generated it by knowing something about how people make transitions between phonemes in, say, English words.

Suppose the Markov "state" is what word a user just intended to speak. You can use a non-hidden Markov Model to generate sentences.

But for speech-recognition, you don't know what the user said (that's your goal) and you're guessing based on audio sensor data. This means the true state is "hidden" from you, and you're trying to make educated guesses.

Using a Hidden Markov Model allows you to combine your imperfect indicators (audio samples that sound like certain words) with statistical knowledge (how likely certain words are to follow other words) to improve your guesses.

I've seen Markov chains applied to language generation - producing sentences that make sense grammatically but not literally. Anyone know what the connection is here? I think I have an idea but would like to see if it gets independently verified by someone else.

There are complicated ways of doing this, but the naïve way is as follows:

First you need a corpus of text that's grammatically correct

Each node in the chain is a word or piece of punctuation. Each word has a certain probability of being followed by every other word in the corpus, including itself. There are a few different ways to start the sentence. One approach is to start from the node for the punctuation mark ".", and only selecting a following node that is not a period, since sentences don't tend to start with punctuation. From there, use a random number generator to pick a following node based on your probability matrix, rinse, repeat.

If you'll notice, there's no guarantee that it will be grammatically correct. There's just some statistical likelihood that it will be.

If you'll notice, there's no guarantee that it will be grammatically correct. There's just some statistical likelihood that it will be.

Which is also true for human speakers.

Here is cool generator that demonstrates this in action: http://projects.haykranen.nl/markov/demo/

Markov chains work well for artificial language and name generations. One of my first programs in 1979, when I was a 13yr old kid, was to sum up the occurrence of characters following other characters in text files, and to roll dice on this table to generate names for role playing games.

I later learned that I reinvented Markov this way. I still have those printouts, and use them when ever I need a name for a role playing game NPC.

Someone made a Twitter account to generate baby names with markov chains, actually: https://twitter.com/markovbaby

Yeah, same. The first time I ever heard "Markov chains" was when I was a 13-year-old hanging out on black hat SEO forums. Back in the day, software like "YASG" (or something like that, "yet another site generator") made a lot of money with simple PHP scripts for rewriting Wikipedia articles with Markov chains, thus skirting Google's duplicate content algorithms, which were much weaker at the time.

There were actually quite a few vulnerabilities back then. I wrote a script that allowed me to create a Google account with the IP address of a visitor to the website, all without them knowing. All I had to do was open the registration page with a server side script, download the CAPTCHA, display it to the user and ask them to fill it out. When they filled it out, the submitted form targeted the Google registration form in a 1x1 iFrame, then another button targeted the logout form. Google was not checking the referrer of the sign up form, nor was it comparing the IP address that received the CAPTCHA to the one that submitted it.

I had a friend load that script into thousands of generated blogspot blogs, which got long tail google traffic and asked the user to "fill the CAPTCHA to continue." The script ran for ~2 weeks and generated ~60000 Google accounts all from unique IP addresses.

That was around 2007, so obviously it's all patched up by now. I was 15 years old and never did anything with the accounts, so if anyone from google is reading this, keep the lawyers away from me please.

Blackhat SEO actually has a lot of clever tricks. I haven't been part of that space in a while, for a lot of reasons, but I can attribute 99% of my knowledge of marketing to time spent trawling through blackhat SEO forums reading not my about that, but also landing page optimization, conversion tracking, copywriting, etc. It was definitely a worthwhile learning experience for me.

Oops that veered off topic. But yes, Markov chains. Blackhat content generation. :)

Usually, you'd have a row and column per word in your dictionary. Rows would represent the current word, and the columns would represent the next word in the sentence, with the transition probability of each cell in the matrix being determined by counting the frequency of occurence of each word pair in some body of text.

Here's a nice introduction/exercise which is part of a python tutorial.


This is an absolutely magical and intuitive (not to mention beautiful) way to imagine the complex mathematical concept of a Markov Chain. This is the exact sort of pedagogical tools that MOOCs and other educational software platforms need to build and adopt to bring education into the 21st Century and finally replace traditional teaching methods. How does this compare to what even the best teacher could draw on a whiteboard? Teachers will still play an essential role in the emotional and social development of students, and they can then focus their energy on these things which software probably will struggle to ever replace.

Back when I was in college (~20 years ago) I was struggling to understand generative models, and I asked my CS professor.

he said, "imagine god is sitting around emitting DNA sequences. She has sevearl 4-sided biased dice, rolls one of the 4-sided die, BAM, emit an A! Again, roll the die, BAM, emit a A! Roll again, BAM, emit a T! Now, imagine god is a fickle person, and between rolls, decides to roll a die to decide which of the biased die to roll.

For some reason, that helped.

Very nice! I've never had to work with Markov chains but I've read about them and they seem to pop up in lots of places.

Nice and simple and interactive explanation.

Gian Carlo Rota is always a pleasure to quote, despite he knowing it. One from his reminiscence about Jack Schwartz, in his "Indiscrete Thoughts" Book (TL;DR: Markov Chains seen as random maps):

The first lecture by Jack I listened to was given in the spring of 1954 in a seminar in functional analysis. A brilliant array of lecturers had been expounding throughout the spring term on their pet topics. Jack's lecture dealt with stochastic processes. Probability was still a mysterious subject cultivated by a few scattered mathematicians, and the expression "Markov chain" conveyed more than a hint of mystery. Jack started his lecture with the words, "A Markov chain is a generalization of a function." His perfect motivation of the Markov property put the audience at ease. Graduate students and instructors relaxed and followed his every word to the end.

Beuatiful visualizations.

I created a Markov chain generator: https://gist.github.com/grant/561834963dc526495c45

var numNodes=10;var roundNum=100;var a=[];for(var i=0;i<numNodes;++i){var connections=[];var sum=0;for(var j=0;j<numNodes;++j){var randNum=Math.random()/numNodes;randNum=Math.round(randNumroundNum)/roundNum;connections[j]=randNum;sum+=randNum}connections=connections.map(function(e){var t=e(1/sum);t=Math.round(troundNum)/roundNum;return t});sum=connections.reduce(function(e,t){return e+t});connections[numNodes-1]+=1-sum;connections[numNodes-1]=Math.round(connections[numNodes-1]roundNum)/roundNum;a[i]=connections}console.log(JSON.stringify(a))

Copy and paste the output into the side bar.

Please indent each line with a double space so it's properly monospace formatted.

Yeah, your comment breaks the readability of this comment thread on my mobile (as in, everything is horrible stretched horizontally).


It really helps, I used to learn Markov Chain Theory in graduate school, but either I was not paying attention, or the tutor did not really intended to get things better-explained, I never truely care about its real life application.

But this tutorial, both visually attractive and expained with real life examples, make me want to re-learn this topic. Just quote one passage:

[if you made a Markov chain model of a baby's behavior, you might include "playing," "eating", "sleeping," and "crying" as states, which together with other behaviors could form a 'state space': a list of all possible states.]

Thanks for sharing!

This is really great, but could you put in a bit how some transition matrices aren't markov (e.g. [0 1; 1 0]) and the convergence criterion where you can take M^n n->infinity and get the occupancy of the states?

Why isn't your example Markov?

It is in fact Markov; Markov just means that the probability distribution of the future depends only on the present, and so the past adds no additional information in conjunction with the present. That's certainly the case here.

This is an example of a Markov chain that is not aperiodic; what that means is that, given a starting node, at any point in time in the future, it will always be the case that it is impossible to be at a certain node. This ends up meaning that the Markov chain never ends up reaching a steady state; rather, its behavior is periodic!

fix some n in n. if my starting state is [1; 0] then the probability of the occupancy being [1; 0] after n cycles is either 1 or 0. If the starting state is [0; 1] then the probabilty of the occupancy being [1; 0] is exactly the opposite, so for a fixed point in the future, the probability is tightly past-dependent.

Yes, but I think it is still a Markov process since the information is contained in its current state.

There was a bug in the playground earlier that I just fixed that allows you to share your Markov chains via the url hash. For example: http://setosa.io/markov/#%7B%22tm%22%3A%5B%5B0.9%2C0.1%2C0%2...

Hey, I think that link is so long it's breaking HN's layout! Reckon you can chuck it in a URL shortener or something? :)

Verboten on HN.

Great! It would be nice to be able to stop the animations though, they are distracting while you are trying to read the text.

The sunny/rainy probability example is perfect as a scenario.

This is at a tangent, but I'm a fresh CS undergrad and this simple explanation really hooked me.

So my question is, where can I find more of this stuff? MOOCs are tough to manage with university, but if I wanted to learn more about these mathematical concepts presented in an interesting way, where should I start looking?

I'm a tad bit indecisive about how good I am with CS theory but I know if I took the leap and mastered some basics I would enjoy it. Any recommendations will help.

Dasher http://www.inference.phy.cam.ac.uk/dasher/ takes some of the Markov chain ideas to the next level. If you are a visual thinker it is a good way to get a feel for probabilistic compression techniques, specifically PPM and arithmetic coding.

Also good is the Viterbi Workshop (EDIT: unfortunately the URL below leads to a broken version of the applet.) http://www.alantro.com/viterbi/workshop.htm It used to have animated Java illustrations of the Viterbi algorithm, mentioned in other comments on this post. There are other working applets out there, e.g. http://www.wirelesscommunication.nl/reference/chaptr05/recei...

The general field is called "algorithm animation" but it seems be out of fashion. The old SRC Modula-3 distribution used to contain a cool package for creating animations of algorithms. Here is a video made by the authors: http://www.youtube.com/watch?v=zIgu9q0vVc0

Again a bit of a tangent, but I wrote an essay on learning math on your own based on my experiences: https://medium.com/@amathstudent/learning-math-on-your-own-3... - perhaps it will help you.

the other visualizations on setosa.io are pretty great. There really aren't enough of these around if you ask me. another set of good tutorials that are at this level is Red Blob Games: http://www.redblobgames.com/ focused on games but introduces math/coding concepts very intuitively and in a practical manner.

Man the text is really well written! Am I right, everyone?

Lewis, is that you?

I wanted to tell Lewis that his domain (heylewis) is expiring.

Also, yes. Keep it up!

Very nice concept; a mvp

Just on first glance: 1. first diagram and others, ball jumps from beginning of BtoA arc to B without sliding along the arc.

2. second diagram box was no P(B|B). That is boxes are mislabeled.

3. strange, but arcs are sometimes at an angle. It appears to happen if they are scrolled to, but no if drawn on the initial screen.

4. while the R S on the next diagram does settle to a steady state, it starts with random Rs and Ss marching across at random rates.

Good Start!

That's pretty cool. The markov chain diagrams seem very similar (identical?) to deterministic finite automota. Would it be correct or incorrect to say that a Markov Chain can be thought of as a DFA where the changes in state are determined by probability?

Yes, that's exactly what it is. There's nothing more to it.

Great explanation, but wow, those animations were awful! The movement was great, but the seizure-like jump every time it hit a state was unwatchable. I had to cover them up to get through the text.

Has anyone thought about or attempted to model game AI with Markov Chains instead of decision trees? Ex: NPCs, wildlife or enemies that use Markov Chains to react to their surroundings.

Not sure about individual-scale models, but Markov Chains are common in Mathematical Ecology. You can do quite a lot in animal group modeling with transition matrices and basic probability theory. This is a good book if you can get it cheap (like most monograph textbooks): http://www.amazon.com/Introduction-Stochastic-Processes-Biol...

Also, props to submission author Victor Powell. Those force-directed graphs are visually interesting, add to the article, and are even responsive (try changing the window size).

Just a decision tree with a probability of transitioning to a branch?

Beautiful. What did you use to create the graphics?

I'd be interested in this as well. Is it D3?

From browsing the source, it looks like the graphics are rendered in D3 and the sliders are connected to the graphics using Angular.

I'm outta control on this page!

[ [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1], [0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1] ]

Great explanation! Very easy. It could be perfect if you added some easy to understand real life examples to play with.

Very cool. Would love to see this generalized to an interactive visualization of Markov Decision Processes (MDPs).

And where is Temple OS creator's comments? I believe they are markov chains of a sort.

This totally reminds me of SpaceChem on higher levels (puzzle game for programmers).

I would not "require" people to know Markov chains but I am usually surprised how many programmers have no idea what it is and how it works and how it can be used. It is a very powerful tool to model queues which is something most distributed systems deal with :).

Can I use this for my class? Creative commons with attribution?

To answer your first question, yes.

Look so cool! Still hard to understand

This is really cool, nice work.

Excellent visualization!

Really cool: well done!

can I fork these?


Here's a model of chutes and ladder using Markov http://www.datagenetics.com/blog/november12011/index.html

And another for Candyland http://www.datagenetics.com/blog/december12011/index.html

Nice. These articles are great. Very much appreciated.

"For example, the algorithm Google uses to determine the order of search results, called PageRank, is a type of Markov chain."

I had to research that to understand it: http://en.wikipedia.org/wiki/PageRank

Here is some key text from Wikipedia:

Google recalculates PageRank scores each time it crawls the Web and rebuilds its index. As Google increases the number of documents in its collection, the initial approximation of PageRank decreases for all documents.

The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. It can be understood as a Markov chain in which the states are pages, and the transitions, which are all equally probable, are the links between pages.

If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.

That's increadibly easy (it is!)

I can't understand why Google owns it.

Especially today ("~15 years later") and with a lot of good open source software available.

The problem with Markov chains is that they are named after a person, which makes math seem more like a "private club". For instance, why use "abelian group", when "commutative group" will do? The reasons for wanting to be a member of an exclusive group are psychological.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact