Hacker News new | past | comments | ask | show | jobs | submit login
Steve's Explanation of the Viterbi Algorithm (2003) (toronto.edu)
71 points by gballan 11 months ago | hide | past | favorite | 16 comments



The best introduction to HMMs and the Viterbi algorithm is the classical tutorial by Rabiner ( see e.g. https://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf ).

What is missing from Steve's clear explanation is (1) a graphic depiction of the central data structure called the "trellis" (Fig. 6 in the above paper) and (2) a citation to the original Viterbi paper [1].

For many years, HMMs - together with a Viterbi triphone decoder - defined the state of the art in ASR (automatic speech recognition), until eventually, neural networks took over.

[1] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory, vol. IT-13, pp. 260-269, Apr. 1967.


My favorite story about Viterbi encoding is that apparently they launched the Voyager probe with a Viterbi encoder even though at the time there wasn't a computer on Earth capable of decoding it in real time (encoding is less computationally intensive than decoding). Years later they remotely switched over to Viterbi to enable higher bitrates.


This is very much like the least action principle [0]. The optimal path between two points is the combination of optimal paths between the starting and an intermediate, and the intermediate and end points.

[0]: https://en.wikipedia.org/wiki/Stationary-action_principle


The text for understanding and implementing HMMs is "A Revealing Introduction to Hidden Markov Models" by Mark Stamp, the content is a google away!



Viterbi algorithm, Baum Welch algorithm, forward-backward algorithm, beam search, etc, are all very closely related, all using basically the same kind of dynamic programming. Viterbi calculates the max and argmax, Baum Welch (forward-backward) calculates the sum (forward) and then also the posteriors (backward). The difference is just the max vs sum. And to get the argmax, you would just backtrack. Beam search is an approximation for the max and argmax.

Btw, using automatic differentiation, just calculating the forward scores (sum) is enough, automatic differentiation will then automatically lead to the backward pass.

This is not just for HMMs but it applies to many kinds of search. Naturally for CTC (can be seen as a simple special case of HMM), transducer, but also encoder-decoder or decoder-only (language) models, and more.


Do you have a good reference(s) that starts from the basics all the way to CTC loss? The distill.pub article is good, but personally doesn't provide a good enough intuition...


You can study CTC in isolation, ignoring all the HMM background. That is how CTC was also originally introduced, by mostly ignoring any of the existing HMM literature. So e.g. look at the original CTC paper. But I think the distill.pub article (https://distill.pub/2017/ctc/) is also good.

For studying HMMs, any speech recognition lecture should cover that. We teach that at RWTH Aachen University but I don't think there are public recordings. But probably you should find some other lectures online somewhere.

You also find a lot of tutorials for Kaldi: https://kaldi-asr.org/

Maybe check this book: https://www.microsoft.com/en-us/research/publication/automat...

The relation of CTC and HMM becomes intuitively clear once you get the concept of HMMs. Often in terms of speech recognition, it is all formulated as finite state automata (FSA) (or finite state transducer (FST), or weighted FST (WFST)), and the CTC FST just looks a bit different (simpler) than the traditional HMMs, but in all cases, you can think about having states with possible transitions.

This is all mostly about the modeling. The training is more different. For CTC, you often calculate the log prob of the full sequence over all possible alignments directly, while for HMMs, people often use a fixed alignment, and calculate framewise cross entropy.

I did some research on the relation of CTC training and HMM training: https://www-i6.informatik.rwth-aachen.de/publications/downlo...

Maybe my PhD thesis gives a good overview: https://www-i6.informatik.rwth-aachen.de/publications/downlo...


The analogy doesn't make sense for me. What if the shortest path from Waterloo to New York doesn't go through Toronto?

And if the point is that you already know that it does, then....the rest is beyond obvious?

What am I missing?


The analogy speaks to the number of paths that need be considered so that, later, we can find the overall shortest path (there are likely other intermediate places that we're also evaluating, along with Toronto). Once we know the shortest path from Waterloo to Toronto, that's the only one we need to consider (on that segment) from then on.


Strictly speaking this analogy doesn’t hold in real life since paths in road networks are not symmetric.


This youtube channel intuitively explains algorithms on a whiteboard: https://youtu.be/IqXdjdOgXPM


The classic failure of this approach is a form of the hill climbing problem (fastest way from Cambridge to Palo Alto is to start driving east to the airport).


It simply says that if you want to evaluate all paths from Cambridge to Palo Alto that go through the airport, you only need to consider those that use the quickest way of getting from Cambridge to the airport, because all others can't be faster than that. This is different from getting stuck in a local minimum, it's merely a dynamic programming approach to reducing the computational burden of finding the global one.


With a little hand-waving, same insight as in Q-learning I think right?


I'm unfortunately not familiar with that, but maybe someone else will see the connection




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: