What is missing from Steve's clear explanation is (1) a graphic depiction of the central data structure called the "trellis" (Fig. 6 in the above paper) and (2) a citation to the original Viterbi paper [1].
For many years, HMMs - together with a Viterbi triphone decoder - defined the state of the art in ASR (automatic speech recognition), until eventually, neural networks took over.
[1] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory, vol. IT-13, pp. 260-269, Apr. 1967.
My favorite story about Viterbi encoding is that apparently they launched the Voyager probe with a Viterbi encoder even though at the time there wasn't a computer on Earth capable of decoding it in real time (encoding is less computationally intensive than decoding). Years later they remotely switched over to Viterbi to enable higher bitrates.
This is very much like the least action principle [0]. The optimal path between two points is the combination of optimal paths between the starting and an intermediate, and the intermediate and end points.
Viterbi algorithm, Baum Welch algorithm, forward-backward algorithm, beam search, etc, are all very closely related, all using basically the same kind of dynamic programming. Viterbi calculates the max and argmax, Baum Welch (forward-backward) calculates the sum (forward) and then also the posteriors (backward). The difference is just the max vs sum. And to get the argmax, you would just backtrack. Beam search is an approximation for the max and argmax.
Btw, using automatic differentiation, just calculating the forward scores (sum) is enough, automatic differentiation will then automatically lead to the backward pass.
This is not just for HMMs but it applies to many kinds of search. Naturally for CTC (can be seen as a simple special case of HMM), transducer, but also encoder-decoder or decoder-only (language) models, and more.
Do you have a good reference(s) that starts from the basics all the way to CTC loss? The distill.pub article is good, but personally doesn't provide a good enough intuition...
You can study CTC in isolation, ignoring all the HMM background. That is how CTC was also originally introduced, by mostly ignoring any of the existing HMM literature. So e.g. look at the original CTC paper. But I think the distill.pub article (https://distill.pub/2017/ctc/) is also good.
For studying HMMs, any speech recognition lecture should cover that. We teach that at RWTH Aachen University but I don't think there are public recordings. But probably you should find some other lectures online somewhere.
The relation of CTC and HMM becomes intuitively clear once you get the concept of HMMs. Often in terms of speech recognition, it is all formulated as finite state automata (FSA) (or finite state transducer (FST), or weighted FST (WFST)), and the CTC FST just looks a bit different (simpler) than the traditional HMMs, but in all cases, you can think about having states with possible transitions.
This is all mostly about the modeling. The training is more different. For CTC, you often calculate the log prob of the full sequence over all possible alignments directly, while for HMMs, people often use a fixed alignment, and calculate framewise cross entropy.
The analogy speaks to the number of paths that need be considered so that, later, we can find the overall shortest path (there are likely other intermediate places that we're also evaluating, along with Toronto). Once we know the shortest path from Waterloo to Toronto, that's the only one we need to consider (on that segment) from then on.
The classic failure of this approach is a form of the hill climbing problem (fastest way from Cambridge to Palo Alto is to start driving east to the airport).
It simply says that if you want to evaluate all paths from Cambridge to Palo Alto that go through the airport, you only need to consider those that use the quickest way of getting from Cambridge to the airport, because all others can't be faster than that. This is different from getting stuck in a local minimum, it's merely a dynamic programming approach to reducing the computational burden of finding the global one.
What is missing from Steve's clear explanation is (1) a graphic depiction of the central data structure called the "trellis" (Fig. 6 in the above paper) and (2) a citation to the original Viterbi paper [1].
For many years, HMMs - together with a Viterbi triphone decoder - defined the state of the art in ASR (automatic speech recognition), until eventually, neural networks took over.
[1] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory, vol. IT-13, pp. 260-269, Apr. 1967.