"In both models, there's an unobserved state that changes over time according to relatively simple rules, and you get indirect information about that state every so often.
In Kalman filters, you assume the unobserved state is Gaussian-ish and it moves continuously according to linear-ish dynamics (depending on which flavor of Kalman filter is being used).
In HMMs, you assume the hidden state is one of a few classes, and the movement among these states uses a discrete Markov chain.
In my experience, the algorithms are often pretty different for these two cases, but the underlying idea is very similar." - THISISDAVE
-- vs LSTM/RNN:
"Some state-of-the-art industrial speech recognition [0] is transitioning from HMM-DNN systems to "CTC" (connectionist temporal classification), i.e., basically LSTMs. Kaldi is working on "nnet3" which moves to CTC, as well.
Speech was one of the places where HMMs were _huge_, so that's kind of a big deal." -PRACCU
"HMMs are only a small subset of generative models that offers quite little expressiveness in exchange for efficient learning and inference." - NEXTOS
"IMO, anything that be done with an HMM can now be done with an RNN. The only advantage that an HMM might have is that training it might be faster using cheaper computational resources. But if you have the $$$ to get yourself a GPU or two, this computational advantage disappears for HMMs." - SHERJILOZAIR
Hybrid DNN-HMM type systems (or CNN-HMM) still roundly beat purely RNN based systems on public benchmarks (~5% across tasks such as Switchboard, WSJ, even a bit behind on TIMIT) and most of the best internal systems at the companies I know (barring Baidu) are still using this DNN-HMM scheme, though RNN+CTC or CNN+CTC is not far behind performance wise these days.
RNN or CNN+CTC is tantalizing because it is conceptually simple, should "scale with data", and you can build a toy version in a few hundred lines of code! Which is a far cry from the massive complexity of most speech systems in use today.
I think we are still a ways off from replacing all HMMs with RNNs, but with tighter integration of the language model into the speech system and training the whole shebang end-to-end there will be some interesting results. There are a few ares of machine translation where people have tried this "deep fusion" with promising improvements.
An RNN can be trained to
a) Simulate the random output behavior of the sequence
b) Predict the categorial hidden states at any given sequence location from supervised training data.
A HMM is able to carry out both a and b. It also has the capacity to do:
c) Predict the categorical hidden states any sequence location with reliance on supervised training data.
Effectively, an HMM can identify/reveal hidden categorical states directly from unsupervised learning. To my knowledge, unsupervised labeling of hidden states is not a trivial task for RNNs.
HMMs are a good model if you know what the categorical states are before you start modeling. If you let the HMM learn the hidden states (e.g. with expectation maximization), there is no good reason to believe they will have obvious semantic value to you.
For RNNs, you can get very similar info. If you do have meaningful hidden state, you can treat a few labeled examples as training data for a supervised learning procedure. The RNN winds up predicting the state variable along with the future of the sequence. If you want something exploratory, you can use the RNN's hidden state activation over the course of a sequence as a general real valued vector that is amenable to cluster analysis. If the RNN state vectors segment well, it's likely these segmentations will have at least as much meaning as learned HMM states.
Interesting observation of computational complexity.
HMMs require Viterbi algorithm to find the desirable sequence. The complexity of the algorithm depends on the number of previous decisions on which we condition the current one. Time complexity is O(n^(k+1) * S) where n is the length of the sequence, k is the number of previous decisions we condition on, and S is the number of possible decisions.
Now, why does HMM require the Viterbi search procedure, and LSTM/RNN doesn't?
Inference in LSTM is linear in the length of the sequence -- ignoring the classifier decision time.
Why does this work without Viterbi?
Given the fact that even Conditional Random Fields need dynamic programming, and that Maximum Entropy Markov Model suffers the inability to learn quickly without dynamic programming and make inference not suffer label bias.
What is so special about LSTM that it doesn't need DP.
RNN obviously learns a much better representation of features. Still, why then won't HMMs work with that same representation without the search part?
Really baffles me.
Although, there exists a process that doesn't use NN for sequence labeling and it is linear in the length of sequence and works well. Instead of using a simple classifier as is the case in MEMMs, one can use a cost-sensitive classifier and learn for each decision associated future loss and greedily avoid the loss. This allows the usage of a simple multiclass classifier, removes the search part and can work as well as NN given the right features.
RNN for discrete structured prediction tasks often use a technique called beam search during decoding (see for example papers on neural MT), which is about as close as you can get to a true Viterbi when you don't have independence assumptions. The CTC mentioned above is also a special type of "reductionist" dynamic programming which uses some awesome and clever stuff to remain backproppable. It only works for monotonic alignments (input larger than desired output).
Still, beam search can be simulated and improved by running decision processes in parallel.
For example, instead of learning the sequence labeling as a sequence of n decisions (where n is the length of the sequence) you can learn sequence labeling as a sequence of 3n+1 decisions where you make 3 decisions for each sequence element and after 3n decision pick one out of three decision streams that minimizes loss using an extra decision. (when inference is done then the classifier will, hopefully, pick the stream that minimizes test loss).
This simulates a beam search and can be done during learning and inference and is probably more effective than picking confidence scores of particular decisions and keeping a beam of most confident partial sequences.
Bean search is a heuristic thing that improves performance and is done mostly to allow you to correct mistakes you made at the beginning of the process.
the question remains the same, for example, in the paper above they approximate the partition function of CRFs with a beam but get superior results to other structured prediction methods.
Sure, there are a lot of potential things to do - either by enumerating towards the full "brute force" loss partway as I think you are hinting toward, or treating as a subsequences, or something else [0, 1, 2]. The point is, IMO RNNs do need some kind of structured loss (more than per step likelihood) to be competitive with HMM approaches using Viterbi decoding, in addition to some sequence smarts (such as beam search) in decoding.
At least in NMT, enumerating the possible decisions as 3n + 1 is almost impossible since the softmax size is generally the memory bottleneck in training - and a bigger vocabulary is typically a huge win. It is more feasible in speech, but often your labels are themselves triphones and you end up with a pretty large vocabulary too.
Figuring out how to get RNNs closer to on par with DNN-HMMs with Viterbi decoding (or full on sequence training [3]) through something like "deep fusion" with a language model (or something else) is something I am very interested in.
> IMO RNNs do need some kind of structured loss (more than per step likelihood) to be competitive with HMM approaches using Viterbi decoding
This is exactly what they do with the dependency parser I've cited, so your opinion is definitely valid. Although their approach is not general, given the fact that they approximate hamming loss with log-loss and again make it work only on sequences.
paper above also has a very good analysis on how to remove the search component of the inference and allow linear time complexity with competitive results.
consistent (as it is used in the machine learning theory of reductions) reduction from structured learning to multiclass classification seems to be possible. I just haven't seen anyone couple the learning procedure with neural networks. (Daume did mention they trained RNNs with the reductionist approach but seems that the code didn't make it to vowpal wabbit).
the approach above works with any loss you want (from F-score to any weird thing you might think of), the loss doesn't have to decompose over the structure (one can just announce the loss after the labelling is done and learn from that loss), it can work on any kind of structure, from images to sequences to documents for translation. it can also use a O(log n) consistent reduction of multiclass classification if speed is of the issue and if number of classes is large. It can easily work as an online method too, not requiring the full structured input.
for example, simple sequence tagging works (depending on the number of possible labels) around 500k tokens per second :D word count is only 2-4 times faster than that :D
there still aren't any papers using the above consistent reduction in the framework of NNs but I guess they'll soon be coming.
A coworker of mine used to ask job candidates (usually folks with PhDs) with HMMs on their CV "what's hidden in a hidden markov model". Lots of people couldn't answer that question.
Are there an open tools for solving HMMs for large datasets? i.e. if I have millions of observations from millions of users and want to learn a HMM from the data, what are my options?
There is also the HTK toolkit, but last I used that, years ago, I was pretty disappointed; it did not compile on a 64-bit system. Both of these are single-machine toolkits, so there is a limit to how much you can scale, but I think you may be able to deal with a million or so points with mlpack. There's also MATLAB's implementation, but that is not very fast.
> A Markov chain is a sequence of random variables X1, X2, X3, . . . , Xt ,
. . . , such that the probability distribution of Xt+1 depends only on t and
xt (Markov property), in other words:
No. In a Markov process, the future does depend on the past, even all of the past. But what is special is that the past and the future are conditionally independent given the present. If we are not given the present, then all of the past can be relevant in predicting the future.
Out of a pretty big slide deck, explicitly intended for a beginner audience, this is a weird thing to pick on. The following is "Proposition 1.1" from the relevant section of a standard text (Bhattacharya and Waymire):
"A stochastic process X0, X1, X2, ... has the Markov property if and only if for each n the conditional distribution of Xn+1 given X0...Xn is a function only of Xn."
The only difference between the above Prop. 1.1 and the statement in TFA is that TFA omits to specify it means a conditional distribution of Xt+1 given X1...Xt (as opposed to an unconditional distribution).
But, in literally the following line, TFA goes on to clarify what it means by stating the equation that the words correspond to, and the equation has the correct conditional distribution.
Markov chains are much simpler than Markov models. The lecture starts off with the simpler idea to motivate the more complex one that you're describing.
It's more reasonable to parse "depends only on" as expressing conditional independence than unconditional, and that parsing has the added bonus of making the original statement correct.
No. Again, once again, over again, one more time, in a Markov process, including a discrete time Markov chain, even one with a finite state space, the past can, and in practice usually does, provide a lot of powerful information to predict the future. The future really can depend on the past. Typically the past and the future are not independent -- not even close.
But, the Markov assumption (property) says that GIVEN (excuse my emphasis -- it is just crucial here) the present, the past does not contribute more information for predicting the future. That is the past and the future may not be independent at all, but the past and the future are conditionally independent given the present.
To discuss the Markov property, desperately need the concept of conditional independence -- there is no substitute, no easier terminology or description.
Alas, conditional independence is not so easy to discuss, even in the simplest cases. And in the grown up cases, need the Radon-Nikodym theorem of measure theory and the associated machinery of conditional expectation.
Tellingly, the part I quoted never mentioned conditional independence or even conditioning.
Sorry, again, the quote was wrong. I tried to correct the quote and, thus, keep readers trying to learn from being misled.
My sympathies are with the readers trying to learn. Early in my career, I read lots of such elementary, easy to understand introductions to lots of topics in applied math. The biggie problem was, as here, the content was from not very good down to actually wrong. Later I got a first class, rock solid foundation in grad school. In particular my background in Markov processes was from a star student of E. Cinlar, a world class expert long at Princeton. And my Ph.D. dissertation was in stochastic optimal control where the Markov assumption is just crucial -- indeed, stochastic optimal control is also commonly called Markov decision processes. In addition I've applied Markov processes in some serious work in US national security.
What I'm saying here is on the center of the target. What I quoted no one should try to learn from -- it's wrong.
"In both models, there's an unobserved state that changes over time according to relatively simple rules, and you get indirect information about that state every so often. In Kalman filters, you assume the unobserved state is Gaussian-ish and it moves continuously according to linear-ish dynamics (depending on which flavor of Kalman filter is being used). In HMMs, you assume the hidden state is one of a few classes, and the movement among these states uses a discrete Markov chain. In my experience, the algorithms are often pretty different for these two cases, but the underlying idea is very similar." - THISISDAVE
-- vs LSTM/RNN:
"Some state-of-the-art industrial speech recognition [0] is transitioning from HMM-DNN systems to "CTC" (connectionist temporal classification), i.e., basically LSTMs. Kaldi is working on "nnet3" which moves to CTC, as well. Speech was one of the places where HMMs were _huge_, so that's kind of a big deal." -PRACCU
"HMMs are only a small subset of generative models that offers quite little expressiveness in exchange for efficient learning and inference." - NEXTOS
"IMO, anything that be done with an HMM can now be done with an RNN. The only advantage that an HMM might have is that training it might be faster using cheaper computational resources. But if you have the $$$ to get yourself a GPU or two, this computational advantage disappears for HMMs." - SHERJILOZAIR