Can you provide articles comparing CRFs directly with LSTMs? Most articles on LS...

cs702 · on May 4, 2018

tensor: my main point was and is that features learned by a suitable deep model (whether recurrent or attention-based) routinely outperform human-designed features. This has been shown in a large and growing number of sequence tasks (WMT language translation datasets, Stanford Question Answering Dataset, WikiText language modeling datasets, Penn Treebank dataset, IMDB and Stanford Sentiment Treebank movie review datasets, etc. -- to name a few).

Now, in some cases, and depending on the task, it might make sense to have the last layer of a deep model be a CRF layer. In the OP's case, for example, one could try replacing all those one-off feature functions with a proven deep architecture -- in other words, instead of having ψ at each time step be equal to exp(sum(weighted feature functions))), have it be a function of the output of the deep model.

That said, for something like the OP's task, the first thing I would try would be one of the readily available LSTM architectures[a], with a standard softmax layer predicting a distribution over the vocabulary of tags at each time step, and feeding that into a standard beam search.[b]

[a] Example: https://github.com/salesforce/awd-lstm-lm/blob/master/model....

[b] Intro to beam search algorithm: https://www.youtube.com/watch?v=UXW6Cs82UKo