This is the kind of problem for which LSTM RNNs -- and more recently, fully-atte...

nl · on May 4, 2018

This article is dated 2015. Can’t blame the author too much for not trying things that would be invented 2 years later.

But yeah, it would be great follow up work.

cs702 · on May 4, 2018

Ah, I didn't notice the article's date until now. Thanks for pointing that out! Makes more sense now.

Yes, it would be great follow-up work.

tensor · on May 4, 2018

Can you provide articles comparing CRFs directly with LSTMs? Most articles on LSTMs don't actually compare against CRFs and an LSTM isn't a drop in replacement for a CRF. I haven't personally seen that neural networks have uniformly beaten CRFs on all tasks. E.g. [2] directly compares CRFs and an LSTM and the CRF achieves an F1 of 97.533 while the LSTM gets 97.848.

In fact, because of the competitiveness of CRFs there are many works that combine them with neural networks (e.g. [2])

[1] https://arxiv.org/pdf/1606.03475.pdf

[2] https://arxiv.org/abs/1508.01991

cs702 · on May 4, 2018

tensor: my main point was and is that features learned by a suitable deep model (whether recurrent or attention-based) routinely outperform human-designed features. This has been shown in a large and growing number of sequence tasks (WMT language translation datasets, Stanford Question Answering Dataset, WikiText language modeling datasets, Penn Treebank dataset, IMDB and Stanford Sentiment Treebank movie review datasets, etc. -- to name a few).

Now, in some cases, and depending on the task, it might make sense to have the last layer of a deep model be a CRF layer. In the OP's case, for example, one could try replacing all those one-off feature functions with a proven deep architecture -- in other words, instead of having ψ at each time step be equal to exp(sum(weighted feature functions))), have it be a function of the output of the deep model.

That said, for something like the OP's task, the first thing I would try would be one of the readily available LSTM architectures[a], with a standard softmax layer predicting a distribution over the vocabulary of tags at each time step, and feeding that into a standard beam search.[b]

[a] Example: https://github.com/salesforce/awd-lstm-lm/blob/master/model....

[b] Intro to beam search algorithm: https://www.youtube.com/watch?v=UXW6Cs82UKo