Can you provide articles comparing CRFs directly with LSTMs? Most articles on LSTMs don't actually compare against CRFs and an LSTM isn't a drop in replacement for a CRF. I haven't personally seen that neural networks have uniformly beaten CRFs on all tasks. E.g. [2] directly compares CRFs and an LSTM and the CRF achieves an F1 of 97.533 while the LSTM gets 97.848.
In fact, because of the competitiveness of CRFs there are many works that combine them with neural networks (e.g. [2])
tensor: my main point was and is that features learned by a suitable deep model (whether recurrent or attention-based) routinely outperform human-designed features. This has been shown in a large and growing number of sequence tasks (WMT language translation datasets, Stanford Question Answering Dataset, WikiText language modeling datasets, Penn Treebank dataset, IMDB and Stanford Sentiment Treebank movie review datasets, etc. -- to name a few).
Now, in some cases, and depending on the task, it might make sense to have the last layer of a deep model be a CRF layer. In the OP's case, for example, one could try replacing all those one-off feature functions with a proven deep architecture -- in other words, instead of having ψ at each time step be equal to exp(sum(weighted feature functions))), have it be a function of the output of the deep model.
That said, for something like the OP's task, the first thing I would try would be one of the readily available LSTM architectures[a], with a standard softmax layer predicting a distribution over the vocabulary of tags at each time step, and feeding that into a standard beam search.[b]
I wonder if the author ever tried using, say, an AWD LSTM RNN[a] or a Transformer-like model[b] for this task.
Using an RNN or an attention model for this would eliminate the need for feature engineering such as:
This is one of seven carefully engineered feature functions listed in the article, and the author states that the seven are only a partial list.Moreover, using a modern RNN or attention model likely would produce better predictions, with much better generalization.
[a] https://arxiv.org/abs/1708.02182 / https://github.com/salesforce/awd-lstm-lm
[b] https://arxiv.org/abs/1706.03762 / https://github.com/tensorflow/tensor2tensor