Hacker News new | past | comments | ask | show | jobs | submit login

This is the kind of problem for which LSTM RNNs -- and more recently, fully-attention-based deep neural nets -- produce state-of-the-art results.

I wonder if the author ever tried using, say, an AWD LSTM RNN[a] or a Transformer-like model[b] for this task.

Using an RNN or an attention model for this would eliminate the need for feature engineering such as:

  feature_1 = 1 if x_t is capitalized and y_t equals "NAME";
              0 otherwise.
This is one of seven carefully engineered feature functions listed in the article, and the author states that the seven are only a partial list.

Moreover, using a modern RNN or attention model likely would produce better predictions, with much better generalization.

[a] https://arxiv.org/abs/1708.02182 / https://github.com/salesforce/awd-lstm-lm

[b] https://arxiv.org/abs/1706.03762 / https://github.com/tensorflow/tensor2tensor




This article is dated 2015. Can’t blame the author too much for not trying things that would be invented 2 years later.

But yeah, it would be great follow up work.


Ah, I didn't notice the article's date until now. Thanks for pointing that out! Makes more sense now.

Yes, it would be great follow-up work.


Can you provide articles comparing CRFs directly with LSTMs? Most articles on LSTMs don't actually compare against CRFs and an LSTM isn't a drop in replacement for a CRF. I haven't personally seen that neural networks have uniformly beaten CRFs on all tasks. E.g. [2] directly compares CRFs and an LSTM and the CRF achieves an F1 of 97.533 while the LSTM gets 97.848.

In fact, because of the competitiveness of CRFs there are many works that combine them with neural networks (e.g. [2])

[1] https://arxiv.org/pdf/1606.03475.pdf

[2] https://arxiv.org/abs/1508.01991


tensor: my main point was and is that features learned by a suitable deep model (whether recurrent or attention-based) routinely outperform human-designed features. This has been shown in a large and growing number of sequence tasks (WMT language translation datasets, Stanford Question Answering Dataset, WikiText language modeling datasets, Penn Treebank dataset, IMDB and Stanford Sentiment Treebank movie review datasets, etc. -- to name a few).

Now, in some cases, and depending on the task, it might make sense to have the last layer of a deep model be a CRF layer. In the OP's case, for example, one could try replacing all those one-off feature functions with a proven deep architecture -- in other words, instead of having ψ at each time step be equal to exp(sum(weighted feature functions))), have it be a function of the output of the deep model.

That said, for something like the OP's task, the first thing I would try would be one of the readily available LSTM architectures[a], with a standard softmax layer predicting a distribution over the vocabulary of tags at each time step, and feeding that into a standard beam search.[b]

[a] Example: https://github.com/salesforce/awd-lstm-lm/blob/master/model....

[b] Intro to beam search algorithm: https://www.youtube.com/watch?v=UXW6Cs82UKo




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: