The write up on that (from Google, who organized it and provided the data) was really interesting: http://blog.kaggle.com/2018/02/07/a-brief-summary-of-the-kag...
The approach used here has secured the 6th position in the Kaggle Russian Text Normalization Challenge by Google's Text Normalization Research Group.
We modified the sentence to say, "An earlier version of the approach used here has secured the 6th position in the Kaggle Russian Text Normalization Challenge by Google's Text Normalization Research Group".
You then argue that the number of "unacceptable errors" is a better measure of model performance, which seems reasonable. However, you don't really show any analysis of these errors other than a table with some hand-picked examples. I would spend some time trying to actually quantify these errors so you can analyze them and show a proper plot or table that summarizes the results.
I think the work is interesting but I would be careful how you present the results. I would suggest adding more plots/tables to back up your claims in a more objective manner or tone down the conclusions a bit. This is meant to be constructive criticism btw, it's not an attack :-) I think with a bit more work you'll be ready for a proper peer review.
The paper by Sproat and Jaitly which introduces the challenge rightly notes that the acceptability of errors and quality of output is more important than accuracy for a real application. The number of instances in all of the critical semiotic classes is too low (1-2k for some, even less than 100 for others) for a meaningful comparison in accuracy.
But you are right to point out that the 'unacceptability' of errors could be analyzed better. However, we could not think of a way to quantify or form a metric that measures such errors. These 'silly' errors are subjective by their very nature and depend on a human reading them. As you have suggested, we are working on preparing a table of sorts to summarize all these errors and show a link between the availability/frequency of particular types examples to the performance of our model on those particular types. Something of this sort for example:
* The training set had 17,712 examples in DATE of the form xx/yy/zzzz. Upon the analyzing the mistakes in DATE class we did not find any mistakes made in the dates of the above form.
* On the other hand, if look into the mistakes made in MEASURE class we find that the DNC network made exactly 4 mistakes. The mistakes were reported in the units (g/cm3, ch, mA). Upon searching for the occurrences in the training set of these units, we found out that 'mA' occurs 3 three times, 'g/cm3' occurred 7 times and 'ch' occurred 8 times, whereas other measurement units like kg occur 296 times and cm occur 600+ times.
If you have any other ideas on how to analyze and report the results, please let us know. We will be glad to improve the quality of our work (By the way, we are undergrads and this is our very first research paper). Thanks again!
Only in arXiv you could get away with that kind of language :). Good paper though! Kudos.
"Another direction to go from here would be to increase the size of the context window during the data preprocessing stage to feed even more contextual information into the model."
Could you comment on how the training time would scale with increasing the size of the context window? Is there a sweet spot?
The memory requirements of DNC is quite high. We used GTX 1060 for training. Increasing the context window anything more than 3 increases the sequence length by a huge amount, causing memory problems. However, we also found that DNC works quite well even on small batch size. We used a batch size of 16 for all our experiments. The training time for a batch size of 16, context window of size 3 and 200k steps is 48h on a GTX 1060 system.
145, i.e. "one hundred and forty five". Oh! This is immediately obvious what you're doing.
Official DeepMind DNC code is here: https://github.com/deepmind/dnc
They use "an unmodified version
of the architecture as specified in the original paper", and it looks like they copy & pasted the core code.
This paper lacks some further explanation. Why do they use XGBoost for predicting whether some word is to be normalized? And why do they use DNC for the seq2seq model? I think a single shared model for both tasks might be a cleaner solution. E.g. an encoder which with output layer for the prediction and also this encoder is fed to the decoder. The motivation for DNC is also not too clear, although I can guess that they think this is too hard for a LSTM. But for DNC, to get the advantages out of it, it should support some time for doing internal calculations, which you could get by introducing internal computation steps. They don't do that. Also, in their results section, they do not compare to any other model, so it it not clear whether XGBoost is the best choice, and also not whether the DNC really helps here.
On the other hand, as mentioned when comparing text normalization systems it is more important to look at the exact kinds of errors made by the system (not only the overall accuracy). Our model showed improvement over the baseline model in https://arxiv.org/abs/1611.00068. DNC showed improvement in certain semiotic classes such as DATE, CARDINAL and TIME making zero unacceptable predictions in these classes, LSTM was susceptible to these kinds of mistakes even when a lot of training data was available. Yes, we do not use internal computation steps, the model replaces a standard LSTM in a seq-to-seq model with a DNC. However thanks for the suggestions it would be interesting to see the performance improvements if the internal computation steps are increased.