
Text Normalization using Memory Augmented Neural Networks - subho406
https://arxiv.org/abs/1806.00044
======
nl
I don't think the paper highlighted enough that this was a Kaggle competition
entry.

The write up on that (from Google, who organized it and provided the data) was
really interesting: [http://blog.kaggle.com/2018/02/07/a-brief-summary-of-the-
kag...](http://blog.kaggle.com/2018/02/07/a-brief-summary-of-the-kaggle-text-
normalization-challenge/)

~~~
subho406
The model specifications used for the Kaggle competition was a lot different
than the one mentioned in the paper. The paper compares on the same test set
used by [https://arxiv.org/abs/1611.00068](https://arxiv.org/abs/1611.00068).
DNC showed significant improvement over LSTM as a recurrent unit of a seq-to-
seq model with almost zero unacceptable mistakes in certain semiotic classes.
LSTM, on the other hand, is susceptible to these kinds of mistakes even when a
lot of data is available.

~~~
nl
I'm confused. On [https://github.com/cognibit/Text-Normalization-
Demo](https://github.com/cognibit/Text-Normalization-Demo) it says:

 _The approach used here has secured the 6th position in the Kaggle Russian
Text Normalization Challenge by Google 's Text Normalization Research Group._

~~~
subho406
I'm sorry for the misunderstanding. The reason we added the sentence because
the model used in the competition was also based on DNC. But, changes were
made when writing the paper, for instance, we did not use any attention
mechanism at the seq-to-seq level in the competition. Besides, the paper
concentrates more on comparing the kinds of errors made by the DNC network
(avoiding unacceptable mistakes; not the overall accuracy), which shows an
improvement over the LSTM model in the paper
([https://arxiv.org/abs/1611.00068](https://arxiv.org/abs/1611.00068)). On the
other hand, overall accuracy was more important for the Kaggle competition.

We modified the sentence to say, "An earlier version of the approach used here
has secured the 6th position in the Kaggle Russian Text Normalization
Challenge by Google's Text Normalization Research Group".

~~~
nl
Ok, got it I think.

------
Analog24
The conclusions you draw from the results that are presented seem to be a bit
of a stretch. From the accuracy table it is pretty clear that the LSTM model
outperforms your DNC model in most classes (for both datasets the LSTM
achieves the same or better accuracy in 10 out of the 14 classes).

You then argue that the number of "unacceptable errors" is a better measure of
model performance, which seems reasonable. However, you don't really show any
analysis of these errors other than a table with some hand-picked examples. I
would spend some time trying to actually quantify these errors so you can
analyze them and show a proper plot or table that summarizes the results.

I think the work is interesting but I would be careful how you present the
results. I would suggest adding more plots/tables to back up your claims in a
more objective manner or tone down the conclusions a bit. This is meant to be
constructive criticism btw, it's not an attack :-) I think with a bit more
work you'll be ready for a proper peer review.

~~~
amandavinci
Thanks for the great feedback.

The paper by Sproat and Jaitly which introduces the challenge rightly notes
that the acceptability of errors and quality of output is more important than
accuracy for a real application. The number of instances in all of the
critical semiotic classes is too low (1-2k for some, even less than 100 for
others) for a meaningful comparison in accuracy.

But you are right to point out that the 'unacceptability' of errors could be
analyzed better. However, we could not think of a way to quantify or form a
metric that measures such errors. These 'silly' errors are subjective by their
very nature and depend on a human reading them. As you have suggested, we are
working on preparing a table of sorts to summarize all these errors and show a
link between the availability/frequency of particular types examples to the
performance of our model on those particular types. Something of this sort for
example:

* The training set had 17,712 examples in DATE of the form xx/yy/zzzz. Upon the analyzing the mistakes in DATE class we did not find any mistakes made in the dates of the above form.

* On the other hand, if look into the mistakes made in MEASURE class we find that the DNC network made exactly 4 mistakes. The mistakes were reported in the units (g/cm3, ch, mA). Upon searching for the occurrences in the training set of these units, we found out that 'mA' occurs 3 three times, 'g/cm3' occurred 7 times and 'ch' occurred 8 times, whereas other measurement units like kg occur 296 times and cm occur 600+ times.

If you have any other ideas on how to analyze and report the results, please
let us know. We will be glad to improve the quality of our work (By the way,
we are undergrads and this is our very first research paper). Thanks again!

------
cuddlypsycho
"...while avoiding the kind of silly errors made by the LSTM based recurrent
neural architectures."

Only in arXiv you could get away with that kind of language :). Good paper
though! Kudos.

"Another direction to go from here would be to increase the size of the
context window during the data preprocessing stage to feed even more
contextual information into the model."

Could you comment on how the training time would scale with increasing the
size of the context window? Is there a sweet spot?

~~~
subho406
Thank you for the review! We will surely correct these mistakes before
submitting for final publication.

The memory requirements of DNC is quite high. We used GTX 1060 for training.
Increasing the context window anything more than 3 increases the sequence
length by a huge amount, causing memory problems. However, we also found that
DNC works quite well even on small batch size. We used a batch size of 16 for
all our experiments. The training time for a batch size of 16, context window
of size 3 and 200k steps is 48h on a GTX 1060 system.

------
__bee
Good paper, I was wondering what is the state-of-art of using Neural Networks
for Text Segmentation, Text Lemmatisation, Part-of-speech Tagging.
Morphological approaches is dominant in this space.

------
bmc7505
Text normalization, i.e. the transformation of words from the written to the
spoken form. Uh, is this speech synthesis?

145, i.e. "one hundred and forty five". Oh! This is immediately obvious what
you're doing.

~~~
kylebgorman
"text normalization" just refers to that sort of thing. it's a subcomponent of
the "front-end" of a speech synthesizer but it can also be used for speech
recognition (if your training data contains things like "145" you may want to
convert it to read "one hundred and forty five", for various reasons) and
information extraction (perhaps you want to treat "145" and "one hundred and
forty five" as the same).

------
dharma1
Seq2Seq DNC looks interesting.. any implementation on github yet?

~~~
albertzeyer
Their code is here: [https://github.com/cognibit/Text-Normalization-
Demo](https://github.com/cognibit/Text-Normalization-Demo)

Official DeepMind DNC code is here:
[https://github.com/deepmind/dnc](https://github.com/deepmind/dnc)

They use "an unmodified version of the architecture as specified in the
original paper", and it looks like they copy & pasted the core code.

This paper lacks some further explanation. Why do they use XGBoost for
predicting whether some word is to be normalized? And why do they use DNC for
the seq2seq model? I think a single shared model for both tasks might be a
cleaner solution. E.g. an encoder which with output layer for the prediction
and also this encoder is fed to the decoder. The motivation for DNC is also
not too clear, although I can guess that they think this is too hard for a
LSTM. But for DNC, to get the advantages out of it, it should support some
time for doing internal calculations, which you could get by introducing
internal computation steps. They don't do that. Also, in their results
section, they do not compare to any other model, so it it not clear whether
XGBoost is the best choice, and also not whether the DNC really helps here.

~~~
subho406
Hi, we tried using a single model for the entire seq-to-seq task but the
number of examples in PLAIN is huge which causes the model to perform worse on
other classes. The reason we used XGBoost was to separate the two very
different tasks (predicting whether a word in normalized; predicting the
sequence of normalized tokens).

On the other hand, as mentioned when comparing text normalization systems it
is more important to look at the exact kinds of errors made by the system (not
only the overall accuracy). Our model showed improvement over the baseline
model in [https://arxiv.org/abs/1611.00068](https://arxiv.org/abs/1611.00068).
DNC showed improvement in certain semiotic classes such as DATE, CARDINAL and
TIME making zero unacceptable predictions in these classes, LSTM was
susceptible to these kinds of mistakes even when a lot of training data was
available. Yes, we do not use internal computation steps, the model replaces a
standard LSTM in a seq-to-seq model with a DNC. However thanks for the
suggestions it would be interesting to see the performance improvements if the
internal computation steps are increased.

