
Deep Text Correcter - atpaino
http://atpaino.com/2017/01/03/deep-text-correcter.html
======
jmiserez
Interesting idea. I went ahead and tested:

> Alex went to the kitchen to store the milk in the fridge.

Corrected:

> Alex went to the kitchen to the store the milk in the fridge.

Gathering a large, high quality dataset from the internet is probably not so
easy. A lot of the content on HN/Reddit/forums is of low quality grammatically
and often written by non-native English speakers (such as myself). Movie
dialogues don't necessarily consist of grammatically correct sentences like
the ones you'd write in a letter. Perhaps there is some public domain
contemporary literature available that could be used instead or alongside the
dialogues?

EDIT: Unrelated to this project, I have this general fear of language
recommendation tools trained on just low-quality comments or emails. A simple
thesaurus and a grammar-checker are often enough to find the right words when
writing. But a tool that could understand my intent and then propose
restructured or similar sentences and words that convey the same meaning could
be a true killer application.

~~~
babuskov
> A lot of the content on HN/Reddit/forums is of low quality grammatically and
> often written by non-native English speakers (such as myself).

Corrected:

> A lot of the content on the HN/Reddit/forums is of the quality grammatically
> and non-native written by written English speakers (such myself). as UNK

Yeah. It's got a long way to go. No idea where "as UNK" came from.

~~~
pilooch
UNK usually comes from the final sampling step: the distribution of words
contains a special token UNK to represent all words with insufficient
statistics in the training corpus.

------
daveytea
This is really cool. If you're looking for more datasets to train your model,
here are a few relevant ones: \-
[https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)
\-
[http://trec.nist.gov/data/qamain.html](http://trec.nist.gov/data/qamain.html)
\-
[http://opus.lingfil.uu.se/OpenSubtitles2016.php](http://opus.lingfil.uu.se/OpenSubtitles2016.php)
\- [http://corpus.byu.edu/full-text/wikipedia.asp](http://corpus.byu.edu/full-
text/wikipedia.asp) OR
[https://en.wikipedia.org/wiki/Wikipedia:Database_download#En...](https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-
language_Wikipedia) \-
[http://opus.lingfil.uu.se/](http://opus.lingfil.uu.se/)

I'd love to see how good your model gets.

~~~
atpaino
Thanks for the links! I'll have to try out some of these. The data is
definitely the limiting factor at this point.

~~~
siscia
Have you consider project Gutenberg?

Is there any reason why that huge corpus would not be useful?

[https://www.gutenberg.org/](https://www.gutenberg.org/)

~~~
gleenn
I think the issue is you need parallel corpuses of (sentences with
grammatically errors) -> (same sentence but correct). Gutenberg has a lot of
only the latter I think.

~~~
sarabande
You don't need parallel corpora -- the OP was generating the incorrect
versions of training data through random perturbation (dropping articles,
etc.)

~~~
gleenn
The author can probably get a lot of traction by doing random perturbation,
but they won't be getting things in the correct ratios (i.e. lots of dropped
hyphens but not many dropped apostrophes) and also won't probably make all the
types of errors that humans actually make. It will work, but a huge part of
machine learning and doing NLP is getting those ratios right. This is one
reason Google slays with their translator, they have huge corpuses that allow
them to get fine grained distinctions in their models.

------
brandonb
Interesting idea! I think this is analogous to the idea of a de-noising
autoencoder in computer vision. Here, instead of introducing Gaussian noise at
the pixel level and using a CNN, you're introducing grammatical "noise" at the
world level and using an LSTM.

I think that general framework applies to many different domains. For example,
we trained a denoising sequence autoencoder on HealthKit data (sequences of
step counts and heart rate measurements) in order to predict whether somebody
is likely to have diabetes, high blood pressure, or a heart rhythm disorder
based on wearable data. I've also seen similar ideas applied to EMR data
(similar to word2vec). It's worth reading "Semi-Supervised Sequence Learning",
where they use a non-denoising sequence autoencoder as a pretraining step, and
compare a couple of different techniques:
[https://papers.nips.cc/paper/5949-semi-supervised-
sequence-l...](https://papers.nips.cc/paper/5949-semi-supervised-sequence-
learning.pdf)

Toward the end, you start thinking about introducing different types of
grammatical errors, like subject-verb disagreement. I think that's a good way
to think about it. In the limit, you might even have a neural network generate
increasingly harder types of grammatical corruptions, with the goal of
"fooling" the corrector network. As the the corruptor network and corrector
network compete with each other, you might end up with something like a
generative adversarial network:
[https://arxiv.org/abs/1701.00160](https://arxiv.org/abs/1701.00160)

~~~
unhammer
It seems like it would be challenging to get the corruptor to generate
examples that are of the same Kind that humans make, while still being
"productive" (in the linguistic sense, ie. not just overfitting on examples
from a corpus of low quality text).

It's easy enough to just drop random words or run Levenshtein-edits on single
words to create wrong-in-this-context makes (then/than), but grammar errors
include much more than can be covered by that method, and the method will
generate many errors that are of a kind never made by humans. And if you
restrict your method to things already seen in a corpus, it's easy to overfit
and miss out on a whole lot of good stuff.

------
jrapdx3
The "correcter" is a worthy effort, and it needs to start somewhere. It shows
the magnitude of the task considering that missing articles are not the most
crucial grammatical issues in on-line discourse. The meaning of a phrase is
usually comprehensible with or without the article, and native speakers can
easily overlook this kind of error made by non-native speakers.

OTOH more troublesome to readers are common errors such as misuse of "its" vs.
"it's", "to" vs. "too", and "their", "there" and "they're". These mistakes are
quite prevalent among native-speaking writers so more ubiquitous than the
missing article problem.

The "correcter" didn't correct the latter class of errors. Understandably this
would be a much harder goal to accomplish given the highly contextual nature
of grammatically correct word choices.

It prompts a question about how well the data-driven approach can handle the
problem. Obviously that's what the research is trying to answer. It sure seems
to point to something fairly easy for a human to do that's near or at the
limit of what we can get a computer to do.

------
danso
Tried out some of classic Garden path sentences [0], and of the 4 examples, it
got all but one right:

Original: _The complex houses married and single soldiers and their families._

Deep Text Corrector: _The complex houses married and a single soldiers and
their families._

OT: does anyone know of a more substantial list of garden path sentences that
people use in testing NLP software?

[0]
[https://en.wikipedia.org/wiki/Garden_path_sentence](https://en.wikipedia.org/wiki/Garden_path_sentence)

------
saycheese
Deep Proofreading Tool Comparisons:

[http://www.deepgrammar.com/evaluation](http://www.deepgrammar.com/evaluation)

[https://blogs.nvidia.com/blog/2016/03/04/deep-learning-
fix-g...](https://blogs.nvidia.com/blog/2016/03/04/deep-learning-fix-grammar/)

------
stephanheijl
Looks like a cool project, I would love to see this as a browser plugin of
some sort. As for the corpus, I suspect that using articles from Wikipedia
would be appropriate. Especially large articles are routinely checked and
cleaned up. It has the added benefit of being available in multiple languages.

([https://en.wikipedia.org/wiki/Wikipedia:Database_download](https://en.wikipedia.org/wiki/Wikipedia:Database_download))

EDIT: I see this has already been suggested, along with a large amount of
other source in another comment by daveytea.

------
camoby
Is the spelling of the name supposed to be ironic? ;)

~~~
saycheese
(Comment removed.)

~~~
kurthr
Well, I think the dictionary disagrees.
[http://www.dictionary.com/browse/corrector](http://www.dictionary.com/browse/corrector)

I'm going to assume that it was meant to be ironic (or errorful).

~~~
tyingq
The demo itself agrees. It changes the word correcter, but not the word
corrector :)

~~~
camoby
So its name is a built-in test case?

Genius. ;)

------
YeGoblynQueenne
>> Thus far, these perturbations have been limited to:

    
    
      + the subtraction of articles (a, an, the)
      + the subtraction of the second part of a verb contraction (e.g. “‘ve”, “‘ll”, “‘s”, “‘m”)
      + the replacement of a few common homophones with one of their counterparts (e.g. replacing “their” with “there”, “then” with “than”)
    

Oooh, that's _very_ tricky what they're trying to do there.

"Perturbations" that cause grammatical sentences to become ungrammatical are
_very_ hard to create, for the absolutely practical reason that the only way
to know whether a sentence is ungrammatical is to check that a grammar rejects
it. And, for English (and generally natural languages) we have no (complete)
such grammars. In fact, that's the whole point of language modelling-
everyone's trying to "model" (i.e. approximate, i.e. guess at) the structure
of English (etc)... because nobody has a complete grammar of it!

Dropping a few bits off sentences may sound like a reasonable alternative (an
approximation of an ungrammaticalising perturbation) but, unfortunately, it's
really, really not that simple.

For instance, take the removal of articles: consider the sentence: "Give him
the flowers". Drop the "the". Now you have "Give him flowers". Which is
perfectly correct and entirely plausible, conversational, everyday English.

In fact, dropping words is de rigeur in language modelling, either to generate
skip-grams for training, or to clean up a corpus by removing "stop words"
(uninformative words like the the and and's) or generally, cruft.

For this reason you'll notice that the NUCLE corpus used in the CoNLL-2014
error correction task mentioned in the OP is not auto-generated, and instead
consists of student essays corrected by professors of English.

tl;dr: You can't rely on generating ungrammaticality unless you can generate
grammaticallity.

~~~
unhammer
[https://encrypted.google.com/search?q=%22give%20him%20flower...](https://encrypted.google.com/search?q=%22give%20him%20flowers%22)
– if your data set is good enough, you won't be suggesting insertion there.

Maybe calling the perturbed (and model-filtered!) sentence "ungrammatical" is
too strong a statement, but that doesn't stop the system from being useful. We
don't have a perfect model of NL grammar, but nor do we have complete
dictionaries of any NL lexicons – we still rely on spell checkers because they
have fairly acceptable false positive rates. Spell checkers also do these
perturbations: Levenshtein-edits. And some times they generate sequences that
the model (dictionary) will remove because it's already in there with some
frequency and they generate a true negative, some times they generate
sequences that were simply missing from their data set and they generate a
false positive, and some times they generate a sequence that is neither in
their model nor in the language and you get a true positive. The same general
principle applies to grammar checkers (but they of course are much harder to
make with an acceptable false positive rate, and it's much more difficult to
generate confusion sets).

~~~
YeGoblynQueenne
>> – if your data set is good enough, you won't be suggesting insertion there.

Generally, I doubt you should ever suggest any corrections to the use of
articles. This tends to be colloquial, or even personal and there's no point
trying to force one style on your users. Unless you like pissing them off (MS
Word sure does).

As to spellcheckers- I think that goes the other way, doesn't it? You apply
some edits to an out-of-vocabulary word you found, to see whether you can
arrive at an in-vocabulary word to suggest as a correction.

In any case, I'm not disagreeing with anything you say. I'm just saying that
"grammatical" and "ungrammatical" have strict definitions and you can't just
throw them around like skittles.

------
smoyer
'Kvothe went to the market'

Off-topic but I've been waiting for the third and final book in the trilogy
for a long time ... I've come to the conclusion that Rothfuss can't find a way
to tie all the plot threads together.

I'm also wondering if anyone else thinks Rothfuss looks like Longfellow in a
lot of his publicity shots.

------
topynate
> Unfortunately, I am not aware of any publicly available dataset of (mostly)
> grammatically correct English.

How about books?

~~~
wiz21c
Guess who's got that massive amount of data nowadays ? Google, facebook and
co. So good luck if you want to work on AI stuff...

~~~
mortehu
And Project Gutenberg:
[https://www.gutenberg.org/](https://www.gutenberg.org/)

The language in many of those books may be a bit archaic, though.

See also Distributed Proofreaders:
[http://www.pgdp.net/c/](http://www.pgdp.net/c/)

~~~
jcoffland
There are a fair number of OCR mistakes in many of the Gutenberg texts. It
would be interesting to try to correct them. They are however not the same
types of errors which are addressed here.

------
raverbashing
While there are limited errors it can officially correct, I tried a few
phrases:

Didn't fix misuse of its: "The tool worked on it's own power"

"He should of gone yesterday" gets corrected to "the He should of gone
yesterday"

"To who does this belong?" doesn't get corrected

"A Apple a day keeps the doctor away" doesn't change

~~~
johnhenry
I tried these, which stem from the given example:

"I'm going to store" gets corrected to "I'm going to the store" (looks good)

"I'm going to store food" gets corrected to "I'm going to the store store"
(I'll forgive this due to lack of a prepositional phrase)

"I'm going to store food in the closet" gets corrected to "I'm going to the
store food in the closet" (yeah, this is wrong)

It seems like the AI is good at fixing sentences where adding a missing "the"
is the solution. So when that does turn out to be the solution, it works fine.
When that isn't the solution, well, when all you have is a hammer, every
problem looks like a nail...

Still really interesting, though. The AI may not [yet] be able to solve
general grammatical errors, but it may good enough that it works in specific
cases. It may even be the case that we'll design systems that depend on
multiple agents offering suggestions and other aggregating suggestions into a
more refined result.

~~~
tyingq
It does seem to pick up on the specific cases where adding "the" isn't right,
like:

Alex went to school, Alex went to bed, Alex went to prison

------
zitterbewegung
This is a neat project . I think as a follow up step he should compare it to
Word's grammar checker .

------
ematvey
Nice work! I was playing with exactly this idea for some time. Potentially it
could be way bigger than simple grammatical corrections.

My list of things to try, in addition to what you've already done:

\- replacing named entities with metadata-annotated tokens;

\- dropping random words, not just articles;

\- replacing random words with rarer synonyms;

\- annotate with POS tags from some external parser;

\- run syntax corrector before feeding sentences in grammatical model;

I think this problem is easier that it appears on the surface. Generated
deformation does not have to be a perfect replica of typical human errors. It
just have to be sufficiently diverse.

Also, I think seq2seq module is getting deprecated, as it doesn't do dynamic
rollouts.

------
WhitneyLand
Alex nice work this is exciting. I've been wanting to work on something
similar because the quality of common grammar checkers (like MS Word) has made
such little progress.

Have you considered combining your approach with rules based system? Some
systems using only an elaborate set of rules for common mistakes have had
pretty good performance. I wonder if these two approaches could be combined.

Btw, what is the highest performing grammar checker you've found that's
commonly available?

------
hbornfree
This is terrific! Very coincidentally, we were thinking of implementing a
sentence de-noiser using sequence-to-sequence models only today evening. I
work in the NLP domain writing Machine Translation systems. But NLP parsers
are accurate for grammatically correct sentences only which necessitates the
need for something like deep text correcter. Thank you for this. Will try this
out this week and let you know how it goes.

------
UhUhUhUh
Re. intent but with regards to spelling. I often wonder if there could be
rules to correct errors due to key strokes in the immediate vicinity of the
intended letter (e.g. "keu" vs. "key"). Would check combination of the
surrounding letters, first oin (that's an unintended addition here) horizontal
axis. That happens to me all the time, probably because I'm not a good typist
but still.

------
OJFord
I can't make it work for anything other than the missing 'the' example.

For example:

> _Do you know where I been_

'corrects' to:

> _Do you know where I 's been_

------
mikeflynn
Interesting project, and I love the example they used on the demo page. (Go
Cardinals!)

------
ashildr
And so it begins: [http://www.goodreads.com/book/show/13184491-avogadro-
corp](http://www.goodreads.com/book/show/13184491-avogadro-corp)

------
macawfish
"I gotta take shit"

->

"I gotta take the shit"

Sorry to be airing out my personal business and everything but... everybody
poops!

~~~
BuuQu9hu
[https://www.youtube.com/watch?v=kQTW7Pd1vqc](https://www.youtube.com/watch?v=kQTW7Pd1vqc)

------
grizzles
For training data, you could try ebook torrents, eg. books with a creative
commons license.

------
koliber
Would the works archived in Project Gutenberg be a good training corpus?

------
burnbabyburn
isn't a ngram bayesian model sufficient here?

also, isn't the test data linear dependent from the training set here so
creating skewed performance measurement?

------
guelo
What about using books such as from Project Gutenberg?

------
sigmonsays
but will it correct "lets eat grandma"

------
vacri
> _" Kvothe went to market"_

This is not a grammatically incorrect sentence; it depends on context.
Products are take to an abstract concept of 'market', for example.

