Hacker News new | past | comments | ask | show | jobs | submit login
Deep Text Correcter (atpaino.com)
228 points by atpaino on Jan 8, 2017 | hide | past | favorite | 66 comments

Interesting idea. I went ahead and tested:

> Alex went to the kitchen to store the milk in the fridge.


> Alex went to the kitchen to the store the milk in the fridge.

Gathering a large, high quality dataset from the internet is probably not so easy. A lot of the content on HN/Reddit/forums is of low quality grammatically and often written by non-native English speakers (such as myself). Movie dialogues don't necessarily consist of grammatically correct sentences like the ones you'd write in a letter. Perhaps there is some public domain contemporary literature available that could be used instead or alongside the dialogues?

EDIT: Unrelated to this project, I have this general fear of language recommendation tools trained on just low-quality comments or emails. A simple thesaurus and a grammar-checker are often enough to find the right words when writing. But a tool that could understand my intent and then propose restructured or similar sentences and words that convey the same meaning could be a true killer application.

> A lot of the content on HN/Reddit/forums is of low quality grammatically and often written by non-native English speakers (such as myself).


> A lot of the content on the HN/Reddit/forums is of the quality grammatically and non-native written by written English speakers (such myself). as UNK

Yeah. It's got a long way to go. No idea where "as UNK" came from.

UNK usually comes from the final sampling step: the distribution of words contains a special token UNK to represent all words with insufficient statistics in the training corpus.

It's like the equivalent of NaN; what you put when you don't know what should be there.

coincidentally, the op corrected this as: It's like the equivalent of NaN

Yep, it definitely has room to improve. The work thus far has primarily been a proof-of-concept for the methodology used to generate training samples (i.e. starting with grammatically correct text and introducing errors). Next step is to try to include more high quality data, after which I may try out comment data from HN, etc. I think it would be interesting to see what the effect of somewhat noisier data like that would have on the model.

Programmatically generating incorrect grammar from correct sentences must be really tough. There are so many more ways to incorrectly structure a sentence than there are correct ones.

Random idea: what happens if you use Google translate to generate the incorrect sentence, I.e. Translate it to other languages and then back again. If the resulting sentence doesn't match the original, add it to the dataset.

>> But a tool that could understand my intent and then propose restructured or similar sentences and words that convey the same meaning could be a true killer application.

Yeah but that's hard though. Who knows what is your intent when you're generating an utterance, written or spoken? Sometimes you yourself may even read what you wrote (or hear what you said) and wonder what you meant.

Not to mention what an absolute nightmare it would be, trying to compile data on the linguistic intent behind utterances! How do you even start to collect that? Ask people to say things, then ask them what they meant... but what did they mean when they explain what they meant in the first instance?

The worse thing is that the very notion of what is grammatical changes with context. For instance, to go back to dropping the "the"'s: imagine you read the phrase "eat soup with spoon". Is that an ungrammatical form of "eat the soop with the spoon", or is it an instruction, perhaps something you'd find in a soup-eating manual, which therefore is perfectly valid as a terse form of English?

What you intend an utterance to mean affects whether it is grammatical and the grammaticality of the utterance affects its meaning. Nice, eh?

You make a great point.

Presumably biasing the training sets to those built from corrected sources would help, but what about using n-grams to throw out poor examples of ungrammatical sentences? It wouldn't be perfect, but some kind of score based on a much lower prevlence of n-grams from the generated sentence vs the original might indicate acceptable cases and those where they were fairly similar might indicate the generated sentence could in fact be valid (and thus should be discarded)

> A tool that could understand my intent and then propose structure or similar sentences and words

I'm working on this exact thing right now fwiw, in my app called Prompts. I imagine machine assisted apps are going to percolate up over the coming years, or months, in a way we haven't seen yet. It's pretty exciting, if we can get it right. Written language seems like a decent place to start.

Dropping `in the fridge` works.

"Alex went to kitchen to store the milk" corrects to "Alex went to the kitchen to store the milk"

On the other hand, it never removes unnecessary articles, demonstrating one of the deficiencies of the training set.

Couldn't that be rectified with a Parts of Speech tagger? Or am I overlooking something?

The training data never says to remove any articles, so a PoS tagger won't help.

I also got some weird results in the small sample of testing I did.

> The player chose to go to the good team rather than the sucky one.

> The player chose to go to the good team than rather the sucky one.

And it seems to like duplicating adjectives:

> Being worse off than even worse off makes one feel sad.

> Being worse worse than even worse worse makes one feel sad.

Interesting example, seems the algo cannot infer whether 'store' is the verb or the noun. It would need to understand the meaning and context of the sentence to do this.

Maybe it's also to do with how they generate the training data. Author did say removal of articles was one thing they used to generate the 'incorrect' sentence in the training data. If the film data uses 'the store' much more than 'store', you can imagine how the algo could get biased to thinking store is always preceded with 'the'.

Technical documentation may be dry, but it is generally well read and edited, grammatically correct, and easily obtained.

Thanks for the links! I'll have to try out some of these. The data is definitely the limiting factor at this point.

Have you consider project Gutenberg?

Is there any reason why that huge corpus would not be useful?


I think the issue is you need parallel corpuses of (sentences with grammatically errors) -> (same sentence but correct). Gutenberg has a lot of only the latter I think.

You don't need parallel corpora -- the OP was generating the incorrect versions of training data through random perturbation (dropping articles, etc.)

The author can probably get a lot of traction by doing random perturbation, but they won't be getting things in the correct ratios (i.e. lots of dropped hyphens but not many dropped apostrophes) and also won't probably make all the types of errors that humans actually make. It will work, but a huge part of machine learning and doing NLP is getting those ratios right. This is one reason Google slays with their translator, they have huge corpuses that allow them to get fine grained distinctions in their models.

Interesting idea! I think this is analogous to the idea of a de-noising autoencoder in computer vision. Here, instead of introducing Gaussian noise at the pixel level and using a CNN, you're introducing grammatical "noise" at the world level and using an LSTM.

I think that general framework applies to many different domains. For example, we trained a denoising sequence autoencoder on HealthKit data (sequences of step counts and heart rate measurements) in order to predict whether somebody is likely to have diabetes, high blood pressure, or a heart rhythm disorder based on wearable data. I've also seen similar ideas applied to EMR data (similar to word2vec). It's worth reading "Semi-Supervised Sequence Learning", where they use a non-denoising sequence autoencoder as a pretraining step, and compare a couple of different techniques: https://papers.nips.cc/paper/5949-semi-supervised-sequence-l...

Toward the end, you start thinking about introducing different types of grammatical errors, like subject-verb disagreement. I think that's a good way to think about it. In the limit, you might even have a neural network generate increasingly harder types of grammatical corruptions, with the goal of "fooling" the corrector network. As the the corruptor network and corrector network compete with each other, you might end up with something like a generative adversarial network: https://arxiv.org/abs/1701.00160

It seems like it would be challenging to get the corruptor to generate examples that are of the same Kind that humans make, while still being "productive" (in the linguistic sense, ie. not just overfitting on examples from a corpus of low quality text).

It's easy enough to just drop random words or run Levenshtein-edits on single words to create wrong-in-this-context makes (then/than), but grammar errors include much more than can be covered by that method, and the method will generate many errors that are of a kind never made by humans. And if you restrict your method to things already seen in a corpus, it's easy to overfit and miss out on a whole lot of good stuff.

I like the analogy to de-noising autoencoders; that's a good way of thinking about this.

> In the limit, you might even have a neural network generate increasingly harder types of grammatical corruptions, with the goal of "fooling" the corrector network.

Very interesting. I wonder how many constraints would need to be added to the corruptor model to ensure the corrupted sentence retains the same meaning as the original. Somewhat related to that, I've thought that a more basic curriculum learning setup could be deployed quite effectively here, and am hoping to try that out soon.

The "correcter" is a worthy effort, and it needs to start somewhere. It shows the magnitude of the task considering that missing articles are not the most crucial grammatical issues in on-line discourse. The meaning of a phrase is usually comprehensible with or without the article, and native speakers can easily overlook this kind of error made by non-native speakers.

OTOH more troublesome to readers are common errors such as misuse of "its" vs. "it's", "to" vs. "too", and "their", "there" and "they're". These mistakes are quite prevalent among native-speaking writers so more ubiquitous than the missing article problem.

The "correcter" didn't correct the latter class of errors. Understandably this would be a much harder goal to accomplish given the highly contextual nature of grammatically correct word choices.

It prompts a question about how well the data-driven approach can handle the problem. Obviously that's what the research is trying to answer. It sure seems to point to something fairly easy for a human to do that's near or at the limit of what we can get a computer to do.

Tried out some of classic Garden path sentences [0], and of the 4 examples, it got all but one right:

Original: The complex houses married and single soldiers and their families.

Deep Text Corrector: The complex houses married and a single soldiers and their families.

OT: does anyone know of a more substantial list of garden path sentences that people use in testing NLP software?

[0] https://en.wikipedia.org/wiki/Garden_path_sentence

Looks like a cool project, I would love to see this as a browser plugin of some sort. As for the corpus, I suspect that using articles from Wikipedia would be appropriate. Especially large articles are routinely checked and cleaned up. It has the added benefit of being available in multiple languages.


EDIT: I see this has already been suggested, along with a large amount of other source in another comment by daveytea.

Is the spelling of the name supposed to be ironic? ;)

(Comment removed.)

Well, I think the dictionary disagrees. http://www.dictionary.com/browse/corrector

I'm going to assume that it was meant to be ironic (or errorful).

The demo itself agrees. It changes the word correcter, but not the word corrector :)

So its name is a built-in test case?

Genius. ;)

Is the final word not supposed to be "corrector?"


>> Thus far, these perturbations have been limited to:

  + the subtraction of articles (a, an, the)
  + the subtraction of the second part of a verb contraction (e.g. “‘ve”, “‘ll”, “‘s”, “‘m”)
  + the replacement of a few common homophones with one of their counterparts (e.g. replacing “their” with “there”, “then” with “than”)
Oooh, that's _very_ tricky what they're trying to do there.

"Perturbations" that cause grammatical sentences to become ungrammatical are _very_ hard to create, for the absolutely practical reason that the only way to know whether a sentence is ungrammatical is to check that a grammar rejects it. And, for English (and generally natural languages) we have no (complete) such grammars. In fact, that's the whole point of language modelling- everyone's trying to "model" (i.e. approximate, i.e. guess at) the structure of English (etc)... because nobody has a complete grammar of it!

Dropping a few bits off sentences may sound like a reasonable alternative (an approximation of an ungrammaticalising perturbation) but, unfortunately, it's really, really not that simple.

For instance, take the removal of articles: consider the sentence: "Give him the flowers". Drop the "the". Now you have "Give him flowers". Which is perfectly correct and entirely plausible, conversational, everyday English.

In fact, dropping words is de rigeur in language modelling, either to generate skip-grams for training, or to clean up a corpus by removing "stop words" (uninformative words like the the and and's) or generally, cruft.

For this reason you'll notice that the NUCLE corpus used in the CoNLL-2014 error correction task mentioned in the OP is not auto-generated, and instead consists of student essays corrected by professors of English.

tl;dr: You can't rely on generating ungrammaticality unless you can generate grammaticallity.

https://encrypted.google.com/search?q=%22give%20him%20flower... – if your data set is good enough, you won't be suggesting insertion there.

Maybe calling the perturbed (and model-filtered!) sentence "ungrammatical" is too strong a statement, but that doesn't stop the system from being useful. We don't have a perfect model of NL grammar, but nor do we have complete dictionaries of any NL lexicons – we still rely on spell checkers because they have fairly acceptable false positive rates. Spell checkers also do these perturbations: Levenshtein-edits. And some times they generate sequences that the model (dictionary) will remove because it's already in there with some frequency and they generate a true negative, some times they generate sequences that were simply missing from their data set and they generate a false positive, and some times they generate a sequence that is neither in their model nor in the language and you get a true positive. The same general principle applies to grammar checkers (but they of course are much harder to make with an acceptable false positive rate, and it's much more difficult to generate confusion sets).

>> – if your data set is good enough, you won't be suggesting insertion there.

Generally, I doubt you should ever suggest any corrections to the use of articles. This tends to be colloquial, or even personal and there's no point trying to force one style on your users. Unless you like pissing them off (MS Word sure does).

As to spellcheckers- I think that goes the other way, doesn't it? You apply some edits to an out-of-vocabulary word you found, to see whether you can arrive at an in-vocabulary word to suggest as a correction.

In any case, I'm not disagreeing with anything you say. I'm just saying that "grammatical" and "ungrammatical" have strict definitions and you can't just throw them around like skittles.

So what? Who says the training data has to be perfect? Your comment reminds me of an essay by Peter Norvig. Where he showed that a simple statistical model could tell the difference between a nonsense sentence that was grammatically correct and one that wasn't, but gave them both very low probability of occuring naturally.

Similarly, the machine may learn that "Give him the flowers" has higher probability than "give him flowers". Or it may learn that both are possible sentences and not be able to correct it. Which is also OK, we can't expect any system to be perfect.

You're talking about the grammatical, but nonsensical sentence "colorless green ideas sleep furiously" proposed by Chomsky, along with its ungrammatical counterpart, "Furiously sleep ideas green colorless." [1] Chomsky famously said that one of the two sentences is ungrammatical, but both would be assigned a low probability by a statistical model trained on English corpora. The one who showed that the ungrammatical sentence was (much) less likely than the grammatical one was Fernando Pereira, not Peter Norvig. Peter Norvig has argued against Chomsky in a different instance, and cited Pereira during that (one-sided) exchange, if memory serves.

This is completely unrelated to what I'm saying, of course. The fact of the matter is that you can't just swap some words out of a sentence and call it "ungrammatical". You can call it a "perturbed" sentence. You can call it a "sentence with some words dropped". You can call it Daisy Duck for all I care. But it's not an ungrammatical sentence, because that sort of thing has a very precise definition, which I gave above.


[1] The sentence is famous enough to have a wikipedia page all its own: https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_fu...

'Kvothe went to the market'

Off-topic but I've been waiting for the third and final book in the trilogy for a long time ... I've come to the conclusion that Rothfuss can't find a way to tie all the plot threads together.

I'm also wondering if anyone else thinks Rothfuss looks like Longfellow in a lot of his publicity shots.

> Unfortunately, I am not aware of any publicly available dataset of (mostly) grammatically correct English.

How about books?

Wikipedia, Books are usually ok

Online forums are a disaster (HN is not too bad, but there are a lot of natives and non-natives fumbling some things)

Guess who's got that massive amount of data nowadays ? Google, facebook and co. So good luck if you want to work on AI stuff...

And Project Gutenberg: https://www.gutenberg.org/

The language in many of those books may be a bit archaic, though.

See also Distributed Proofreaders: http://www.pgdp.net/c/

There are a fair number of OCR mistakes in many of the Gutenberg texts. It would be interesting to try to correct them. They are however not the same types of errors which are addressed here.

Who's got that massive amount of data nowadays? Everybody who wants it.

I recommend http://opus.lingfil.uu.se/ as a starting point to find lots of sources of quality, multilingual data.

And if all you want is sheer quantity, at rather the expense of quality, then there's a several terabyte Web crawl at http://commoncrawl.org/.

While there are limited errors it can officially correct, I tried a few phrases:

Didn't fix misuse of its: "The tool worked on it's own power"

"He should of gone yesterday" gets corrected to "the He should of gone yesterday"

"To who does this belong?" doesn't get corrected

"A Apple a day keeps the doctor away" doesn't change

I tried these, which stem from the given example:

"I'm going to store" gets corrected to "I'm going to the store" (looks good)

"I'm going to store food" gets corrected to "I'm going to the store store" (I'll forgive this due to lack of a prepositional phrase)

"I'm going to store food in the closet" gets corrected to "I'm going to the store food in the closet" (yeah, this is wrong)

It seems like the AI is good at fixing sentences where adding a missing "the" is the solution. So when that does turn out to be the solution, it works fine. When that isn't the solution, well, when all you have is a hammer, every problem looks like a nail...

Still really interesting, though. The AI may not [yet] be able to solve general grammatical errors, but it may good enough that it works in specific cases. It may even be the case that we'll design systems that depend on multiple agents offering suggestions and other aggregating suggestions into a more refined result.

It does seem to pick up on the specific cases where adding "the" isn't right, like:

Alex went to school, Alex went to bed, Alex went to prison

This is a neat project . I think as a follow up step he should compare it to Word's grammar checker .

Nice work! I was playing with exactly this idea for some time. Potentially it could be way bigger than simple grammatical corrections.

My list of things to try, in addition to what you've already done:

- replacing named entities with metadata-annotated tokens;

- dropping random words, not just articles;

- replacing random words with rarer synonyms;

- annotate with POS tags from some external parser;

- run syntax corrector before feeding sentences in grammatical model;

I think this problem is easier that it appears on the surface. Generated deformation does not have to be a perfect replica of typical human errors. It just have to be sufficiently diverse.

Also, I think seq2seq module is getting deprecated, as it doesn't do dynamic rollouts.

Alex nice work this is exciting. I've been wanting to work on something similar because the quality of common grammar checkers (like MS Word) has made such little progress.

Have you considered combining your approach with rules based system? Some systems using only an elaborate set of rules for common mistakes have had pretty good performance. I wonder if these two approaches could be combined.

Btw, what is the highest performing grammar checker you've found that's commonly available?

This is terrific! Very coincidentally, we were thinking of implementing a sentence de-noiser using sequence-to-sequence models only today evening. I work in the NLP domain writing Machine Translation systems. But NLP parsers are accurate for grammatically correct sentences only which necessitates the need for something like deep text correcter. Thank you for this. Will try this out this week and let you know how it goes.

Re. intent but with regards to spelling. I often wonder if there could be rules to correct errors due to key strokes in the immediate vicinity of the intended letter (e.g. "keu" vs. "key"). Would check combination of the surrounding letters, first oin (that's an unintended addition here) horizontal axis. That happens to me all the time, probably because I'm not a good typist but still.

I can't make it work for anything other than the missing 'the' example.

For example:

> Do you know where I been

'corrects' to:

> Do you know where I 's been

Interesting project, and I love the example they used on the demo page. (Go Cardinals!)

"I gotta take shit"


"I gotta take the shit"

Sorry to be airing out my personal business and everything but... everybody poops!

For training data, you could try ebook torrents, eg. books with a creative commons license.

Would the works archived in Project Gutenberg be a good training corpus?

isn't a ngram bayesian model sufficient here?

also, isn't the test data linear dependent from the training set here so creating skewed performance measurement?

What about using books such as from Project Gutenberg?

but will it correct "lets eat grandma"

> "Kvothe went to market"

This is not a grammatically incorrect sentence; it depends on context. Products are take to an abstract concept of 'market', for example.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact