> Alex went to the kitchen to store the milk in the fridge.
> Alex went to the kitchen to the store the milk in the fridge.
Gathering a large, high quality dataset from the internet is probably not so easy. A lot of the content on HN/Reddit/forums is of low quality grammatically and often written by non-native English speakers (such as myself). Movie dialogues don't necessarily consist of grammatically correct sentences like the ones you'd write in a letter. Perhaps there is some public domain contemporary literature available that could be used instead or alongside the dialogues?
Unrelated to this project, I have this general fear of language recommendation tools trained on just low-quality comments or emails. A simple thesaurus and a grammar-checker are often enough to find the right words when writing. But a tool that could understand my intent and then propose restructured or similar sentences and words that convey the same meaning could be a true killer application.
> A lot of the content on the HN/Reddit/forums is of the quality grammatically and non-native written by written English speakers (such myself). as UNK
Yeah. It's got a long way to go. No idea where "as UNK" came from.
coincidentally, the op corrected this as: It's like the equivalent of NaN
Random idea: what happens if you use Google translate to generate the incorrect sentence, I.e. Translate it to other languages and then back again. If the resulting sentence doesn't match the original, add it to the dataset.
Yeah but that's hard though. Who knows what is your intent when you're generating an utterance, written or spoken? Sometimes you yourself may even read what you wrote (or hear what you said) and wonder what you meant.
Not to mention what an absolute nightmare it would be, trying to compile data on the linguistic intent behind utterances! How do you even start to collect that? Ask people to say things, then ask them what they meant... but what did they mean when they explain what they meant in the first instance?
The worse thing is that the very notion of what is grammatical changes with context. For instance, to go back to dropping the "the"'s: imagine you read the phrase "eat soup with spoon". Is that an ungrammatical form of "eat the soop with the spoon", or is it an instruction, perhaps something you'd find in a soup-eating manual, which therefore is perfectly valid as a terse form of English?
What you intend an utterance to mean affects whether it is grammatical and the grammaticality of the utterance affects its meaning. Nice, eh?
Presumably biasing the training sets to those built from corrected sources would help, but what about using n-grams to throw out poor examples of ungrammatical sentences? It wouldn't be perfect, but some kind of score based on a much lower prevlence of n-grams from the generated sentence vs the original might indicate acceptable cases and those where they were fairly similar might indicate the generated sentence could in fact be valid (and thus should be discarded)
I'm working on this exact thing right now fwiw, in my app called Prompts. I imagine machine assisted apps are going to percolate up over the coming years, or months, in a way we haven't seen yet. It's pretty exciting, if we can get it right. Written language seems like a decent place to start.
"Alex went to kitchen to store the milk"
"Alex went to the kitchen to store the milk"
> The player chose to go to the good team rather than the sucky one.
> The player chose to go to the good team than rather the sucky one.
And it seems to like duplicating adjectives:
> Being worse off than even worse off makes one feel sad.
> Being worse worse than even worse worse makes one feel sad.
Maybe it's also to do with how they generate the training data. Author did say removal of articles was one thing they used to generate the 'incorrect' sentence in the training data. If the film data uses 'the store' much more than 'store', you can imagine how the algo could get biased to thinking store is always preceded with 'the'.
I'd love to see how good your model gets.
Is there any reason why that huge corpus would not be useful?
I think that general framework applies to many different domains. For example, we trained a denoising sequence autoencoder on HealthKit data (sequences of step counts and heart rate measurements) in order to predict whether somebody is likely to have diabetes, high blood pressure, or a heart rhythm disorder based on wearable data. I've also seen similar ideas applied to EMR data (similar to word2vec). It's worth reading "Semi-Supervised Sequence Learning", where they use a non-denoising sequence autoencoder as a pretraining step, and compare a couple of different techniques:
Toward the end, you start thinking about introducing different types of grammatical errors, like subject-verb disagreement. I think that's a good way to think about it. In the limit, you might even have a neural network generate increasingly harder types of grammatical corruptions, with the goal of "fooling" the corrector network. As the the corruptor network and corrector network compete with each other, you might end up with something like a generative adversarial network:
It's easy enough to just drop random words or run Levenshtein-edits on single words to create wrong-in-this-context makes (then/than), but grammar errors include much more than can be covered by that method, and the method will generate many errors that are of a kind never made by humans. And if you restrict your method to things already seen in a corpus, it's easy to overfit and miss out on a whole lot of good stuff.
> In the limit, you might even have a neural network generate increasingly harder types of grammatical corruptions, with the goal of "fooling" the corrector network.
Very interesting. I wonder how many constraints would need to be added to the corruptor model to ensure the corrupted sentence retains the same meaning as the original. Somewhat related to that, I've thought that a more basic curriculum learning setup could be deployed quite effectively here, and am hoping to try that out soon.
OTOH more troublesome to readers are common errors such as misuse of "its" vs. "it's", "to" vs. "too", and "their", "there" and "they're". These mistakes are quite prevalent among native-speaking writers so more ubiquitous than the missing article problem.
The "correcter" didn't correct the latter class of errors. Understandably this would be a much harder goal to accomplish given the highly contextual nature of grammatically correct word choices.
It prompts a question about how well the data-driven approach can handle the problem. Obviously that's what the research is trying to answer. It sure seems to point to something fairly easy for a human to do that's near or at the limit of what we can get a computer to do.
Original: The complex houses married and single soldiers and their families.
Deep Text Corrector: The complex houses married and a single soldiers and their families.
OT: does anyone know of a more substantial list of garden path sentences that people use in testing NLP software?
EDIT: I see this has already been suggested, along with a large amount of other source in another comment by daveytea.
I'm going to assume that it was meant to be ironic (or errorful).
+ the subtraction of articles (a, an, the)
+ the subtraction of the second part of a verb contraction (e.g. “‘ve”, “‘ll”, “‘s”, “‘m”)
+ the replacement of a few common homophones with one of their counterparts (e.g. replacing “their” with “there”, “then” with “than”)
"Perturbations" that cause grammatical sentences to become ungrammatical are
_very_ hard to create, for the absolutely practical reason that the only way
to know whether a sentence is ungrammatical is to check that a grammar rejects
it. And, for English (and generally natural languages) we have no (complete)
such grammars. In fact, that's the whole point of language modelling-
everyone's trying to "model" (i.e. approximate, i.e. guess at) the structure
of English (etc)... because nobody has a complete grammar of it!
Dropping a few bits off sentences may sound like a reasonable alternative (an
approximation of an ungrammaticalising perturbation) but, unfortunately, it's
really, really not that simple.
For instance, take the removal of articles: consider the sentence: "Give him
the flowers". Drop the "the". Now you have "Give him flowers". Which is
perfectly correct and entirely plausible, conversational, everyday English.
In fact, dropping words is de rigeur in language modelling, either to generate
skip-grams for training, or to clean up a corpus by removing "stop words"
(uninformative words like the the and and's) or generally, cruft.
For this reason you'll notice that the NUCLE corpus used in the CoNLL-2014
error correction task mentioned in the OP is not auto-generated, and instead consists of student
essays corrected by professors of English.
tl;dr: You can't rely on generating ungrammaticality unless you can generate
Maybe calling the perturbed (and model-filtered!) sentence "ungrammatical" is too strong a statement, but that doesn't stop the system from being useful. We don't have a perfect model of NL grammar, but nor do we have complete dictionaries of any NL lexicons – we still rely on spell checkers because they have fairly acceptable false positive rates. Spell checkers also do these perturbations: Levenshtein-edits. And some times they generate sequences that the model (dictionary) will remove because it's already in there with some frequency and they generate a true negative, some times they generate sequences that were simply missing from their data set and they generate a false positive, and some times they generate a sequence that is neither in their model nor in the language and you get a true positive. The same general principle applies to grammar checkers (but they of course are much harder to make with an acceptable false positive rate, and it's much more difficult to generate confusion sets).
Generally, I doubt you should ever suggest any corrections to the use of
articles. This tends to be colloquial, or even personal and there's no point
trying to force one style on your users. Unless you like pissing them off (MS
Word sure does).
As to spellcheckers- I think that goes the other way, doesn't it? You apply some edits to an out-of-vocabulary word you found, to see whether you can arrive at an in-vocabulary word to suggest as a correction.
In any case, I'm not disagreeing with anything you say. I'm just saying that "grammatical" and "ungrammatical" have strict definitions and you can't just throw them around like skittles.
Similarly, the machine may learn that "Give him the flowers" has higher probability than "give him flowers". Or it may learn that both are possible sentences and not be able to correct it. Which is also OK, we can't expect any system to be perfect.
This is completely unrelated to what I'm saying, of course. The fact of the
matter is that you can't just swap some words out of a sentence and call it
"ungrammatical". You can call it a "perturbed" sentence. You can call it a
"sentence with some words dropped". You can call it Daisy Duck for all I care.
But it's not an ungrammatical sentence, because that sort of thing has a very
precise definition, which I gave above.
 The sentence is famous enough to have a wikipedia page all its own:
Off-topic but I've been waiting for the third and final book in the trilogy for a long time ... I've come to the conclusion that Rothfuss can't find a way to tie all the plot threads together.
I'm also wondering if anyone else thinks Rothfuss looks like Longfellow in a lot of his publicity shots.
How about books?
Online forums are a disaster (HN is not too bad, but there are a lot of natives and non-natives fumbling some things)
The language in many of those books may be a bit archaic, though.
See also Distributed Proofreaders: http://www.pgdp.net/c/
I recommend http://opus.lingfil.uu.se/ as a starting point to find lots of sources of quality, multilingual data.
And if all you want is sheer quantity, at rather the expense of quality, then there's a several terabyte Web crawl at http://commoncrawl.org/.
Didn't fix misuse of its: "The tool worked on it's own power"
"He should of gone yesterday" gets corrected to "the He should of gone yesterday"
"To who does this belong?" doesn't get corrected
"A Apple a day keeps the doctor away" doesn't change
"I'm going to store" gets corrected to "I'm going to the store" (looks good)
"I'm going to store food" gets corrected to "I'm going to the store store" (I'll forgive this due to lack of a prepositional phrase)
"I'm going to store food in the closet" gets corrected to "I'm going to the store food in the closet" (yeah, this is wrong)
It seems like the AI is good at fixing sentences where adding a missing "the" is the solution. So when that does turn out to be the solution, it works fine. When that isn't the solution, well, when all you have is a hammer, every problem looks like a nail...
Still really interesting, though. The AI may not [yet] be able to solve general grammatical errors, but it may good enough that it works in specific cases. It may even be the case that we'll design systems that depend on multiple agents offering suggestions and other aggregating suggestions into a more refined result.
Alex went to school, Alex went to bed, Alex went to prison
My list of things to try, in addition to what you've already done:
- replacing named entities with metadata-annotated tokens;
- dropping random words, not just articles;
- replacing random words with rarer synonyms;
- annotate with POS tags from some external parser;
- run syntax corrector before feeding sentences in grammatical model;
I think this problem is easier that it appears on the surface. Generated deformation does not have to be a perfect replica of typical human errors. It just have to be sufficiently diverse.
Also, I think seq2seq module is getting deprecated, as it doesn't do dynamic rollouts.
This is not a grammatically incorrect sentence; it depends on context. Products are take to an abstract concept of 'market', for example.
Have you considered combining your approach with rules based system? Some systems using only an elaborate set of rules for common mistakes have had pretty good performance. I wonder if these two approaches could be combined.
Btw, what is the highest performing grammar checker you've found that's commonly available?
> Do you know where I been
> Do you know where I 's been
"I gotta take the shit"
Sorry to be airing out my personal business and everything but... everybody poops!
also, isn't the test data linear dependent from the training set here so creating skewed performance measurement?