Deep Text Correcter (atpaino.com)
58 points by atpaino 1 hour ago | hide | past | web | 22 comments | favorite





This is really cool. If you're looking for more datasets to train your model, here are a few relevant ones: - https://archive.org/details/stackexchange - http://trec.nist.gov/data/qamain.html - http://opus.lingfil.uu.se/OpenSubtitles2016.php - http://corpus.byu.edu/full-text/wikipedia.asp OR https://en.wikipedia.org/wiki/Wikipedia:Database_download#En... - http://opus.lingfil.uu.se/

I'd love to see how good your model gets.

Thanks for the links! I'll have to try out some of these. The data is definitely the limiting factor at this point.

Interesting idea. I went ahead and tested:

> Alex went to the kitchen to store the milk in the fridge.

Corrected:

> Alex went to the kitchen to the store the milk in the fridge.

Gathering a large, high quality dataset from the internet is probably not so easy. A lot of the content on HN/Reddit/forums is of low quality grammatically and often written by non-native English speakers (such as myself). Movie dialogues don't necessarily consist of grammatically correct sentences like the ones you'd write in a letter. Perhaps there is some public domain contemporary literature available that could be used instead or alongside the dialogues?

EDIT: Unrelated to this project, I have this general fear of language recommendation tools trained on just low-quality comments or emails. A simple thesaurus and a grammar-checker are often enough to find the right words when writing. But a tool that could understand my intent and then propose restructured or similar sentences and words that convey the same meaning could be a true killer application.

> A lot of the content on HN/Reddit/forums is of low quality grammatically and often written by non-native English speakers (such as myself).

Corrected:

> A lot of the content on the HN/Reddit/forums is of the quality grammatically and non-native written by written English speakers (such myself). as UNK

Yeah. It's got a long way to go. No idea where "as UNK" came from.

UNK usually comes from the final sampling step: the distribution of words contains a special token UNK to represent all words with insufficient statistics in the training corpus.

Yep, it definitely has room to improve. The work thus far has primarily been a proof-of-concept for the methodology used to generate training samples (i.e. starting with grammatically correct text and introducing errors). Next step is to try to include more high quality data, after which I may try out comment data from HN, etc. I think it would be interesting to see what the effect of somewhat noisier data like that would have on the model.

> A tool that could understand my intent and then propose structure or similar sentences and words

I'm working on this exact thing right now fwiw, in my app called Prompts. I imagine machine assisted apps are going to percolate up over the coming years, or months, in a way we haven't seen yet. It's pretty exciting, if we can get it right. Written language seems like a decent place to start.

Dropping `in the fridge` works.

"Alex went to kitchen to store the milk" corrects to "Alex went to the kitchen to store the milk"

Tried out some of classic Garden path sentences [0], and of the 4 examples, it got all but one right:

Original: The complex houses married and single soldiers and their families.

Deep Text Corrector: The complex houses married and a single soldiers and their families.

OT: does anyone know of a more substantial list of garden path sentences that people use in testing NLP software?

[0] https://en.wikipedia.org/wiki/Garden_path_sentence

Re. intent but with regards to spelling. I often wonder if there could be rules to correct errors due to key strokes in the immediate vicinity of the intended letter (e.g. "keu" vs. "key"). Would check combination of the surrounding letters, first oin (that's an unintended addition here) horizontal axis. That happens to me all the time, probably because I'm not a good typist but still.

Looks like a cool project, I would love to see this as a browser plugin of some sort. As for the corpus, I suspect that using articles from Wikipedia would be appropriate. Especially large articles are routinely checked and cleaned up. It has the added benefit of being available in multiple languages.

(https://en.wikipedia.org/wiki/Wikipedia:Database_download)

EDIT: I see this has already been suggested, along with a large amount of other source in another comment by daveytea.

Deep Proofreading Tool Comparisons:

http://www.deepgrammar.com/evaluation

https://blogs.nvidia.com/blog/2016/03/04/deep-learning-fix-g...

> Unfortunately, I am not aware of any publicly available dataset of (mostly) grammatically correct English.

How about books?

Guess who's got that massive amount of data nowadays ? Google, facebook and co. So good luck if you want to work on AI stuff...

And Project Gutenberg: https://www.gutenberg.org/

The language in many of those books may be a bit archaic, though.

See also Distributed Proofreaders: http://www.pgdp.net/c/

There are a fair number of OCR mistakes in many of the Gutenberg texts as well. It would be interesting to try to correct them. They are however not the same types of errors which are addressed here.

Is the spelling of the name supposed to be ironic? ;)

Appears you're saying there's a typo in "Deep Text Correcter" - but there is not.

Well, I think the dictionary disagrees. http://www.dictionary.com/browse/corrector

I'm going to assume that it was meant to be ironic (or errorful).

Is the final word not supposed to be "corrector?"

https://en.m.wikipedia.org/wiki/Corrector

'Kvothe went to the market'

Off-topic but I've been waiting for the third and final book in the trilogy for a long time ... I've come to the conclusion that Rothfuss can't find a way to tie all the plot threads together.

I'm also wondering if anyone else thinks Rothfuss looks like Longfellow in a lot of his publicity shots.

This is terrific! Very coincidentally, we were thinking of implementing a sentence de-noiser using sequence-to-sequence models only today evening. I work in the NLP domain writing Machine Translation systems. But NLP parsers are accurate for grammatically correct sentences only which necessitates the need for something like deep text correcter. Thank you for this. Will try this out this week and let you know how it goes.

