

How we found a million style and grammar errors in the English Wikipedia - chmars
https://fosdem.org/2014/schedule/event/how_we_found_600000_grammar_errors/

======
taspeotis
The slug in the url is how_we_found_600000_grammar_errors. Did they not get
enough attention at first? How many style and grammar errors did they really
find?

> So we manually looked at 200 of the errors, finding that 29 of the 200
> errors were real errors. Projected to the whole Wikipedia (currently at 4.3
> million articles), that's about 1.1 million real errors

Projections are fine, but maybe the title should be something like "How we
found (probably) a million style and grammar errors in the English Wikipedia".

~~~
sheetjs
> "How we found (probably) a million style and grammar errors in the English
> Wikipedia".

Would you have clicked if that were the title?

Sad part is that I suspect many people upvoted based on the title alone,
without reading the article

------
anaphor
I hope these are not "errors" like splitting infinitives, using conjunctions
at the beginning of sentences, or putting prepositions at the end of
sentences. Because those aren't real errors in English grammar.

------
JazCE
To all the other commentators, this isn't about wikipedia, this is about a
tool that can be used by copyeditors to help them enforce house style and
grammar rules.

If you work for a magazine publisher, it's likely you have 2 lines, a web line
and a published line. The web line might not go through the
copyeditors/production/subeditors before it goes online whereas the published
line will likely go through those people/departments. By adding this into the
web line, it can help enfore the rules of the publication so that web and
magazine both match up nicely.

------
ivan_ah
Here are the rules this system checks: [https://github.com/languagetool-
org/languagetool/tree/master...](https://github.com/languagetool-
org/languagetool/tree/master/languagetool-
core/src/main/java/org/languagetool/rules)

Imagine a pre-commit hook for writing quality test ;)

Here are some other scripts (much more basic) for "automated language tests":
[https://github.com/ivanistheone/writing_scripts](https://github.com/ivanistheone/writing_scripts)

------
kevingadd
The false positive rate seems high enough to make this effectively useless. I
wonder if they have a strategy for reducing it?

------
afhof
Is this even news? Pick any Wikipedia article at random that has more than one
author. It isn't hard to find run on sentences and verb tense mismatches.

------
LeeHunter
It would be more interesting to know how that compares, perhaps per hundred
words, with other bodies of text, professionaly edited and not.

And, for an article on finding grammatical errors, there were a surprising
number of grammatical errors.

------
dwyer
How I found a million style and grammar errors in the English Wikipedia: by
browsing it.

------
hnriot
So what! Grammatical and style errors are only important when it impacts the
meaning. The majority of pages I read that contain poor grammar are still
perfectly understandable. I correct many that I see, but you don't have to
look far to find more.

I'll take a wikipedia that has these kinds of errors over a time before
wikipedia. As a boy going to the library and immersing myself in encyclopedias
and world maps I am gobsmacked at how wonderful great resources like
wikipedia, freebase and google maps are.

