Looks like a cool project, I would love to see this as a browser plugin of some sort. As for the corpus, I suspect that using articles from Wikipedia would be appropriate. Especially large articles are routinely checked and cleaned up. It has the added benefit of being available in multiple languages.
(https://en.wikipedia.org/wiki/Wikipedia:Database_download)
EDIT: I see this has already been suggested, along with a large amount of other source in another comment by daveytea.