
Ask HN: Best real day-to-day language data sets? - caio1982
Hi there, I&#x27;ve used Cicero&#x27;s work to run some NLP code against in the past so I could practice some latin. I am now looking for something more mundane, which reflects real day-to-day usage of languages.<p>I am particularly interested in data sets for german, but I think I can use Folha de Sao Paulo&#x27;s texts for brazilian portuguese (it&#x27;s one of the biggest newspaper and media portal in Brazil and has a respected language manual).<p>Any suggestions for other languages? Scraping data sets would be OK!<p>PS1: I don&#x27;t consider NLTK-level data sets really useful here<p>PS2: for english I suppose the whole internet and NLP frameworks are enough already<p>PS3: data sets don&#x27;t need to be tagged in fact, anything ressemblying a corpus will suffice as it would be used for X-as-second-language :-)
======
physicsyogi
Try the Europarl corpus
([http://www.statmt.org/europarl/](http://www.statmt.org/europarl/)) or the
TED corpus ([https://github.com/ajinkyakulkarni14/TED-Multilingual-
Parall...](https://github.com/ajinkyakulkarni14/TED-Multilingual-Parallel-
Corpus)).

