Hacker News new | past | comments | ask | show | jobs | submit login

In theory you could do this with books? Sentences in translations should still be the same, and you could add some heuristics to identify paragraphs, quote marks, etc. You might be wrong occasionally, but doing the extraction per chapter or per paragraph would mitigate it.



Only in theory. Sentences are hardly the same in other languages as the books are not one-to-one translations and a sentence in one language might be three in another. Even the paragraphs might not be the same in asian languages.


It depends on the language pairings. Take a look at Don Quixote in English and Spanish, they're very comparable at the sentence level (at least this translation). Also if course it depends on the artistic license of the translation. Maybe it's something you could do with a bit of human intervention to sync up the two texts.

http://www.cervantesvirtual.com/obra-visor/el-ingenioso-hida...

https://www.fulltextarchive.com/page/Don-Quixotex5770/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: