Hacker News new | past | comments | ask | show | jobs | submit login

Infact someone has run word2vec on the Voynich manuscript: http://blog.christianperone.com/2016/01/voynich-manuscript-w... (web archive while it's down: http://web.archive.org/web/20160205003817/http://blog.christ...) Such methods could someday completely decode the thing, but for now they just show the relationships between words and different clusters of words, not their meaning.

Of course we have no idea how the Voynich manuscript is encrypted (which would make the assumptions of word2vec wrong), or if it even has any meaning at all. And it's an incredibly small dataset compared to modern text corpuses, so there is probably significant uncertainty and overfitting. And other problems like inconsistent spellings, many errors in transcriptions, etc. But in principle this is a good strategy.

>how would a person do if they were locked in a room with lots of books written in a language unknown to them?

If you spent all day reading them, for years, and you somehow didn't get bored and kept at it, eventually you would start to see the patterns. You would learn how "slithy toves" are related to "brillig", even if you have no idea how that would translate to English. Study it long enough, and you may even be able to produce text in that language, indistinguishable from the real text. You may be able to predict the next word in a sentence, and identify mistakes, etc. Perhaps carry out a conversation in that language.

And I think eventually you would understand what the words mean, by comparing the patterns to those found in English. Once you have guesses for translations of just a few words, you can translate the rest. Because you know the relationships between words, and so knowing one word constrains the possibilities of what the other words can be.

If the translation it produces is nonsense, the words you guessed must have been wrong, and you can try again with other words. Eventually you will find a translation that isn't nonsense, and there you go. This would be very difficult for humans, because the number of hypotheses to test is so large, and analyzing text takes forever. Computers can do it at lightspeed though.

I'm familiar with this particular attack, as it was discussed here previously. It's a worthwhile attempt but the identification as star names, if real, hasn't been confirmed. But your reservations are justified.

More generally, has any attempt been made to identify the meanings of words in any sufficiently large corpus of text in a known foreign language (for example, Finnish), without being provided with a translation into English, and then compare the identified meanings to the actual meanings, as a first step towards translation?

There was a paper where they trained word vectors for English and Chinese at the same time. But they forced a few Chinese words to have the same vectors as their translated English words. This gave accurate translations for many Chinese words that didn't have translations.

Doing this without any translated words at all, would be more difficult. But I believe possible. It's actually a project I want to try in the near future.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact