I'd be interested in seeing more direct evidence, like SVD factorizing the PMI matrix (which is what similar to what word2vec is doing) and seeing how much of the variance is explained by the first components. If you want to do this, check out: https://minhlab.wordpress.com/2015/06/08/a-new-proof-for-the...
From the guy who helped make t-sne:
When I run t-SNE, I get a strange ‘ball’ with uniformly distributed points?
This usually indicates you set your perplexity way too high. All points now want to be equidistant. The result you got is the closest you can get to equidistant points as is possible in two dimensions. If lowering the perplexity doesn’t help, you might have run into the problem described in the next question. Similar effects may also occur when you use highly non-metric similarities as input.
Well if that were the case it would be perfectly pointless to make such a visualization... The goal of dimensionality reduction is to provide a useful summarization of the data; it is a valid question to ask to what degree it is successful at that.
He believes he has translated about 10 words in the manuscript, which is huge, and he thinks the script may have been invented to express a language once spoken between the near east and the Himalayas, maybe Turkic or Caucasian...
There are some contemporary alchemical manuscripts which are written at least partly in code, for example, but the illustrations strongly suggest that Voynich is pure fantasy.
Even if the overall goal was some kind of hoax, the text itself almost certainly carries semantic meaning, and that in itself is fascinating due to its apparent indecipherability.
Still, for me, if the text produced with the Cardan grille wouldn't have such statistics, that would be a proof for me that the "language" is real. That's what I hoped to find. If the Cardan grille generated text would pass the statistics too, I'd still more believe the text to be a kind of hoax.
The game isn't "lets find which statistics Voynich passes" but "let's disprove the hypothesis it is a hoax" by finding a test which all claimed hoaxes schemes obviously fail.
3. Why do you give hoax higher prior probability?
EDIT: I was wrong about the possible distributions so there's no point 2.
3. Because nothing else "fits" with anything else we know of. When we find first text in an unknown language, typically it's some religious thing. The illustrations are at best "dream" constructs when not intentional attempt to make an "original" hoax. Otherwise, it's obvious that it's not something out of some non-western culture.
And (I think this one is really fascinating) some researchers at the Louvre awhile back even used spectroscopic analysis on a painting by Murillo to determine that the obsidian he painted on began its life in a 14th century Aztec obsidian mine!
[(u'princess', 0.519856333732605), (u'latifah', 0.47644317150115967),