
Voynich Manuscript: word vectors and t-SNE visualization of some patterns - perone
http://blog.christianperone.com/2016/01/voynich-manuscript-word-vectors-and-t-sne-visualization-of-some-patterns/?a=b
======
juxtaposicion
I implemented the new t-SNE in sklearn, so I've got some experience in reading
these diagrams. Unfortunately, as wonderful as the algorithm is, it's
extremely hard to interpret what it means rigorously. I've seen many diagrams
that look like this one -- and they were generated from actual noise. So take
the plots with a big grain of salt :)

I'd be interested in seeing more direct evidence, like SVD factorizing the PMI
matrix (which is what similar to what word2vec is doing) and seeing how much
of the variance is explained by the first components. If you want to do this,
check out: [https://minhlab.wordpress.com/2015/06/08/a-new-proof-for-
the...](https://minhlab.wordpress.com/2015/06/08/a-new-proof-for-the-
equivalence-of-word2vec-skip-gram-and-shifted-ppmi/)

~~~
bearzoo
i once read that high perplexity can generate embeddings that are very tightly
bound on a unit circle...not sure if this is what is going on

~~~
Houshalter
The diagram shown is only a visualization. The actual word vectors have many
dimensions. To reduce them to 2 dimensions, they use a method which tries to
keep vectors that are similar as close to each other as possible, but also
unsimilar words apart. This creates the shape seen on the scatter plot. Just
looking at the scatter plot by itself doesn't tell you anything about the
underlying data.

~~~
bearzoo
I did not claim anything about the underlying data.. The 2 dimensional
embeddings were forced into the unit circle because the 'perplexity' hyper
parameter for the t-sne was set too high.

From the guy who helped make t-sne:

When I run t-SNE, I get a strange ‘ball’ with uniformly distributed points?

This usually indicates you set your perplexity way too high. All points now
want to be equidistant. The result you got is the closest you can get to
equidistant points as is possible in two dimensions. If lowering the
perplexity doesn’t help, you might have run into the problem described in the
next question. Similar effects may also occur when you use highly non-metric
similarities as input.

------
vonnik
I think this approach has a lot of potential, and I wonder what a statistical
comparison of character co-occurrences between the Voynich manuscript and
other writing systems would reveal. For anyone curious, here is Stephen Bax's
video on his 2014 findings.

[https://m.youtube.com/watch?index=1&v=fpZD_3D8_WQ&list=LLATc...](https://m.youtube.com/watch?index=1&v=fpZD_3D8_WQ&list=LLATcCtXq6Eg7iFjmWQ1CNkA)

He believes he has translated about 10 words in the manuscript, which is huge,
and he thinks the script may have been invented to express a language once
spoken between the near east and the Himalayas, maybe Turkic or Caucasian...

~~~
benbreen
It blows me away how many conflicting but semi-convincing theories there are
about this manuscript (I read one fairly recently that argued for a New World
origin of many of the plants, for instance). Which gets me thinking, has
anyone analyzed the actual paper it's made out of to get any clues about where
it originated? I know that spectroscopy can sometimes be used to make a guess
at geographic origins of biological material but I haven't heard anything
about it being used on the Voynich MS.

~~~
TillE
I've seen nothing that persuades me even a little bit away from the most
common theory, which is that it's basically a hoax. It's a pseudo-occult book
produced by some creative individual for fun and/or profit.

There are some contemporary alchemical manuscripts which are written at least
partly in code, for example, but the illustrations strongly suggest that
Voynich is pure fantasy.

~~~
throwaway2048
It obeys statistical models of natural language that "fake text" would not at
all. This sort of statistical modelling of language would almost certainly be
unknown at the time it was produced. I find the hoax theory to be pretty
unsatisfactory.

Even if the overall goal was some kind of hoax, the text itself almost
certainly carries semantic meaning, and that in itself is fascinating due to
its apparent indecipherability.

~~~
acqq
Any links to the "statistical" proof the that the text is not fake? I've seen
just the opposite, that the statistics are really too weird, compared to
anything we know. Which, to me, is a bit too much.

~~~
ajuc
[http://journals.plos.org/plosone/article?id=10.1371/journal....](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0066344)

~~~
acqq
Thanks.

Still, for me, if the text produced with the Cardan grille wouldn't have such
statistics, that would be a proof for me that the "language" is real. That's
what I hoped to find. If the Cardan grille generated text would pass the
statistics too, I'd still more believe the text to be a kind of hoax.

The game isn't "lets find which statistics Voynich passes" but "let's disprove
the hypothesis it is a hoax" by finding a test which all claimed hoaxes
schemes obviously fail.

~~~
ajuc
1\. Cardan grille was invented in 1550. Voynish manuscript is almost 100 year
older by carbon dating

3\. Why do you give hoax higher prior probability?

EDIT: I was wrong about the possible distributions so there's no point 2.

~~~
acqq
1\. Technically, the velum seems to be older, we don't know when the
manuscript is written.

3\. Because nothing else "fits" with anything else we know of. When we find
first text in an unknown language, typically it's some religious thing. The
illustrations are at best "dream" constructs when not intentional attempt to
make an "original" hoax. Otherwise, it's obvious that it's not something out
of some non-western culture.

------
danharaj
You know, I really like this, because it's an example of the kind of structure
machine learning finds without my own understanding of the training set
clouding my understanding of the machine's understanding.

------
haddr
Very interesting approach, but I would say this is just a scratch. There are
several factors that might really limit statistical analysis of this
manuscript [1].

[1] [http://www.ciphermysteries.com/2013/03/09/this-week-a-
talk-a...](http://www.ciphermysteries.com/2013/03/09/this-week-a-talk-at-
stanford-on-the-voynich-manuscript)

~~~
Houshalter
That blog post is assuming that the manuscript is a cipher. If so then it is
unlikely that statistical tools will help much. But I don't think that's been
proven. Many seem to believe that it's a real lost language.

------
lawpoop
>>> model.most_similar("queen")

[(u'princess', 0.519856333732605), (u'latifah', 0.47644317150115967),

------
splitbrain
First time I hear about the cultural extinction theory. If that were the case,
shouldn't there be more documents using the same script? But assuming the
theory is right. Is there any way to decipher it without finding a Rosetta
stone?

~~~
alexwebb2
If there were more documents with the same script, then it would have been
decoded a long time ago, and you probably would never have even heard about
it.

------
acqq
Do we get any new insight with this?

~~~
Houshalter
The author didn't give any novel insights about the text itself. It's just a
proof of concept. If someone could translate a few words, it could give strong
hints to what the translation should be of the other words.

