Hacker News new | past | comments | ask | show | jobs | submit login
Voynich Manuscript: word vectors and t-SNE visualization of some patterns (christianperone.com)
131 points by perone on Jan 18, 2016 | hide | past | web | favorite | 29 comments

I implemented the new t-SNE in sklearn, so I've got some experience in reading these diagrams. Unfortunately, as wonderful as the algorithm is, it's extremely hard to interpret what it means rigorously. I've seen many diagrams that look like this one -- and they were generated from actual noise. So take the plots with a big grain of salt :)

I'd be interested in seeing more direct evidence, like SVD factorizing the PMI matrix (which is what similar to what word2vec is doing) and seeing how much of the variance is explained by the first components. If you want to do this, check out: https://minhlab.wordpress.com/2015/06/08/a-new-proof-for-the...

Hi, thanks for the feedback. There are three main points that makes me believe that the clusters aren't artificial: the first one is that I've made the clusters with DBSCAN on the original data (100-d word vectors) and not after the t-SNE embedding. The second point is that I manually inspected some clusters and they make sense when compared to the similarity queries on the word2vec model (take a look on the star names for instance). The third point: I took the folios from the two main clusters (red/blue) (https://www.reddit.com/r/MachineLearning/comments/419e5a/voy...) and they seem to match with the folios from the two languages hypothesis.

I had the same experience. Whenever I use t-SNE with noisy data or data grouped densely around some point, they tend to have a very similar visual structure as the example given in this experiment: a circular shape with some points making small "clusters" but with rather no significant meaning.

i once read that high perplexity can generate embeddings that are very tightly bound on a unit circle...not sure if this is what is going on

The diagram shown is only a visualization. The actual word vectors have many dimensions. To reduce them to 2 dimensions, they use a method which tries to keep vectors that are similar as close to each other as possible, but also unsimilar words apart. This creates the shape seen on the scatter plot. Just looking at the scatter plot by itself doesn't tell you anything about the underlying data.

I did not claim anything about the underlying data.. The 2 dimensional embeddings were forced into the unit circle because the 'perplexity' hyper parameter for the t-sne was set too high.

From the guy who helped make t-sne:

When I run t-SNE, I get a strange ‘ball’ with uniformly distributed points?

This usually indicates you set your perplexity way too high. All points now want to be equidistant. The result you got is the closest you can get to equidistant points as is possible in two dimensions. If lowering the perplexity doesn’t help, you might have run into the problem described in the next question. Similar effects may also occur when you use highly non-metric similarities as input.

> Just looking at the scatter plot by itself doesn't tell you anything about the underlying data.

Well if that were the case it would be perfectly pointless to make such a visualization... The goal of dimensionality reduction is to provide a useful summarization of the data; it is a valid question to ask to what degree it is successful at that.

I think this approach has a lot of potential, and I wonder what a statistical comparison of character co-occurrences between the Voynich manuscript and other writing systems would reveal. For anyone curious, here is Stephen Bax's video on his 2014 findings.


He believes he has translated about 10 words in the manuscript, which is huge, and he thinks the script may have been invented to express a language once spoken between the near east and the Himalayas, maybe Turkic or Caucasian...

It blows me away how many conflicting but semi-convincing theories there are about this manuscript (I read one fairly recently that argued for a New World origin of many of the plants, for instance). Which gets me thinking, has anyone analyzed the actual paper it's made out of to get any clues about where it originated? I know that spectroscopy can sometimes be used to make a guess at geographic origins of biological material but I haven't heard anything about it being used on the Voynich MS.

I've seen nothing that persuades me even a little bit away from the most common theory, which is that it's basically a hoax. It's a pseudo-occult book produced by some creative individual for fun and/or profit.

There are some contemporary alchemical manuscripts which are written at least partly in code, for example, but the illustrations strongly suggest that Voynich is pure fantasy.

It obeys statistical models of natural language that "fake text" would not at all. This sort of statistical modelling of language would almost certainly be unknown at the time it was produced. I find the hoax theory to be pretty unsatisfactory.

Even if the overall goal was some kind of hoax, the text itself almost certainly carries semantic meaning, and that in itself is fascinating due to its apparent indecipherability.

Any links to the "statistical" proof the that the text is not fake? I've seen just the opposite, that the statistics are really too weird, compared to anything we know. Which, to me, is a bit too much.


Still, for me, if the text produced with the Cardan grille wouldn't have such statistics, that would be a proof for me that the "language" is real. That's what I hoped to find. If the Cardan grille generated text would pass the statistics too, I'd still more believe the text to be a kind of hoax.

The game isn't "lets find which statistics Voynich passes" but "let's disprove the hypothesis it is a hoax" by finding a test which all claimed hoaxes schemes obviously fail.

1. Cardan grille was invented in 1550. Voynish manuscript is almost 100 year older by carbon dating

3. Why do you give hoax higher prior probability?

EDIT: I was wrong about the possible distributions so there's no point 2.

1. Technically, the velum seems to be older, we don't know when the manuscript is written.

3. Because nothing else "fits" with anything else we know of. When we find first text in an unknown language, typically it's some religious thing. The illustrations are at best "dream" constructs when not intentional attempt to make an "original" hoax. Otherwise, it's obvious that it's not something out of some non-western culture.

Yes it's definitely a real language that has been written down, and not one that we know of. So the only question is if it's a made up language, or a real language that has been lost. A made up language is possible, but seems less likely. There are thousands of real languages that have been lost to the world.

The vellum was carbon dated not too long ago, indicating that the book dated back to the 15th century: http://phys.org/news/2011-02-experts-age.html

I'd seen that, but I'm wondering if there's any way to use spectroscopy or some other technique to determine roughly what part of the world the vellum came from. It's been shown to be possible with wine:


And (I think this one is really fascinating) some researchers at the Louvre awhile back even used spectroscopic analysis on a painting by Murillo to determine that the obsidian he painted on began its life in a 14th century Aztec obsidian mine!


Thanks for the feedback, I also believe that this approach has a lot of potential, especially after the work of Stephen Bax, a few word translations can help us to figure out transformations that could allow translation in vector space.

You know, I really like this, because it's an example of the kind of structure machine learning finds without my own understanding of the training set clouding my understanding of the machine's understanding.

Very interesting approach, but I would say this is just a scratch. There are several factors that might really limit statistical analysis of this manuscript [1].

[1] http://www.ciphermysteries.com/2013/03/09/this-week-a-talk-a...

That blog post is assuming that the manuscript is a cipher. If so then it is unlikely that statistical tools will help much. But I don't think that's been proven. Many seem to believe that it's a real lost language.

Well, one factor is the sample size is very small.

>>> model.most_similar("queen")

[(u'princess', 0.519856333732605), (u'latifah', 0.47644317150115967),

First time I hear about the cultural extinction theory. If that were the case, shouldn't there be more documents using the same script? But assuming the theory is right. Is there any way to decipher it without finding a Rosetta stone?

If there were more documents with the same script, then it would have been decoded a long time ago, and you probably would never have even heard about it.

Do we get any new insight with this?

The author didn't give any novel insights about the text itself. It's just a proof of concept. If someone could translate a few words, it could give strong hints to what the translation should be of the other words.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact