
Language, trees, and geometry in neural networks - 1wheel
https://pair-code.github.io/interpretability/bert-tree/
======
ttctciyf
This looks to be an interesting piece on a very interesting paper! Somewhat
tangentially (I'm afraid) I just wanted to comment on this para from the
article's intro:

> Language is made of discrete structures, yet neural networks operate on
> continuous data: vectors in high-dimensional space. A successful language-
> processing network must translate this symbolic information into some kind
> of geometric representation

I was a bit surprised recently by another article linked here recently[1] that
discusses "direct speech-to-speech translation without relying on intermediate
text representation" which (if I read it correctly) works by taking frequency
domain representations of speech as input and producing frequency domain
representations of translated speech as output. This is indeed as near as you
get to "continuous" input and output data in the digital domain, and brings
into question (in my mind, anyhow) the assumption that discrete structures are
fundamental to language processing (in humans too, for that matter.)

I don't mean to detract from the paper, which looks highly interesting, it's
just that this business of given discrete structures in language is a bugbear
of mine for some time now :)

1: [https://ai.googleblog.com/2019/05/introducing-
translatotron-...](https://ai.googleblog.com/2019/05/introducing-
translatotron-end-to-end.html)

------
jf-
I’m impressed by the method of mapping higher dimensional vectors to a
consistent tree representation, but I’m not sure what the take home point is
after that. The BERT embeddings are (possibly randomly) branching structures?
I’m only eyeballing figure 5 here, but the BERT embeddings only approximate
the dependency parse tree to the same extent that the random trees do.

~~~
pacala
Figure 5(c) illustrates the shape of the projections for a random branching
embedding of the correct tree structure. This roughly matches the ideal
Pythagorean embedding, and also the BERT embedding. Keep in mind that BERT
only sees word sequences, with no explicit notion of a tree structure. In
theory, there are O(N^3) possible parse trees, which are not completely
arbitrary graphs, but rather have have a context-free structure. Thus figure
5(d) is too weak, with embeddings are picked completely randomly, with no
tree-based constructive process. I wish there were figure 5(e) showing random
branching embedding of a random parse tree, to give a sense of how much
randomly embedding the right parse tree vs. randomly embedding some random
parse tree influences the final result. The hard problem in parsing in finding
the right tree...

------
DoctorOetker
This is fantastic!

Can this be generalized to embedding 1: graphs or 2: DAGs ?

