

A framework for vector-based natural language semantics - daoudc
http://arxiv.org/abs/1101.4479

======
bravura
The question ultimately boils down to: Can we represent variable-length
sequences (e.g. sentences) in a fixed-length representation (vector with k
bits), without losing any meaning?

Fixed-length representations are more useful, because we can use standard
learning machinery (ML) to predict over them. Learning techniques over
variable-length sequences are more primitive. e.g. a CRF for token sequences
cannot incorporate long-distance dependencies, whereas a two-layer neural
network can model ANY mathematical function.

I used to believe that a variable-length sequence would, ipso facto, require a
variable-length representation. However, Leon Bottou argued there must be an
upper bound (1000?) on the number of bits required to represent English
sentences that a human could recognize and parse in the course of normal
conversation. I'm not talking about a pathological grammatical case or some
Old Testament-like inventory of someone's possessions. I mean simply a
sentence that you could parse and repeat back to me in your own words.

My problem with the cited work is that it is purely theoretical, and does no
empirical work to explore potential limitations of the framework. It is
difficult, without throwing the approach at real data, to see if it is
actually an effective model for practical use. I haven't evaluated the
approach deeply enough to poke any specific holes in it.

The author writes "there is currently no general semantic formalism for
representing meaning in terms of vectors". However, I believe this is untrue.
The author is seemingly unaware of the entire connectionist literature on
fixed-length representations, which are based upon recursive neural-networks.
For example, the recursive auto-associative memory work (RAAM) by Pollack in
1988, the Labeled RAAM architecture, the holographic reduced representation
(Plate, 1991), and the recursive nets used by Sperduti and collaborators in
the mid-90's, these works are all highly germane, but remain uncited. In
principle, these architectures are powerful enough to represent all meaning in
fixed length vectors, and operate over these vectors effectively. The problem
with these approaches isn't theoretical, it's practical. We simply don't know
how to train these architectures effectively. I find it annoying when a
theoretician makes claims on the basis of existing theoretician models, and is
ignorant of existing empirical models.

RAAM in particular is pretty cool. It's a fixed-length machine trained to eat
the input left-to-right. It is designed so that it can uncompress itself
right-to-left. So it has two basic operations: Consume, and uncompress. Each
time it eats an input, it outputs a new machine of the same fixed-length. Each
time it uncompresses a token, it outputs the token and a new machine of the
same length. Very cool!

As you can tell, I am more excited by purely empirical and data-driven vector-
based methods. For vector-based word meanings, see the language model of
Collobert + Weston, which I summarized in this paper:
<http://www.aclweb.org/anthology/P/P10/P10-1040.pdf> You can also download
some word representations and code to play with here:
<http://metaoptimize.com/projects/wordreprs/>

~~~
werg
I have done some work with RAAMs and other simple recurrent network based
methods. It's terribly tedious, you can hardly represent more than five
symbols or so.

I did my Bachelor's thesis on related techniques (pressing variable length
strings into fixed sized vectors to deal with neural nets - URL below) - I can
only say pertaining to neural networks that the decoding capabilities of NNs
are severly limited. I actually developed a Jordan Network with a conventional
additive sigmoid NN that could encode/decode more than 40 symbols! The
technique was based on Cantor coding, but I had to set the weights by hand and
could not retrain without losing performance (unstable fixed point attractor
of Backpropagation through Time) - i.e. it's kind of a dirty hack.

It's also really hard to adapt parameters for sufficiently large nets (so my
idea is that you need huge nets to represent language). In order to deal with
this complexity limitation I've also looked into reservoir computing style
networks (Echo State Networks) which use a large randomly initialized network
paired with a linear learned network. ESNs may be good at modelling many kinds
of temporal dynamics, but their capacity to represent relations seems rather
limited as well.

So making an arbitrary length string walk into a vector may sound attractive,
but either (a) don't expect to be able to decode it using conventional NNs or
(b) don't expect to acheive compression, i.e. blowing up your representation
may help (cf LVQ nets) and using compressing techniques such as Hinton's deep
belief networks won't.

[http://gpickard.wordpress.com/2008/11/11/well-heres-my-
actua...](http://gpickard.wordpress.com/2008/11/11/well-heres-my-actual-
thesis/)

~~~
bravura
I believe the problem is the training algorithm, backprop, not the model
(NNs). "unstable fixed point attractor of Backpropagation through Time", as
you said.

Ilya Sutskever in Geoff Hinton's lab has had great success recently using
hessian-free optimization to train recurrent networks. He has trained
_character-level_ RNNs on Wikipedia, and they can generate very long sequences
of quasi-plausible text. In particular, it seems like they can remember many
symbols.

See his most recent pubs: <http://www.cs.toronto.edu/~ilya/pubs/>

------
daoudc
This is a distillation of my Phd work which I completed about four years ago.
Any comments appreciated!

~~~
sqrt17
It heaps on a lot of theoretical musing without providing either empirical
evidence of practical relevance or providing a theoretical guarantee along the
lines of "as long as the model is within delta of [something intuitively
verifiable] it will have [useful property] with p>1-epsilon."

For what you're trying to do, there is a relatively solid baseline, namely
doing something trivial to first-order logic (e.g. assigning a dimension to
each Herbrand formula or ground term or whatever) and e.g. turn generalized
quantifiers into tensors (with some mathematical plumbing required). This
would allow you to reuse model-theoretic semantics with a serving of tensors
and vector spaces on top, and it would do at least those things that
generalized quantifiers can do. You could then argue empirically for some kind
of finite-dimensional approximation to that, or expose neat theoretical
properties from a formal viewpoint.

As is, you don't do any of that. It's more or less like going into a hardware
store, getting some pipe and toilet bowls and making an abstract sculpture
without ever talking about the house for which you want to provide the
plumbing.

~~~
daoudc
You're right - I talk about how to incorporate logical representations in my
thesis, but the reviewers asked me to remove it as it wasn't complete enough.
We had some more thoughts about the correct way to do this in this paper:
[http://homepages.feis.herts.ac.uk/~dc09aav/publications/iwcs...](http://homepages.feis.herts.ac.uk/~dc09aav/publications/iwcs-2011.pdf)

------
mikhailfranco
One of the main references, and other interesting stuff can be found on Bob
Coecke's site:

<http://www.comlab.ox.ac.uk/bob.coecke/>

Personally, I think the NLP problem is, and will always be, a graph problem.
To the extent that you can approximate and accelerate graph algorithms with
vectors, then fine, but vectors are not the fundamental space.

One interesting aspect of Coecke's work is the reuse of Hilbert spaces and
mathematical formalism from quantum mechanics. In fact, the most fascinating
papers on his site are the ones where he simplifies and visualizes QM
intuition based on monoidal categories, very much in the spirit of Baez &
Stay:

<http://math.ucr.edu/home/baez/rosetta.pdf>

Neither Clarke nor Coecke give enough prominence to the great work of Aerts &
Gabora on QM-style attribute spaces and 'collapse' of knowledge vectors, e.g.

<http://arxiv.org/abs/quant-ph/0402207>

<http://arxiv.org/abs/quant-ph/0402205>

Coecke clearly knew of their work, even though he's now at Oxford, he did his
PhD at the Free University of Brussels with A&G.

Mik

