Towards A Token-Free Future In NLP (2022)

talolard · on Jan 31, 2023

I'm a native Hebrew speaker, where we have wild word morphology, and also worked on NLP at a large bank where we trained models to anlyze Bloomberg chats (that are almost english but behave like a DSL).

In both cases, "token-free" has been the way to go for extractive tasks like entity recognition.

It's unfortunately a technique that is off the beaten path, so a lot of extra engineering effort needs to be paid down. For example, when dealing with spans but embedding the text at the byte level, one needs to account for multi-byte codepoints that can throw off the alignment between "encoded" and raw bytes.

As the article alludes, the increase in sequence length can be very expensive. In extractive tasks, I've found it effective to use much lighter models with limited (but large) context length like ByteNet[0].

Thinking out loud, as always, there is no one-tool for everything. Often the summarization/few-shot capabilities of an off-the-shelf transformer are so far ahead that it's not worth building a model from ±scratch to solve the subtlties of tokenization. Other times, you don't need the unbounded context or have a simpler task and can forgo the power of off-the-shelf models for the specificity of a token free one.

[0] Neural Machine Translation in Linear Time (https://arxiv.org/abs/1610.10099)

DoctorOetker · on Feb 2, 2023

I don't deny the incredible success of transformers, but at the end of the day they are not that amenable for interpretation.

Word2Vec was (despite its poor performance compared to transformers) much more amenable for mathematical interpretation (partition function, analogies, ...)

Rhetorical question (i.e. not directed at you): how about fixing word2vec instead of letting it rot in the history of machine learning books?

We are all familiar with word2vec's property that the vector sum ("king"-"man"+"woman") lies close to the vector of "queen". Imperfections for other words are due to things like word ambiguity (the vector of "ball" is a linear interpolation between the vectors "ball.1", "ball.2", ... if the corpus had each word annotated with the distinct word senses). Other reasons for imperfections could be limited corpus, stopping training "early" due to diminishing returns or for rational fear of overfitting, ...

These analogy sets are directed shifts. for example "man"-"woman", "bull"-"cow", "husband"-"wife", ...

Suppose one were to ask an audience to randomly give examples of antonym pairs: people would submit things like:

loud, silent

dry, wet

love, hate

poor, rich

young, old

male, female

strong, weak

day, night

etc..., then someone submits:

bad, good

at this point the imaginary lecturer stops consulting the crowd and explains that in word2vec we see antonyms appear at a similar distance from each other, and that the axes going through each pair are roughly parallel.

the crowd slowly comes to the realization that this effectively directs each pair in one of 2 global classes, say the class containing "good" and the class containing "bad".

good: silent, wet, hate, rich, old, female, weak, night

bad: quiet, dry, love, poor, young, male, strong, day

Obviously something went horribly wrong. "This is unethical and must be stopped now!" a certain person exclaims.

The programmers blame the humans who expressed the corpus. Or perhaps the team that curated the corpus. Or perhaps they shift responsibility to the end-user for using the embeddings responsibly.

Someone mutters "algorithmic bias" whatever that means, the problem surely can't be due to the simple mathematical constraints we posed, or could they?

Proof by contradiction:

Suppose there was such a thing as an antonym shift vector.

One expects that if such a shift exists, that the antonym of the antonym of a word should be near the original word. "male" + 2 x v_antonym = "male"

simple linear algebra in Rn implies that this can not exist, unless each word had the same embedding location as its antonym!

if v_antonym = 0 then the equation can be satisfied trivially, but now the model can not differentiate between concepts and their antonyms.

if v_antonym != 0 then the supposed antonym vector that might emerge from the algorithm is being forced into spontaneous symmetry breaking, so the slightest bias in the corpus will be amplified.

So yes, sometimes the oversimplification is due to the mathematician or programmer, ... and not due to the corpus, corpus curator or end-user!

=====

Now imagine generalizing word2vec so that instead of embeddings laying in Rn it lays in Tn, an n-dimensional torus.

Consider now a shift vector, with all coordinates zero except for 1 coordinate: it is halfway along the n-torus in that direction.

Applying such a shift twice would result in a zero-shifting. Aha! a non-zero vector that applied twice returns to the original position!

How many such vectors are available? Naively n such vectors (one for each coordinate), but really all combinations are valid, i.e.

(0, 0, 0.5, 0, 0.5, ...) x 2 = (0, 0, 0, 0, 0, ...)

so the number of antonym types on an n-dimensional torus is 2 to the power of n.

that should more than suffice to allow the existence of many types of undirected antonyms: if we negate an antonym shift vector on a n-dimensional torus, it is identical to the original shift, so antonyms pairs can form such that the algorithm does not force them into a global positive and negative class.

For example the different shifts for (child,parent,child,...) and (male, female, male, ...) would be different types of antonym shifts, and sometimes the combined shifts are occupied with words: the combined shift satisfies both (father, daughter, father, ...) and (mother, son, mother, son, ...)

Other advantages of a toroidal representation: weekdays can be placed 1/7th of a full cycle, along some directions (or indeed 0/7ths in others, or 2/7ths etc...)

Months by 1/12th, and so on

=======

Character-level tokens:

in word2vec its unclear how to generalize from words to sentences:

simply adding the word vectors of a sentence does not preserve meaning:

for example:

"the cat bit the dog"

means something very different from

"the dog bit the cat"

so concatenation of words can't be addition of vectors, since vector addition commutes: v + w = w + v.

so "the" + ("cat" + "bit") + "the dog" would equal:

"the" + ("bit" + "cat") + "the dog".

If we can swap adjacent words (due to vector commutativity), we can repeat swapping and move "cat" to the place of "dog" and vice versa.

Similarly word2vec is not used at the character level.

Observe that some words or concepts do in fact roughly commute in English:

"the green round block" means roughly the same as "the round green block"

so adjectives largely commute (although not entirely "the large red block" sounds more natural to me than "the red large block")

so we sometimes want commutativity and sometimes don't want it, depending on the pair of words.

Next consider training some kind of generalized word2vec, let's call it character2element such that somehow the character elements combine mathematically to produce a new element which roughly corresponds to what we call word vectors in word2vec.

Some words are made of 1 character, others are made of many characters.

So the elements should have some type of "product" such that "multiplying" 2 elements results in an element of the same space.

This could be satisfied with matrices (char2matrix) or perhaps "geometric algebra" multivectors (char2gamv).

In this way we train the character's (matrix or multivector) coefficients such that for example multiplying the characters of each synonym (assuming the words have exactly 1 sense) results in the same matrix or multivector. (I don't propose this as the training loss though!)

i.e. multiplying the characters q u i c k = f a s t (or roughly so)

similarily, multiplying g r e e n and r o u n d should result in elements that roughly commute (with an extra space character): the full product of "green round" and "round green" should be similar.

word2vec uses an inner product (supposedly between a context and focus word vector, although from a physics perspective the only way to get an invariant is by multiplying a covariant with contravariant coordinates, so theres a hidden metric, and instead of adding the context and focus vectors after training to get the "embeddings", I believe providing a metric and using the same contravariant coordinates would make more sense, or alternatively keep a unit metric, and redistribute the parameters freed up from the context vectors to the focus word vectors.)

In word2vec this inner product is used to get a scalar (possibly negative), and a partition function is used to get the joint probability of a context and focus word.

How should the matrix or multivector be compressed to a scalar value to pass to the partition function? Perhaps a series of informed guesses, determinant or trace of a matrix, or norm of a multivector could be made and the better performing one selected...

OR we could instead run through the corpus, split into an integral number of sentences at a time, and corrupt it (at first only one character at a time) and optimize the coefficients for autoencoding loss.

Can you read this:? 𝕆𝔹𝕍𝕀𝕆𝕌𝕊𝕃𝕐

When we read it we see the normal capitalized word "OBVIOUSLY", but somehow entirely emphasized in this mathematical doublestroke notation.

so the matrix for each doublestroke letter should be related to the matrix of the normal capitalized letter.

Suppose there was an invertible matrix M with an inverse N such that MN=1

then we can hypothesize that

"𝕆" = M "O" N,

𝔹 = M "B" N, etc...

so that

"𝕆𝔹𝕍𝕀𝕆𝕌𝕊𝕃𝕐" = M"O"N M"B"N M"V"N M"I"N M"O"N M"U"N M"S"N M"L"N M"Y"N

but since NM=1 we get:

M"OBVIOUSLY"N

so this effectively allows such decorative words to form

The same argument can be made for upper and lower case.

Not sure how much of what I wrote is compatible with HN's markdown...

so not only words but complete sentences would result in a "thought matrix" or "thought multivector".

Decoding such a thought multivector might be done by performing gradient descent on unit-sum interpolations of character matrices for each position in a string of such matrices.

DoctorOetker · on Feb 2, 2023

I just discovered https://github.com/shahzainmehboob/word2matrices from about 5 years ago. There doesn't seem to be an associated paper though, and its not immediately clear what norm they used, but it looks like they flatten the matrix and compute the inner product on the resulting vector, by analogy with word2vec vectors embeddings that occur frequently should thus be closer to the 0 matrix. i.e. they seem to be using the Frobenius norm of matrices, which seems very reasonable.

DoctorOetker · on Feb 3, 2023

They also only optimize for bigram statistics (2 matrices), so they don't utilize the associative property of matrices A(BC)=(AB)C, corresponding to string concatenation...

deepsquirrelnet · on Jan 31, 2023

I work in an industry where most of the text is not natural — super wide vocabulary, lots of symbols and abbreviations. I built some utf-8 tokenized models that do very well in comparison to sentence/word piece tokenizers.

Unless your corpus fits the tokenizer vocabulary very well, I think there are many cases where “token free” models can be advantageous. You only have to determine how to deal with the expansion of the input dimension and how it might blow up your attention costs.

optimalsolver · on Jan 31, 2023

Interesting. Could you give some examples of "words" in this field?

jszymborski · on Jan 31, 2023

Can't you just train a piece tokenizer like SentencePiece on your irregular corpus? That's what I've done with my amino-acid model.

Or are you saying your tokenizer does better than that.

deepsquirrelnet · on Jan 31, 2023

I think in most applications standard tokenizers work fine. There’s a point where too much of your vocabulary gets chopped up into single tokens anyway, and it ends up performing much worse. I’m not able to quantify it exactly for you. I’d have to benchmark it over some synthetic data to get a better idea.

One could imagine a situation where the opportunity for misspelling is very high - say, for example OCR of medium to low quality images. By not using a character-level tokenizer, your model also has to learn to associate complete tokens to any number of incomplete tokens. Instead, it’s simpler to just let the model infer words based on complete or mostly complete character sequences, rather than randomly chopped up word piece tokens.

f_devd · on Jan 31, 2023

I've actually been working on a Unicode only code prediction LM, and it already works pretty well the main issues with Unicode models as also found in the article is the large sequence length required compared a sentencePiece or BPE tokenizer. The current direction to resolve this seems to be to use structured state spaces (S4/DSS) models which scale linearly along sequence length compared to O(n^2) for transformers. I haven't read the charformer paper yet but if it does a dynamic pooling of the tokens it could be a promising step to have this same functionality in transformer models.

Imnimo · on Jan 31, 2023

We should just operate on pixels of images of text: https://arxiv.org/abs/2212.08045

infradig · on Jan 31, 2023

Or strokes.

sebzim4500 · on Jan 31, 2023

My understanding is that transformer model scale quadratically with the sequence length so it makes sense to encode your text into as few tokens as possible.

In any case, I don't understand why this is called 'Token-Free'. There are still tokens, there is just one for each character.

earleybird · on Jan 31, 2023

I expect it is analogous to scannerless parsing - there's still a scanner, it just recognizes individual characters. It may be more about where you draw the line and how important it is to your understanding (or ability to convey the idea).

throwaway03263 · on Jan 31, 2023

[shameless plug] You can effectively train a token-free model if you somehow know how to segment the text into meaningful chunks.

https://arxiv.org/abs/2211.12677

jszymborski · on Jan 31, 2023

I find subword tokenisers like SentencePiece to be pretty great. It makes the sequence length smaller and you can get away with very, very small vocabularies.