Hacker News new | comments | show | ask | jobs | submit login
Show HN: Word2Bits – Quantized Word Vectors (github.com)
221 points by maxlam 9 months ago | hide | past | web | favorite | 110 comments

This is great and echoes a recent fascination for me. One application of compact word embeddings is that if they're small enough you can ship a whole model to the user in a web app so that semantic computations can be done entirely client-side, which is useful for privacy. I did a naive 1-bit quantization a few months ago in order to fit a large-vocabulary word embedding into a smallish (< 3 MB) JSON file so that synonyms could be generated by the user without hitting a server. (See https://docs.google.com/presentation/d/1SfbfnuNSlDRtUOpN1KNc... for a description and https://prototypes.cortico.ai/search-history/ for the application -- happy to send you more details/code if you want) I never got as far as formally evaluating the quality of these vectors, though.

You can also replace rare words with a linear combination of 1..3 of their closest neighbours, especially for topic classification tasks. This would allow higher precision for more frequent words and still handle the rare words, reducing the number of vectors you need to send to the thin client.

Finally a use for 5g bandwidth: sending big ML models to the client.

Can you expand on the privacy use case? I don't understand how applying a word vector model on the client improves privacy.

The tool I made works with local data -- it analyzes your Google search history from a file on your disk. You can filter it by specifying a topic, which is then matched semantically against the search queries in your history using the word vectors. Given the sensitive nature of this data, I felt it important not to send these topic queries (nor any of the raw search history queries) to a remote server. I can also imagine using the word vectors to do a topic clustering of the search history.

You presumably do not have to ship the content of the user's query up to the server.

How does that improve privacy in this case? The sort of queries that can be made against a word vector model, and only a word vector model, are pretty trivial.

Interesting to see that "science" and "fiction" are so similar according to this metric. Not surprising, though, given how often they co-occur in text, but this clearly shows the limitations of the method. The unit of interest is not really character strings but lexical entries and there can be multiple lexical entries associated with one character string. For instance, the word "bank" can mean "financial institution" or "edge of a river" but both meanings would contribute to a common word vector. You could say that this only affects ambiguous words, but first ambiguous words are very common and second these words probably distort the whole vector space (by introducing "short circuits") and therefore also affect the vectors for non-ambiguous words. A long way left to go for these methods I suppose.

Edit: Have people tried to detect ambiguous words by measuring local conflict in word2vec space? E.g. "laboratory" is similar to "science" and "science" is similar to "fiction" but there is no evidence to suggest that "laboratory" should be similar to "fiction".

One method to partially rectify the problem you mention is to add additional preprocessing to the text to identify ngrams and add part of speech tags to words. For example [The, river, bank] might be restructed as [The, river bank (bigram)] where "river bank" is embedded as its own word vector.

This can also be used to disambiguate words like "hit" which can be used as a verb and a noun. You just replace "hit" with "hit|noun" and "hit|verb".

Yes! And, also, I swear I've seen this method done before. I can't remember the paper. Can anyone help?

Basically, instead of making neural word embeddings, much like you describe, the objects being embedded in a vector space were "hit|noun(1)" "hit|noun(2)" "hit|verb", and so on.

I believe they used WordNet or some ontology, or maybe a POS corpus like Penn Treebank....

I think you're talking about sense2vec

Yes!!! It was! Thanks so much!

This is a known problem and work is being done on it. I'm not sure whether there's work in the specific direction that you specified in your edit, but I did manage to find this paper from 2012 via a quick search: http://www.aclweb.org/anthology/P12-1092

This paper in particular has ~700 citations right now; I didn't go through them to see the latest work but this likely doesn't represent the cutting edge.

I imagine that marginal computational cost would be the deciding factor in choosing a more advanced model here. Granted, these embeddings are often generated infrequently so I don't see the harm in extra one-off training time. It's possible that choosing between definitions adds overhead elsewhere in the system.

I thought I knew what a vector was, but I have no idea what any of this project means.

Would someone please explain in simple language what this is, and why it's cool?

The idea is that you can kind of capture the "meaning" of a word with a sequence of numbers (a vector) -- and then you use these vectors for machine learning tasks to do cool stuff like answer questions!

Word2Vec is one of the algorithms to do this. Given a bunch of text (like Wikipedia) it turns words into vectors. These vectors have interesting properties like:

vector("man") - vector("woman") + vector("queen") = vector("king")


distance(vector("man"), vector("woman)) < distance(vector("man"), vector("cat"))

What Word2Bits does is make sure that the numbers that represent a word is limited to just 2 values (-.333 and +.333). This reduces the amount of storage the vectors take and surprisingly improves accuracy in some scenarios.

If you're interested in learning more, check out http://colah.github.io/posts/2014-07-NLP-RNNs-Representation... which has a lot more details about representations in deep learning!

Thank you. That really helped.

Quick run down on word2vec and what happened here:

A big problem with NLP is understanding the semantic associations between words (or even lemmas. Lemmas in this context refer to different meanings of the same word, like a baseball bat vs. a vampire bat). For example "run" and "sprint" are similar in meaning but convey different connotations; kings and queens are both high-level monarchs but we need to encode the difference in gender between them for a true semantic understanding. The problem is that the information in words themselves don't accurately convey all of this information through their string (or spoken) representations. Even dictionary definitions lack explanations of connotations or subtle differences in context, and furthermore aren't easily explained to a computer

Word2vec is an algorithm that maps words to vectors, which can be compared to one another to analyze semantic meaning. These are typically high-dimensional (e.g. several hundred dimensions). When words are used often together, or in similar contexts, they are embedded within the vector space closer to each other. The idea is that words like "computer" and "digital" are placed closer together than "inference" and "paddling".

Usually these vector mappings are represented as something like a python dictionary with each key in the dictionary corresponding to some token or word appearing at least one (maybe more) times in some set of training data. As you can imagine these can be quite large if the vocabulary of the training data is diverse, and due to being in such a high-dimensional space, the precision of a vector entry may not need to be as high as doubles or floats encode. Floats are 32 bits. The authors of this paper/repo figured out a way to quantize vector entries into representations with smaller numbers of bits, which can be used in storage to make saved word2vec models even smaller. This is really useful because a big problem with running word2vec instances is that they can take up space on the order of gigabytes. I haven't read it all yet but it seems the big innovation might have been figuring out a way to work with these quantized word vectors in-memory without losing much performance

Edit: seems it may not work in-memory

Training these quantized word vectors has to be done in full precision (so no memory gains during training the word vectors).

But when you save them to disk every value is either -1/3 or +1/3 so one could encode the word vectors in binary. This can lead to reducing memory usage during application time if you kept the word vectors in this compressed format (though you'd need to write a decode function in tensorflow or pytorch to take a sequence of bits corresponding to a word and convert it into a vector of -1/3s and +1/3s)

Oh interesting I see, so this is like a digital format mostly for sharing models between storage / over networks. I definitely think it would be possible (and useful!) to extend it to in-memory usage, though a C-function wrapper might be better than a native python function. Personally I'm often more frustrated by word2vec's size in memory than in storage so it might be used more in this manner. Would you mind if I submit a pull request?

Absolutely, please do!

I wonder, shouldn't the vector space ideally be much more than high-dimensional than hundreds? Like many thousands at least. (I mean, our brains likely can hold much more than hundreds of dimensions internally, I'd guess.) It's just that we don't have the computing power yet?

That's the curse of dimensionality in effect. In very high dimensions, the probability that two randomly chosen vectors are orthogonal increases, and conventional distance metrics become less useful since most vectors end up very far from each other.

In geometric terms, think of a circle embedded in a square. As the number of dimensions increases (e.g. a sphere embedded in a cube, a 4-dimensional ball embedded in a 4-dimensional cube, etc), most of the volume in the cube is outside the radius of the sphere. Any vectors you have effectively become sparse.

Basically this means that, while you need large dimensionality to model complex relations, high dimensionality makes it difficult to model these relationships (and need exponentially more data).

I don't really know why hundreds (I typically see 200-400) of dimensions are "better" than thousands, but there are a few justifications I can think of. You definitely get diminishing returns as you increase dimensionality, and size is a big concern. As an example, FastText's English word2vec is 300-d with 1 million word vectors, giving it a size of around 1 million * 300 floats * 4 bytes/float = 12GB (I haven't loaded it in a while so I forget if they do anything to compress it or if it gets bloated by something). Considering this often has to reside in-memory you really have to question whether marginal gains are worth a single model occupying on the order of 10x that much memory monetarily (and at sizes like that you even have on-disk storage, boot times, etc. as considerations)

Intuitively I can suggest the following: One of the problems is that even relatively huge corpuses like wikipedia can't deal with high dimensionality that well because even they likely lack the sheer size and diversity of content to "flesh it out", so to speak. My intuition tells me that this is due to the training process overfitting "clusters" of information together due to the size of the model. If you have too many dimensions, there's a lot of space for clusters to form, and with that will come some loss of semantic differentiability between clusters - or at least my hunch tells me so. You definitely want "computer" to be more associated with "mail" than "taupe" but if the frequencies of their associations are small they'll essentially be interpreted as noise or overfit. One thing to note is that word2vec embeddings are trained using a shallow neural net, and it's entirely possible for it to be a generic ML problem of too-many-parameters/bad network topology given the input when dimensions get too high.

With dimensions in this context it can be easy to forget that each additional dimension added can (potentially) add an order of complexity to the model - a smaller model lies on a hyperplane in the new vector space. What may happen is that an added dimension (by added, I mean before training, not after) might add some beneficial complexity for a specific subset of the model; e.g. if we were originally in very low dimensions, adding one dimension may allow king/queen and boy/girl to separate in the vector space based on gender( although in reality you can't create correspondences to individual dimensions and properties like this usually) but simply lead to noise or overfitting in other subsets. I think in very high dimensions this overfitting is likely to manifest itself in either too much clustering or strangely high similarity words resulting from outliers/noise in the data.

I've never really seen thousands of dimensions used in the wild, but I don't know of any papers that explicitly compare performance among different dimensions (word2vec is hard to evaluate with a single metric, though). Perhaps once you get to internal Google or Facebook levels of big data you could use distributed computing to make it work, but again, I haven't seen references to that.

From the Arxiv paper, multidimensional word vectors are used in natural language processing. However they can be many gigabytes in size making then cumbersome. This approach aims to solve that issue.

The wiki page on Word2vec seems helpful.


Only tangentially related, but we've recently tried to find an encoding of text that's "stable" with regards to it's characters (as opposed to stable wrt semantic meaning as here). That is, similar words (or fragments) such as "carg" and "cargo" should yield a simliar encoding.

To our surprise, we couldn't find any example or description of someone doing this before. Is this such an uncommon problem or did we just not search in the right places?

FWIW, our use case is a search tool with live result listing, so we're dealing with word fragments and would like the outcome to be somewhat stable as the user types along. We ended up rolling our own, but it has certain shortcomings, such as a hard character limit.

> To our surprise, we couldn't find any example or description of someone doing this before. Is this such an uncommon problem or did we just not search in the right places?

This is one of the defining differences between Word2Vec and Fasttext. But fasttext incorporates these character vectors as part of calculating the semantic vector, so you can't expect carg and cargo to end up being similar, but people have thought of it.

I don't think partial search is that uncommon, but I don't think it is usually solved by using vector representations similarly to word vectors. It seems like what you are looking for is usually accomplished by edit-distance / Levenshtein distance [1]

[1] https://en.wikipedia.org/wiki/Levenshtein_distance

Yeah, Levenstein distance is pretty close to our goal metric of "similarity". The thing is that we're feeding the search query into a neural network, hence we need some kind of vector represenation.


Or don't feed the search query directly into a neural network?

I don't understand; what do you mean?

Do what a search engine does: detect a more common form via edit distance and substitute that word.

You might be interested in phonetic algorithms for similarly sounding words:



Your specific example would be relevant to word-stemming and lemmatization. Stemming is the process of removing suffixes from words for standardization (e.g. swim, swims, swimming, swimmer could all be stemmed to just "swim") across inflections/conjugations. Lemmatization is similar but uses contexts. Actually, some stemmers wouldn't stem cargo to carg by default, but they definitely could be modified to exhibit that kind of behavior, or used as one step in a multistep standardization process



Levenshtein distance is a good metric for individual comparisons but if you're doing a lot of pairwise comparisons/want to index it's not a great option sometimes.


You definitely want to also look into tries/prefix trees. These take each character in the word and use it for an O(1) index for the next level of the tree. For example, "brea" queries the top node "b", pointing the next node "r", then "e", then "a". If you next read "d", the trie would indicate that this represents a completed word-fragment at the b->r->e->a->d node of the trie. If you combine this data structure with a statistical model, you can use it for things like spell-checking and autocompletion


(I've edited this comment twice now to make it more clear, hopefully this is sufficient). Let me know if you'd like me to point you to any other resources. I've worked with NLP a decent amount and could even work with you guys, if interested my email is in my profile and we can arrange further conversations

Stemming is similar, but probably wouldn't help solving our problem because we're potentially dealing with word fragments shorter than a stem. Also, we're feeding the search query into a neural network, so we need to create a vector representation of it.

You could modify word2vec to embed word fragments as part of the training process. You could also use stemming before training, or if you have a decent amount of computational resources you could embed trie entries with word2vec representations of word fragments and probabilistic models of the likely next character/syllable/word, which would allow you to use something like a markov process

For word completion, I would create a trie and encode naive char-to-char and absolute (word prefix vs. all recorded) frequencies for each edge during its creation, noting that the "end word" value is also possible and deserves a frequency. Then when a user is entering a single word without prior context, you simulate a markov process from the last character to the end of the word. If the user has input a short but unlikely combination you can observe the frequencies of untraversed edges from the nodes of the current path and start a markov process from there if it is much more likely. That gets you to the end of your current word in terms of its string representation. From there you can use n-grams if desired, or go straight into sanitization preceding vectorizaton, to construct a likely query

If I were you I would decouple processes in your pipeline. I mentioned the best way I know of combining word vectors and word fragments (embedding word fragments of a corpus into word2vec, then indexing them with a trie) and I don't think it would be feasible for size reasons - although perhaps the topic of this thread could make it more computationally amenable. It sounds like what you want to do is 1 first infer a word/phrase, 2 stem/sanitize it, 3 map it to its word2vec representation, then 4 do some search query using the vector. 2 and 3 could be combined if desired (would decrease corpus vocab substantially and be good space-wise, improve embeddings of stems of rare words by reducing overfitting, but lose some semantic complexity), perhaps even aggressively, but not with 1 unless you further modify the training process / augment the corpus with fragments.

TL;DR: There's not a good way I know of to use a word2vec mapping trained on a vanilla corpus to directly account for short spelling errors since individual spelling error fragments will be rare or not present within the vanilla data. You seem to think Levenshtein will help but keep in mind this is an expensive pairwise string comparison algorithm. Unless you implement a good way to check which strings to compare the input fragment to, you will likely perform too many comparisons because you won't know where to start

Our problem is not about auto-completion (we're not dealing with that much data to need sophisticated algorithms for that). What we're doing with our NN is ordering the set of results (matches) we already have. In other words, we're assigning a relevance number in [0, 1] to each result, based on the query string and training based on past user choices (clicking a result). In order to maintain some consistency and robustness, we need our NN to yield similar results for similar word fragments.

So if the NN has previously learned meaningful result priorities for "cargo", they should ideally also work out for "carg" (and vice versa) because of the live listing nature of our tool.

Ah ok I think I get it. So herein lies the problem (let me know if any of this is incorrect): you want to encode fragments of a word similarly to completed versions of the word, without doing inference. The easiest way of augmenting your training data into a word2vec embedding with artificially misspelled inputs doesn't seem like it would work considering how many misspellings there are per word.

I think the best way to do this is to create a second neural network which smooths out fragments into word2vec vectors corresponding to the derived word (or the derived word itself). In both approaches you start by making a dataset where each word in the vocabulary is the output for multiple incorrectly spelled, artificially generated inputs. For example you want to have the inputs "crg", "carg", "argo", "crgo", "cago", "cargo", "cargop" "cartgo" all have outputs to "cargo" in this data, whether it's the string "cargo" itself or the w2vec embedding of it. The approach where w2vec embeddings are the output allows for words like "carg" to be interpreted as something like a median between "car" and "cargo" both as input to your main NN and for training purposes, which might be want you want. There's some info on this here [0] but they use it to regenerate words themselves, which you probably don't want. Note that including the identity/low training error is very important unless you do a preliminary vocabulary check.

The second approach of generating correct spellings instead of approximate vectors fails if it doesn't get a close enough approximation, although it seems if levenstein distance <=2, the approximation can be corrected cheaply [1]. Sorry I couldn't be more of help, I haven't really encountered this type of problem before. Good luck, you have an interesting problem to solve!

[0] https://machinelearnings.co/deep-spelling-9ffef96a24f6 [1] http://norvig.com/spell-correct.html

Ah, so your first suggestion would be basically building an auto-encoder based on training data with correct and incorrect words (fragments). This might work, but it would require a lot of computation: Each word of the vocabulary multiplied by all its similar counterparts. And this wouldn't cover new/unknown terms yet.

What we ended up doing for now is a two-dimensional input layer with per-column one-hot encoding of characters (i.e. one character is one column, 128 rows for the ascii alphabet). Then, apply a convolution with kernel dimensions 3x128, which flattens data to one dimesion and combines three neighboring characters. The second part builds an "assiciation" between neighbors, which helps yielding similar outputs for similar word fragments.

This works quite well, except for some nasty limitations:

- Search queries have a hard limit in length, caused by our input layer dimensions

- Due to varying search query length, input nodes on the right side are often unused/zero, leading an weighting bias on the left side when training. That is, the start of search queries receives more attention that the end. But that's not necessarily a bad thing.

creichenbach: I believe I have an algorithm with working code, for you, but don't want to spam this thread. send me an email if you want to. (on my profile).

[Also. I hate having to spam discussion threads with personal user-to-user comments.. but there's no message user feature. This message will self destruct once it's goal has been achieved.]

You can use a character RNN that reads the fragmented input char by char, and feed its output to your NN.

We recently open sourced a library that does exactly this:


For word2vec, GloVE, and fastText. It is able to generate vectors for out-of-vocabulary words through "fragments" of words or subword character n-grams rather.

You can try encoding your input in shingles maybe, and feeding vectorized shingles (instead of a one hot encoded dictionary) into the usual CBOW or skipgram thing to train the embedding.

You'd need to invest plenty of effort into shingling and vectorizing properly to get useful results, though.

Ah, so we'd divide our word (fragment) into parts and treat the parts like words for usage with CBOW or skipgram?

Uh yeah I have no idea if it'd perform well, but instead of having a sparse vector with the one-hot encoding of 'cargo', you enter a sparse vector with the 'car', 'arg' and 'rgo' dimensions set high.

Top of my head speculation, I never tried this...

Ah, that's actually not too far idea-wise from what we ended up doing: https://news.ycombinator.com/item?id=16637525

FastText can generate word vectors for partial and/or unknown words based on similar chargram patterns.

That sounds interesting, do you happen to have a documentation link or similar? I can't seem to find any info about it.

There are some somewhat reasonable docs at the bottom of https://fasttext.cc/docs/en/unsupervised-tutorial.html

I really should do a blog post about it or something.

Enriching Word Vectors with Subword Information: https://arxiv.org/abs/1607.04606

Optimizations are always great :-)

This sounds similar to Maciej Kula's experiments in "Binary Latent Representations for Efficient Ranking: Empirical Assessment", https://arxiv.org/abs/1706.07479.

Maciej shared this [1]: "FWIW I ran similar experiments on recommendation tasks. My initial results were very encouraging (and similar to those reported here), but the effect disappeared entirely after adding regularization to the baseline. I would have more confidence in the results reported here if the authors ran a more principled hyperparameter search, and added regularization to their baseline."

[1] https://twitter.com/Maciej_Kula/status/976028573487239168

This could be useful with DNA/protein sequences - there's dna2vec and similar approaches to mine sequences for regulatory elements and more, the problem when compared with word2vec is that you have an order of magnitude more words. You don't really have words so you split the continuous DNA sequence into overlapping mers and treat that as words (afaik). That means that dna2vec takes forever (>24h with one chromosome) and builds a huge vector. I wonder whether this will speed up things...

Very cool that this beats Word2Vec on SQuAD! I wonder if the current state of the art models are using the standard GloVe word vecs and might see improvement from this. Tbh I don't know too much about how those have been implemented though, haha.

I'm curious, how many values did you try for the quantization functions? Without thinking too much about it, that seems like one of the hyperparams that could have a pretty big impact on performance.

You're definitely right, the quantization function and its values definitely have an impact on performance.

For 1 bit I think I tried something like -1/+1, -.5/+.5, -.25/+.25, -.333/+.333. and something like -10/+10 -- (and I think a few more). It seemed -.333/+.333 worked the best while +10/-10 did the worst on the google analogy task (getting like 0% right). All this was tuned on 100MB of Wikipedia data.

Have you considered doing gradient descent on the quantization steps? It looks to me like the model should be differentiable with respect to those values, so I'm not sure why you'd have to fix them to a constant.

Hm what do you mean? I'm not quite seeing how to differentiate with respect to the quantization steps.

Say you have a function f(q(x)) where q quantizes x into one of s_1, ..., s_n. Then if q(x) = s_i for a certain x, df/ds_i = df/dq and df/ds_j = 0 for all j != i.

That breaks down for values of x precisely at the boundary between steps, so I should have qualified "differentiable" with "almost everywhere".

It also occurs to me that this might interact strangely with the approximation dq/dx = 1, but since the quantization steps are globally shared, I think it should be stable anyway.

If the evaluation suite for your code doesn't require too much manual interaction, I might try and see for myself.

That's definitely an interesting idea -- it seems this would allow for boundaries that "change" along with the data (instead of having static boundaries as it is). Would be interested to know how that turns out!


So, this approach computes a "traditional" neural embedding, in say, R^50, and then "brutally" replaces each of the reals with an integer in Z_2,4,8...

I can't quite put my finger on it, but my hunch is that this naive method, while already delivering interesting results, can be drastically improved upon.

* don't use a fixed bit depth for all vector components

I guess it depends on what you're trying to optimize for -- what algebraic properties you wish to preserve for the end application:

If the end goal is: "linearity be damned, I want a stupidly fast but inaccurate way of doing approximate nearest neighbor search", then turning words into bitvectors, and using hamming distance, not(xor(a,b)), &c" works.

Either way, thanks for the ideas, OP. (was going to go on with some mathematical stuff which I suspect would improve upon it, but decided to either shutup, or put-up-and-credit-you.)

I would appreciate if you elaborated on your supposed improvements.

Very nice, thanks for creating this!

In addition to (possibly new) hardware that supports much larger memory, compression techniques like this might allow us to start operations with “phrase embedding” or eventually even whole “sentence embedding.”

It's interesting and slightly uncomfortable that the illustration for similar words uses the word "man" as an example, given the gender biases that result from learning word vectors solely from word distributions.

To explain, although I’m sure the author himself is familiar with the issue: For any word that is disproportionately associated with one gender in the corpus, the model will learn that gender difference as part of the representation of the word, pretty much baking it into all applications. [1] It's all fun and games when this helps you find the difference between "king" and "queen", but it becomes a problem when the same difference appears between "genius" and "beautiful".

I haven't evaluated these vectors for built-in biases, but I assume they would have similar problems to the pre-computed word2vec and GloVe embeddings. (If they don't -- if quantization is a natural way to counteract bias -- then that's an awesome result! But an unlikely one.)

To the author: I don’t mean this to sound like an accusation that you haven’t done this yet; I know that short papers can’t tell two stories at the same time. But the next step is pretty clear, right? How do these vectors measure on the Word Embedding Association Test for implicit bias? Does Bolukbasi’s de-biasing method, or an analogue of it, work on quantized embeddings?

[1] Bolukbasi et al., "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." https://arxiv.org/abs/1607.06520

In modern linguistics grammar is used descriptively rather than prescriptively, i.e. it's about describing how speakers actually use language rather than being about telling them how to use language 'properly'.

If there's gender bias in certain words it might be interesting to point that out but it's not the linguists' job to 'de-bias' the grammar and the underlying model.

There's an undeniable gender bias in certain words (men probably are less frequently referred to as 'beautiful' than women while on the other hand 'genius' likely is more often used when referring to men). Glossing over that by smoothing models not only misrepresents how speakers use language but probably doesn't really help the cause either.

If the corpus used displays a disproportionate association of certain words with one gender this could just mean that the corpus is insufficient for representing a language in general (as opposed to just a particular register or sociolect), which is a common problem in computational linguistics, not just when it comes to gender biases.

There’s two purposes for using such a corpus, one is to get an accurate representation of how language is used. Another is to use it as part of a machine human interface, in which case you want to be careful about whether you offend your customers or users or whether you’re even violating the law by biasing certain decisions somehow.

> it's not the linguists' job to 'de-bias' the grammar and the underlying model.

Sure, but this is not a project of linguistic research. It’s a tool that could see real-world usage. Not perpetuating the stereotypes that suffuse the data we feed into such models does seem like a worthy Endeavour, and I find it encouraging to see the creator taking these concerns seriously in this threat.

(As opposed to the community-at-large, which quickly send them to the bottom of the threat)

The issue is that the undesired language constructs one would like to remove are not universal. What is an improvement for one user group will be a regression for another & vice versa.

I’m a white, able-bodied, mostly heterosexual man, and I reject the theory that (non)discrimination is a zero-sum game.

Debiasing word vectors is not guaranteed to yield less discriminatory results in every application. In many languages, gender is expressed grammatically and there are different words for e.g. male and female scientists. If you debias the word vectors to remove associations with gender, it becomes impossible to distinguish those two. Now if you translate between two languages with that distinction, the model will have to fix itself to one of the two possibilities, and the translation error will be minimized by always picking the more common form. "First year of all-female Nobel prize winners and machine translation gets it totally wrong!" By debiasing the word vectors, the model has become more biased and less accurate overall.

If you create a discriminatory machine learning model, it's unlikely to be because you didn't debias your input representation. More often, you're training it on a task whose real-world statistics are biased, and the model learns to accurately reflect that bias to solve your task. The solution to that is not to modify the input, but rather the output you expect the model to provide.

Ah yes. This approach to ML bias is popular in some large corporate research groups:

- Say a lot of nice things about fairness

- Talk about the importance of de-biasing and doing it as well as possible

- Show (legitimately) that the farther downstream in your ML process you apply de-biasing, the more sound the results are

- Assume that any successful ML model will be used downstream in something else

- Therefore, never de-bias anything, except the final output of something where your company benefits from showing fairness: that is, a demonstration at a conference talk about fairness

- The time to de-bias real applications is always "later" and everyone can say they are working on doing the right thing

People who release data artifacts have the ability to de-bias now instead of later. It's not perfect, but it is good. If you rely on the developer after you to do the de-biasing that you're not doing, you're ensuring that de-biasing won't happen, because they won't do it either.

My point is that the appropriate debiasing is task-specific. You can completely eradicate bias one one task, but as soon as someone builds on top of your results, the bias is going to creep right back in, and likely to be worse due to the information loss in previous layers. If you just hand a bias-conscious developer a bunch of debiased word vectors with the implication that it will make their model less biased, I don't think that's really helpful. Like security, bias in machine learning needs to be addressed on the level of complete systems, it's not something you can leave to some simple input sanitizer.

Ah. In that case, I agree with where you're coming from, but not on the details:

I think there's no reason to believe that bias from multiple layers of ML end up neatly rolled up into one wad of bias that can be extracted at the end of the process. If there is an empirical demonstration that this can happen, I'd love to see it.

And I think the cost of de-biasing is much lower than you think it is. I mean, I de-bias word vectors. The accuracy loss on intrinsic evaluations from doing so is tiny; it's much smaller than what you gain from easy wins that not everyone does, like improving your OOV lookup strategy (another way to improve ML results by intervening manually on your insufficent data!).

A simple example for a thought experiment: let's say you're classifying movie reviews for sentiment, so you've got a word-embedding layer (trained on general word embeddings) and a classifier layer (trained on the specific data).

The word-embedding layer will learn from corpora that it is negative to say "gay". The classifier layer will learn from the specific dataset of movie reviews that it's negative to say "Steven Seagal" (this actually happens in simple classifiers trained on the MovieLens data set). And sure, I could accept this one as objective truth, it's just an amusing and memorable example.

I think that if you save all the de-biasing for the end, you are likely to be able to fix the "Seagal" problem but not the "gay" problem, whose representation has become too complicated by going through another step of machine learning.

So I don't see the moral hazard of doing more de-biasing, as long as nobody presents it as a one-stop fix to the problem, which I sure don't.

That's because what's called 'debiasing' is not actually removing bias but adding another opposite one at the superficial layers we can actually notice.

That's possibly a fair description of the kind of as-late-as-possible de-biasing that I find insufficient.

I think there is not nearly enough research into de-biasing at training time, perhaps because it would be discouraged by the reward structure of our field.

("You mean I can make my model more complex and my paper longer, and in return I lose the 0.15% accuracy gain that's the entire reason I could call the result 'state of the art'? Sure, I'll get right on that!")

The corpus alone can't be sufficient. The corpus displays associations that are disproportionate for how we want to use it, because corpora are from the past. But the point of ML is to build something that will be used in the future. [1]

There is no objective way to describe this difference between your training set and your test set, because you don't have data from the future. It requires a conscious decision. Enforcing that the future should be like the past is a conscious decision, and it's a lazy and unfortunate one.

This is just one instance of a problem that ML is already quite familiar with, and has been since "ML" was called "statistics": observed data produces a biased estimator of the actual distribution. You always have to correct your objectively-measured distributions for what you can't measure objectively. Here I mean "biased" in the mathematical sense; when it comes to human biases encoded in word embeddings, it also produces bias in the ethical sense. But there are many simpler instances of this.

For example: a maximum-likelihood language model assigns a probability of 0 to any word it has never seen. Maximum likelihood is, when measured, a better model of the input data than anything else. If you implement this completely objective language model, it would be useless when applied; it would output the impossible probability of 0 for most inputs. Instead you have to "smooth" and "correct" it for the fact that other words exist.

[1] https://twitter.com/random_walker/status/975700725807439879

Personally I am rather uncomfortable with the idea of 'correcting' the learning process to have the machine output what we think it should instead of what it objectively learns from the dataset. It is trading a bias against an ideal world with a bias against reality.

Where did you get the idea that data, of all things, is objective? Data is shit! Data is the noxious raw material that we have to process with great difficulty into something useful!

If you are familiar at all with machine learning, you should recognize that human decisions affect every step of the process, especially the part where the data is produced and collected. It is not an oracle of objective truth.

And let me quote Arvind Narayanan for why you are not going to get the right answer in your search for objectivity: "Training data is from the past and test data is from the future. We use ML because we want to learn from the past, not reproduce it." [1]

[1] https://twitter.com/random_walker/status/975700725807439879

IIRC, if you remove the 'uh' and 'hmm' from audio samples to train for speech recognition, you get lesser performance than if you leave them in. So it is not that easy to differentiate a-priori between actual data and noise. If you look at the history of NLP we went from 'sentences are instances of a well defined grammar' to 'sentences are a bunch of statistically related tokens'. Embracing the mess is how we got better results !

And as for de-biasing gender terms, It makes the assumption that the gender differences in language only have nefarious purposes, but without an actual proof that this is the case, you may very well throw the baby with the bathwater.

A good example of this is when the french government mandated that Resumes must be anonymous and not mention sex, gender, etc. It actually resulted in worse outcome for people from lower socio-economic background. It turned out that on average people were more forgiving of bad Resumes if it came from people who could be expected to be disadvantaged.

You've given an example of where a human decision in a data-processing pipeline harms the result. There are many examples where human decisions are a benefit. Look at every other decision in a successful speech-recognition pipeline for examples. We're not just putting raw waveforms into a black box handed to us by aliens.

If you read the paper, you'll see that de-biasing is not based on assumptions, it is also based on data. Bolukbasi ran a pretty substantial crowdsourced survey to find the comparisons that word vectors make that people consider inappropriate. Don't make hollow demands for proof when you're not even aware of what's already been shown.

Crowdsourcing comes with its own set of biases, of course, and we may want to revisit this data sometime, but so far this is a pretty reasonable proxy for whether an ML model will cause problems when it is deployed. And it's much better than nothing, the option that I'm struggling to understand why you prefer.

I read the Bolukbasi paper and I now see where you are getting at. I think that the problem is not one of bias but rather that words vector embed the whole meaning regardless of the context. And that debiasing is rather stop-gap solution as there is an endless list of bias to correct depending on the usage of the model.

You may have fixed PROGRAMMER - MAN + WOMAN = HOMEMAKER, and still get SOLDIER - AMERICAN + ARAB = TERRORIST, the list is endless

Yep, so there's a lot of work to do. Bolukbasi, Cheng, et al. will straightforwardly admit that their model only applies to the one axis of gender bias. Fortunately there are other people working on this too, and I am one of them.

The baked-in assumption that Arabs or Muslims are terrorists, or that terrorists are Arabs or Muslims, is something that the de-biasing process in ConceptNet Numberbatch (which I make) attempts to mitigate at the same time as gender and racial bias. And of course there is much more to do.

It's a fascinating and productive field of research. Why did you have such a negative initial reaction to it?

Because I find in AI's candid responses to the questions we ask a fascinating opportunity to understand humanity better. When we try to conform those answers to our desired biases we learn less from it. But now I understand that such aspects are a subset of the greater problem of getting our models to answer in a specific context.

Ultimately the Model would need to detect & learn the contexts by itself.

And more to the point; I see you are publishing data sets after correcting them for fairness. I think the unfair results can sometimes be more useful. I once prototyped a short-story tagging & recommendation system based on word2vec and I would not be surprised if raw vectors gave better results. People's taste in litterature (especially romantic) are very dependant on gender stereotypes.

Interesting, definitely need to think about debiasing. Seems like it won't really work straight out of the box since it'd destroy the 1 bit-ness of the vectors.

Though if only a few of the vectors are de-biased then you can still save a lot of space since all the other vectors are still represented using 2 numbers (while the de-biased vectors are represented using the full range of 32 bit numbers).

There may be a way to build it into the loss function so that it happens before the quantization, right?

(and holy crap, look how fast the HN conservatives are getting to my comment)

Literally everything is either Left or Right. Anyone who disagrees with you is definitely a sexist conservative. No other possibility. The center is a black hole.

Hm not sure, would need to think more about this -- definitely an interesting idea though!

Thanks for being willing to discuss it. And sorry if I'm complicating your task.

But any interesting release of NLP data has the potential to affect the way the field progresses, so take it as a compliment that I consider this an interesting release of NLP data. That's why I'm asking you to actively consider the downstream effects of word vectors and find out if you can make them better.

Thanks for the compliments! And thanks also for bringing up the debiasing! It's interesting to see what people care about and any conversation that leads to new ideas is a plus.

One of the closest related words is lady, and then effeminate, so it's quite unclear exactly what there is to be upset about? In fact it appears it's saying men are related to being effeminate, which should get you excited.

What a great quantitative evaluation of bias you just carried out.

Alright let me try to be real. After reading http://www.sciencemag.org/news/2017/04/even-artificial-intel... and https://en.wikipedia.org/wiki/Implicit-association_test (in that order thank you), it looks like the implicit association test is a good way to magnify small differences (I'm thinking floating point error) in how much you associate one of a few words with a certain other. So, presumably, quantification would really be a natural way to eliminate those types of subtle differences.

I'm glad we could at least get you to read it, but you're still not caught up. The effect is many orders of magnitude larger than floating point error.

The notion that implicit bias could be corrected by quantization (not "quantification") is an interesting hypothesis, with a low prior probability, which could be tested by experiment and easily published if it is true.

Very cool! I like the visualizations a lot. Did you try to get an interpretation for what each quantized vector dimension means (have just skimmed, not read)?

Also, I am curious why you chose to go straight to publishing on Arxiv? I am actually also in CS224N right now and have a project me and my collaborator feel is publication worthy, but our plan is to go the normal route of submitting to a conference and only putting it on Arxiv after the review process (though our code is open source, so now that I think about it maybe that's not that useful a plan...).

Definitely tried to figure out if the dimensions mean anything -- as far as I can tell they don't really mean much :(

If you want them to be meaningful without changing the model...

Couldn't you rotate the basis to minimize the distance between each basis vector and it's nearest neighbor?

Haven't tried this but this is definitely a good idea for visualizing what's going on!

One of the fascinating things about neural embedding such as these is that the individual component dimensions have NO "real" semantic meaning to us humans. It's better to think of them as single points in a higher dimensional space.

(of course, with a clever network design you could probably FORCE "meaning" onto some components)

Is there an explanation for why almost all the words most distant from "man" are either uppercase or otherwise lexically anomalous? Is it largely driven by their being low frequency in the corpus?

can this be done with fasttext as well?

Word2bits is definitely great for memory-constrained applications but for server use memory isn't as much a constraint (there's a direct word -> vector relationship so you can just put it in a database)

it would be amazing to combine this with fasttext's ability to generate vectors for out-of-vocab words.

The data engineer in me thanks you. Great work.

The paper would be a great weekend read.

Can someone explain how to read the graph?

From the paper

> Figure 2 shows a visualisation of 800 dimensional 1 bit word vectors trained on English Wikipedia (2017). The top 100 closest and furthest word vectors to the target word vector are plotted. Distance is measured by dot product; every 5 word vectors are labelled. A turquoise line separates the 100 closest vectors to the target word from the 100 furthest vectors (labelled “...”). We see that there are qualitative similarities between word vectors whose words are related to each other.

Yeah, I should definitely put more detail in the writeup -- thanks for the feedback!

What's happening with figure 1a (epochs vs google accuracy) is that as you train for more epochs the full precision loss continues to decrease (dotted red line) but accuracy also starts decreasing (solid red line). This indicates overfitting (since you'd expect accuracy to increase if loss decreases). The blue lines (quantized training with 1 bit) do not show this which suggests that quantized training seems to act as a form of regularization.

Figure 1b is pretty similar, except on the x axis we have vector dimension. As you increase vector dimension, full precision loss decreases, yet after a certain point full precision accuracy decreases as well. I took this to mean that word2vec training was overfitting with respect to vector dimension.

Oops, thought you meant the graphs (with the dotted/solid lines) in the writeup.

If you're referring to the image under "Visualizing Quantized Word Vectors" then each row is a word vector (and there are only two colors since each parameter is either -1/3 or +1/3).

Thanks for the reply. Yes, I did mean the image under "Visualizing Quantized Word Vectors".

I did get that the colors indicated values of dimensions, but I suppose what I really meant is, what is the take-away message? To me, it just looks like noise. Is there a pattern I should look for and go "a-ha, I see"?

You can kind of see that words that are similar have similar looking vector values (that's why there are vertical stripes of yellow / black). But you're right in that most of it just looks like noise. I put the picture there mainly to show what the quantized vectors look like.

In Figure 2, have you noticed that every 25 words or so, there are off-pattern words? They are represented by visually differing lines (looks like a glitch, and it may be a glitch as this happens for all visualization charts).

What words correspond to these "glitches" for "mushroom" for example? In the case of "mushroom" there is a glitch line just below "earthstar".

Can you provide the full y-axis word vector for any of the visualization charts?

Nice work.

Here's the one for man:

['man', 'woman', 'boy', 'handsome', 'stranger', 'gentleman', 'young', 'drunkard', 'devil', 'lonely', 'lady', 'lad', 'drunken', 'beggar', 'kid', 'effeminate', 'brave', 'bearded', 'himself', 'dressed', 'loner', 'meek', 'sees', 'hustler', 'girl', 'coward', 'thief', 'wicked', 'person', 'balding', 'dashing', 'deranged', 'tramp', 'mysterious', 'him', 'pretends', 'lecherous', 'friend', 'shepherd', 'portly', 'bespectacled', 'jolly', 'thug', 'gangster', 'dapper', 'genius', 'slob', 'beast', 'hero', 'hoodlum', 'policeman', 'elderly', 'drunk', 'manly', 'mustachioed', 'ruffian', 'cop', 'burly', 'beard', 'fool', 'terrified', 'scarecrow', 'scruffy', 'lover', 'peddler', 'remembers', 'supposedly', 'gambler', 'bloke', 'bastard', 'acquaintance', 'mighty', 'playboy', 'unshaven', 'prostitute', 'pimp', 'mans', 'skinny', 'carefree', 'scoundrel', 'crook', 'obsessed', 'surly', 'fancies', 'accosted', 'foolish', 'jovial', 'cocky', 'shifty', 'loves', 'narrator', 'butler', 'dying', 'casually', 'waiter', 'evil', 'frightened', 'gigolo', 'conman', 'cunning']

(Edit: Actually this isn't quite right as it doesn't match the image. Many of these vectors actually have the same distance to "man" and the dict doesn't keep a deterministic order. What you can do is modify https://github.com/agnusmaximus/Word2Bits/blob/development/s... and run it on the 1 bit 400k vectors and see what it prints out. To run it do: `python w2bvisualize.py path_to_1bit800d400kvectors` then it should generate similar figures as in the writeup)

interesting that 'artists' is a furthest neighbor of 'man' in the example

This might be because "Artist" has an uppercase "A" -- I trained all the word vectors to be case sensitive so "Artist" is not the same as "artist" (which should be closer to "man" than "Artist")

Why would you do that? If you look at something like LSA, they goal is to uniform those, rather than distinguish. artist and Artist should be (near)100% match; what are you trying to do here?

Main reason I did it this way is because Facebook's DrQA (which I evaluate the vectors on for the SQuAD task) uses case sensitive vectors.

Was a tough decision between choosing whether to train case sensitive vectors / case insensitive vectors and a future task would be to train case-insensitive vectors.

Makes sense. The plural vs singular probably impacts that one too.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact