Edit: Have people tried to detect ambiguous words by measuring local conflict in word2vec space? E.g. "laboratory" is similar to "science" and "science" is similar to "fiction" but there is no evidence to suggest that "laboratory" should be similar to "fiction".
This can also be used to disambiguate words like "hit" which can be used as a verb and a noun. You just replace "hit" with "hit|noun" and "hit|verb".
Basically, instead of making neural word embeddings, much like you describe, the objects being embedded in a vector space were "hit|noun(1)" "hit|noun(2)" "hit|verb", and so on.
I believe they used WordNet or some ontology, or maybe a POS corpus like Penn Treebank....
This paper in particular has ~700 citations right now; I didn't go through them to see the latest work but this likely doesn't represent the cutting edge.
I imagine that marginal computational cost would be the deciding factor in choosing a more advanced model here. Granted, these embeddings are often generated infrequently so I don't see the harm in extra one-off training time. It's possible that choosing between definitions adds overhead elsewhere in the system.
Would someone please explain in simple language what this is, and why it's cool?
Word2Vec is one of the algorithms to do this. Given a bunch of text (like Wikipedia) it turns words into vectors. These vectors have interesting properties like:
vector("man") - vector("woman") + vector("queen") = vector("king")
distance(vector("man"), vector("woman)) < distance(vector("man"), vector("cat"))
What Word2Bits does is make sure that the numbers that represent a word is limited to just 2 values (-.333 and +.333). This reduces the amount of storage the vectors take and surprisingly improves accuracy in some scenarios.
If you're interested in learning more, check out http://colah.github.io/posts/2014-07-NLP-RNNs-Representation... which has a lot more details about representations in deep learning!
A big problem with NLP is understanding the semantic associations between words (or even lemmas. Lemmas in this context refer to different meanings of the same word, like a baseball bat vs. a vampire bat). For example "run" and "sprint" are similar in meaning but convey different connotations; kings and queens are both high-level monarchs but we need to encode the difference in gender between them for a true semantic understanding. The problem is that the information in words themselves don't accurately convey all of this information through their string (or spoken) representations. Even dictionary definitions lack explanations of connotations or subtle differences in context, and furthermore aren't easily explained to a computer
Word2vec is an algorithm that maps words to vectors, which can be compared to one another to analyze semantic meaning. These are typically high-dimensional (e.g. several hundred dimensions). When words are used often together, or in similar contexts, they are embedded within the vector space closer to each other. The idea is that words like "computer" and "digital" are placed closer together than "inference" and "paddling".
Usually these vector mappings are represented as something like a python dictionary with each key in the dictionary corresponding to some token or word appearing at least one (maybe more) times in some set of training data. As you can imagine these can be quite large if the vocabulary of the training data is diverse, and due to being in such a high-dimensional space, the precision of a vector entry may not need to be as high as doubles or floats encode. Floats are 32 bits. The authors of this paper/repo figured out a way to quantize vector entries into representations with smaller numbers of bits, which can be used in storage to make saved word2vec models even smaller. This is really useful because a big problem with running word2vec instances is that they can take up space on the order of gigabytes. I haven't read it all yet but it seems the big innovation might have been figuring out a way to work with these quantized word vectors in-memory without losing much performance
Edit: seems it may not work in-memory
But when you save them to disk every value is either -1/3 or +1/3 so one could encode the word vectors in binary. This can lead to reducing memory usage during application time if you kept the word vectors in this compressed format (though you'd need to write a decode function in tensorflow or pytorch to take a sequence of bits corresponding to a word and convert it into a vector of -1/3s and +1/3s)
In geometric terms, think of a circle embedded in a square. As the number of dimensions increases (e.g. a sphere embedded in a cube, a 4-dimensional ball embedded in a 4-dimensional cube, etc), most of the volume in the cube is outside the radius of the sphere. Any vectors you have effectively become sparse.
Basically this means that, while you need large dimensionality to model complex relations, high dimensionality makes it difficult to model these relationships (and need exponentially more data).
Intuitively I can suggest the following: One of the problems is that even relatively huge corpuses like wikipedia can't deal with high dimensionality that well because even they likely lack the sheer size and diversity of content to "flesh it out", so to speak. My intuition tells me that this is due to the training process overfitting "clusters" of information together due to the size of the model. If you have too many dimensions, there's a lot of space for clusters to form, and with that will come some loss of semantic differentiability between clusters - or at least my hunch tells me so. You definitely want "computer" to be more associated with "mail" than "taupe" but if the frequencies of their associations are small they'll essentially be interpreted as noise or overfit. One thing to note is that word2vec embeddings are trained using a shallow neural net, and it's entirely possible for it to be a generic ML problem of too-many-parameters/bad network topology given the input when dimensions get too high.
With dimensions in this context it can be easy to forget that each additional dimension added can (potentially) add an order of complexity to the model - a smaller model lies on a hyperplane in the new vector space. What may happen is that an added dimension (by added, I mean before training, not after) might add some beneficial complexity for a specific subset of the model; e.g. if we were originally in very low dimensions, adding one dimension may allow king/queen and boy/girl to separate in the vector space based on gender( although in reality you can't create correspondences to individual dimensions and properties like this usually) but simply lead to noise or overfitting in other subsets. I think in very high dimensions this overfitting is likely to manifest itself in either too much clustering or strangely high similarity words resulting from outliers/noise in the data.
I've never really seen thousands of dimensions used in the wild, but I don't know of any papers that explicitly compare performance among different dimensions (word2vec is hard to evaluate with a single metric, though). Perhaps once you get to internal Google or Facebook levels of big data you could use distributed computing to make it work, but again, I haven't seen references to that.
The wiki page on Word2vec seems helpful.
To our surprise, we couldn't find any example or description of someone doing this before. Is this such an uncommon problem or did we just not search in the right places?
FWIW, our use case is a search tool with live result listing, so we're dealing with word fragments and would like the outcome to be somewhat stable as the user types along. We ended up rolling our own, but it has certain shortcomings, such as a hard character limit.
This is one of the defining differences between Word2Vec and Fasttext. But fasttext incorporates these character vectors as part of calculating the semantic vector, so you can't expect carg and cargo to end up being similar, but people have thought of it.
I don't think partial search is that uncommon, but I don't think it is usually solved by using vector representations similarly to word vectors. It seems like what you are looking for is usually accomplished by edit-distance / Levenshtein distance 
Or don't feed the search query directly into a neural network?
Your specific example would be relevant to word-stemming and lemmatization. Stemming is the process of removing suffixes from words for standardization (e.g. swim, swims, swimming, swimmer could all be stemmed to just "swim") across inflections/conjugations. Lemmatization is similar but uses contexts. Actually, some stemmers wouldn't stem cargo to carg by default, but they definitely could be modified to exhibit that kind of behavior, or used as one step in a multistep standardization process
Levenshtein distance is a good metric for individual comparisons but if you're doing a lot of pairwise comparisons/want to index it's not a great option sometimes.
You definitely want to also look into tries/prefix trees. These take each character in the word and use it for an O(1) index for the next level of the tree. For example, "brea" queries the top node "b", pointing the next node "r", then "e", then "a". If you next read "d", the trie would indicate that this represents a completed word-fragment at the b->r->e->a->d node of the trie. If you combine this data structure with a statistical model, you can use it for things like spell-checking and autocompletion
(I've edited this comment twice now to make it more clear, hopefully this is sufficient). Let me know if you'd like me to point you to any other resources. I've worked with NLP a decent amount and could even work with you guys, if interested my email is in my profile and we can arrange further conversations
For word completion, I would create a trie and encode naive char-to-char and absolute (word prefix vs. all recorded) frequencies for each edge during its creation, noting that the "end word" value is also possible and deserves a frequency. Then when a user is entering a single word without prior context, you simulate a markov process from the last character to the end of the word. If the user has input a short but unlikely combination you can observe the frequencies of untraversed edges from the nodes of the current path and start a markov process from there if it is much more likely. That gets you to the end of your current word in terms of its string representation. From there you can use n-grams if desired, or go straight into sanitization preceding vectorizaton, to construct a likely query
If I were you I would decouple processes in your pipeline. I mentioned the best way I know of combining word vectors and word fragments (embedding word fragments of a corpus into word2vec, then indexing them with a trie) and I don't think it would be feasible for size reasons - although perhaps the topic of this thread could make it more computationally amenable. It sounds like what you want to do is 1 first infer a word/phrase, 2 stem/sanitize it, 3 map it to its word2vec representation, then 4 do some search query using the vector. 2 and 3 could be combined if desired (would decrease corpus vocab substantially and be good space-wise, improve embeddings of stems of rare words by reducing overfitting, but lose some semantic complexity), perhaps even aggressively, but not with 1 unless you further modify the training process / augment the corpus with fragments.
TL;DR: There's not a good way I know of to use a word2vec mapping trained on a vanilla corpus to directly account for short spelling errors since individual spelling error fragments will be rare or not present within the vanilla data. You seem to think Levenshtein will help but keep in mind this is an expensive pairwise string comparison algorithm. Unless you implement a good way to check which strings to compare the input fragment to, you will likely perform too many comparisons because you won't know where to start
So if the NN has previously learned meaningful result priorities for "cargo", they should ideally also work out for "carg" (and vice versa) because of the live listing nature of our tool.
I think the best way to do this is to create a second neural network which smooths out fragments into word2vec vectors corresponding to the derived word (or the derived word itself). In both approaches you start by making a dataset where each word in the vocabulary is the output for multiple incorrectly spelled, artificially generated inputs. For example you want to have the inputs "crg", "carg", "argo", "crgo", "cago", "cargo", "cargop" "cartgo" all have outputs to "cargo" in this data, whether it's the string "cargo" itself or the w2vec embedding of it. The approach where w2vec embeddings are the output allows for words like "carg" to be interpreted as something like a median between "car" and "cargo" both as input to your main NN and for training purposes, which might be want you want. There's some info on this here  but they use it to regenerate words themselves, which you probably don't want. Note that including the identity/low training error is very important unless you do a preliminary vocabulary check.
The second approach of generating correct spellings instead of approximate vectors fails if it doesn't get a close enough approximation, although it seems if levenstein distance <=2, the approximation can be corrected cheaply . Sorry I couldn't be more of help, I haven't really encountered this type of problem before. Good luck, you have an interesting problem to solve!
What we ended up doing for now is a two-dimensional input layer with per-column one-hot encoding of characters (i.e. one character is one column, 128 rows for the ascii alphabet). Then, apply a convolution with kernel dimensions 3x128, which flattens data to one dimesion and combines three neighboring characters. The second part builds an "assiciation" between neighbors, which helps yielding similar outputs for similar word fragments.
This works quite well, except for some nasty limitations:
- Search queries have a hard limit in length, caused by our input layer dimensions
- Due to varying search query length, input nodes on the right side are often unused/zero, leading an weighting bias on the left side when training. That is, the start of search queries receives more attention that the end. But that's not necessarily a bad thing.
[Also. I hate having to spam discussion threads with personal user-to-user comments.. but there's no message user feature. This message will self destruct once it's goal has been achieved.]
For word2vec, GloVE, and fastText. It is able to generate vectors for out-of-vocabulary words through "fragments" of words or subword character n-grams rather.
You'd need to invest plenty of effort into shingling and vectorizing properly to get useful results, though.
Top of my head speculation, I never tried this...
I really should do a blog post about it or something.
This sounds similar to Maciej Kula's experiments in "Binary Latent Representations for Efficient Ranking: Empirical Assessment", https://arxiv.org/abs/1706.07479.
Maciej shared this : "FWIW I ran similar experiments on recommendation tasks. My initial results were very encouraging (and similar to those reported here), but the effect disappeared entirely after adding regularization to the baseline. I would have more confidence in the results reported here if the authors ran a more principled hyperparameter search, and added regularization to their baseline."
I'm curious, how many values did you try for the quantization functions? Without thinking too much about it, that seems like one of the hyperparams that could have a pretty big impact on performance.
For 1 bit I think I tried something like -1/+1, -.5/+.5, -.25/+.25, -.333/+.333. and something like -10/+10 -- (and I think a few more). It seemed -.333/+.333 worked the best while +10/-10 did the worst on the google analogy task (getting like 0% right). All this was tuned on 100MB of Wikipedia data.
That breaks down for values of x precisely at the boundary between steps, so I should have qualified "differentiable" with "almost everywhere".
It also occurs to me that this might interact strangely with the approximation dq/dx = 1, but since the quantization steps are globally shared, I think it should be stable anyway.
If the evaluation suite for your code doesn't require too much manual interaction, I might try and see for myself.
So, this approach computes a "traditional" neural embedding, in say, R^50, and then "brutally" replaces each of the reals with an integer in Z_2,4,8...
I can't quite put my finger on it, but my hunch is that this naive method, while already delivering interesting results, can be drastically improved upon.
* don't use a fixed bit depth for all vector components
I guess it depends on what you're trying to optimize for -- what algebraic properties you wish to preserve for the end application:
If the end goal is: "linearity be damned, I want a stupidly fast but inaccurate way of doing approximate nearest neighbor search", then turning words into bitvectors, and using hamming distance, not(xor(a,b)), &c" works.
Either way, thanks for the ideas, OP. (was going to go on with some mathematical stuff which I suspect would improve upon it, but decided to either shutup, or put-up-and-credit-you.)
In addition to (possibly new) hardware that supports much larger memory, compression techniques like this might allow us to start operations with “phrase embedding” or eventually even whole “sentence embedding.”
To explain, although I’m sure the author himself is familiar with the issue: For any word that is disproportionately associated with one gender in the corpus, the model will learn that gender difference as part of the representation of the word, pretty much baking it into all applications.  It's all fun and games when this helps you find the difference between "king" and "queen", but it becomes a problem when the same difference appears between "genius" and "beautiful".
I haven't evaluated these vectors for built-in biases, but I assume they would have similar problems to the pre-computed word2vec and GloVe embeddings. (If they don't -- if quantization is a natural way to counteract bias -- then that's an awesome result! But an unlikely one.)
To the author: I don’t mean this to sound like an accusation that you haven’t done this yet; I know that short papers can’t tell two stories at the same time. But the next step is pretty clear, right? How do these vectors measure on the Word Embedding Association Test for implicit bias? Does Bolukbasi’s de-biasing method, or an analogue of it, work on quantized embeddings?
 Bolukbasi et al., "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." https://arxiv.org/abs/1607.06520
If there's gender bias in certain words it might be interesting to point that out but it's not the linguists' job to 'de-bias' the grammar and the underlying model.
There's an undeniable gender bias in certain words (men probably are less frequently referred to as 'beautiful' than women while on the other hand 'genius' likely is more often used when referring to men). Glossing over that by smoothing models not only misrepresents how speakers use language but probably doesn't really help the cause either.
If the corpus used displays a disproportionate association of certain words with one gender this could just mean that the corpus is insufficient for representing a language in general (as opposed to just a particular register or sociolect), which is a common problem in computational linguistics, not just when it comes to gender biases.
Sure, but this is not a project of linguistic research. It’s a tool that could see real-world usage. Not perpetuating the stereotypes that suffuse the data we feed into such models does seem like a worthy Endeavour, and I find it encouraging to see the creator taking these concerns seriously in this threat.
(As opposed to the community-at-large, which quickly send them to the bottom of the threat)
If you create a discriminatory machine learning model, it's unlikely to be because you didn't debias your input representation. More often, you're training it on a task whose real-world statistics are biased, and the model learns to accurately reflect that bias to solve your task. The solution to that is not to modify the input, but rather the output you expect the model to provide.
- Say a lot of nice things about fairness
- Talk about the importance of de-biasing and doing it as well as possible
- Show (legitimately) that the farther downstream in your ML process you apply de-biasing, the more sound the results are
- Assume that any successful ML model will be used downstream in something else
- Therefore, never de-bias anything, except the final output of something where your company benefits from showing fairness: that is, a demonstration at a conference talk about fairness
- The time to de-bias real applications is always "later" and everyone can say they are working on doing the right thing
People who release data artifacts have the ability to de-bias now instead of later. It's not perfect, but it is good. If you rely on the developer after you to do the de-biasing that you're not doing, you're ensuring that de-biasing won't happen, because they won't do it either.
I think there's no reason to believe that bias from multiple layers of ML end up neatly rolled up into one wad of bias that can be extracted at the end of the process. If there is an empirical demonstration that this can happen, I'd love to see it.
And I think the cost of de-biasing is much lower than you think it is. I mean, I de-bias word vectors. The accuracy loss on intrinsic evaluations from doing so is tiny; it's much smaller than what you gain from easy wins that not everyone does, like improving your OOV lookup strategy (another way to improve ML results by intervening manually on your insufficent data!).
A simple example for a thought experiment: let's say you're classifying movie reviews for sentiment, so you've got a word-embedding layer (trained on general word embeddings) and a classifier layer (trained on the specific data).
The word-embedding layer will learn from corpora that it is negative to say "gay". The classifier layer will learn from the specific dataset of movie reviews that it's negative to say "Steven Seagal" (this actually happens in simple classifiers trained on the MovieLens data set). And sure, I could accept this one as objective truth, it's just an amusing and memorable example.
I think that if you save all the de-biasing for the end, you are likely to be able to fix the "Seagal" problem but not the "gay" problem, whose representation has become too complicated by going through another step of machine learning.
So I don't see the moral hazard of doing more de-biasing, as long as nobody presents it as a one-stop fix to the problem, which I sure don't.
I think there is not nearly enough research into de-biasing at training time, perhaps because it would be discouraged by the reward structure of our field.
("You mean I can make my model more complex and my paper longer, and in return I lose the 0.15% accuracy gain that's the entire reason I could call the result 'state of the art'? Sure, I'll get right on that!")
There is no objective way to describe this difference between your training set and your test set, because you don't have data from the future. It requires a conscious decision. Enforcing that the future should be like the past is a conscious decision, and it's a lazy and unfortunate one.
This is just one instance of a problem that ML is already quite familiar with, and has been since "ML" was called "statistics": observed data produces a biased estimator of the actual distribution. You always have to correct your objectively-measured distributions for what you can't measure objectively. Here I mean "biased" in the mathematical sense; when it comes to human biases encoded in word embeddings, it also produces bias in the ethical sense. But there are many simpler instances of this.
For example: a maximum-likelihood language model assigns a probability of 0 to any word it has never seen. Maximum likelihood is, when measured, a better model of the input data than anything else. If you implement this completely objective language model, it would be useless when applied; it would output the impossible probability of 0 for most inputs. Instead you have to "smooth" and "correct" it for the fact that other words exist.
If you are familiar at all with machine learning, you should recognize that human decisions affect every step of the process, especially the part where the data is produced and collected. It is not an oracle of objective truth.
And let me quote Arvind Narayanan for why you are not going to get the right answer in your search for objectivity: "Training data is from the past and test data is from the future. We use ML because we want to learn from the past, not reproduce it." 
And as for de-biasing gender terms, It makes the assumption that the gender differences in language only have nefarious purposes, but without an actual proof that this is the case, you may very well throw the baby with the bathwater.
A good example of this is when the french government mandated that Resumes must be anonymous and not mention sex, gender, etc. It actually resulted in worse outcome for people from lower socio-economic background. It turned out that on average people were more forgiving of bad Resumes if it came from people who could be expected to be disadvantaged.
If you read the paper, you'll see that de-biasing is not based on assumptions, it is also based on data. Bolukbasi ran a pretty substantial crowdsourced survey to find the comparisons that word vectors make that people consider inappropriate. Don't make hollow demands for proof when you're not even aware of what's already been shown.
Crowdsourcing comes with its own set of biases, of course, and we may want to revisit this data sometime, but so far this is a pretty reasonable proxy for whether an ML model will cause problems when it is deployed. And it's much better than nothing, the option that I'm struggling to understand why you prefer.
You may have fixed PROGRAMMER - MAN + WOMAN = HOMEMAKER, and still get SOLDIER - AMERICAN + ARAB = TERRORIST, the list is endless
The baked-in assumption that Arabs or Muslims are terrorists, or that terrorists are Arabs or Muslims, is something that the de-biasing process in ConceptNet Numberbatch (which I make) attempts to mitigate at the same time as gender and racial bias. And of course there is much more to do.
It's a fascinating and productive field of research. Why did you have such a negative initial reaction to it?
Ultimately the Model would need to detect & learn the contexts by itself.
And more to the point; I see you are publishing data sets after correcting them for fairness. I think the unfair results can sometimes be more useful. I once prototyped a short-story tagging & recommendation system based on word2vec and I would not be surprised if raw vectors gave better results. People's taste in litterature (especially romantic) are very dependant on gender stereotypes.
Though if only a few of the vectors are de-biased then you can still save a lot of space since all the other vectors are still represented using 2 numbers (while the de-biased vectors are represented using the full range of 32 bit numbers).
(and holy crap, look how fast the HN conservatives are getting to my comment)
But any interesting release of NLP data has the potential to affect the way the field progresses, so take it as a compliment that I consider this an interesting release of NLP data. That's why I'm asking you to actively consider the downstream effects of word vectors and find out if you can make them better.
The notion that implicit bias could be corrected by quantization (not "quantification") is an interesting hypothesis, with a low prior probability, which could be tested by experiment and easily published if it is true.
Also, I am curious why you chose to go straight to publishing on Arxiv? I am actually also in CS224N right now and have a project me and my collaborator feel is publication worthy, but our plan is to go the normal route of submitting to a conference and only putting it on Arxiv after the review process (though our code is open source, so now that I think about it maybe that's not that useful a plan...).
Couldn't you rotate the basis to minimize the distance between each basis vector and it's nearest neighbor?
(of course, with a clever network design you could probably FORCE "meaning" onto some components)
Word2bits is definitely great for memory-constrained applications but for server use memory isn't as much a constraint (there's a direct word -> vector relationship so you can just put it in a database)
it would be amazing to combine this with fasttext's ability to generate vectors for out-of-vocab words.
The paper would be a great weekend read.
> Figure 2 shows a visualisation of 800 dimensional 1 bit word vectors trained on English Wikipedia (2017). The top 100 closest and furthest word vectors to the target word vector are plotted. Distance is measured by dot product; every 5 word vectors are labelled. A turquoise line separates the 100 closest vectors to the target word from the 100 furthest vectors (labelled “...”). We see that there are qualitative similarities between word vectors whose words are related to each other.
What's happening with figure 1a (epochs vs google accuracy) is that as you train for more epochs the full precision loss continues to decrease (dotted red line) but accuracy also starts decreasing (solid red line). This indicates overfitting (since you'd expect accuracy to increase if loss decreases). The blue lines (quantized training with 1 bit) do not show this which suggests that quantized training seems to act as a form of regularization.
Figure 1b is pretty similar, except on the x axis we have vector dimension. As you increase vector dimension, full precision loss decreases, yet after a certain point full precision accuracy decreases as well. I took this to mean that word2vec training was overfitting with respect to vector dimension.
If you're referring to the image under "Visualizing Quantized Word Vectors" then each row is a word vector (and there are only two colors since each parameter is either -1/3 or +1/3).
I did get that the colors indicated values of dimensions, but I suppose what I really meant is, what is the take-away message? To me, it just looks like noise. Is there a pattern I should look for and go "a-ha, I see"?
What words correspond to these "glitches" for "mushroom" for example? In the case of "mushroom" there is a glitch line just below "earthstar".
Can you provide the full y-axis word vector for any of the visualization charts?
['man', 'woman', 'boy', 'handsome', 'stranger', 'gentleman', 'young', 'drunkard', 'devil', 'lonely', 'lady', 'lad', 'drunken', 'beggar', 'kid', 'effeminate', 'brave', 'bearded', 'himself', 'dressed', 'loner', 'meek', 'sees', 'hustler', 'girl', 'coward', 'thief', 'wicked', 'person', 'balding', 'dashing', 'deranged', 'tramp', 'mysterious', 'him', 'pretends', 'lecherous', 'friend', 'shepherd', 'portly', 'bespectacled', 'jolly', 'thug', 'gangster', 'dapper', 'genius', 'slob', 'beast', 'hero', 'hoodlum', 'policeman', 'elderly', 'drunk', 'manly', 'mustachioed', 'ruffian', 'cop', 'burly', 'beard', 'fool', 'terrified', 'scarecrow', 'scruffy', 'lover', 'peddler', 'remembers', 'supposedly', 'gambler', 'bloke', 'bastard', 'acquaintance', 'mighty', 'playboy', 'unshaven', 'prostitute', 'pimp', 'mans', 'skinny', 'carefree', 'scoundrel', 'crook', 'obsessed', 'surly', 'fancies', 'accosted', 'foolish', 'jovial', 'cocky', 'shifty', 'loves', 'narrator', 'butler', 'dying', 'casually', 'waiter', 'evil', 'frightened', 'gigolo', 'conman', 'cunning']
(Edit: Actually this isn't quite right as it doesn't match the image. Many of these vectors actually have the same distance to "man" and the dict doesn't keep a deterministic order. What you can do is modify https://github.com/agnusmaximus/Word2Bits/blob/development/s... and run it on the 1 bit 400k vectors and see what it prints out. To run it do: `python w2bvisualize.py path_to_1bit800d400kvectors` then it should generate similar figures as in the writeup)
Was a tough decision between choosing whether to train case sensitive vectors / case insensitive vectors and a future task would be to train case-insensitive vectors.