
Fast word vectors with little memory usage in Python - jinqueeny
https://github.com/ThoughtRiver/lmdb-embeddings
======
patelajay285
This is interesting work. We at Plasticity (YC S17) have open sourced
something similar called Magnitude
([https://github.com/plasticityai/magnitude](https://github.com/plasticityai/magnitude))
a few months ago for querying vector embeddings quickly with low memory usage
using SQLite and a standard universal file format (.magnitude) between
word2vec, fastText, and GloVe. We also added features for out-of-vocabulary
lookups, bulk queries, concatenating embeddings from multiple models, etc. It
may also be of interest to folks looking at this.

We also have a paper on Magnitude we will be presenting at EMNLP 2018 and ELMo
support is coming soon!

~~~
hyc_symas
SQLite (and SQL in general) is ridiculously slow for specialized applications
like this.
[http://www.lmdb.tech/bench/microbench/](http://www.lmdb.tech/bench/microbench/)

~~~
patelajay285
It depends what you are optimizing for. It's a little slower for random disk
reads, but it doesn't matter given we utilize an in-memory LRU cache and most
text has a Zipf-ian distribution which will hit the cache very often even with
a small cache size. (see my comment elsewhere on this thread)

Developers are more familiar with SQLite, it's bundled with Python, it is easy
to add to nearly every language, and it has specialized features we use like
the Full Text Search module for out of vocabulary lookups.

~~~
hyc_symas
Wrappers for LMDB are available for nearly every language as well.
[https://symas.com/lmdb/technical/#wrappers](https://symas.com/lmdb/technical/#wrappers)

------
danieldk
Note that the original word2vec binary format is extremely bad for low-memory
use. It stores the words and vectors interleaved (and words are obviously of a
variable length). Of course, you could circumvent this problem by building a
separate (sorted) index file.

However, newer formats, such as the fastText binary format, store the
embedding matrix contiguously. In such formats you can just memory-map the
embedding matrix and you only have to load the vocabulary into memory. This is
even simpler than the approach described here and in the Delft README, you
don't have any serialization/deserialization overhead [1], you can let the OS
decide how much to cache in memory, and you have one dependency less (mmap is
in POSIX [2]).

[1] Of course, if you have a system with different endianness, you have to do
byte swapping.

[2]
[http://pubs.opengroup.org/onlinepubs/7908799/xsh/mmap.html](http://pubs.opengroup.org/onlinepubs/7908799/xsh/mmap.html)

~~~
hyc_symas
Note that LMDB uses mmap already, and has the same advantages you list. (no
ser/deser, OS controls cache).

~~~
danieldk
But the embedding matrix will not be stored contiguously, which has benefits,
especially when frequent words are stored together (which happens in some
implementations, because they sort the vocabulary by frequency), due to page-
granularity of mapping, etc. Also, in this particular implementation:

 _By default, LMDB Embeddings uses pickle to serialize the vectors to bytes
(optimized and pickled with the highest available protocol). However, it is
very easy to use an alternative approach such as msgpack_

So, they use serialization/deserialization.

------
visarga
You can also speed up the loading of embeddings by using BPE (byte pair
encoding) to segment words into a smaller dictionary of char ngrams, and
learning ngram embeddings instead of words.

You can replace a list of 500K words with 50K ngrams, and it also works on
unseen words and agglutinative languages such as German. It's interesting that
it can both join together frequent words or split into pieces infrequent
words, depending on the distribution of characters. Another advantage is that
the ngram embedding size is much smaller, thus making it easy to deploy on
resource constrained systems such as mobile phones.

Neural Machine Translation of Rare Words with Subword Units

[https://arxiv.org/abs/1508.07909a](https://arxiv.org/abs/1508.07909a)

A Python library for BPE ngrams: sentencepiece

[https://github.com/google/sentencepiece](https://github.com/google/sentencepiece)

~~~
stochastic_monk
Fixed arxiv link:
[https://arxiv.org/abs/1508.07909](https://arxiv.org/abs/1508.07909)

------
spennihana
I almost went a mem-mapped approach for a version I wrote in java that uses
fork-join:
[https://github.com/spennihana/FasterWordEmbeddings](https://github.com/spennihana/FasterWordEmbeddings)

Edit: this work uses a different byte-layout + parallel reader that heaves the
word vecs into memory as compressed byte arrays. Load time is seconds (haven’t
benchmarked with current ssds). Memory footprint is on the order of the size
of your word vecs (memory is cheap for me, but could easily be extended to
support mem-mapping if memory resources are scarce).

------
infocollector
Please add a license to the source code? Thanks.

~~~
domhudson
Hi thanks for your interest. I have added a GPL v3 license now.

Thanks

------
atrudeau
Very nice work! :) Are there any benchmarks available? I'm curious how this
compares to caching frequent word vectors (Zipf's law helps here) and disk-
seeking the rest.

~~~
patelajay285
See my top-level reply about a similar library called Magnitude that we have
built and open-sourced:
[https://github.com/plasticityai/magnitude](https://github.com/plasticityai/magnitude).

It actually utilizes an LRU cache (configurable cache size argument on the
constructor) so you can utilize an in-memory LRU cache as you query words off-
disk using SQLite indexed disk-seeks. And you're right, due to Zipf-ian
properties of most text, you can see gains even with a small in-memory cache
size :).

------
guybedo
so far i've been using rocksdb for this use case, is there any benchmark that
would compare both dbs ?

~~~
danielmorozoff
We use rocks in production and research and have used LMDB for research
purposes. Here's a benchmark:

[https://www.influxdata.com/blog/benchmarking-leveldb-vs-
rock...](https://www.influxdata.com/blog/benchmarking-leveldb-vs-rocksdb-vs-
hyperleveldb-vs-lmdb-performance-for-influxdb/)

I think rocks gives better read+ write performance together whereas lmdb is
heavily skewed to read performance in our experience -- which is mirrored by
the benchmark.

~~~
hyc_symas
It's not as clearcut as that. RocksDB gives better write performance up to
record sizes of about 1/4 page size. Above that size, LMDB write performance
is superior.

[http://www.lmdb.tech/bench/ondisk/](http://www.lmdb.tech/bench/ondisk/)

Also the InfluxDB blog post is missing quite a lot of discussion that
originally occurred here
[https://disqus.com/home/discussion/influxdb/benchmarking_lev...](https://disqus.com/home/discussion/influxdb/benchmarking_leveldb_vs_rocksdb_vs_hyperleveldb_vs_lmdb_performance_for_influxdb/)

~~~
danielmorozoff
Awesome did not know this thanks!

