- Encode 1GB in 20sec
- Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece...
- Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...)
- Written in Rust with bindings for Python and node.js
Github repository and doc: https://github.com/huggingface/tokenizers/tree/master/tokeni...
- Rust: https://crates.io/crates/tokenizers
- Python: pip install tokenizers
- Node: npm install tokenizers
I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.
I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.
They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.
Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.
> put a few thousand $ in a series-A
Not a good idea.
I can't think of many small teams that can be acquired and can build a company's ML infrastructure as fast as this team.
If they have the money for it, OCI and Azure may also be keeping a look out for them.
SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.
Is this possible using HuggingFace (or another word embedding based library)?
I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.
Indeed. Do you have an example of a library or snippet that demonstrates this?
My limited understanding of BERT (and other) word embeddings was that they only contain the word's position in the 728 (I believe) dimensional space but doesn't contain queryable temporal information no?
I like ngrams as a sort of untagged / unlabelled entity.
One of the simpler ways to try that out in your code seems to be running BERT-as-a-service https://github.com/hanxiao/bert-as-service , or alternatively the huggingface libraries that are discussed in the original article.
It's kind of the other way around compared to word2vec-style systems; before that you used to have a 'thin' embedding layer that's essentially just a lookup table followed by a bunch of complex layers of neural networks (e.g. multiple Bi-LSTMs followed by CRF); in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.
Would you say they are still usually called "embeddings" when using this new style? This sounds more like just a pretrained network which includes both some embedding scheme and a lot of learning on top of it, but maybe the word "embedding" stuck anyway?
However, the analogy still is somewhat meaningful, because if you want to look at the properties of a particular word or token, it's not just a general pretrained network, it still preseves the one-to-one mapping between the input token and the output vector corresponding to each particular token; which is very important for all kinds of sequence labeling or span/boundary detection tasks. So you can use them just as word2vec embeddings - for example, if you'd do word similarity or word difference metrics with 'transformer-stack-embeddings' then that would work just as well as word2vec (though you'd have to get to a word-level measurement instead of wordpiece or BPE subword tokens) with the added bonus of having done contextual disambiguation; you probably could do a decent word sense disambiguation system just by directly clustering these embeddings; the mouse-as-animal and mouse-as-computer-peripheral should have clearly different embeddings.
All NLP neural nets (based on LSTM or Transformer) do this. It's their main function - to create contextual representations of the input tokens.
The word 'position' in the 728 dimensional space is an embedding and it can be compared with other words by dot product. There are libraries that can do dot product ranking fast (such as annoy).
Open-SESAME (2017): https://arxiv.org/abs/1706.09528 / https://github.com/swabhs/open-sesame
VAMPIRE (2019): https://arxiv.org/abs/1906.02242 / https://github.com/allenai/vampire
On a related note, as a newcomer just trying to get things done (i.e. applied NLP) I find the whole ecosystem great but frustrating, so many frameworks and libraries but not clear ways to compose them together. Any resources out there that help make a sense of things?
I haven’t yet tried TFIDF though so I’ll see what that will do.
What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?
What cool problems are there?
It's mostly understanding text and generating text. You can do named entity extraction, question answering, summarisation, dialogue bots, information extraction from semi-structured documents such as tables and invoices, spelling correction, typing auto-suggestions, document classification and clustering, topic discovery, part of speech tagging, syntactic trees, language modelling, image description and image question answering, entailment detection (if two affirmations support one another), coreference resolution, entity linking, intent detection and slot filling, build large knowledge bases (databases of triplets subject-relation-object), spam detection, toxic message detection, ranking search results in search engines and many many more.
The sentence "Hello, y'all! How are you ?" is tokenized into words. Those words are then encoded into integers representative of the words' identity in the model's dictionary.
>>> output = tokenizer.encode("Hello, y'all! How are you ?")
Encoding(num_tokens=13, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])
>>> print(output.ids, output.tokens, output.offsets)
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 102]
['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]']
[(0, 0), (0, 5), (5, 6), (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, 25), (26, 27), (28, 29), (0, 0)]