
Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines - julien_c
https://github.com/huggingface/tokenizers
======
julien_c
TL;DR: Hugging Face, the NLP research company known for its transformers
library (DISCLAIMER: I work at Hugging Face), has just released a new open-
source library for ultra-fast & versatile tokenization for NLP neural net
models (i.e. converting strings in model input tensors).

Main features: \- Encode 1GB in 20sec \- Provide BPE/Byte-Level-
BPE/WordPiece/SentencePiece... \- Compute exhaustive set of outputs (offset
mappings, attention masks, special token masks...) \- Written in Rust with
bindings for Python and node.js

Github repository and doc:
[https://github.com/huggingface/tokenizers/tree/master/tokeni...](https://github.com/huggingface/tokenizers/tree/master/tokenizers)

To install: \- Rust:
[https://crates.io/crates/tokenizers](https://crates.io/crates/tokenizers) \-
Python: pip install tokenizers \- Node: npm install tokenizers

------
mark_l_watson
I love the work done and made freely available by both spaCy and HuggingFace.

I had my own NLP libraries for about 20 years, simple ones were examples in my
books, and more complex and not so understandable ones I sold as products and
pulled in lots of consulting work with.

I have completely given up my own work developing NLP tools, and generally I
use the Python bindings (via the Hy language (hylang) which is a Lisp that
sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am
retired now but my personal research is in hybrid symbolic and deep learning
AI.

~~~
dunefox
Hybrid symbolic and NN will be my next area of hobby research, currently
getting my masters degree in NLP. Do you have a few good resources to get
startedor/read about?

------
screye
I can't believe the level of productivity this Hugging face team has.

They seemed to have found the ideal balance of software engineering capability
and Neural network knowledge, in a team of highly effective and efficient
employees.

Idk what their monetization plan is as a startup, but it is 100% undervalued
at 20 million, and that is just the quality of that team. Now, if only I can
figure out how to put a few thousand $ in a series-A startup as just some guy.

~~~
manojlds
> Idk what their monetization plan is as a startup

> put a few thousand $ in a series-A

Not a good idea.

~~~
screye
I see them as an acqui-hire target. Especially form Facebook since they are so
geographically close to FAIR labs in NY or Google and get integrated into
Google AI like Deep Mind did. (esp. since google uses a ton of Transformers
any ways)

I can't think of many small teams that can be acquired and can build a
company's ML infrastructure as fast as this team.

If they have the money for it, OCI and Azure may also be keeping a look out
for them.

------
ZeroCool2u
We use both SpaCy and HuggingFace at work. Is there a comparison of this vs
SpaCy's tokenizer[1]?

1\. [https://spacy.io/usage/linguistic-
features#tokenization](https://spacy.io/usage/linguistic-
features#tokenization)

------
LunaSea
It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token
sized chunks) but this doesn't seem to exist anymore in the word embedding
tokenizers I've come by.

Is this possible using HuggingFace (or another word embedding based library)?

I know that there are some simple heuristics like merging noun token sequences
together to extract ngrams but they are too simplistic and very error prone.

~~~
brockf
Most implementations are actually moving in the opposite direction.
Previously, there was a tendency to look to aggregate words into phrases to
better capture the "context" of a word. Now, most approaches are splitting
words into sub-word parts or even characters. With networks that capture
temporal relationships across tokens (as opposed to older, "bag of words"
models), multi-word patterns can effectively be captured by attending to the
temporal order of sub-word parts.

~~~
LunaSea
> multi-word patterns can effectively be captured by attending to the temporal
> order of sub-word parts

Indeed. Do you have an example of a library or snippet that demonstrates this?

My limited understanding of BERT (and other) word embeddings was that they
only contain the word's position in the 728 (I believe) dimensional space but
doesn't contain queryable temporal information no?

I like ngrams as a sort of untagged / unlabelled entity.

~~~
PeterisP
When using BERT (and all the many things like it, such as earlier ELMO, ULMfit
and later ROBERTA/ERNIE/ALBERTa/etc) as the 'embeddings' you provide as input
all the tokens in a sequence. You don't get an "embedding for word foobar in
position 123", you get an embedding for all the sequence at once, so whatever
corresponds to that token is a 728-dimensional "embedding for word foobar in
position 123 conditional on _all the particular other words that were before
and after it_ '. Including very long-distance relations.

One of the simpler ways to try that out in your code seems to be running BERT-
as-a-service [https://github.com/hanxiao/bert-as-
service](https://github.com/hanxiao/bert-as-service) , or alternatively the
huggingface libraries that are discussed in the original article.

It's kind of the other way around compared to word2vec-style systems; before
that you used to have a 'thin' embedding layer that's essentially just a
lookup table followed by a bunch of complex layers of neural networks (e.g.
multiple Bi-LSTMs followed by CRF); in the 'current style' you have "thick
embeddings" which is running through all the many transformer layers in a
pretrained BERT-like system, followed by a thin custom layer that's often just
glorified linear regression.

~~~
Erlich_Bachman
> in the 'current style' you have "thick embeddings" which is running through
> all the many transformer layers in a pretrained BERT-like system, followed
> by a thin custom layer that's often just glorified linear regression.

Would you say they are still usually called "embeddings" when using this new
style? This sounds more like just a pretrained network which includes both
some embedding scheme and a lot of learning on top of it, but maybe the word
"embedding" stuck anyway?

~~~
PeterisP
They _do_ seem to be still called "embeddings" although yes, that's become a
somewhat misleading misonmer in some sense.

However, the analogy still is somewhat meaningful, because if you want to look
at the properties of a particular word or token, it's not just a _general_
pretrained network, it still preseves the one-to-one mapping between the input
token and the output vector corresponding to each particular token; which is
very important for all kinds of sequence labeling or span/boundary detection
tasks. So you can use them just as word2vec embeddings - for example, if you'd
do word similarity or word difference metrics with 'transformer-stack-
embeddings' then that would work just as well as word2vec (though you'd have
to get to a word-level measurement instead of wordpiece or BPE subword tokens)
with the added bonus of having done contextual disambiguation; you probably
could do a decent word sense disambiguation system just by directly clustering
these embeddings; the mouse-as-animal and mouse-as-computer-peripheral should
have clearly different embeddings.

------
useful
Somewhat related, if someone want to build something awesome, I haven't seen
anything that merges lucene with BPE/SentencePiece.

SentencePiece has to make it so you can shrink the memory requirements of your
indexes for search and typeahead stuff.

------
hnaccy
Great! Just did a quick test and got a 6-7x speedup on tokenization.

~~~
clmnt
Mind sharing what tests your ran & with which setup? Thanks!

------
orestis
Are there examples on how this can be used for topic modeling, document
similarity etc? All the examples I’ve seen (gensim) use bag-of-words which
seems to be outdated.

~~~
ogrisel
Big transformers neural network are probably overkill for topic modeling. More
traditional methods implemented in Gensim or scikit learn such as tfidf
vectors followed by SVD (aka LSI) or LDA or NMF are probably just fine to
extract topics (soft clustering).

~~~
ogrisel
The reason is that you do not need to finely understand the structure of
individual sentences to group documents by similar topics. Word order does not
matter much for this task. Hence the success of methods that use Bag of Words
(eg TFIDF) as their input representation.

~~~
orestis
It might be that the corpus I was trying to cluster needs better
preprocessing, or perhaps better n-grams. Using Bigrams only I saw a lot of
common words that were meaningless, but adding them as stop words made the
results worse. Hence my wondering if some other vectorization would produce
better results.

On a related note, as a newcomer just trying to get things done (i.e. applied
NLP) I find the whole ecosystem great but frustrating, so many frameworks and
libraries but not clear ways to compose them together. Any resources out there
that help make a sense of things?

~~~
nestorD
If I understand you problem clearly, you can use TFIDF to reduce the weight of
meaningless words.

~~~
orestis
It’s not meaningless words - it’s common English words that are overloaded and
I think considering their position in sentences instead would give better
results.

I haven’t yet tried TFIDF though so I’ll see what that will do.

------
echelon
I'm very familiar with the TTS, VC, and other "audio-shaped" spaces, but I've
never delved into NLP.

What problems can you solve with NLP? Sentiment analysis? Semantic analysis?
Translation?

What cool problems are there?

~~~
mraison
I believe many folks are particularly attracted to NLP because the Turing test
[1] is an NLP problem.

[1]
[https://en.m.wikipedia.org/wiki/Turing_test](https://en.m.wikipedia.org/wiki/Turing_test)

~~~
brokensegue
Disagree. I think it's value is mostly unrelated to that.

------
m0zg
Question for HuggingFace folks. Your repos do not contain any tests. Why is
that? How do you ensure your stuff actually works after you make a change?

------
virtuous_signal
I didn't realize that particular emoji had a name. I thought it was a play on
this:
[https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc...](https://en.wikipedia.org/wiki/Alien_\(creature_in_Alien_franchise\)#Facehugger)

~~~
rjmorris
All emoji have a name. I've found emojipedia to be a good source of info about
emoji. [https://emojipedia.org/hugging-face/](https://emojipedia.org/hugging-
face/)

------
manojlds
Title is off? Should mention Tokenizers as the project.

------
rsp1984
What does tokenization (of strings, I guess) do?

~~~
wyldfire
The README [1] shows a great example:

The sentence "Hello, y'all! How are you ?" is tokenized into words. Those
words are then encoded into integers representative of the words' identity in
the model's dictionary.

    
    
        >>> output = tokenizer.encode("Hello, y'all! How are you  ?")
        Encoding(num_tokens=13, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])
        >>> print(output.ids, output.tokens, output.offsets)
        [101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 102]
        ['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]']
        [(0, 0), (0, 5), (5, 6), (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, 25), (26, 27), (28, 29), (0, 0)]
    

But there's also good detail in the source [2] which says, "A Tokenizer works
as a pipeline, it processes some raw text as input and outputs an Encoding.
The various steps of the pipeline are: ...."

[1] [https://github.com/huggingface/tokenizers#quick-examples-
usi...](https://github.com/huggingface/tokenizers#quick-examples-using-python)

[2]
[https://github.com/huggingface/tokenizers/tree/master/tokeni...](https://github.com/huggingface/tokenizers/tree/master/tokenizers#what-
is-a-tokenizer)

------
tarr11
Why is this company called HuggingFace?

~~~
itronitron
I assume it is a reference to the movie Alien.

