Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines (github.com)
168 points by julien_c 8 months ago | hide | past | favorite | 42 comments

TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast & versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).

Main features: - Encode 1GB in 20sec - Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece... - Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...) - Written in Rust with bindings for Python and node.js

Github repository and doc: https://github.com/huggingface/tokenizers/tree/master/tokeni...

To install: - Rust: https://crates.io/crates/tokenizers - Python: pip install tokenizers - Node: npm install tokenizers

I love the work done and made freely available by both spaCy and HuggingFace.

I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.

I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.

Hybrid symbolic and NN will be my next area of hobby research, currently getting my masters degree in NLP. Do you have a few good resources to get startedor/read about?

Very interested in working on symbolic and deep learning projects for NLP as well.

I can't believe the level of productivity this Hugging face team has.

They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.

Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.

> Idk what their monetization plan is as a startup

> put a few thousand $ in a series-A

Not a good idea.

I see them as an acqui-hire target. Especially form Facebook since they are so geographically close to FAIR labs in NY or Google and get integrated into Google AI like Deep Mind did. (esp. since google uses a ton of Transformers any ways)

I can't think of many small teams that can be acquired and can build a company's ML infrastructure as fast as this team.

If they have the money for it, OCI and Azure may also be keeping a look out for them.

We use both SpaCy and HuggingFace at work. Is there a comparison of this vs SpaCy's tokenizer[1]?

1. https://spacy.io/usage/linguistic-features#tokenization

It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn't seem to exist anymore in the word embedding tokenizers I've come by.

Is this possible using HuggingFace (or another word embedding based library)?

I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.

Most implementations are actually moving in the opposite direction. Previously, there was a tendency to look to aggregate words into phrases to better capture the "context" of a word. Now, most approaches are splitting words into sub-word parts or even characters. With networks that capture temporal relationships across tokens (as opposed to older, "bag of words" models), multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts.

> multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts

Indeed. Do you have an example of a library or snippet that demonstrates this?

My limited understanding of BERT (and other) word embeddings was that they only contain the word's position in the 728 (I believe) dimensional space but doesn't contain queryable temporal information no?

I like ngrams as a sort of untagged / unlabelled entity.

When using BERT (and all the many things like it, such as earlier ELMO, ULMfit and later ROBERTA/ERNIE/ALBERTa/etc) as the 'embeddings' you provide as input all the tokens in a sequence. You don't get an "embedding for word foobar in position 123", you get an embedding for all the sequence at once, so whatever corresponds to that token is a 728-dimensional "embedding for word foobar in position 123 conditional on all the particular other words that were before and after it'. Including very long-distance relations.

One of the simpler ways to try that out in your code seems to be running BERT-as-a-service https://github.com/hanxiao/bert-as-service , or alternatively the huggingface libraries that are discussed in the original article.

It's kind of the other way around compared to word2vec-style systems; before that you used to have a 'thin' embedding layer that's essentially just a lookup table followed by a bunch of complex layers of neural networks (e.g. multiple Bi-LSTMs followed by CRF); in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.

> in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.

Would you say they are still usually called "embeddings" when using this new style? This sounds more like just a pretrained network which includes both some embedding scheme and a lot of learning on top of it, but maybe the word "embedding" stuck anyway?

They do seem to be still called "embeddings" although yes, that's become a somewhat misleading misonmer in some sense.

However, the analogy still is somewhat meaningful, because if you want to look at the properties of a particular word or token, it's not just a general pretrained network, it still preseves the one-to-one mapping between the input token and the output vector corresponding to each particular token; which is very important for all kinds of sequence labeling or span/boundary detection tasks. So you can use them just as word2vec embeddings - for example, if you'd do word similarity or word difference metrics with 'transformer-stack-embeddings' then that would work just as well as word2vec (though you'd have to get to a word-level measurement instead of wordpiece or BPE subword tokens) with the added bonus of having done contextual disambiguation; you probably could do a decent word sense disambiguation system just by directly clustering these embeddings; the mouse-as-animal and mouse-as-computer-peripheral should have clearly different embeddings.

> Do you have an example of a library or snippet that demonstrates this?

All NLP neural nets (based on LSTM or Transformer) do this. It's their main function - to create contextual representations of the input tokens.

The word 'position' in the 728 dimensional space is an embedding and it can be compared with other words by dot product. There are libraries that can do dot product ranking fast (such as annoy).

Somewhat related, if someone want to build something awesome, I haven't seen anything that merges lucene with BPE/SentencePiece.

SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.

Great! Just did a quick test and got a 6-7x speedup on tokenization.

Mind sharing what tests your ran & with which setup? Thanks!

Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.

They don't use huggingface, but some of the modern approaches for topic modeling use variational auto-encoders, see:

Open-SESAME (2017): https://arxiv.org/abs/1706.09528 / https://github.com/swabhs/open-sesame

VAMPIRE (2019): https://arxiv.org/abs/1906.02242 / https://github.com/allenai/vampire

Thanks! I hadn’t seen VAMPIRE! So stoked to see a new approach to topic modeling. SVD etc are very much a local max

Big transformers neural network are probably overkill for topic modeling. More traditional methods implemented in Gensim or scikit learn such as tfidf vectors followed by SVD (aka LSI) or LDA or NMF are probably just fine to extract topics (soft clustering).

The reason is that you do not need to finely understand the structure of individual sentences to group documents by similar topics. Word order does not matter much for this task. Hence the success of methods that use Bag of Words (eg TFIDF) as their input representation.

It might be that the corpus I was trying to cluster needs better preprocessing, or perhaps better n-grams. Using Bigrams only I saw a lot of common words that were meaningless, but adding them as stop words made the results worse. Hence my wondering if some other vectorization would produce better results.

On a related note, as a newcomer just trying to get things done (i.e. applied NLP) I find the whole ecosystem great but frustrating, so many frameworks and libraries but not clear ways to compose them together. Any resources out there that help make a sense of things?

If I understand you problem clearly, you can use TFIDF to reduce the weight of meaningless words.

It’s not meaningless words - it’s common English words that are overloaded and I think considering their position in sentences instead would give better results.

I haven’t yet tried TFIDF though so I’ll see what that will do.

With the appropriate amount of data of course.

I'm very familiar with the TTS, VC, and other "audio-shaped" spaces, but I've never delved into NLP.

What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?

What cool problems are there?

> What problems can you solve with NLP?

It's mostly understanding text and generating text. You can do named entity extraction, question answering, summarisation, dialogue bots, information extraction from semi-structured documents such as tables and invoices, spelling correction, typing auto-suggestions, document classification and clustering, topic discovery, part of speech tagging, syntactic trees, language modelling, image description and image question answering, entailment detection (if two affirmations support one another), coreference resolution, entity linking, intent detection and slot filling, build large knowledge bases (databases of triplets subject-relation-object), spam detection, toxic message detection, ranking search results in search engines and many many more.

I believe many folks are particularly attracted to NLP because the Turing test [1] is an NLP problem.

[1] https://en.m.wikipedia.org/wiki/Turing_test

Disagree. I think it's value is mostly unrelated to that.

All of the above, it's like asking what problems can you solve with math? HuggingFace's transformers are said to be a swiss army knife for NLP. I haven't worked with them yet, but the main fundamental utility seems to be generating fixed-length vector representations of words. Word2vec started this, but the vectors have gotten much better with stuff like BERT.

I thought transformers are mainly used for multi-word embeddings?!

There's a lot! Sentence detection, parts of speech (POS) detection to name a couple. These can be used to determine key concepts in documents that lack metadata. For example: you could cluster on common phrases to identify relationships in data.

Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?

I didn't realize that particular emoji had a name. I thought it was a play on this: https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc...

All emoji have a name. I've found emojipedia to be a good source of info about emoji. https://emojipedia.org/hugging-face/

Title is off? Should mention Tokenizers as the project.

What does tokenization (of strings, I guess) do?

The README [1] shows a great example:

The sentence "Hello, y'all! How are you ?" is tokenized into words. Those words are then encoded into integers representative of the words' identity in the model's dictionary.

    >>> output = tokenizer.encode("Hello, y'all! How are you  ?")
    Encoding(num_tokens=13, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])
    >>> print(output.ids, output.tokens, output.offsets)
    [101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 102]
    ['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]']
    [(0, 0), (0, 5), (5, 6), (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, 25), (26, 27), (28, 29), (0, 0)]
But there's also good detail in the source [2] which says, "A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding. The various steps of the pipeline are: ...."

[1] https://github.com/huggingface/tokenizers#quick-examples-usi...

[2] https://github.com/huggingface/tokenizers/tree/master/tokeni...

Why is this company called HuggingFace?

I assume it is a reference to the movie Alien.

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact