In 2008 when I was on the Bing team, Sergei had just developed this library and was putting it to work on the document understanding pipeline. He was incredibly passionate about it and would evangelize it with anybody that cared to listen. When I saw the headline on HN, I immediately guessed that this was Sergei's work. And of course it is! Sergei, if you are reading this, try to guess who I am ;)
I've been impressed with several of the Bing projects; I recently found that they open sourced the JIT compiler for C++ they use on Bing [1] I've been more and more impressed with Microsoft's efforts in Open Source and elsewhere. Especially with these under the MIT license.
Note that the top level summary shows it is 10x faster than Spacey, which is a more impressive achievement than being 10x faster than NLTK (it is 20x faster).
NLTK is designed for learning, not for production systems, speed or efficiency. The code is often the most straight-forward way to implement an algorithm, especially to read the code later.
I have been experimenting with tokenization in Rust in https://github.com/rth/vtext, mostly relying on Unicode segmentation with a few custom rules for tokenization. The obtained performance was also around 10x faster than spacy for comparable precision (see benchmarks section of the readme).
Anyone familiar enough with this library to explain how they've achieved such impressive performance? Otherwise, may dig into it for a blog post when I have time.
There are deterministic finite state machines underneath implemented in lean C++. These automata allow to implement operations with strings optimally or close to that. In the readme file there is a link to how to recompile linguistic resources, if you look inside the makefile that is used for resource compilation you will the steps.
Do you use any kind of library or typical design pattern for your FSM implementation? When I've used them, I typically always go back to very C-like code (enum for the states, a state variable, and a big switch statement inside a loop).
We don't represent automata as a code, that leads to compiler errors etc., we represent them as graphs, just as data.
Inside the code we use:
state --> int
set of states --> sorted / unique'd array of int's
input symbol --> int
output symbol --> int
transition function is abstracted out behind an interface and implemented differently based on whether automaton is read-only or changeable and based on the state etc.
I'm going to have to think about that a bit. Maybe I'm missing the point, but it sounds like an interpreter. Thanks for posting the code. This feels like something I need to understand.
One advantage is the core appears to be written in C/C++. NLTK is python and SpaCy is a lot of C/C++ with a significant amount of python.
I’ll have to dig in further, but I suspect there’s a lot of optimization for the task at hand here. SpaCy and NLTK tries to be much more general for NLP.
I was more so referring to spacy, where (IIRC) a lot of the lower-level tokenization & parsing is written in Cython, so (I assume) the simple fact that Bling is written in C++ wouldn't explain the huge performance gains.
There’s still a good amount of overhead in Cython portions of code unless there are no python objects or GIL involved, and then there’s still overhead communicating between C and Python worlds.
That’s part of why I stopped writing Cython and just use pybind11.
Love seeing these powerful, production libraries be as easy to install as `pip install blingfire`. They've even included a token-based classifier demo in Jupyter Notebook:
Yes, indeed, although Tensorflow and Pytorch have a long way to go before their GPU versions are equally easy to install. Their GPU versions are currently incredibly hard to install correctly, even on Linux. Some users just give up and use prebuilt containers that although good good for production code, can be hard to develop with.
Pytorch install without any problem for quite a long time (they have a whl with all libraries included). This can't be said about tensorflow unfortunately.
In my own experience, I've found it much easier to install tensorflow with GPU support from scratch (including cuDNN...) than it is to install Spark/Hadoop.
>This is what happens when you let programmers name projects.
OK, this is pretty OT, but coders don't have a monopoly on bad naming. I worked on an internal project that had a unique and well recognized name within the company. Exec forced us to change it to an acronym that he liked but made no sense to anybody else.
Two years later, after the new name had finally caught on, he realized that the project didn't do what he thought and had been telling people. In order to save face, he created another project with _the same name_ so he could tell people that's what he meant all along.
So now there's one project that's had two different names, one of which overlaps with a separate-but-closely-related project. As you might imagine, this has not been confusing to anyone. But at least that executive only looks like a jackass to people who've been around long enough to watch this farce unfold.
Programmers may suck at naming, but at least they understand the concept of using distinct names for related things...
Names are hard, but that’s no reason to bowl straight into the gutter. Calling it Bling is just confusing. That’s like McDonald’s selling a Big Mc. Oh wait...
In NLP, tokenization is one of the first steps in the pipeline to break down text into individual words. I don't know about others but for the project I'm doing, the tokenization speed doesn't matter much in the big picture because the following steps take much longer.
From the Readme it appears that it will deliver the same results as NLTK, just much faster. Therefore publishing it won't give away any of the secret sauce that make up Bing's search results.
Just to clarify, I didn't mean to imply that this is Bing's secrete sauce. But if it runs at this scale, but it's probably extremely robust and optimized software.
>Currently released model supports most of the languages except East Asian (Chinese Simplified, Traditional, Japanese, Korean, Thai). You should expect good results if a language uses space as a main token delimitter.
1. Why only put perf data in the blurb? To me it would be much more helpful to first see a few API calls this allows that solve problems I have had before, or may have in the future. It remains mostly opaque to me what this software does.
2. If this is mostly about splitting text into words, why is the code dump dozens of directories with many hundreds of source code files in there?
I understand that they are proud of their perf numbers, but those are (to me) not as important as first understanding what I could do with the project.
Regular expression mangling sounds like it could be useful for much more than word splitting in a search engine context.
The library we have published is a finite state machines manipulation library first of all, also it is developed to support linguistic applications (large number of entries, unicode, compile once use many times, etc.) Tokenization is one of its applications. We needed to start somewhere. In our Deep Learning Era not everything that we have created is relevant, but tokenization is. What we might add to the project is BPE or BPE- variant, support East Asian languages, Multi Word Expressions ("New York" as one token).
fefe23, Sorry we did not put all this information into the GitHub readme, we will put more documentation into the doc folder soon, I hope it will answer some of your questions. To specifically answer your questions:
Regular expressions are somewhat early POSIX standard... does not have many features that nfa-based regular experssions have like in C#/Python or PCRE …
Machines are easy to create but right now it is all done via command line tools, so you will have to write code to create it from code.
Does not have JIT.
Machines operate on int's (int32), input weight maps and variable length coding is used in places.
One thing that can be important but tricky with some tokenization libraries is to un-tokenize, i.e. recover the offsets of the tokens in the original string. Is this supported in your library? How?
No it does not, we are thinking about it. Extra segmentation of tokens is a great idea, I only don't like (who cares? :-)) inconsistencies that BPE sometimes produce. Looking at how BERT tokenizers for example: "tranformers" is one token but "transformer" is segmented into "transform" "##er". Word "outperforms" becomes "out", "##per", "##forms". It is truly amazing that BERT works so well with such tokenization.
imho the fact that character level language models work mostly just as well as word level or bpe tokenizers makes it clear that the "embedding" view is wrong nowadays, that the meaning is really built (and stored) in the layers, just fine
In my probably even more humble opinion, due to the fact I'm merly a novice NLP researcher, I think you are spot on. I've been experimenting with layering two models on top of each other, one using the utmost banal character based token embedding (bags-of-characters) with clustering, the other using embeddings as wide as there are clusters, and found that even though the tokens "apple" and "elppa" count as one and the same, the context of how the individual tokens are used together is enough for very precisely localizing the right cluster of documents. It seems context is more important than the individual words are. There is also this curious fact that humans have no trouble at all decoding tokens even tough teh the characters have been displaced, which seems to support the idea that a model as trivial (and easy to compute on) as bags of characters is indeed a valid one.
Maybe you are right from theoretical point of view, however 1.latency at inference time may be an overkill, 2.look at success of fasttext model from Facebook, it computes vectors for words and ngrams! as well. What you will have to learn with layers and filters this model memorizes. It trains faster hence you can feed huge training data, it runs fast as well.
It's a good thing neither Bing nor Google remove so called stop words, because if they did I wouldn't be able to google them for the lyrics from that great song by "The the".
If you use a system that require you to remove stop words from your lexicon I suggest you find better tech because removing them destroys linguistic context. If you're using a search engine that doesn't care about linguistic context, again, you should find better tech.
Even the most naive search engine should at least be using tf-idf or some other statistical tool to determine both a token's document frequency as well as its lexical frequency. That's very important context to have.
If you insist on using stop words most tokenizers would assume you did the filtering or they'd want you to submit a word list for it to use as filter. Be aware, you are not making tokenization much simpler, because a proper search engine stores each distinct token once and only once in its index, so you hardly saved any space. What you did was to slow down the indexing process and use lots more CPU cycles for each token which translates to energy waste in my mind.
Leave your stop words right where they are, and save the planet.
MWE: 1. given your corpus you run something like word2phrase algorithm on it. 2. Compute the phrases. 3. Compile them as one Min DFA and do left-most-longest match between the input sequence of tokens and the compiled DFA. We will include this as a separate API with phrases computed on some large corpus.
Word-Guessing: 1. We compute a mapping as a finite state machine from a word to its properties. 2. we generalize the finite state machine by shrining some path and creating new finite state with the union of the properties so that a. no mistakes are made wrt the known words, b. the model is as general as possible. The result is a finite state machine that a. does not make mistakes wrt the known set of words and is b. as general as possible for typical unseen words.
We will put description better than this comment once the models are published.