Hacker News new | past | comments | ask | show | jobs | submit login
Bling Fire: Finite state machine and regular expression manipulation library (github.com)
301 points by catam 37 days ago | hide | past | web | favorite | 65 comments

In 2008 when I was on the Bing team, Sergei had just developed this library and was putting it to work on the document understanding pipeline. He was incredibly passionate about it and would evangelize it with anybody that cared to listen. When I saw the headline on HN, I immediately guessed that this was Sergei's work. And of course it is! Sergei, if you are reading this, try to guess who I am ;)

Thank you for remembering us ;)

I've been impressed with several of the Bing projects; I recently found that they open sourced the JIT compiler for C++ they use on Bing [1] I've been more and more impressed with Microsoft's efforts in Open Source and elsewhere. Especially with these under the MIT license.

[1] https://github.com/BitFunnel/NativeJIT

Is Bling related to Bing? Reading the repo and the comments I'm really confused by this.

First sentence in the introduction:

> Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter.

Yes it is, it's from the team in Bing building the index, doing document understanding.

Note that the top level summary shows it is 10x faster than Spacey, which is a more impressive achievement than being 10x faster than NLTK (it is 20x faster).

NLTK is designed for learning, not for production systems, speed or efficiency. The code is often the most straight-forward way to implement an algorithm, especially to read the code later.

I have been experimenting with tokenization in Rust in https://github.com/rth/vtext, mostly relying on Unicode segmentation with a few custom rules for tokenization. The obtained performance was also around 10x faster than spacy for comparable precision (see benchmarks section of the readme).

To further your point, NLTK is written in pure Python whereas the backend for spaCy, thinc, is written in C/C++/Python. Really no comparison here.

Anyone familiar enough with this library to explain how they've achieved such impressive performance? Otherwise, may dig into it for a blog post when I have time.

There are deterministic finite state machines underneath implemented in lean C++. These automata allow to implement operations with strings optimally or close to that. In the readme file there is a link to how to recompile linguistic resources, if you look inside the makefile that is used for resource compilation you will the steps.

Do you use any kind of library or typical design pattern for your FSM implementation? When I've used them, I typically always go back to very C-like code (enum for the states, a state variable, and a big switch statement inside a loop).

We don't represent automata as a code, that leads to compiler errors etc., we represent them as graphs, just as data.

Inside the code we use: state --> int set of states --> sorted / unique'd array of int's input symbol --> int output symbol --> int transition function is abstracted out behind an interface and implemented differently based on whether automaton is read-only or changeable and based on the state etc.

I'm going to have to think about that a bit. Maybe I'm missing the point, but it sounds like an interpreter. Thanks for posting the code. This feels like something I need to understand.

Thanks - will dive in and start checking things out.

What is lean C++?

I think he just means efficient C++ programming.


Maybe ?

One advantage is the core appears to be written in C/C++. NLTK is python and SpaCy is a lot of C/C++ with a significant amount of python.

I’ll have to dig in further, but I suspect there’s a lot of optimization for the task at hand here. SpaCy and NLTK tries to be much more general for NLP.

I was more so referring to spacy, where (IIRC) a lot of the lower-level tokenization & parsing is written in Cython, so (I assume) the simple fact that Bling is written in C++ wouldn't explain the huge performance gains.

There’s still a good amount of overhead in Cython portions of code unless there are no python objects or GIL involved, and then there’s still overhead communicating between C and Python worlds.

That’s part of why I stopped writing Cython and just use pybind11.

Love seeing these powerful, production libraries be as easy to install as `pip install blingfire`. They've even included a token-based classifier demo in Jupyter Notebook:


Yes, indeed, although Tensorflow and Pytorch have a long way to go before their GPU versions are equally easy to install. Their GPU versions are currently incredibly hard to install correctly, even on Linux. Some users just give up and use prebuilt containers that although good good for production code, can be hard to develop with.

This has been vastly improved fairly recently with the advent of the tensorflow-gpu meta package available via conda. See https://towardsdatascience.com/tensorflow-gpu-installation-m...

Pytorch install without any problem for quite a long time (they have a whl with all libraries included). This can't be said about tensorflow unfortunately.

On Windows, in order to get Torch I need to use Anaconda.

Your name leaked from an identifier to a property of your account in my head and I actually thought this was a dead comment. Weird.

This is unrelated, but you should see the potential that this holds for blind people: https://arxiv.org/abs/1901.02527

In my own experience, I've found it much easier to install tensorflow with GPU support from scratch (including cuDNN...) than it is to install Spark/Hadoop.

And running a marathon is easier than doing Iron Man, but neither of them is a walk in the park.

The title is wrong. It's not "Bing Fire", it's "Bling Fire"

Thanks, This makes sense. Also anything 'Bing' & Open source might be just the client for an Azure subscription.

This seems to be a true standalone open source project.

Confusingly, Bling Fire is part of the Bing search engine.

This is what happens when you let programmers name projects. I'm sure the product owners love this. /s

Names are hard. ;-)

>This is what happens when you let programmers name projects.

OK, this is pretty OT, but coders don't have a monopoly on bad naming. I worked on an internal project that had a unique and well recognized name within the company. Exec forced us to change it to an acronym that he liked but made no sense to anybody else.

Two years later, after the new name had finally caught on, he realized that the project didn't do what he thought and had been telling people. In order to save face, he created another project with _the same name_ so he could tell people that's what he meant all along.

So now there's one project that's had two different names, one of which overlaps with a separate-but-closely-related project. As you might imagine, this has not been confusing to anyone. But at least that executive only looks like a jackass to people who've been around long enough to watch this farce unfold.

Programmers may suck at naming, but at least they understand the concept of using distinct names for related things...

Good lord, executives are such a burden on Earth.

Its says Bling stands for "Beyond Language Understanding". They must be really good programmers.

Names are hard, but that’s no reason to bowl straight into the gutter. Calling it Bling is just confusing. That’s like McDonald’s selling a Big Mc. Oh wait...

Can someone suggest a few practical use cases of this for normal plebs like me?

In NLP, tokenization is one of the first steps in the pipeline to break down text into individual words. I don't know about others but for the project I'm doing, the tokenization speed doesn't matter much in the big picture because the following steps take much longer.

It matters for anything that should be more or less realtime, so where fast feedback is needed.

If you're talking about Natural Language Processing, think voice operated - everything.

It sounds like they released the tokenizer that powers bing, which one of the largest search engines in the world. That's pretty significant.

From the Readme it appears that it will deliver the same results as NLTK, just much faster. Therefore publishing it won't give away any of the secret sauce that make up Bing's search results.

The tokenization results are not the same, you can see some examples in readme.

Just to clarify, I didn't mean to imply that this is Bing's secrete sauce. But if it runs at this scale, but it's probably extremely robust and optimized software.

>Currently released model supports most of the languages except East Asian (Chinese Simplified, Traditional, Japanese, Korean, Thai). You should expect good results if a language uses space as a main token delimitter.

Any plans to change this?

What does MS fall back on for these languages?

Two things about this baffle me.

1. Why only put perf data in the blurb? To me it would be much more helpful to first see a few API calls this allows that solve problems I have had before, or may have in the future. It remains mostly opaque to me what this software does.

2. If this is mostly about splitting text into words, why is the code dump dozens of directories with many hundreds of source code files in there?

I understand that they are proud of their perf numbers, but those are (to me) not as important as first understanding what I could do with the project.

Regular expression mangling sounds like it could be useful for much more than word splitting in a search engine context.

The library we have published is a finite state machines manipulation library first of all, also it is developed to support linguistic applications (large number of entries, unicode, compile once use many times, etc.) Tokenization is one of its applications. We needed to start somewhere. In our Deep Learning Era not everything that we have created is relevant, but tokenization is. What we might add to the project is BPE or BPE- variant, support East Asian languages, Multi Word Expressions ("New York" as one token).

Could you please explain what kind of manipulation your library supports on finite state machines?

Does it contain a regex engine to easily create the state machines in the first place?

Does it have a JIT?

Do the state machines operate on char? wchar_t?

These are the kinds of details that I would love to see on the github entry page. It's obvious to you what your library does, but I have no idea :-)

fefe23, Sorry we did not put all this information into the GitHub readme, we will put more documentation into the doc folder soon, I hope it will answer some of your questions. To specifically answer your questions:

Regular expressions are somewhat early POSIX standard... does not have many features that nfa-based regular experssions have like in C#/Python or PCRE …

Machines are easy to create but right now it is all done via command line tools, so you will have to write code to create it from code.

Does not have JIT.

Machines operate on int's (int32), input weight maps and variable length coding is used in places.

One thing that can be important but tricky with some tokenization libraries is to un-tokenize, i.e. recover the offsets of the tokens in the original string. Is this supported in your library? How?

Does it also do BPE (byte pair encoding)? can't tell

No it does not, we are thinking about it. Extra segmentation of tokens is a great idea, I only don't like (who cares? :-)) inconsistencies that BPE sometimes produce. Looking at how BERT tokenizers for example: "tranformers" is one token but "transformer" is segmented into "transform" "##er". Word "outperforms" becomes "out", "##per", "##forms". It is truly amazing that BERT works so well with such tokenization.

Looks like an opportunity to do BPE right. Would love to see your take on it with this engine!

imho the fact that character level language models work mostly just as well as word level or bpe tokenizers makes it clear that the "embedding" view is wrong nowadays, that the meaning is really built (and stored) in the layers, just fine

In my probably even more humble opinion, due to the fact I'm merly a novice NLP researcher, I think you are spot on. I've been experimenting with layering two models on top of each other, one using the utmost banal character based token embedding (bags-of-characters) with clustering, the other using embeddings as wide as there are clusters, and found that even though the tokens "apple" and "elppa" count as one and the same, the context of how the individual tokens are used together is enough for very precisely localizing the right cluster of documents. It seems context is more important than the individual words are. There is also this curious fact that humans have no trouble at all decoding tokens even tough teh the characters have been displaced, which seems to support the idea that a model as trivial (and easy to compute on) as bags of characters is indeed a valid one.

Maybe you are right from theoretical point of view, however 1.latency at inference time may be an overkill, 2.look at success of fasttext model from Facebook, it computes vectors for words and ngrams! as well. What you will have to learn with layers and filters this model memorizes. It trains faster hence you can feed huge training data, it runs fast as well.

Perhaps I'm mistaken because I thought blingfire was mostly a tokenizer. But now I realize it might also be a language model.

How are tokens/phrases/documents represented, computationaly? Theoretically, do they live in a vector space?

That's cool, very powerful lib. But I'm not sure if I still have to use other libraries to remove stop words to make the tokenizer simpler.

>> stop words

It's a good thing neither Bing nor Google remove so called stop words, because if they did I wouldn't be able to google them for the lyrics from that great song by "The the".

If you use a system that require you to remove stop words from your lexicon I suggest you find better tech because removing them destroys linguistic context. If you're using a search engine that doesn't care about linguistic context, again, you should find better tech.

Even the most naive search engine should at least be using tf-idf or some other statistical tool to determine both a token's document frequency as well as its lexical frequency. That's very important context to have.

If you insist on using stop words most tokenizers would assume you did the filtering or they'd want you to submit a word list for it to use as filter. Be aware, you are not making tokenization much simpler, because a proper search engine stores each distinct token once and only once in its index, so you hardly saved any space. What you did was to slow down the indexing process and use lots more CPU cycles for each token which translates to energy waste in my mind.

Leave your stop words right where they are, and save the planet.

Can you explain how you use FSTs and REs for Multi-word expression matching and Unknown word-guessing?

MWE: 1. given your corpus you run something like word2phrase algorithm on it. 2. Compute the phrases. 3. Compile them as one Min DFA and do left-most-longest match between the input sequence of tokens and the compiled DFA. We will include this as a separate API with phrases computed on some large corpus.

Word-Guessing: 1. We compute a mapping as a finite state machine from a word to its properties. 2. we generalize the finite state machine by shrining some path and creating new finite state with the union of the properties so that a. no mistakes are made wrt the known words, b. the model is as general as possible. The result is a finite state machine that a. does not make mistakes wrt the known set of words and is b. as general as possible for typical unseen words.

We will put description better than this comment once the models are published.

Just for the record NLTK is the slowest implementation out there.

It's also 10X faster than SpaCy :) see the repo benchmarking

How this compares to hfst?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact