
Bling Fire: Finite state machine and regular expression manipulation library - catam
https://github.com/Microsoft/BlingFire
======
switch2017
In 2008 when I was on the Bing team, Sergei had just developed this library
and was putting it to work on the document understanding pipeline. He was
incredibly passionate about it and would evangelize it with anybody that cared
to listen. When I saw the headline on HN, I immediately guessed that this was
Sergei's work. And of course it is! Sergei, if you are reading this, try to
guess who I am ;)

~~~
sergeio76
Thank you for remembering us ;)

------
mgreenleaf
I've been impressed with several of the Bing projects; I recently found that
they open sourced the JIT compiler for C++ they use on Bing [1] I've been more
and more impressed with Microsoft's efforts in Open Source and elsewhere.
Especially with these under the MIT license.

[1]
[https://github.com/BitFunnel/NativeJIT](https://github.com/BitFunnel/NativeJIT)

~~~
appleflaxen
Is Bling related to Bing? Reading the repo and the comments I'm really
confused by this.

~~~
seeekr
First sentence in the introduction:

> Hi, we are a team at Microsoft called Bling (Beyond Language Understanding),
> we help Bing be smarter.

------
rlayton2
Note that the top level summary shows it is 10x faster than Spacey, which is a
more impressive achievement than being 10x faster than NLTK (it is 20x
faster).

NLTK is designed for learning, not for production systems, speed or
efficiency. The code is often the most straight-forward way to implement an
algorithm, especially to read the code later.

~~~
avinium
Anyone familiar enough with this library to explain how they've achieved such
impressive performance? Otherwise, may dig into it for a blog post when I have
time.

~~~
sergeio76
There are deterministic finite state machines underneath implemented in lean
C++. These automata allow to implement operations with strings optimally or
close to that. In the readme file there is a link to how to recompile
linguistic resources, if you look inside the makefile that is used for
resource compilation you will the steps.

~~~
mlacks
What is lean C++?

~~~
Abishek_Muthian
I think he just means efficient C++ programming.

------
danso
Love seeing these powerful, production libraries be as easy to install as `pip
install blingfire`. They've even included a token-based classifier demo in
Jupyter Notebook:

[https://github.com/Microsoft/BlingFire/blob/master/doc/Bling...](https://github.com/Microsoft/BlingFire/blob/master/doc/Bling%20Fire%20Tokenizer%20Demo.ipynb)

~~~
ShadowBannedUsr
Yes, indeed, although Tensorflow and Pytorch have a long way to go before
their GPU versions are equally easy to install. Their GPU versions are
currently incredibly hard to install correctly, even on Linux. Some users just
give up and use prebuilt containers that although good good for production
code, can be hard to develop with.

~~~
solomatov
Pytorch install without any problem for quite a long time (they have a whl
with all libraries included). This can't be said about tensorflow
unfortunately.

~~~
ctoth
On Windows, in order to get Torch I need to use Anaconda.

------
bastawhiz
The title is wrong. It's not "Bing Fire", it's "Bling Fire"

~~~
Lowkeyloki
Confusingly, Bling Fire is part of the Bing search engine.

This is what happens when you let programmers name projects. I'm sure the
product owners love this. /s

Names are hard. ;-)

~~~
breischl
>This is what happens when you let programmers name projects.

OK, this is pretty OT, but coders don't have a monopoly on bad naming. I
worked on an internal project that had a unique and well recognized name
within the company. Exec forced us to change it to an acronym that he liked
but made no sense to anybody else.

Two years later, after the new name had finally caught on, he realized that
the project didn't do what he thought and had been telling people. In order to
save face, he created another project with _the same name_ so he could tell
people that's what he meant all along.

So now there's one project that's had two different names, one of which
overlaps with a separate-but-closely-related project. As you might imagine,
this has not been confusing to anyone. But at least that executive only looks
like a jackass to people who've been around long enough to watch this farce
unfold.

Programmers may suck at naming, but at least they understand the concept of
using distinct names for related things...

~~~
pts_
Good lord, executives are such a burden on Earth.

------
harrygeez
Can someone suggest a few practical use cases of this for normal plebs like
me?

~~~
zone411
In NLP, tokenization is one of the first steps in the pipeline to break down
text into individual words. I don't know about others but for the project I'm
doing, the tokenization speed doesn't matter much in the big picture because
the following steps take much longer.

~~~
atoav
It matters for anything that should be more or less realtime, so where fast
feedback is needed.

------
ma2rten
It sounds like they released the tokenizer that powers bing, which one of the
largest search engines in the world. That's pretty significant.

~~~
dx034
From the Readme it appears that it will deliver the same results as NLTK, just
much faster. Therefore publishing it won't give away any of the secret sauce
that make up Bing's search results.

~~~
jiguang98
The tokenization results are not the same, you can see some examples in
readme.

~~~
ma2rten
Just to clarify, I didn't mean to imply that this is Bing's secrete sauce. But
if it runs at this scale, but it's probably extremely robust and optimized
software.

------
natch
>Currently released model supports most of the languages except East Asian
(Chinese Simplified, Traditional, Japanese, Korean, Thai). You should expect
good results if a language uses space as a main token delimitter.

Any plans to change this?

What does MS fall back on for these languages?

------
fefe23
Two things about this baffle me.

1\. Why only put perf data in the blurb? To me it would be much more helpful
to first see a few API calls this allows that solve problems I have had
before, or may have in the future. It remains mostly opaque to me what this
software does.

2\. If this is mostly about splitting text into words, why is the code dump
dozens of directories with many hundreds of source code files in there?

I understand that they are proud of their perf numbers, but those are (to me)
not as important as first understanding what I could do with the project.

Regular expression mangling sounds like it could be useful for much more than
word splitting in a search engine context.

~~~
sergeio76
The library we have published is a finite state machines manipulation library
first of all, also it is developed to support linguistic applications (large
number of entries, unicode, compile once use many times, etc.) Tokenization is
one of its applications. We needed to start somewhere. In our Deep Learning
Era not everything that we have created is relevant, but tokenization is. What
we might add to the project is BPE or BPE- variant, support East Asian
languages, Multi Word Expressions ("New York" as one token).

~~~
fefe23
Could you please explain what kind of manipulation your library supports on
finite state machines?

Does it contain a regex engine to easily create the state machines in the
first place?

Does it have a JIT?

Do the state machines operate on char? wchar_t?

These are the kinds of details that I would love to see on the github entry
page. It's obvious to you what your library does, but I have no idea :-)

~~~
sergeio76
fefe23, Sorry we did not put all this information into the GitHub readme, we
will put more documentation into the doc folder soon, I hope it will answer
some of your questions. To specifically answer your questions:

Regular expressions are somewhat early POSIX standard... does not have many
features that nfa-based regular experssions have like in C#/Python or PCRE …

Machines are easy to create but right now it is all done via command line
tools, so you will have to write code to create it from code.

Does not have JIT.

Machines operate on int's (int32), input weight maps and variable length
coding is used in places.

------
bravura
One thing that can be important but tricky with some tokenization libraries is
to un-tokenize, i.e. recover the offsets of the tokens in the original string.
Is this supported in your library? How?

------
make3
Does it also do BPE (byte pair encoding)? can't tell

~~~
sergeio76
No it does not, we are thinking about it. Extra segmentation of tokens is a
great idea, I only don't like (who cares? :-)) inconsistencies that BPE
sometimes produce. Looking at how BERT tokenizers for example: "tranformers"
is one token but "transformer" is segmented into "transform" "##er". Word
"outperforms" becomes "out", "##per", "##forms". It is truly amazing that BERT
works so well with such tokenization.

~~~
make3
imho the fact that character level language models work mostly just as well as
word level or bpe tokenizers makes it clear that the "embedding" view is wrong
nowadays, that the meaning is really built (and stored) in the layers, just
fine

~~~
misterman0
In my probably even more humble opinion, due to the fact I'm merly a novice
NLP researcher, I think you are spot on. I've been experimenting with layering
two models on top of each other, one using the utmost banal character based
token embedding (bags-of-characters) with clustering, the other using
embeddings as wide as there are clusters, and found that even though the
tokens "apple" and "elppa" count as one and the same, the context of how the
individual tokens are used together is enough for very precisely localizing
the right cluster of documents. It seems context is more important than the
individual words are. There is also this curious fact that humans have no
trouble at all decoding tokens even tough teh the characters have been
displaced, which seems to support the idea that a model as trivial (and easy
to compute on) as bags of characters is indeed a valid one.

~~~
sergeio76
Maybe you are right from theoretical point of view, however 1.latency at
inference time may be an overkill, 2.look at success of fasttext model from
Facebook, it computes vectors for words and ngrams! as well. What you will
have to learn with layers and filters this model memorizes. It trains faster
hence you can feed huge training data, it runs fast as well.

~~~
misterman0
Perhaps I'm mistaken because I thought blingfire was mostly a tokenizer. But
now I realize it might also be a language model.

How are tokens/phrases/documents represented, computationaly? Theoretically,
do they live in a vector space?

------
ironfootnz
That's cool, very powerful lib. But I'm not sure if I still have to use other
libraries to remove stop words to make the tokenizer simpler.

~~~
misterman0
>> stop words

It's a good thing neither Bing nor Google remove so called stop words, because
if they did I wouldn't be able to google them for the lyrics from that great
song by "The the".

If you use a system that require you to remove stop words from your lexicon I
suggest you find better tech because removing them destroys linguistic
context. If you're using a search engine that doesn't care about linguistic
context, again, you should find better tech.

Even the most naive search engine should at least be using tf-idf or some
other statistical tool to determine both a token's document frequency as well
as its lexical frequency. That's very important context to have.

If you insist on using stop words most tokenizers would assume you did the
filtering or they'd want you to submit a word list for it to use as filter. Be
aware, you are not making tokenization much simpler, because a proper search
engine stores each distinct token once and only once in its index, so you
hardly saved any space. What you did was to slow down the indexing process and
use lots more CPU cycles for each token which translates to energy waste in my
mind.

Leave your stop words right where they are, and save the planet.

------
bravura
Can you explain how you use FSTs and REs for Multi-word expression matching
and Unknown word-guessing?

~~~
sergeio76
MWE: 1. given your corpus you run something like word2phrase algorithm on it.
2. Compute the phrases. 3. Compile them as one Min DFA and do left-most-
longest match between the input sequence of tokens and the compiled DFA. We
will include this as a separate API with phrases computed on some large
corpus.

Word-Guessing: 1. We compute a mapping as a finite state machine from a word
to its properties. 2. we generalize the finite state machine by shrining some
path and creating new finite state with the union of the properties so that a.
no mistakes are made wrt the known words, b. the model is as general as
possible. The result is a finite state machine that a. does not make mistakes
wrt the known set of words and is b. as general as possible for typical unseen
words.

We will put description better than this comment once the models are
published.

------
StreamBright
Just for the record NLTK is the slowest implementation out there.

~~~
jiamguo
It's also 10X faster than SpaCy :) see the repo benchmarking

------
algorithm314
How this compares to hfst?

