> Hi, we are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter.
NLTK is designed for learning, not for production systems, speed or efficiency. The code is often the most straight-forward way to implement an algorithm, especially to read the code later.
Inside the code we use:
state --> int
set of states --> sorted / unique'd array of int's
input symbol --> int
output symbol --> int
transition function is abstracted out behind an interface and implemented differently based on whether automaton is read-only or changeable and based on the state etc.
I’ll have to dig in further, but I suspect there’s a lot of optimization for the task at hand here. SpaCy and NLTK tries to be much more general for NLP.
That’s part of why I stopped writing Cython and just use pybind11.
This seems to be a true standalone open source project.
This is what happens when you let programmers name projects. I'm sure the product owners love this. /s
Names are hard. ;-)
OK, this is pretty OT, but coders don't have a monopoly on bad naming. I worked on an internal project that had a unique and well recognized name within the company. Exec forced us to change it to an acronym that he liked but made no sense to anybody else.
Two years later, after the new name had finally caught on, he realized that the project didn't do what he thought and had been telling people. In order to save face, he created another project with _the same name_ so he could tell people that's what he meant all along.
So now there's one project that's had two different names, one of which overlaps with a separate-but-closely-related project. As you might imagine, this has not been confusing to anyone. But at least that executive only looks like a jackass to people who've been around long enough to watch this farce unfold.
Programmers may suck at naming, but at least they understand the concept of using distinct names for related things...
Any plans to change this?
What does MS fall back on for these languages?
1. Why only put perf data in the blurb? To me it would be much more helpful to first see a few API calls this allows that solve problems I have had before, or may have in the future. It remains mostly opaque to me what this software does.
2. If this is mostly about splitting text into words, why is the code dump dozens of directories with many hundreds of source code files in there?
I understand that they are proud of their perf numbers, but those are (to me) not as important as first understanding what I could do with the project.
Regular expression mangling sounds like it could be useful for much more than word splitting in a search engine context.
Does it contain a regex engine to easily create the state machines in the first place?
Does it have a JIT?
Do the state machines operate on char? wchar_t?
These are the kinds of details that I would love to see on the github entry page. It's obvious to you what your library does, but I have no idea :-)
Regular expressions are somewhat early POSIX standard... does not have many features that nfa-based regular experssions have like in C#/Python or PCRE …
Machines are easy to create but right now it is all done via command line tools, so you will have to write code to create it from code.
Does not have JIT.
Machines operate on int's (int32), input weight maps and variable length coding is used in places.
How are tokens/phrases/documents represented, computationaly? Theoretically, do they live in a vector space?
It's a good thing neither Bing nor Google remove so called stop words, because if they did I wouldn't be able to google them for the lyrics from that great song by "The the".
If you use a system that require you to remove stop words from your lexicon I suggest you find better tech because removing them destroys linguistic context. If you're using a search engine that doesn't care about linguistic context, again, you should find better tech.
Even the most naive search engine should at least be using tf-idf or some other statistical tool to determine both a token's document frequency as well as its lexical frequency. That's very important context to have.
If you insist on using stop words most tokenizers would assume you did the filtering or they'd want you to submit a word list for it to use as filter. Be aware, you are not making tokenization much simpler, because a proper search engine stores each distinct token once and only once in its index, so you hardly saved any space. What you did was to slow down the indexing process and use lots more CPU cycles for each token which translates to energy waste in my mind.
Leave your stop words right where they are, and save the planet.
Word-Guessing: 1. We compute a mapping as a finite state machine from a word to its properties. 2. we generalize the finite state machine by shrining some path and creating new finite state with the union of the properties so that a. no mistakes are made wrt the known words, b. the model is as general as possible. The result is a finite state machine that a. does not make mistakes wrt the known set of words and is b. as general as possible for typical unseen words.
We will put description better than this comment once the models are published.