Hacker News new | past | comments | ask | show | jobs | submit login

The library we have published is a finite state machines manipulation library first of all, also it is developed to support linguistic applications (large number of entries, unicode, compile once use many times, etc.) Tokenization is one of its applications. We needed to start somewhere. In our Deep Learning Era not everything that we have created is relevant, but tokenization is. What we might add to the project is BPE or BPE- variant, support East Asian languages, Multi Word Expressions ("New York" as one token).



Could you please explain what kind of manipulation your library supports on finite state machines?

Does it contain a regex engine to easily create the state machines in the first place?

Does it have a JIT?

Do the state machines operate on char? wchar_t?

These are the kinds of details that I would love to see on the github entry page. It's obvious to you what your library does, but I have no idea :-)


fefe23, Sorry we did not put all this information into the GitHub readme, we will put more documentation into the doc folder soon, I hope it will answer some of your questions. To specifically answer your questions:

Regular expressions are somewhat early POSIX standard... does not have many features that nfa-based regular experssions have like in C#/Python or PCRE …

Machines are easy to create but right now it is all done via command line tools, so you will have to write code to create it from code.

Does not have JIT.

Machines operate on int's (int32), input weight maps and variable length coding is used in places.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: