
Finite state transducers in Go - mirceasoaica
https://github.com/couchbaselabs/vellum
======
mci
Nitpick: as I understand them, transducers translate strings to strings, not
strings to integers.

There is a ~17 lines long algorithm for building FSM's from lexicographically
ordered strings:
[http://sun.aei.polsl.pl/~mciura/publikacje/lexicon.pdf](http://sun.aei.polsl.pl/~mciura/publikacje/lexicon.pdf)
(Figure 8). Extending it to map the strings to integers, also in lexicographic
order, is an exercise for the reader.

Full disclosure: I am a co-author of that paper.

~~~
burntsushi
I skimmed your paper. It's good stuff!

The implementation in the OP is loosely based off this paper:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3698&rep=rep1&type=pdf)
\--- The algorithm in your paper is roughly similar, but the key difference is
the observation that the output tape of the transducer can itself be
compressed---so long as your outputs follow a certain algebra.[1]

There are two other critical pieces to a non-toy implementation that most
papers don't cover, particularly with respect to very large sets (billions of
entries):

1\. Representation of the FST itself. To squeeze the most juice, you ideally
want most states to occupy no more than 1 byte. Anything by Jan Daciuk[2] in
this space is excellent.

2\. The hash map used during construction of the FST is universally glossed
over, but it is absolutely critical. A standard hash map will store all
states, which means you'll need to fit your entire FST into memory. Throwing
away the hash map means that your FST isn't compressed very well. So you need
something in between that's bounded and tuned to give the best bang for your
buck.

All of these considerations make FSTs like these non-trivial to implement. :-)

[1] -
[https://docs.rs/fst/0.1.38/fst/raw/struct.Output.html](https://docs.rs/fst/0.1.38/fst/raw/struct.Output.html)

[2] -
[http://www.cs.put.poznan.pl/dweiss/site/publications/downloa...](http://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf)

------
gok
The diagrams are weird... usually the graphical notation "a/4" for an arc of
an FST would mean that "4" is a weight and the arc is an acceptor arc (input
and output are the same) with label "a". So I'd expect that to actually be
"a:4".

It looks like this only supports unweighted finite state transducers?

~~~
mschoch
Might be using non-standard notation, I just followed what I saw used here:
[http://blog.burntsushi.net/transducers/](http://blog.burntsushi.net/transducers/)

The values after the / are the output values associated with the transition.

Weighted finite state transducers are not supported.

~~~
burntsushi
The notation is standard in the literature AFAIK:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3698&rep=rep1&type=pdf)

FSTs are commonly used for morphological analysis, and AFAIK, there isn't much
cross pollination between those folks and folks that use FSTs more as data
structures of large sets or maps. So there's probably a bit of a communicative
mismatch.

------
fishywang
This is very cool! I can see that there are some major restrictions to its
application (you cannot do changes after it's built, and the type restrictions
about the value), but in the cases you can bypass the restrictions, this could
be very very efficient.

~~~
mschoch
Thanks,

Author here, yes the observation that it is immutable is correct. One approach
you can use is to have some lighter-weight representation for the most recent
data. Then have one or more FSTs representing the older data. As time permits
keep merging the new data and the older ones down into new larger FSTs.
Merging them is straightforward since you can iterate the contents in order,
which is the order you need to build new ones. In this way, it is very similar
to having a WAL up front, and one or more segments backing an LSM storage.

------
ganfortran
Meaning high performance auto-completion feature is possible in Go now? Neat

