
Finite state machines as data structures (2015) - heydenberk
https://blog.burntsushi.net/transducers/#finite-state-machines-as-data-structures
======
unhammer
(with proper formatting this time)

Note that it is perfectly possible to have more complicated values associated
with FST keys, just not in the
[https://docs.rs/fst/0.3.3/fst/map/index.html](https://docs.rs/fst/0.3.3/fst/map/index.html)
implementation. FST's _can also be cyclic_ – this lets you represent things
you couldn't with just a hash table.

\-----

Anyone who wants to play around with this should try the HFST[1] library,
which lets you create compact and possibly cyclic string-to-string maps, which
are closed under union, intersection, reversal, inversion, difference,
composition. HFST makes it quite easy to do different operations on FST's:

    
    
        $ echo 'c a t 0:%+N 0:%+Sg' | hfst-regexp2fst > cat.fst
        $ echo cat |hfst-lookup -q cat.fst
        cat     cat+N+Sg        0,000000
    

(the default is for each arc to have input-equals-output, but you can use : to
map inputs to outputs, and use % to escape special characters; 0 is
epsilon/"nothing")

    
    
        $ echo cats |hfst-lookup -q cat.fst
        cats    cats+?  inf
    

only singular, so let's make one for the plural:

    
    
        $ echo 'c a t 0:%+N s:%+Pl' | hfst-regexp2fst > cats.fst
        $ echo cats |hfst-lookup -q cats.fst
        cats    cat+N+Pl        0,000000
    

and combine them:

    
    
        $ hfst-union -1 cat.fst -2 cats.fst >feline.fst
        $ hfst-fst2strings feline.fst
        cat:cat+N+Sg
        cats:cat+N+Pl
    

and make it go from analysis to form:

    
    
        $ hfst-invert feline.fst > ɟǝlᴉuǝ.fst
        $ echo 'cat+N+Pl' | hfst-lookup -q ɟǝlᴉuǝ.fst
        cat+N+Pl        cats    0,000000
        $ hfst-fst2strings ɟǝlᴉuǝ.fst
        cat+N+Sg:cat
        cat+N+Pl:cats
    

this is what the states look like:

    
    
        $ hfst-fst2txt feline.fst
        0       1       c       c       0.000000
        0       6       @0@     @0@     0.000000
        1       2       a       a       0.000000
        2       3       t       t       0.000000
        3       4       @0@     +N      0.000000
        4       5       @0@     +Sg     0.000000
        5       0.000000
        6       7       c       c       0.000000
        7       8       a       a       0.000000
        8       9       t       t       0.000000
        9       10      @0@     +N      0.000000
        10      11      s       +Pl     0.000000
        11      0.000000
    

there's some redundancy, so we should minimize it:

    
    
        $ hfst-minimize feline.fst  >min.fst
        $ hfst-fst2txt min.fst
        0       1       c       c       0.000000
        1       2       a       a       0.000000
        2       3       t       t       0.000000
        3       4       @0@     +N      0.000000
        4       5       @0@     +Sg     0.000000
        4       5       s       +Pl     0.000000
        5       0.000000
    

Now let's make an FST that turns slashes into +-es and increases the weight
for every slash we see (~$[ a ] means anything-but-a):

    
    
        $ echo '%/:%+ ~$[ %/ ]' | hfst-regexp2fst | hfst-reweight --end-states-only --addition=1 |hfst-fst2txt
        0       1       /       +       0.000000
        1       1       +       +       0.000000
        1       1       @_IDENTITY_SYMBOL_@     @_IDENTITY_SYMBOL_@     0.000000
        1       1.000000
    
        $ echo '%/:%+ ~$[ %/ ]' | hfst-regexp2fst | hfst-reweight --end-states-only --addition=1 | hfst-repeat > dir.fst
        $ echo /a | hfst-lookup -q dir.fst
        /a      +a      1,000000
    
        $ echo /ab/c | hfst-lookup -q dir.fst
        /ab/c   +ab+c   2,000000
    
        $ echo /ab/c/d//e | hfst-lookup -q dir.fst
        /ab/c/d//e      +ab+c+d++e      5,000000
    

– that's not something you can do with (just) a hash table.

\-----

On Debians you can install the package `giella-sme` which gives you a 37M
cyclic FST of 587060 states and 1101943 arcs which turns North Sámi word forms
into analyses. North Sámi has productive compounding, so e.g. "school bus
coffee" is a word that I suppose someone might say:

    
    
        $ echo 'skuvlabussegáfe' | hfst-lookup -q /usr/share/giella/sme/analyser-disamb-gt-desc.hfstol |head -1
        skuvlabussegáfe	skuvlabusse+G3+Sem/Veh+N+Err/Orth+SgNomCmp+Cmp#gáfe+Sem/Plant+N+Sg+Nom	10,000000
    

and there's a bit of ambiguity in the analysis:

    
    
        $ echo 'skuvlabussegáfe' | hfst-lookup -q /usr/share/giella/sme/analyser-disamb-gt-desc.hfstol |wc -l
        64
    
    
    

[1] [https://github.com/hfst/hfst/](https://github.com/hfst/hfst/) , using
among others
[http://openfst.org/twiki/bin/view/FST/WebHome](http://openfst.org/twiki/bin/view/FST/WebHome)
under the hood. `sudo apt install hfst` on Debians

~~~
burntsushi
Can you say more about the trade offs? In your cyclic formulation, can you
still build nearly minimal FSTs in a streaming linear fashion with constant
heap memory?

------
unhammer
Note that it is perfectly possible to have more complicated values associated
with FST keys, just not in the
[https://docs.rs/fst/0.3.3/fst/map/index.html](https://docs.rs/fst/0.3.3/fst/map/index.html)
implementation. FST's can also be cyclic – this lets you represent things you
couldn't with just a hash table.

Anyone who wants to play around with this should try the HFST[1] library,
which lets you create compact and possibly cyclic string-to-string maps, which
are closed under union, intersection, reversal, inversion, difference,
composition. HFST makes it quite easy to do different operations on FST's:

$ echo 'c a t 0:%+N 0:%+Sg' | hfst-regexp2fst > cat.fst $ echo cat |hfst-
lookup -q cat.fst cat cat+N+Sg 0,000000

(the default is for each arc to have input-equals-output, but you can use : to
map inputs to outputs, and use % to escape special characters; 0 is
epsilon/"nothing")

$ echo cats |hfst-lookup -q cat.fst cats cats+? inf

$ echo 'c a t 0:%+N s:%+Pl' | hfst-regexp2fst > cats.fst $ echo cats |hfst-
lookup -q cats.fst cats cat+N+Pl 0,000000

$ hfst-union -1 cat.fst -2 cats.fst >feline.fst $ hfst-fst2strings feline.fst
cat:cat+N+Sg cats:cat+N+Pl

$ hfst-invert feline.fst > ɟǝlᴉuǝ.fst $ echo 'cat+N+Pl' | hfst-lookup -q
ɟǝlᴉuǝ.fst cat+N+Pl cats 0,000000 $ hfst-fst2strings ɟǝlᴉuǝ.fst cat+N+Sg:cat
cat+N+Pl:cats

$ hfst-fst2txt feline.fst 0 1 c c 0.000000 0 6 @0@ @0@ 0.000000 1 2 a a
0.000000 2 3 t t 0.000000 3 4 @0@ +N 0.000000 4 5 @0@ +Sg 0.000000 5 0.000000
6 7 c c 0.000000 7 8 a a 0.000000 8 9 t t 0.000000 9 10 @0@ +N 0.000000 10 11
s +Pl 0.000000 11 0.000000

$ hfst-minimize feline.fst >min.fst $ hfst-fst2txt min.fst 0 1 c c 0.000000 1
2 a a 0.000000 2 3 t t 0.000000 3 4 @0@ +N 0.000000 4 5 @0@ +Sg 0.000000 4 5 s
+Pl 0.000000 5 0.000000

Now let's make an FST that turns slashes into +-es and increases the weight
for every slash we see (~$[ a ] means anything-but-a):

$ echo '%/:%+ ~$[ %/ ]' | hfst-regexp2fst | hfst-reweight --end-states-only
--addition=1 |hfst-fst2txt 0 1 / \+ 0.000000 1 1 + + 0.000000 1 1
@_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@ 0.000000 1 1.000000

$ echo '%/:%+ ~$[ %/ ]' | hfst-regexp2fst | hfst-reweight --end-states-only
--addition=1 | hfst-repeat > dir.fst $ echo /a | hfst-lookup -q dir.fst /a +a
1,000000

$ echo /ab/c | hfst-lookup -q dir.fst /ab/c +ab+c 2,000000

$ echo /ab/c/d//e | hfst-lookup -q dir.fst /ab/c/d//e +ab+c+d++e 5,000000

On Debians you can install the package `giella-sme` which gives you a 37M
cyclic FST of 587060 states and 1101943 arcs which turns North Sámi word forms
into analyses. North Sámi has productive compounding, so e.g. "school bus
coffee" is a word that I suppose someone might say:

$ echo 'skuvlabussegáfe' | hfst-lookup -q /usr/share/giella/sme/analyser-
disamb-gt-desc.hfstol |head -1 skuvlabussegáfe
skuvlabusse+G3+Sem/Veh+N+Err/Orth+SgNomCmp+Cmp#gáfe+Sem/Plant+N+Sg+Nom
10,000000

and there's a bit of ambiguity in the analysis:

$ echo 'skuvlabussegáfe' | hfst-lookup -q /usr/share/giella/sme/analyser-
disamb-gt-desc.hfstol |wc -l 64

[1] [https://github.com/hfst/hfst/](https://github.com/hfst/hfst/) , using
among others
[http://openfst.org/twiki/bin/view/FST/WebHome](http://openfst.org/twiki/bin/view/FST/WebHome)
under the hood. `sudo apt install hfst` on Debians

