
Finite state machines as data structure for representing ordered sets and maps - dbaupp
http://blog.burntsushi.net/transducers/
======
haberman
Interesting stuff. The article is quite verbose -- here is the Cliff Notes
versions:

The trie is a well-known data structure for storing string sets or string-
keyed maps
([https://en.wikipedia.org/wiki/Trie](https://en.wikipedia.org/wiki/Trie)).
You can compress a trie by using a DAG/FSM instead of a tree -- this lets you
share states for suffixes, whereas trees only let you share states for
prefixes. But for large sets minimizing a DAG is too expensive to perform on
the whole trie. If you insert the keys in-order, you can minimize the DAG on-
the-fly much more cheaply.

If you have a map instead of a set, the DAG is called a FST (finite state
tranducer) instead of a finite state acceptor. In this case every edge has
some value that you accumulate into the final result. For example, if the map
values are integers, each edge can be an integer that is added to the final
result. It takes a bit more cleverness in the algorithm to share prefix/suffix
states while maintaining the tranducer invariants.

~~~
burntsushi
OP here. That is an excellent summary. Thank you for that. :-)

The length of these articles always surprises me. I start off with a simple
idea: "I just want to explain this data structure." But when I join that with
my desired target audience (some programming, some data structure experience),
it takes a lot to explain and build up each idea with good examples!

~~~
nickpsecurity
Other commenters mention that you've rediscovered an old technique. You still
get my props for being clever enough to do that. :) Nonetheless, might be
worth digging through any surveys of FSM's now to see what else you might
find, eh?

~~~
burntsushi
Well, I wouldn't say I rediscovered it. I knew it was old. Sorry if I had
given the wrong impression! My references section[1] includes citations going
back to at least 2000, but the technique is certainly older than that. (I'm
not so sure the algorithm for construction presented in the article is older
though. As far as I know, it was a novel result for Daciuk, Mihov and Watson
in 2000.)

[1] -
[http://blog.burntsushi.net/transducers/#references](http://blog.burntsushi.net/transducers/#references)

~~~
nickpsecurity
Yes, my skimming didn't catch that. So, you started with something that was
once shown to work and got results with modern tech/problems. Still a good way
to do programming.

~~~
burntsushi
Hmm. I'm not quite sure what you mean. I guess I had two primary motivations
for writing the blog:

1\. The knowledge I gained while writing the `fst` crate was difficult to
obtain. I hoped to distill some of it down into a more easily consumable
format so that others could learn more quickly than I could.

2\. Advertise a real implementation that someone can actually use, or failing
that, port to another language.

~~~
nickpsecurity
Makes sense. Posts on solutions promoting comprehension and immediate utility
are always good to have. You seemed to have pulled it off except maybe too
long. You already noted that, though.

------
mckoss
Also called DAWGs (Directed Acyclic Word Graphs [1]). John Resig blogged about
a similar problem 5 years ago, for which I wrote a JavaScript solution [2].

[1]
[https://en.m.wikipedia.org/wiki/Directed_acyclic_word_graph](https://en.m.wikipedia.org/wiki/Directed_acyclic_word_graph)

[2] [https://github.com/mckoss/lookups](https://github.com/mckoss/lookups)

~~~
kmike84
DAWG name is ambiguous, there are 2 different structures called DAWG -
sometimes it is used as a synonym to DAFSA and sometimes it is used as a
synonym to DAFSA which has all key substrings in it, not only keys. The linked
DAWG wikipedia article is for a wrong one.

------
blackkettle
Nice article. Google has an incredible library for this stuff which is also
worth checking out: [http://www.openfst.org](http://www.openfst.org)

------
tlarkworthy
Have a look at HSM's if you have not. They fix a few combinatoric explosion
problems with FSMs. Also known as statecharts. They are flattened to FSMs when
run, but they make expressing complex systems more user friendly.

[https://en.wikipedia.org/wiki/UML_state_machine](https://en.wikipedia.org/wiki/UML_state_machine)

~~~
burntsushi
Combinatoric explosion really isn't an issue here for constructing ordered
sets or maps. In fact, the algorithm presented constructs a _minimal_ FSA in
linear time. This blog is basically about hacking FSAs for use as a data
structure rather than for computational purposes. :-)

------
kmike84
I'd like to try compressing DOI urls; what was the file used?
[https://archive.org/details/doi-urls](https://archive.org/details/doi-urls)
has 11 files; on my machine they are 4.3GB combined (after uncompressing), and
there is no a file which is 2.8 GB uncompressed.

~~~
burntsushi
Right. I really should have been better about data provenance, because I know
better than that.

On my disk, I do indeed see `2007.csv`, `2008.csv`, ..., `2013.csv`. Combined,
they are 4.1GB uncompressed. It appears to contain 100,488,054 lines
(50,244,026 CSV records). Since I only cared about indexing the URLs, I
extracted them from the CSV data (this command just pulls out the second
column from the CSV file, which is also one of my creations:
[https://github.com/BurntSushi/xsv](https://github.com/BurntSushi/xsv) \---
but a simple Python script should do nicely as well).

    
    
        xsv select 2 20*.csv > urls-unsorted
    

And then remove quotes (because xsv correctly encodes CSV data):

    
    
        sed -i 's/^"\(.\+\)"$/\1/g' urls-sorted
    

Finally, sort and dedupe them:

    
    
        LC_ALL=C sort -u urls-unsorted > urls-sorted
    
    

I just ran this procedure again, and urls-sorted is byte-for-byte equivalent
to the data I used in the blog post.

Judging by the timestamps on the original CSV files, I did this around the
beginning of August. In fact, I had completely forgotten that I had to do this
at all! Sorry about that. When I get a chance, I'll update the blog post.

~~~
kmike84
Thanks for the details! A really nice article, by the way.

------
elcritch
This makes me particularly excited. And yes, that cements me as a nerd/geek,
but oh this is pretty sweet.

It seems like this might have an application to XML processing. The ability to
store and mutate changes efficiently could be a huge boon for a wide range of
xml/html processing.

Multiple indirect references however could pose a big problem for cache
locality. Any out there for cache locality properties of tries and these
finite state machines?

------
jackokring
Nice for fixed indexes such as all words in all latin alphabet languages.

------
danieldk
burntsushi: thanks for the great write-up! Now I know where to point people
who have an interest in this :) (besides Jan Daciuk's thesis).

