Hacker News new | past | comments | ask | show | jobs | submit login
Finite state machines as data structure for representing ordered sets and maps (burntsushi.net)
213 points by dbaupp on Nov 12, 2015 | hide | past | web | favorite | 25 comments



Interesting stuff. The article is quite verbose -- here is the Cliff Notes versions:

The trie is a well-known data structure for storing string sets or string-keyed maps (https://en.wikipedia.org/wiki/Trie). You can compress a trie by using a DAG/FSM instead of a tree -- this lets you share states for suffixes, whereas trees only let you share states for prefixes. But for large sets minimizing a DAG is too expensive to perform on the whole trie. If you insert the keys in-order, you can minimize the DAG on-the-fly much more cheaply.

If you have a map instead of a set, the DAG is called a FST (finite state tranducer) instead of a finite state acceptor. In this case every edge has some value that you accumulate into the final result. For example, if the map values are integers, each edge can be an integer that is added to the final result. It takes a bit more cleverness in the algorithm to share prefix/suffix states while maintaining the tranducer invariants.


OP here. That is an excellent summary. Thank you for that. :-)

The length of these articles always surprises me. I start off with a simple idea: "I just want to explain this data structure." But when I join that with my desired target audience (some programming, some data structure experience), it takes a lot to explain and build up each idea with good examples!


For the record, I do not consider your writing "verbose". Writing advanced material for a beginner audience necessarily means you will have to explain a lot. That's okay.


Other commenters mention that you've rediscovered an old technique. You still get my props for being clever enough to do that. :) Nonetheless, might be worth digging through any surveys of FSM's now to see what else you might find, eh?


Well, I wouldn't say I rediscovered it. I knew it was old. Sorry if I had given the wrong impression! My references section[1] includes citations going back to at least 2000, but the technique is certainly older than that. (I'm not so sure the algorithm for construction presented in the article is older though. As far as I know, it was a novel result for Daciuk, Mihov and Watson in 2000.)

[1] - http://blog.burntsushi.net/transducers/#references


Yes, my skimming didn't catch that. So, you started with something that was once shown to work and got results with modern tech/problems. Still a good way to do programming.


Hmm. I'm not quite sure what you mean. I guess I had two primary motivations for writing the blog:

1. The knowledge I gained while writing the `fst` crate was difficult to obtain. I hoped to distill some of it down into a more easily consumable format so that others could learn more quickly than I could.

2. Advertise a real implementation that someone can actually use, or failing that, port to another language.


Makes sense. Posts on solutions promoting comprehension and immediate utility are always good to have. You seemed to have pulled it off except maybe too long. You already noted that, though.


Very nearly this same technique is used in glibc for representing unicode character classes. Essentially you save a huge amount of space by sharing common leaf nodes of the trie.

I wrote code to construct these quickly during startup for JOE. I also noticed the minimizing you can do while building the trie in order: basically if the last constructed leaf matches the previous one, use it instead. In JOE we also have sets (for character class) or maps (for regex and case conversion tables).


Also called DAWGs (Directed Acyclic Word Graphs [1]). John Resig blogged about a similar problem 5 years ago, for which I wrote a JavaScript solution [2].

[1] https://en.m.wikipedia.org/wiki/Directed_acyclic_word_graph

[2] https://github.com/mckoss/lookups


DAWG name is ambiguous, there are 2 different structures called DAWG - sometimes it is used as a synonym to DAFSA and sometimes it is used as a synonym to DAFSA which has all key substrings in it, not only keys. The linked DAWG wikipedia article is for a wrong one.


My Java library, which does some of the things described in the article (DAWGS, perfect hashing, Levenshtein automata):

https://github.com/danieldk/dictomaton



Similar, but the data structure described here is a different kind of beast. It is more efficient, allows for fuzzy matching, regular expression searches and more.


Nice article. Google has an incredible library for this stuff which is also worth checking out: http://www.openfst.org


Have a look at HSM's if you have not. They fix a few combinatoric explosion problems with FSMs. Also known as statecharts. They are flattened to FSMs when run, but they make expressing complex systems more user friendly.

https://en.wikipedia.org/wiki/UML_state_machine


Combinatoric explosion really isn't an issue here for constructing ordered sets or maps. In fact, the algorithm presented constructs a minimal FSA in linear time. This blog is basically about hacking FSAs for use as a data structure rather than for computational purposes. :-)


I'd like to try compressing DOI urls; what was the file used? https://archive.org/details/doi-urls has 11 files; on my machine they are 4.3GB combined (after uncompressing), and there is no a file which is 2.8 GB uncompressed.


Right. I really should have been better about data provenance, because I know better than that.

On my disk, I do indeed see `2007.csv`, `2008.csv`, ..., `2013.csv`. Combined, they are 4.1GB uncompressed. It appears to contain 100,488,054 lines (50,244,026 CSV records). Since I only cared about indexing the URLs, I extracted them from the CSV data (this command just pulls out the second column from the CSV file, which is also one of my creations: https://github.com/BurntSushi/xsv --- but a simple Python script should do nicely as well).

    xsv select 2 20*.csv > urls-unsorted
And then remove quotes (because xsv correctly encodes CSV data):

    sed -i 's/^"\(.\+\)"$/\1/g' urls-sorted
Finally, sort and dedupe them:

    LC_ALL=C sort -u urls-unsorted > urls-sorted

I just ran this procedure again, and urls-sorted is byte-for-byte equivalent to the data I used in the blog post.

Judging by the timestamps on the original CSV files, I did this around the beginning of August. In fact, I had completely forgotten that I had to do this at all! Sorry about that. When I get a chance, I'll update the blog post.


Thanks for the details! A really nice article, by the way.


This makes me particularly excited. And yes, that cements me as a nerd/geek, but oh this is pretty sweet.

It seems like this might have an application to XML processing. The ability to store and mutate changes efficiently could be a huge boon for a wide range of xml/html processing.

Multiple indirect references however could pose a big problem for cache locality. Any out there for cache locality properties of tries and these finite state machines?


Nice for fixed indexes such as all words in all latin alphabet languages.


burntsushi: thanks for the great write-up! Now I know where to point people who have an interest in this :) (besides Jan Daciuk's thesis).


[flagged]


Note to flaggers and downvoters: This comment is actually a comment on one of the state machines in the article.


Thanks @teddyh; glad someone got it :).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: