The trie is a well-known data structure for storing string sets or string-keyed maps (https://en.wikipedia.org/wiki/Trie). You can compress a trie by using a DAG/FSM instead of a tree -- this lets you share states for suffixes, whereas trees only let you share states for prefixes. But for large sets minimizing a DAG is too expensive to perform on the whole trie. If you insert the keys in-order, you can minimize the DAG on-the-fly much more cheaply.
If you have a map instead of a set, the DAG is called a FST (finite state tranducer) instead of a finite state acceptor. In this case every edge has some value that you accumulate into the final result. For example, if the map values are integers, each edge can be an integer that is added to the final result. It takes a bit more cleverness in the algorithm to share prefix/suffix states while maintaining the tranducer invariants.
The length of these articles always surprises me. I start off with a simple idea: "I just want to explain this data structure." But when I join that with my desired target audience (some programming, some data structure experience), it takes a lot to explain and build up each idea with good examples!
 - http://blog.burntsushi.net/transducers/#references
1. The knowledge I gained while writing the `fst` crate was difficult to obtain. I hoped to distill some of it down into a more easily consumable format so that others could learn more quickly than I could.
2. Advertise a real implementation that someone can actually use, or failing that, port to another language.
I wrote code to construct these quickly during startup for JOE. I also noticed the minimizing you can do while building the trie in order: basically if the last constructed leaf matches the previous one, use it instead. In JOE we also have sets (for character class) or maps (for regex and case conversion tables).
On my disk, I do indeed see `2007.csv`, `2008.csv`, ..., `2013.csv`. Combined, they are 4.1GB uncompressed. It appears to contain 100,488,054 lines (50,244,026 CSV records). Since I only cared about indexing the URLs, I extracted them from the CSV data (this command just pulls out the second column from the CSV file, which is also one of my creations: https://github.com/BurntSushi/xsv --- but a simple Python script should do nicely as well).
xsv select 2 20*.csv > urls-unsorted
sed -i 's/^"\(.\+\)"$/\1/g' urls-sorted
LC_ALL=C sort -u urls-unsorted > urls-sorted
Judging by the timestamps on the original CSV files, I did this around the beginning of August. In fact, I had completely forgotten that I had to do this at all! Sorry about that. When I get a chance, I'll update the blog post.
It seems like this might have an application to XML processing. The ability to store and mutate changes efficiently could be a huge boon for a wide range of xml/html processing.
Multiple indirect references however could pose a big problem for cache locality. Any out there for cache locality properties of tries and these finite state machines?