
Show HN: Subsequential Transducer for Efficient Text Rewriting - deniskyashif
https://github.com/deniskyashif/ssfst
======
k2xl
Could someone explain why this is better than simple string substitution? Is
regular string substitution much slower than this?

~~~
binarymax
I am not yet 100% certain so I requested full-text access to read the paper,
but I can say that a naive approach to string substitution will certainly be
less performant than this, and offer less flexibility in replacement
complexity without significant effort.

The reason being that in a naive approach, a vocabulary of size M and a
document of token size N is an O(m*n) operation in the best case - while this
is claimed to be an operation of O(2n).

------
andy_ppp
I’ve wanted this specifically for terms and conditions documents for a long
while... any thoughts on it’s applicability to this subject; so many legal
document sections are the same but you can’t just copy them.

~~~
tyingq
What would it do that some basic templating library wouldn't do?

Not knowing the specifics, I would lean toward a template library, as it would
be more precise, and could account for things like changing gender, plurality,
a vs an, etc, in a structured way.

------
brudgers
If it meets the guidelines, this might make a good 'Show HN'. Show HN
guidelines:
[https://news.ycombinator.com/showhn.html](https://news.ycombinator.com/showhn.html)

------
binarymax
Interesting project. Question on this: _" For any text t of length |t| the
time it takes to perform a rewrite is O(|t|+|t'|) where t' denotes the
resulting output string"_

Wouldn't the vocabulary size fit into the order complexity? Vocabs that would
be considered useful in this context tend to be quite large. Are you achieving
worst case logn of search in the vocab?

~~~
deniskyashif
Hi, thanks! The outputs are encoded in the transitions and the states
themselves, so the vocabulary size doesn't affect the search. It affects,
however, the construction because for each node in the trie we have to add an
outgoing transition for each distinct symbol of the input vocabulary.

