
TRE: A Regex Engine with Approximate Matching - totalperspectiv
http://ducktape.blot.im/tre-a-regex-engine-with-approximate-matching
======
chaoxu
An interesting theoretical problem that arises: Given a regular language L,
define LD(L,k) to be the set of strings with Levenshtein distance at most k
from some string in L.

1\. Is LD(L,k) regular? (yes)

2\. How hard is it to construct a regular expression(or DFA, NFA) of LD(L,k)
given a regular expression(or DFA, NFA) of L? (probably it is something k
times the original size in terms of NFA)

I believe a constant k here does not make much sense. since we might match
super long strings, then having more errors seems ok. Maybe we can ask for the
following:

LDM(L,ε) is the set of strings, such that if x in LDM(L,ε), then there is a y
in L, such that distance between x and y is at most ε|y|.

LDM(L,ε) does not seem to be a regular language.

~~~
poorcedure
If memory serves, it returns the number of deletes/inserts/edits. So what you
mention can still be handled.

~~~
poorcedure
Nevermind, you can set the cost/max of each:

[https://laurikari.net/tre/documentation/regaexec/](https://laurikari.net/tre/documentation/regaexec/)

------
tyingq
Looks pretty handy, but wish it could capture the actual edit distance
somewhere. You can set a max, but many use cases call for knowing how close
the match was.

------
vram22
Apropos of regex engines: PCRE [1] is an interesting and somewhat widely used
[2] C library for Perl-Compatible Regular Expressions.

I was once on a product team where we used PCRE, and one of my tasks was
building it from its C source on many Unix variants (such as Linux, Solaris,
HP-UX, AIX, Tru64 UNIX, etc.) and Windows (x 32/64-bit OS versions for some of
them). I learned something about the differences between the C compiler
toolchains on all those OSes in the process. There was quite a bit of
variation between them, in the commands and steps needed to compile and link
programs.

[1] [https://www.pcre.org/](https://www.pcre.org/)

[https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expres...](https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions)

[https://en.wikipedia.org/wiki/Philip_Hazel](https://en.wikipedia.org/wiki/Philip_Hazel)
(PCRE creator)

[2] From the Wikipedia article about PCRE:

>A number of prominent open-source programs, such as the Apache and Nginx HTTP
Servers, and the PHP and R scripting languages, incorporate the PCRE library;
proprietary software can do likewise, as the library is BSD licensed. As of
Perl 5.10, PCRE is also available as a replacement for Perl's default regular
expression engine through the re::engine::PCRE module.

>The library can be built on Unix, Windows, and several other environments.
PCRE is distributed with a POSIX C wrapper, a native C++ wrapper,[a] several
test programs, and the utility program pcregrep built in tandem with the
library.

------
poorcedure
Years ago in a pinch, I used this library to isolate data in OCR results.
Looking for phone numbers or tax IDs, it would return values with a B in it
and you could correct to an 8 or other similar transforms. Probably much
better tools for this now, but at the time it was extremely useful.

------
gitgud
Interesting, would help with [1] ReDOS attacks? It would seem that fuzzy
matching could put a upper limit on execution time right?

[1]
[https://en.m.wikipedia.org/wiki/ReDoS](https://en.m.wikipedia.org/wiki/ReDoS)

~~~
onlydnaq
I believe the correct way to protect against ReDOS is to use one of the many
regular expression implementations that do not have exponential worst cases.

Nowadays there are good DFA or Tree based matching algorithms that are O(mn)
where m is the length of the regular expression and n is the length of the
input.

------
laumars
This looks pretty interesting. Anyone know if this library has been ported to
any other languages aside Perl?

~~~
tyingq
C, Perl, Python, and Haskell:
[https://laurikari.net/tre/faq/](https://laurikari.net/tre/faq/)

