Hacker News new | past | comments | ask | show | jobs | submit login
Programming Techniques: Regular expression search algorithm (1968) (acm.org)
32 points by userbinator on Aug 12, 2023 | hide | past | favorite | 1 comment



This is Thompson's Construction.

There is a nice description given by Russ Cox:

https://swtch.com/~rsc/regexp/regexp1.html

The following project has an interesting implementation in Elixir, which converts the NFA directy into a process network:

https://github.com/mike-french/myrex

The network runs all possible traversals concurrently without back-tracking. It automatically scales to use all cores (Erlang BEAM runtime). Multiple input strings can be processed concurrenty. It can also generate matching strings concurrently (Monte Carlo). It implements captures and Unicode character sets.

While it is designed for concurrency, it is not meant to be the fastest regex implementation. There is an example of a highly ambiguous match that launches 900k traversals and reports all capture results in about 10s.

There is a short proof sketch for the formula to calculate the number of possible matches in this case as a dot product of vectors sliced from Pascal's Triangle.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: