
Fast request routing using regular expressions - nikic
http://nikic.github.io/2014/02/18/Fast-request-routing-using-regular-expressions.html
======
sjtgraham
Are regular expressions are good choice for routing? A trie seems like a
better choice than a sequence of regexps. O(n) vs O(log n).

~~~
skj
Or, O(n^2) vs O(n log n).

~~~
skj
Now that I read the article, seems like they're merging all the regexps into
one, which really could be O(n). However, the trie will not be O(log n). You
have to at least look at every character in the input, so there's a hard O(n)
lower bound.

~~~
sjtgraham
You're confusing n size of route set, with k length of the path to be matched.
With a trie the worst case running time to match a route grows O(log n) with
an array of regexps it's O(n). The length of a path string to be matched is
independent of how many routes are in the route set.

~~~
amalcon
In theory, the combined regex could be O(1) in the size of the route set.

Of course, in practice it's not. PCRE is backtracking-based rather than
automaton-based, so increasing the complexity of the regex does have an effect
on runtime.

~~~
jerf
Which probably means we should be using a NFA-based engine for routing, rather
than PCRE, because I don't think "don't backtrack" is a very serious
restriction for routing. (Yes, of the hundreds of developers who may read
this, one of them may have used backtracking in a routing expression. But in
general I don't think it's that serious a restriction.)

------
joosters
It's a shame that the regex API doesn't seem to provide a way to indicate
which part of a (?:foo|bar|...) matched. The underlying regex code must know
this. A large part of the article is coming up with workarounds to solve this
problem.

~~~
JadeNB
> It's a shame that the regex API doesn't seem to provide a way to indicate
> which part of a (?:foo|bar|...) matched. The underlying regex code must know
> this.

As lmm points out (I think), one can get a big advantage precisely by _not_
requiring the regex code to know this. (Regexes are 'really' DFAs, and a DFA
that makes it to its start state not just _does_ not but _can_ not know how it
got there.) The more power one gives regexes, the less, well, regular they
are, and the slower they become to match.

~~~
_delirium
A DFA can give you more than just a yes/no decision, though sometimes they're
narrowly defined as only machines that give yes/no decisions. But if you have
multiple accept states and number or label the states, a DFA can report not
only 'accept' but 'accept' plus the label of the accepting state (this could
be seen as a special case of a Moore machine, in which you only care about the
last output).

You'd have to compile the DFA differently, of course. A really simple version
would just compile each alternation into its own DFA separately, and then
chain them together with "try each in sequence". That's still a DFA, so it's
not a matter of a different class of computations, though it's often slower by
a constant factor (due to not being able to share any information between the
alternations when matching them).

~~~
JadeNB
Good point: a DFA can do _exactly_ what joosters asked, namely, distinguish
between "matched regex '(a|b)' to input 'a'" and "matched regex '(a|b)' to
input 'b'". However, it may be worth noting—as I'm sure you know, but others
may not—that there is no way for even a juiced-up DFA to say, for example, how
many a's were actually in the input when the regex is 'a * ' (spaces to fool
the Markdown processor).

P.S. I might as well mention that I obviously meant to refer to a DFA reaching
its end, rather than start, state.

------
lmm
Fast until you hit the catastrophic backtracking problem, and then your
routing code spins indefinitely. I've hit a production issue of this form.

For this kind of problem you need to use actual regular expressions, not
Perl's bastardization of the concept. Or produce your own DFA - it's not as
hard as it looks, and gives you a much better idea what's going on.

~~~
nikic
You won't hit catastrophic backtracking problems with the types of regular
expressions you need for routing.

Even if you do use some very weird expression, the worst that can happen is
that you hit the backtracking limit and a route fails to match because of
that. There won't be any code spinning indefinitely.

------
jheriko
its interesting to see this. i don't often use regular expressions but for the
most general case of this problem they are a good fit - however the specific
example given, and most real world cases i imagine can be optimised further by
doing away with regex altogether and using more traditional lookup and
transformation approaches.

the example given can determine the route required by everything following
/user/ \- in the case that the 7th character of the string is a digit then we
have the last case - all that remains to distinguish the other two is to count
the number of slashes - if you want to distinguish bad matches then validating
the extracted values will be necessary too. of course this kind of logic fails
to scale to the general case - but for real world optimisation i can imagine
it being much more fruitful (possibly you could do a match like i described
faster than the regex function will parse its own input string before doing
any matching at all!)

there is the smell of a very classical problem here though - "guess
optimising". optimisation should be driven by measurement - identify the slow
part then optimise it. timings that show the performance of the solutions
overall are interesting and prove the point - but what would be more
interesting imo would be 'this was the hot spot (point at timing), therefore i
was right (point at new timing)'.

------
adamtj
ctrl+f tree

No results.

The chunking seems like a missed opportunity. It still uses a loop, so it's
still O(n) in the number of routes. That is, 200 routes will take twice as
long. 1000 routes will take 10 times as long. But maybe 100 routes is all we
ever see in reality?

Regardless, it's a small leap from that to using a tree. I would like to have
seen timings for that, and for larger numbers of routes.

Building a trie from a list of strings is trivial. Building one from a list of
regexes is not. But you don't need the complexity of inspecting or
interpreting the regexes.

With a regular binary tree, you'll need to compile multiple regexes. The root
is a regex with all urls in only two groups: (a|b|c|d)|(e|f|g|h). If the first
group is non-empty, then you move to the next node: (a|b)|(c|d). If the second
group is non-empty, then you try (c)|(d). If the first group of that is non-
empty, then the url was c. That should be O(k * log n) where k is the length
of the url and n in the number of urls.

The grouping method requires 10 regex matches for 100 routes. 2^10 == 1024, so
the tree method can do 10 times more routes in about the same amount of work.
A million routes with a tree is only twice as hard as a thousand, or twice as
hard as a hundred with chunking. A million routes with chunking in 10,000
times harder than a hundred.

On the other hand, the tree requires O(n * log n) memory, where n chunks of
constant size requires only O(n) memory. Perhaps that's significant.

edit: misplaced asterisks triggered italics.

edit2: Actually, you don't even need to add or check groups. At each node you
only need to combine half the regexes into one with no outer group. For
example, the top node regex from above could be just a|b|c|d. If it matches,
go to the left node (which is just a|b). If it doesn't match, go to the right
node (which is just e|f). As an edge case, the right-most tip node will all
need to have a regex to distinguish between it and non-matching urls. This
also makes it easier to dynamically add routes without recompiling the whole
tree. Just add the route to the right side of each node, which requires no
work until you get to the rightmost tip. Occasionally rebalance. Is
dynamically modifying the routes something that people do?

~~~
adamtj
Or how about this: a single regex, but with groups that you can walk like a
tree:

(((a)|(b))|((c)|(d))) | (((e)|(f))|((g)|(h)))

The number of groups will be O(n * log n). So 10-20 times the number of urls
at most. It's certainly not quadratic.

The groups will in depth-first order. It's slightly complicated by the fact
that the regexes contain capturing groups themselves, but you can precompute
an offset table that accounts for them.

Or give each tree-group a name:

(?<0>(?<00>(?<000>a)|(?<001>b))|(?<01>(?<010>c)|(?<011>d))) |
(?<1>(?<10>(?<100>e)|(?<101>f))|(?<11>(?<110>g)|(?<111>h)))

If you use names like that, walking the tree is easy. Just concatenate "0" or
"1" to the current node node to find the child nodes. Also, the final matching
group name is a binary number corresponding to matching route's index into the
list of routes. For example, url "e" has the name "100". That's a binary 4,
which is url "e"'s position in the list.

------
jeremymcanally
I built a router like this for Rails back in the 2.3 era. It was ridiculously
fast for small route sets, but as the sets grew, the standard approach was
actually faster. If your route set is small and routing is a bottleneck,
though, this approach might work for you. :)

(I don't know how it would stand up against Journey, the Rails 4 router
business. I imagine it's only gotten faster so there may be next to no gain)

~~~
nikic
Did you read on to the end? I also had the problem that performance degraded
with large route sets, but resolved it by matching in chunks of ten routes.

------
tlarkworthy
Oh that is so neat how you bundle the whole thing up and then extract which
one was hit! Bravo at exploiting the significant regex compiling technology to
the max. That's one trick that I will remember.

------
zimpenfish
That's a nice clear exposition of the thought process which most of us would
go through to (eventually) get a decent solution.

------
adnam
This ... is insane

~~~
rjbond3rd
Why?

~~~
adnam
Because it would be a nightmare to maintain.

~~~
nikic
It's not like you're writing the regular expressions and lookup tables per
hand. You write the routes in whatever format you deem nice and it's
automatically translated from there ;)

