Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We sacrifice features and compatibility. We have sacrificed zero correctness as far as we are aware. We maintain wart-for-wart compatibility with libpcre in order to have a good basis for comparison for fuzzing and we believe we 100% recognize libpcre constructs even when ww don't support them.

Our semantics differ, but roughly speaking, we provide the set of matches that libpcre would if you hook up a callout to libpcre that rejects each match and sends libpcre back to look for another match (our ordering will be different, but is also well defined).



If a pattern uses quantification and the match extends successfully, won't two matches be reported (instead of one)? That's what I mean--not that it's buggy. As I'm sure you know quite well, deciding when matches can be reported is quite tricky, especially with Perl's ordered alternation rules.

A few questions: - What kind of overhead is actually implied by using start of matches in streaming mode?

- Is the backend a machine code JIT of the automaton, an interpreter, a table-driven search, or...?

- Why not lazy quantifiers? They're almost always preferable to greedy quantification in a streaming scenario.

cheers, Jon


Sorry - might have sounded a bit defensive there. We've seen a few systems that have a tendency to sweep correctness under the carpet with shortcuts, especially when you hit nasties like "large bounded repeats in streaming mode".

Start of Match (SoM) in streaming mode is a real bear for some patterns. It really depends on the pattern - some are free, others (that we would otherwise support) are very expensive or impossible. Our SoM is fully general (i.e. it doesn't time out or give up after N bytes or K stream writes) so the worst case means tracking lots of 64-bit offsets through our automata.

The backend is a big pile of code that coordinates the efforts of many different automata and subsystems, most of which are a combination of a C implementation plus a big table of some sort to implement NFA/DFA/literal matcher/whatever. This isn't something of which we are particularly proud. We would like to make this a lot simpler and more interpreter-based.

A JIT would be really cool but we had some reluctance in our user base to run arbitrary code. I would be fascinated to support a JIT project even if it's not on our mainline.

I respectfully differ on lazy vs greedy. In streaming mode, both require information that's expensive to maintain and imply that you can 'predict the future' (especially once you put a complex pattern after the lazy quantifier). We've been using "automata semantics" for a long time and this is simple and easy to keep track of in streaming mode.

Capturing semantics and lazy/greedy simulation are a work in progress for us. We used to have more work in this area but we did't have a strong demand for it. We do have some IP in this that isn't in Hyperscan - I may blog about this soon. We did have a subsystem that got 100% compatibility for some patterns ("100% right, 50% of the time"?) in block mode for libpcre that covered quantifer semantics and capturing subexpressions.


- With a PCRE/RE2-like NFA implementation, where you have green threads representing the states in the automata, you can just store the start of match in the thread. You do have to be careful then about not killing some threads off. If HyperScan is a DFA, though, then SoM is indeed hard.

- Redgrep is a cool project by Paul Wankadia that uses LLVM to JIT regexp automata into machine code. https://github.com/google/redgrep. The two things that are tough about JITting for very large automata, though, are breaking the code up into functions (a compiler can't deal with one gigantor block), and all that entails, and the fact that a machine code representation of a large automaton might be many times bigger than a more traditional representation, so what you gain in decode time you lose on L3 cache hits.

- I don't grok what you mean about predicting the future wrt/ lazy/greedy quantification. With lazy quantification, you can stop producing matches once you've gotten the first match (or once you've moved past the quantification iff the subsequent subpattern and the quantification subpattern are disjoint). Looking for "<html>.+</html>" in a large stream is very bad, whereas looking for "<html>.+?</html>" is almost certainly what's desired.

- Capturing in a streaming mode just doesn't seem worth it.

- Large spans of bounded repetition are indeed the devil. It should be possible to replace states with a counter, but I think it requires dynamic mallocs (though I haven't worked on it yet).


We really like to stick to the original meaning of NFA, so our NFAs have the same problems as a DFA from a SoM perspective. That is, our NFAs are just bitvectors over states, not threads that can hold their own SoM or information about prior quantifiers. "NFA" has suffered a bit of semantic drift over the years...

We've heard of redgrep. I think there's a role for JITs here but I'm not sure I buy the concept that it should be a straightforward "turn the DFA states into code and jump around".

Predicting the future: if we have a regular expression made up of R1, R2 and Q (where R1, R2 are regexes and Q is a quantified something-or-rather - NOT necessarily with Q and R2 disjoint!) as follows /(R1)(Q)+(R2)/ and we're in streaming mode, 'predicting the future' means that we need to understand what happens beyond a stream boundary to know what to do now. Supposing that we have a stream boundary falling between something matching both (Q)+ and (R2) what is going to happen to future matches of R2.

This can be done but is hard - the problems are very similar to SoM tracking in a traditional NFA or DFA, and mean that you are starting to track loads of extra state.

The other problem for us with trying to support lazy/greedy quantification is ordering. We might have another regex (in a multiple regex set) that is unambiguously meant to throw a match between two potential matches controlled by something we might need to hold on to until we see whether the equivalent of (R2) matches for the second instance. So we can't just fire and forget if we go the quantifier-support route.

The fact is that the client base in network apps is very familiar with 'automata semantics' (i.e. lazy/greedy is a no-op and you get all matches) and supporting anything beyond that is a lot more expensive to get right.

Yes, there's something a bit peculiar about the idea of capturing and streaming mode together, isn't there? "Oh, we have the offsets of what matched, but you threw the data away, sorry".

Bounded repetition is a huge pain and is a work in progress for us in the general case. Single-character-wide bounded repetition (e.g. \s{100} or .{80,120} etc) works quite well. Depending on the form of the bounded repeat and the context, it is sometimes be possible to replace the repeat by tracking either the location of first or last entry, but the general case is horrible: /a.{20000}b/s is just as bad as it looks. We actually do this, but you better have 20,000 bits handy per stream state. :-)


re: predicting the future/streaming/"stitching", how do you need to remember more state for quantification beyond the state that needs to be maintained to support streaming at all? Why do you need to make a decision when you've reached the current block--can't that decision wait until you've seen more data or been told there is no more data?

--

...I can see that if you're using bit vectors to represent active states, that's how you can make use of SSE4/string instructions and POPCNT... I've been thinking about how to do that but it just doesn't seem to work with a VM implementation. Maybe for a hybrid filtration scheme, though...

--

Have you implemented Watson's MultiRE with input skipping, or Wu-Manber? I've tried some hacky approaches to skipping before and the problem seems to be that a sufficiently large pattern set will always include a short pattern (e.g., "PE") and now you're stuck with a bad l-min.

--

Ordering is horrible, both between patterns and within patterns (Perl's ordered alternation rules). It took us a couple years to get full compatibility with PCRE (for forensics, correctness is more important than line speed searching; and most of our input comes from spinning rust, not ephemeral packets). We had to sacrifice the strict guarantee on O(M)/static memory usage during searching in order to gain full compatibility. Fortunately, branch prediction and reserving memory remove pretty much remove that cost, so that with normal pattern sets we don't see mallocs.

Streaming matching is criminally understudied from a theoretical perspective. You cannot match some patterns in O(M) memory in a streaming model as an NFA algorithm should. The one we know about is any time a lower priority path is a prefix for a higher priority prefix; you have to buffer the lower priority match until the high priority match resolves. E.g., "a+b|a" on a string of a's is pathological. These kinds of patterns can be detected; what I don't know is whether there are other classes (I don't think so, but...).

--

I've always found the DPI use case fascinating, as it is a completely different input model than what I've worked on, which has its own rules, and yet has many of the same demanding theoretical requirements in its treatment of automata.


I think we'll need to take this discussion offline, and we will be posting more details about how Hyperscan is implemented to our blog over the next few weeks.

Our different implementations clearly start with different emphasis; some of design decisions will probably be clearer as we post about our implementation and the hard constraints in DPI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: