1) it works "faster" than regex in a specific scenario of treating text as entities in natural language. (E.g. higher conceptual abstractions such as qualifiers, variations, etc). If one were to reconstruct Nevod's rules using pure traditional regex (a very complicated regex), executing that regex would be slower because it's more of a "dumb" character sequence matching engine instead of a higher level natural language parser.
2) it's currently an unreleased "text search engine" that presumably will be licensed for you to integrate into your own software. The text matching engine is currently only used in their proprietary database engine. Whether the engine is a library one statically links in like Sqlite -- or -- it's a separate runtime like ElasticSearch that you make API calls to, I don't know.
I notice the CEO is Yury Chetyrko and the submitter is ychetyrko, so maybe he can explain in more detail what exactly Nevod is.
My reading of that is that they compared apples and oranges. They didn't use NLP there at all (stemming, parts of speech tagging, etc); they just relaxed handling of whitespace. The Nevod 'equivalent' was a less general expression to make it seem more maintainable.
The Nevod example translates to (ruby):
pattern = Regexp.new( "(?<Name>ejection fraction|LVEF)( by visual inspection)?
(?<Qualifier>(is|of)( (at least|about|greater than|less than|equal to))?)
(?<Value>[0-9]+(-[0-9]+)?|normal|moderate|severe)".gsub(/\s+/, "\\s+"),
Regexp::IGNORECASE)
["ejection fraction is at least 70-75",
"ejection fraction of about 20",
"ejection fraction of 60",
"ejection fraction of greater than 65",
"ejection fraction of 55",
"ejection fraction by visual inspection is 65",
"LVEF is normal"].each do |line|
puts line
puts pattern.match(line).inspect
end
The only trick I used was to substitute literal whitespace in the regex with a whitespace pattern, so that the typed regex was more readable.
If you're looking for a tool that allows you to incorporate legitimate NLP approaches, you should have a look at `odin`. Here's a paper https://doi.org/10.1093/database/bay098 showing its usage in the medical domain.
And the code is open-sourced as part of the `processors` library out of the CLULab at the University of Arizona: https://github.com/clulab/processors
The most detailed (though not completely up-to-date) documentation is probably in the manual here: https://arxiv.org/abs/1509.07513
I'm using it at my current job to build an analysis tool for customer-agent phone calls.
It allows you to build rules that match on different levels of abstraction: tokens, pos-tags, dependency paths. You can even match tokens based on word similarity (as measured by cosine similarity of word vectors).
And these rules can "cascade" (i.e. build off of each other). So you can find an entity or event in rule 1 and then look for how that interacts with another matched entity or event in a later rule.
That sounds very much like what Rust's RegEx engine does. I understand it extracts longer literals (when available) to find a starting point for where a match might be according to [1].
This is not a new technique with the Rust regex engine. We had a considerably more comprehensive literal 'factoring' approach in Hyperscan about a decade earlier (which also satisfied a lifelong ambition of mine; specifically misusing the netflow algorithm in a graph for something).
The multiple literal implementation in that matcher is also a partial lift from Hyperscan's "Teddy" small group literal matcher (not salty about that as "burntsushi" has been very clear about inspiration). I do wish they'd pull in the rest of the algorithm at some point - the bits that allow merging of Teddy literals into buckets to reduce false positive rates...
While we're wishing for things, I'd love to see you write up how some of the algorithms (like Teddy) work. Hyperscan is clearly a treasure trove of them, but the effort required to extract its secrets is quite large! For example, I've spent quite a bit of time perusing its code (and your excellent mailing list posts), and I still don't think I could describe its execution flow at a high level.
As far as literals go, I was doing that well before I heard about Hyperscan. I got the idea from hints in the literature. (On mobile or else I'd provide a link.)
We have a paper accepted to NSDI, so this will hopefully shed some light on things. A conference paper is, sadly, too small to fit more than a few subsystems so a writeup of Teddy isn't included.
As I don't work for Intel any more, it's unlikely that I will put in the effort for anything significantly more comprehensive. I am considering another building another regex matcher but it wouldn't be the encyclopedic list of optimizations approach. Paul Terra refers to Hyperscan as the "Dwarf Fortress of regular expression matchers". I have some new ideas I'd like to try out, in any case.
I wasn't suggesting that your use of literal factoring comes from Hyperscan, only the Teddy matcher. Literal factoring seems an old technique and I don't know the genuinely first cite of that.
Gotya. Thanks for the clarification! I look forward to reading your paper! I'll hold out hope for the folks working on Hyperscan at Intel to do some more in depth write ups.
I didn't mean to insinuate that it was. The Rust RegEx engine was simply the first that came to mind for me. What I did intend to say that if this is indeed the optimisation that Nevod claims it has, it is not the first to do it.
However I did take the time to read up on Hyperscan and would like to thank you for contributing to what looks like an excellent tool.
> Nevod is a language and technology that provide pattern-based text search. Nevod is specially aimed to rapidly reveal entities and their relationships in texts written in the natural language. This patent pending technology is unique on the market.
What is this? A website, a tool? Can I run it locally? Is it free software, open source, close source? Where is the information about the patent?
Also, you mention it's easier (why?) and hundred time faster, but I don't see anywhere a comparison with traditional regular expressions.
The claims seem dubious to me as well. What exactly does faster mean here? Since a speedup of two orders of magnitude is mentioned I'm assuming it refers to matching speed.
Is it faster than PCRE? Is it faster than some other engine or is the claim that it's faster than any available RegEx engine?
I had a quick look at the reference and this looks like it will accept context-free languages (since recursion is allowed). I strongly doubt that a CFG parser is magically faster than any RegEx engine.
> It is hundreds of times faster than the well-known regular expressions, and the speed doesn’t degrade linearly when you add patterns.
The author seems to believe that matching time for RegExes is linear in the time of the pattern? Once compiled to a DFA, the pattern only has negligible influence on the matching time.
EDIT: I've been trying to figure out what kind of parser this generates. It might be a LR(k) parser? It breaks on the following (admittedly contrived) example:
#S = E + End;
E = {E + "+" + E, E + "*" + E, T};
T = {[1+]Alpha, N + "()"};
N = {P, Q};
P = {"a" + P, "ab"};
Q = [1+]"a";
> Once compiled to a DFA, the pattern only has negligible influence on the matching time.
With the right optimisations, certain longer patterns can be much faster to scan for, on account of the Boyer-Moore approach.
If I asked you to search through a book for 10 consecutive pages of the letter 'Q', you wouldn't need to check every page. The same optimisation can be applied to regex. (Not that most regex implementations bother to do it, though.)
You're right. What I meant to say is that for any (formal) regular expression, we can compile it to a DFA and then match any strings in time linear in the length of the string.
Certainly the pattern length matters both for pre-processing (compiling to DFA) and the runtime in a Boyer-Moore approach. However, as you mentioned in the Boyer-Moore average case of Θ(n/m) a longer pattern is faster, rather than slower as the page on Nevod seems to imply.
This is cute! It's basically a nice syntax for full parsing of extended regular expressions. It would make for a very nice tool. Far too often, regular expression tools only implement "matching" (i.e., extract a list of strings). Full parsing gives you the complete parsetree, and thus preserve more structural information, especially under repetitions.
I'm very much in support of anything that uses regex parsing instead of matching. Matching is nearly always the wrong tool and cause more bugs than anything. The only reason it's used is that the engines are easier to implement. I wrote a small paper on how to retrofit parsing on top of an engine that only gives you matching recently (https://gabriel.radanne.net/papers/tyre/tyre_paper.pdf).
Given the amount of prior art in the academic community, patenting this is probably worthless, but eh ...
Well, except Perl doesn't parse regular grammars (it parses much more) and is far from being "one pass" (since the complexity guarantees are not valid anymore) ....
Except Perl 6 doesn't enforce it, so you have to guess yourself if you are really in the regular case. Additionally, PCRE used to be exponential for certain "bad cases", some of which were (truly) regular expressions.
My point is simply that Perl doesn't give you any guarantee except "we will try to parse it". Maybe you will hit the right case for the right version of Perl, who knows ? The Web is full of DDOS attack based on exponential PCREs.
With actual regular expression, the complexity is guaranteed.
> My point is simply that Perl doesn't give you any guarantee except "we will try to parse it".
Perl 6 gives you pretty good control over where to use backtracking and where not. It parses regular languages with a NFA or DFA, and can switch to a backtracking engine.
The way regular expressions work in Perl doesn't randomly degrade across Perl versions. For a given regex you can determine what it's complexity would be across versions just like you would for an arbitrary chunk of Perl code.
It's true you don't always get a complexity guarantee for free like in a DFA implementation of formal regular expressions.
No, Parsing expression grammar (PEG) are not regular grammars at all.
For more resources, I think the related work of my paper has the main ones. The "Kleene meets Church" project has lot's of very good publications on the topic: https://di.ku.dk/kmc/publications/
1. the ability to easily break patterns into named subpatterns which can be referenced later on
2. the `@` operator which gives you the ability to talk about things inside these subpatterns
These seem like worthwhile additions which would make regex more manageable. I don't see any reason why the whole of regex would need to be abandoned for some completely new, potentially proprietary technology, though.
It also reminds me of parser combinators (in the form they are popular in Haskell, for instance).
It looks like there are some interesting things in this project beyond regex, but the way that it skates over the fact that there are already automata-based multiple pattern matchers is pretty rank.
When we started developing automata-based pattern matching s/w in 2006 - 2006! - we were well aware that it wasn't really a new thing. 12 years of work on Hyperscan (https://github.com/intel/hyperscan), 1 acquisition and 1 open source release later - as well as multiple other projects doing similar things (re2, Rust regex, Parabix, icgrep, Rosie Pattern Language, ...) has only added to this crowded field. All this has taught me many things, but the thing that I keep returning to is that "a lot of people apparently don't know about prior art, or assiduously pretend not to". It will be interesting to see what they try to patent.
> Pattern definition language is simple and clear, thus very easy to learn
Once you get into more complicated expressions, I don't see this as much easier or simpler than regexes. For example, this expression from the tutorial looks as much like gibberish to me as an equivalent regex:
Domain = Word + [1+] ("." + Word + [0+] {Word, "_", "-"});
That said, I can see some possible value in pattern matching based on text tokens, rather than individual characters. I'm sure there is a subset of pattern matching problems that could be solved more simply using this.
In the end, automata are automata and anything that tries to do what regex's do winds up looking a lot like regex's -- which look a lot like regular expressions (as a representation of a finite state machine). The difference between "regex alternatives" and regex's is that the alternatives tend to be more verbose and less well documented...and less likely to elicit good answers on StackOverflow...and perhaps less likely to generate good questions there.
I think the hard part of pattern matching is reasoning about pattern matching. The obscurity of Regex notation is mostly a function of unfamiliarity with the concepts. [:word:]+ is not easier to reason about than \w+ and "\w+" is much better documented than "[:word:]+" or "Word + [1+]."
The other problem with learning Regex's is that regex notation is someone else's code. There's always the attraction of fixing it. I've dunning-kuger'ed it myself. Fortunately, making my new more sensible superduper regex notation complete required RTFM'ing...and then I'd read the manual and realized I'd already fixed regex notation by fixing the absence of knowledge in my head. Plus I could talk to other people about pattern matching using the common language of pattern matching.
First of all, thanks everyone for the valuable feedback, critics, suggestions, etc. Truly appreciate that!
And thanks for patience to all people who are playing with the Nevod right now and getting errors. We have an unexpectedly high interest and number of visitors is very high in our playground. Meanwhile, it's a preview of the technology, please keep in mind.
Let me clarify few things. The speed of Nevod is based on two things:
1. Nevod operates on words, not individual characters. From my 30 years of coding experience I would say that roughly 95% of text processing/rules are word-based. So it covers most of tasks.
2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.
So, at a RULE DECK of 1000 patterns, we got 300 times improvement comparing to RegExp. This is a kind of tasks when you let say need to highlight syntax, to massively verify and route incoming requests in cloud, perform massive text validations, track certain phrases and relations in massive human communication, etc.
With growing number of patterns the difference between Nevod and RegExp is higher and higher and go far beyond 300 times. So, I'm a kind of disagree with moderators removed the word "hundreds" from the headline. :)
We will publish benchmark results and we will also release "negrep" (Nevod GREP) command line tool for Windows, Linux, Mac, so everyone will be able to run "negrep", play with it and verify benchmark him/herself. Generally, Nevod technology will be available both in form of a library, in form of command line tool, and in form of service.
> 2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.
This is a good idea. It's such a good idea that we did it in 2006, sold a company to Intel based around in in 2013 and open-sourced the resulting library (https://github.com/intel/hyperscan) in 2015.
All multi-pattern systems get a huge improvement over one-at-a-time regex. It would be embarrassing if they did not. Our rule of thumb for Hyperscan was that it should be about 5-10x faster than libpcre on single patterns and the gap should get progressively bigger.
I don't doubt that there is quite a bit of interesting work (entirely outside of regex - the word-based focus looks interesting) going on in this project, but the fact that you couldn't be bothered to Google around for a sensible baseline is not encouraging.
Also, which bit of your system do you think is novel enough to patent? You may find considerable prior art out there, which you don't really seem to be aware of based on how much emphasis you're placing on the not-entirely-novel idea of using a state machine.
Geoff, for me it's real honor that you, well-known creator of Hyperscan, paid attention to what we are doing. We truly respect all the prior work in this field (Hyperscan, RE2, Rust, etc), so different benchmarks will be available.
However, we all need to keep in mind that Nevod operates on words, not characters, and Nevod targets large rule decks (1000-100000 patterns). So, we don't compete in the field of individual pattern matching or streaming matching (to some extent, goals of Hyperscan and Nevod are different). In other words, a lot of things depend on target goals, which usually are reflected in the kind of patterns and in their number.
Regarding patent. Sorry, for legal reasons I can't expose details at the moment.
Staying away from streaming is a good idea for a lot of designs. We all lost a lot of our hair making streaming work in Hyperscan, and it was often the thing suppressing easier optimizations. Or at least we would have to have a separate approach for streaming and block mode; not exactly great s/w engineering.
Word-based approaches are interesting. So is large scale. Hyperscan does embarrassingly badly on natural language stuff at scale due to some baked in assumptions about relative frequencies of literal factors and the literal matchers tend to fall apart on really huge literal sets in this domain. So there's definitely an opportunity here.
I would strongly encourage you to produce some repeatable benchmarks (at least, things where we can see what the task is for something like RE2 or Hyperscan). We spent many years in stealth mode with a lot of superlatives and NDA-only Powerpoint decks and I don't think it did ourselves any favors. It means that you wind up being the only guardian of your own honesty, which is ... tricky ("What the hell is a medium-complexity pattern, anyhow?").
Isn't this basically what regexp-based lexer generators (eg. lex, 1975) do? Stdlib regexps need to run multiple passes because the library itself has no idea whether it's being used with one pattern or multiple patterns, but any lexer generator that has all the patterns at once (i.e. all of them) can merge the DFAs generated and match the input in a single pass.
Hell, I've been known to do quick & dirty versions of this technique by concatenating all the patterns together with "|" before passing them to Pattern.compile. It takes all of one line of code. Backtracking regexp engines don't really benefit from this, but DFA-based regexp engines like RE2 can get massive speedups.
The idea of compiling everything together into a single DFA is very old. However, this doesn't always work - it will result in a combinatorial explosion in many cases (informally, if the patterns contain a lot of states that overlap with other potential states you might be in from other patterns or elsewhere in that pattern). So DFA merging is a dark art and many people spend whole careers taping things to the side of the basic DFA algorithm to handle this.
This doesn't seem to come up with most lexers because the idea of using some crazy overlapping tokens doesn't sound great - i.e. you're probably not going to want to allow your tokens to overlap that much. So it's not a problem for lex, but it might be a problem if you're scanning 10,000 text analytics patterns simultaneously.
Concatenating patterns is fine for RE2 - and RE2 can survive this - to a point. However, RE2 just falls back to a 'survivable' Thompson NFA regex when determinization fails and has no other strategies.
For backtracking engines, concatenating patterns with "|" is equivalent to just trying one pattern, then the next, etc.
> 2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.
Regexps do that too. Specifically, there is a whole class of "lexers"/"tokenizers" which are regexp-based, and which specialize in matching 1000's of patterns against the text, in one pass.
(You can kinda approximate it in regular regexp by doing multiple rules at once, each in it's own group, and seeing which groups are non-empty, like in "(rule1)|(rule2)|(rule3)" and so on. If your regexp library is good, it will match this in time independent from number of rules. But real lexers will have nicer syntax)
I see a lot of criticism but not a lot of praise. For me creating something so comprehensive and also easy to use is an achievement. And in my opinion this is much better than Regex for most cases because of the ease of use and readability. I think the main pushback you are seeing is just because people are invested in Regex.
Um.. no? He is seeing pushback because one of his main claims (100x faster) is very likely factually wrong. And also because he is not even acknowledging all other prior art in this space.
> Also, patent sends marketing message that we are serious about the technology we created.
That's the message it sends to business people maybe, but your target audience here is hackers and tinkerers and it sends a completely different message (namely: "keep this technology out of your projects or you'll get into legal trouble down the road").
Pretty cool tool, the NLP focus is surprising but I'm not sure if it is a marketing ploy or fundamental to the technology. The nice syntax is also an improvement over things like LPeg, which seem to obscure the grammar by forcing it to be written in Lua. Seems like it could overlap in functionality with Rebol/RED parse, which I believe has been posted on here before, LPeg, or just PEGs and Packrat parsing in general.
The parse engine is implemented with a PEG, so it is probably a lot slower, but also supports pattern matching over whole words. Also, the database aspect of Nevod seems interesting and offers potentially a huge speed. But as far as I can tell, the syntax looks like a PEG with some implicit rules that separate words. As far as speed goes, I definitely believe that Nevod could be very fast but I haven't seen any numbers.
Shameless plug, I've implemented something similar with PEGs for Janet, a lisp I have been working on for a while. I make no claims to it's speed, but the peg interpreter is written in tight C so it shouldn't be too slow. The peg module in Janet works with a DSL that looks like a lispy EBNF and results in a recursive parser compiled to bytecode for the interpreter.
"Nevod is a language and patent-pending technology for pattern-based text search."
Yawn. I will stay with my patent-free regexps (or whatever other tech I may need) than to rely on something that will let you put a gun to my head if I ever wish to sell my product.
Is this another syntax (of the mechanisms written form) vs semantics (of how it actually operates) confusion moment?
If Nevod builds a different textual model and applies what you "say" to it, differently to the regex underlying model, thats about the speedup. how you say what you want in pattern matching, thats just a pure syntactic moment: I like regex from the UNIX philosophy because of the syntax fluidity for saying things. What actually happens when I say a|b|c|d is it builds a DFA and its not that bad, but if I do individual /a/p /b/p /c/p patterns in SED, same engine, but no DFA builder, its slow.
So is Nevod a new syntax and a DSL tied to language, or is it a new syntax and a generalized text matching model with some real semantic shift from regex?
You could easily do something like this using open source parser generators, like Pegjs(https://github.com/pegjs/pegjs) and own the final code yourself.
[edit] After reading my comment it sounds like I don't like packrat parsers. I actually love them and when they are available for the language I'm using they are my first choice, but the first rule of engineering is everything has it's trade-offs, so...
I'm not familiar with Pegjs, but other PEG parsers I've seen tend to use the packrat algorithm, which is suboptimal for regular languages, because it memoizes parses to speed up backtracking, and regular languages do not need backtracking.
For example, if you were to write a recursive-descent parser for JSON and convert it to a packrat parser, you will often find the packrat parser is slower.
Now, extended regex's include backtracking, and that's where packrat parsers can soundly defeat recursive-descent parsers: super-linear time parses can become linear time. This makes packrat parsers a wonderful "default choice" but if constant factors are important and your language is regular, you will want to look beyond packrat parsers.
Sounds exciting if it's real, but extraordinary claims require extraordinary evidence.
OP (@ychetyrko) are you involved with this project? If so, announcing this before you have usable code that people can use without licencing encumbrance might have been an opportunity lost. Much more detail is required.
Higher level text searching has been around for decades. In fact lawyers use this all the time. For example:
motion w/3 (denied or deny!)
This will match any sentence with the word "motion" within three words of the word "denied" or any other word that starts with "deny" (e.g., "denying"). Often, these systems use regex underneath with fancy libraries built on top to determine word separation, sentence separation, etc.
The truth is that regex is really great for character level searching, but if you commonly do word or sentence level searching there are a variety of solutions already available.
Sadly, RegEx has evolved far away from the original regular expression we learnt in school, and it is certainly less NFA like. This make it harder to execute a faster speed, e.g. backtracing makes it more context sensitive etc.
I’m just glancing at the reference (https://nevod.nezaboodka.com/#reference) and this looks like a superset of regular expressions? I’m curious as to how you’re getting the speedup you claim on this.
I learned about lua patterns while trying out OpenBSD httpd and I quite liked it as Regex alternative. It's easy to learn and covers most of my needs: https://man.openbsd.org/patterns.7
If they want to actually succeed in making this ubiquitous, it cannot be a grammarly-like plug-in that sends text back to the motherland. It has to be standalone and locally hosted. Otherwise it’s just another MITM / spying prone library.
Please add implementation details. I've once made a search engine that could support a similar feature set. It's possible that we did the same core tricks but I have no way to evaluate this.
After digging around their website, I found this blog post which explains it better:
https://blog.nezaboodka.com/post/2019/594-using-nevod-for-te...
So my summary would be:
1) it works "faster" than regex in a specific scenario of treating text as entities in natural language. (E.g. higher conceptual abstractions such as qualifiers, variations, etc). If one were to reconstruct Nevod's rules using pure traditional regex (a very complicated regex), executing that regex would be slower because it's more of a "dumb" character sequence matching engine instead of a higher level natural language parser.
2) it's currently an unreleased "text search engine" that presumably will be licensed for you to integrate into your own software. The text matching engine is currently only used in their proprietary database engine. Whether the engine is a library one statically links in like Sqlite -- or -- it's a separate runtime like ElasticSearch that you make API calls to, I don't know.
I notice the CEO is Yury Chetyrko and the submitter is ychetyrko, so maybe he can explain in more detail what exactly Nevod is.