
Show HN: Nevod is easier and faster than RegExp - ychetyrko
https://nevod.nezaboodka.com/#playground
======
jasode
Like the other sibling comments mentioned, I too was confused about "100x
faster than regex" and what the actual product was about.

After digging around their website, I found this blog post which explains it
better:

[https://blog.nezaboodka.com/post/2019/594-using-nevod-for-
te...](https://blog.nezaboodka.com/post/2019/594-using-nevod-for-text-
analytics-in-healthcare)

So my summary would be:

1) it works "faster" than regex in a specific scenario of treating text as
_entities in natural language_. (E.g. higher conceptual abstractions such as
qualifiers, variations, etc). If one were to reconstruct Nevod's rules using
pure traditional regex (a very complicated regex), executing _that_ regex
would be _slower_ because it's more of a "dumb" character sequence matching
engine instead of a higher level natural language parser.

2) it's currently an unreleased "text search engine" that presumably will be
licensed for you to integrate into your own software. The text matching engine
is currently only used in their proprietary database engine. Whether the
engine is a library one statically links in like Sqlite -- or -- it's a
separate runtime like ElasticSearch that you make API calls to, I don't know.

I notice the CEO is Yury Chetyrko and the submitter is ychetyrko, so maybe he
can explain in more detail what exactly Nevod is.

~~~
bazzargh
My reading of that is that they compared apples and oranges. They didn't use
NLP there at all (stemming, parts of speech tagging, etc); they just relaxed
handling of whitespace. The Nevod 'equivalent' was a less general expression
to make it seem more maintainable.

The Nevod example translates to (ruby):

    
    
        pattern = Regexp.new( "(?<Name>ejection fraction|LVEF)( by visual inspection)?
                               (?<Qualifier>(is|of)( (at least|about|greater than|less than|equal to))?)
                               (?<Value>[0-9]+(-[0-9]+)?|normal|moderate|severe)".gsub(/\s+/, "\\s+"), 
                             Regexp::IGNORECASE)
    
        ["ejection fraction is at least 70-75",
        "ejection  fraction of about 20",
        "ejection fraction  of 60",
        "ejection  fraction of greater than 65",
        "ejection fraction of 55",
        "ejection fraction by visual  inspection is 65",
        "LVEF is normal"].each do |line|
          puts line
          puts pattern.match(line).inspect
        end
    

The only trick I used was to substitute literal whitespace in the regex with a
whitespace pattern, so that the typed regex was more readable.

~~~
mikecapizzi
If you're looking for a tool that allows you to incorporate legitimate NLP
approaches, you should have a look at `odin`. Here's a paper
[https://doi.org/10.1093/database/bay098](https://doi.org/10.1093/database/bay098)
showing its usage in the medical domain.

And the code is open-sourced as part of the `processors` library out of the
CLULab at the University of Arizona:
[https://github.com/clulab/processors](https://github.com/clulab/processors)

The most detailed (though not completely up-to-date) documentation is probably
in the manual here:
[https://arxiv.org/abs/1509.07513](https://arxiv.org/abs/1509.07513)

I'm using it at my current job to build an analysis tool for customer-agent
phone calls.

It allows you to build rules that match on different levels of abstraction:
tokens, pos-tags, dependency paths. You can even match tokens based on word
similarity (as measured by cosine similarity of word vectors).

And these rules can "cascade" (i.e. build off of each other). So you can find
an entity or event in rule 1 and then look for how that interacts with another
matched entity or event in a later rule.

------
Cynddl
> Nevod is a language and technology that provide pattern-based text search.
> Nevod is specially aimed to rapidly reveal entities and their relationships
> in texts written in the natural language. This patent pending technology is
> unique on the market.

What is this? A website, a tool? Can I run it locally? Is it free software,
open source, close source? Where is the information about the patent?

Also, you mention it's easier (why?) and hundred time faster, but I don't see
anywhere a comparison with traditional regular expressions.

~~~
codazoda
The patent kills it for me even though it may just be for defense (I didn't
look and it sounds like they didn't say).

~~~
airstrike
Well, it will eventually expire so we can revisit this then

------
Cu3PO42
The claims seem dubious to me as well. What exactly does faster mean here?
Since a speedup of two orders of magnitude is mentioned I'm assuming it refers
to matching speed.

Is it faster than PCRE? Is it faster than some other engine or is the claim
that it's faster than any available RegEx engine? I had a quick look at the
reference and this looks like it will accept context-free languages (since
recursion is allowed). I strongly doubt that a CFG parser is magically faster
than any RegEx engine.

> It is hundreds of times faster than the well-known regular expressions, and
> the speed doesn’t degrade linearly when you add patterns.

The author seems to believe that matching time for RegExes is linear in the
time of the pattern? Once compiled to a DFA, the pattern only has negligible
influence on the matching time.

EDIT: I've been trying to figure out what kind of parser this generates. It
might be a LR(k) parser? It breaks on the following (admittedly contrived)
example:

    
    
      #S = E + End;
      E = {E + "+" + E, E + "*" + E, T};
      T = {[1+]Alpha, N + "()"};
      N = {P, Q};
      P = {"a" + P, "ab"};
      Q = [1+]"a";
    

with an input of "aaaaaaab()+beta*c"

~~~
MaxBarraclough
> Once compiled to a DFA, the pattern only has negligible influence on the
> matching time.

With the right optimisations, certain longer patterns can be much _faster_ to
scan for, on account of the Boyer-Moore approach.

If I asked you to search through a book for 10 consecutive pages of the letter
'Q', you wouldn't need to check every page. The same optimisation can be
applied to regex. (Not that most regex implementations bother to do it,
though.)

~~~
Cu3PO42
You're right. What I meant to say is that for any (formal) regular expression,
we can compile it to a DFA and then match any strings in time linear in the
length of the string.

Certainly the pattern length matters both for pre-processing (compiling to
DFA) and the runtime in a Boyer-Moore approach. However, as you mentioned in
the Boyer-Moore average case of Θ(n/m) a longer pattern is faster, rather than
slower as the page on Nevod seems to imply.

------
jehna1
Here's a somewhat related project that aims to make regular expressions more
verbose:

[https://github.com/VerbalExpressions/JSVerbalExpressions](https://github.com/VerbalExpressions/JSVerbalExpressions)

Here's a simple example for matching URLs:

    
    
        const tester = VerEx()
          .startOfLine()
          .then('http')
          .maybe('s')
          .then('://')
          .maybe('www.')
          .anythingBut(' ')
          .endOfLine();
    

The project has been ported to many different languages and it outputs a
normal regular expression that you can use to match your text.

------
Drup
This is cute! It's basically a nice syntax for full parsing of extended
regular expressions. It would make for a very nice tool. Far too often,
regular expression tools only implement "matching" (i.e., extract a list of
strings). Full parsing gives you the complete parsetree, and thus preserve
more structural information, especially under repetitions.

I'm very much in support of anything that uses regex parsing instead of
matching. Matching is nearly always the wrong tool and cause more bugs than
anything. The only reason it's used is that the engines are easier to
implement. I wrote a small paper on how to retrofit parsing on top of an
engine that only gives you matching recently
([https://gabriel.radanne.net/papers/tyre/tyre_paper.pdf](https://gabriel.radanne.net/papers/tyre/tyre_paper.pdf)).

Given the amount of prior art in the academic community, patenting this is
probably worthless, but eh ...

~~~
perlgeek
The Perl 6 grammar engine does full parsing, and gives you the parse tree that
matches the structure of the rules.

~~~
Drup
Well, except Perl doesn't parse regular grammars (it parses much more) and is
far from being "one pass" (since the complexity guarantees are not valid
anymore) ....

~~~
tux1968
Regular grammars a strict subset of what Perl 6 will parse, no? If you stick
to that subset it will parse in a single pass.

~~~
Drup
Except Perl 6 doesn't enforce it, so you have to guess yourself if you are
really in the regular case. Additionally, PCRE used to be exponential for
certain "bad cases", some of which were (truly) regular expressions.

My point is simply that Perl doesn't give you any guarantee except "we will
try to parse it". Maybe you will hit the right case for the right version of
Perl, who knows ? The Web is full of DDOS attack based on exponential PCREs.

With actual regular expression, the complexity is guaranteed.

~~~
perlgeek
> My point is simply that Perl doesn't give you any guarantee except "we will
> try to parse it".

Perl 6 gives you pretty good control over where to use backtracking and where
not. It parses regular languages with a NFA or DFA, and can switch to a
backtracking engine.

More details in
[https://www.apress.com/us/book/9781484232279](https://www.apress.com/us/book/9781484232279)
(sorry for the plug, wrote it myself).

------
feanaro
I see two novel aspects of this language:

1\. the ability to easily break patterns into named subpatterns which can be
referenced later on

2\. the `@` operator which gives you the ability to talk about things inside
these subpatterns

These seem like worthwhile additions which would make regex more manageable. I
don't see any reason why the whole of regex would need to be abandoned for
some completely new, potentially proprietary technology, though.

It also reminds me of parser combinators (in the form they are popular in
Haskell, for instance).

~~~
al2o3cr
The Oniguruma engine has had named, callable subpatterns for a while:

[https://github.com/kkos/oniguruma/blob/master/doc/RE#L400](https://github.com/kkos/oniguruma/blob/master/doc/RE#L400)

Supported (via Oniguruma) in Ruby 2.0.0+, dunno about other languages.

------
glangdale
It looks like there are some interesting things in this project beyond regex,
but the way that it skates over the fact that there are already automata-based
multiple pattern matchers is pretty rank.

When we started developing automata-based pattern matching s/w in 2006 - 2006!
- we were well aware that it wasn't really a new thing. 12 years of work on
Hyperscan
([https://github.com/intel/hyperscan](https://github.com/intel/hyperscan)), 1
acquisition and 1 open source release later - as well as multiple other
projects doing similar things (re2, Rust regex, Parabix, icgrep, Rosie Pattern
Language, ...) has only added to this crowded field. All this has taught me
many things, but the thing that I keep returning to is that "a lot of people
apparently don't know about prior art, or assiduously pretend not to". It will
be interesting to see what they try to patent.

------
JeffRosenberg
> Pattern definition language is simple and clear, thus very easy to learn

Once you get into more complicated expressions, I don't see this as much
easier or simpler than regexes. For example, this expression from the tutorial
looks as much like gibberish to me as an equivalent regex:

    
    
        Domain = Word + [1+] ("." + Word + [0+] {Word, "_", "-"});
    

That said, I can see some possible value in pattern matching based on text
tokens, rather than individual characters. I'm sure there is a subset of
pattern matching problems that could be solved more simply using this.

~~~
brudgers
In the end, automata are automata and anything that tries to do what regex's
do winds up looking a lot like regex's -- which look a lot like regular
expressions (as a representation of a finite state machine). The difference
between "regex alternatives" and regex's is that the alternatives tend to be
more verbose and less well documented...and less likely to elicit good answers
on StackOverflow...and perhaps less likely to generate good questions there.

I think the hard part of pattern matching is reasoning about pattern matching.
The obscurity of Regex notation is mostly a function of unfamiliarity with the
concepts. [:word:]+ is not easier to reason about than \w+ and "\w+" is much
better documented than "[:word:]+" or "Word + [1+]."

The other problem with learning Regex's is that regex notation is someone
else's code. There's always the attraction of fixing it. I've dunning-kuger'ed
it myself. Fortunately, making my new more sensible superduper regex notation
complete required RTFM'ing...and then I'd read the manual and realized I'd
already fixed regex notation by fixing the absence of knowledge in my head.
Plus I could talk to other people about pattern matching using the common
language of pattern matching.

------
BeeOnRope
Any details and/or reproducible benchmarks backing up the "100s of times
faster" claim?

------
ychetyrko
First of all, thanks everyone for the valuable feedback, critics, suggestions,
etc. Truly appreciate that!

And thanks for patience to all people who are playing with the Nevod right now
and getting errors. We have an unexpectedly high interest and number of
visitors is very high in our playground. Meanwhile, it's a preview of the
technology, please keep in mind.

Let me clarify few things. The speed of Nevod is based on two things:

1\. Nevod operates on words, not individual characters. From my 30 years of
coding experience I would say that roughly 95% of text processing/rules are
word-based. So it covers most of tasks.

2\. Nevod matches MULTIPLE patterns against document in ONE PASS.
Patters/expressions are indexed by state machine and filtered effectively
during matching.

So, at a RULE DECK of 1000 patterns, we got 300 times improvement comparing to
RegExp. This is a kind of tasks when you let say need to highlight syntax, to
massively verify and route incoming requests in cloud, perform massive text
validations, track certain phrases and relations in massive human
communication, etc.

With growing number of patterns the difference between Nevod and RegExp is
higher and higher and go far beyond 300 times. So, I'm a kind of disagree with
moderators removed the word "hundreds" from the headline. :)

We will publish benchmark results and we will also release "negrep" (Nevod
GREP) command line tool for Windows, Linux, Mac, so everyone will be able to
run "negrep", play with it and verify benchmark him/herself. Generally, Nevod
technology will be available both in form of a library, in form of command
line tool, and in form of service.

~~~
glangdale
> 2\. Nevod matches MULTIPLE patterns against document in ONE PASS.
> Patters/expressions are indexed by state machine and filtered effectively
> during matching.

This is a good idea. It's such a good idea that we did it in 2006, sold a
company to Intel based around in in 2013 and open-sourced the resulting
library
([https://github.com/intel/hyperscan](https://github.com/intel/hyperscan)) in
2015.

All multi-pattern systems get a huge improvement over one-at-a-time regex. It
would be embarrassing if they did not. Our rule of thumb for Hyperscan was
that it should be about 5-10x faster than libpcre on single patterns and the
gap should get progressively bigger.

I don't doubt that there is quite a bit of interesting work (entirely outside
of regex - the word-based focus looks interesting) going on in this project,
but the fact that you couldn't be bothered to Google around for a sensible
baseline is not encouraging.

Also, which bit of your system do you think is novel enough to patent? You may
find considerable prior art out there, which you don't really seem to be aware
of based on how much emphasis you're placing on the not-entirely-novel idea of
using a state machine.

~~~
ychetyrko
Geoff, for me it's real honor that you, well-known creator of Hyperscan, paid
attention to what we are doing. We truly respect all the prior work in this
field (Hyperscan, RE2, Rust, etc), so different benchmarks will be available.

However, we all need to keep in mind that Nevod operates on words, not
characters, and Nevod targets large rule decks (1000-100000 patterns). So, we
don't compete in the field of individual pattern matching or streaming
matching (to some extent, goals of Hyperscan and Nevod are different). In
other words, a lot of things depend on target goals, which usually are
reflected in the kind of patterns and in their number.

Regarding patent. Sorry, for legal reasons I can't expose details at the
moment.

Thank you for your valuable comments.

~~~
glangdale
I don't think I'm all _that_ well-known. :-)

Staying away from streaming is a good idea for a lot of designs. We all lost a
lot of our hair making streaming work in Hyperscan, and it was often the thing
suppressing easier optimizations. Or at least we would have to have a separate
approach for streaming and block mode; not exactly great s/w engineering.

Word-based approaches are interesting. So is large scale. Hyperscan does
embarrassingly badly on natural language stuff at scale due to some baked in
assumptions about relative frequencies of literal factors and the literal
matchers tend to fall apart on really huge literal sets in this domain. So
there's definitely an opportunity here.

I would strongly encourage you to produce some repeatable benchmarks (at
least, things where we can see what the task is for something like RE2 or
Hyperscan). We spent many years in stealth mode with a lot of superlatives and
NDA-only Powerpoint decks and I don't think it did ourselves any favors. It
means that you wind up being the only guardian of your own honesty, which is
... tricky ("What the hell is a medium-complexity pattern, anyhow?").

~~~
ychetyrko
Based on what you are saying I'm concluding that we are on the right path.
Again, thank a lot for all the valuable feedback and advice!

------
bakpakin
Pretty cool tool, the NLP focus is surprising but I'm not sure if it is a
marketing ploy or fundamental to the technology. The nice syntax is also an
improvement over things like LPeg, which seem to obscure the grammar by
forcing it to be written in Lua. Seems like it could overlap in functionality
with Rebol/RED parse, which I believe has been posted on here before, LPeg, or
just PEGs and Packrat parsing in general.

[https://www.red-lang.org/2013/11/041-introducing-parse.html](https://www.red-
lang.org/2013/11/041-introducing-parse.html)

The parse engine is implemented with a PEG, so it is probably a lot slower,
but also supports pattern matching over whole words. Also, the database aspect
of Nevod seems interesting and offers potentially a huge speed. But as far as
I can tell, the syntax looks like a PEG with some implicit rules that separate
words. As far as speed goes, I definitely believe that Nevod could be very
fast but I haven't seen any numbers.

Shameless plug, I've implemented something similar with PEGs for Janet, a lisp
I have been working on for a while. I make no claims to it's speed, but the
peg interpreter is written in tight C so it shouldn't be too slow. The peg
module in Janet works with a DSL that looks like a lispy EBNF and results in a
recursive parser compiled to bytecode for the interpreter.

[https://janet-lang.org/peg.html](https://janet-lang.org/peg.html)

You can also mess with the language in a browser on the home page.

[https://janet-lang.org/](https://janet-lang.org/)

------
janoc
"Nevod is a language and _patent-pending_ technology for pattern-based text
search."

Yawn. I will stay with my patent-free regexps (or whatever other tech I may
need) than to rely on something that will let you put a gun to my head if I
ever wish to sell my product.

No thanks. Software patents are evil crap.

------
rusk
Now you've got 3 problems!

~~~
agumonkey
off by one

------
Zelphyr
One of the best and most powerful alternatives to RegEx I've ever seen is the
PARSE function used in the Rebol and Red programming languages.

    
    
        page: read http://hq.rebol.net
        parse page [thru <title> copy title to </title>]
        print title
        The REBOL Developer Network

~~~
eggy
I like the compactness and syntax of Red a lot.

Pharo's Petit Parser is a framework for building parsers that is also pretty
nifty [1].

[1]
[https://github.com/moosetechnology/PetitParser](https://github.com/moosetechnology/PetitParser)

------
ggm
Is this another syntax (of the mechanisms written form) vs semantics (of how
it actually operates) confusion moment?

If Nevod builds a different textual model and applies what you "say" to it,
differently to the regex underlying model, thats about the speedup. _how_ you
say what you want in pattern matching, thats just a pure syntactic moment: I
like regex from the UNIX philosophy because of the _syntax_ fluidity for
saying things. What actually happens when I say a|b|c|d is it builds a DFA and
its not that bad, but if I do individual /a/p /b/p /c/p patterns in SED, same
engine, but no DFA builder, its slow.

So is Nevod a new syntax _and_ a DSL tied to language, or is it a new syntax
_and_ a generalized text matching model with some real semantic shift from
regex?

------
asah
You lost me at "patent pending"

------
imhoguy
This thread reminds me lexers vs parsers debates
[https://stackoverflow.com/questions/2842809/lexers-vs-
parser...](https://stackoverflow.com/questions/2842809/lexers-vs-parsers)

------
tianshuo
You could easily do something like this using open source parser generators,
like Pegjs([https://github.com/pegjs/pegjs](https://github.com/pegjs/pegjs))
and own the final code yourself.

~~~
aidenn0
[edit] After reading my comment it sounds like I don't like packrat parsers. I
actually love them and when they are available for the language I'm using they
are my first choice, but the first rule of engineering is everything has it's
trade-offs, so...

I'm not familiar with Pegjs, but other PEG parsers I've seen tend to use the
packrat algorithm, which is suboptimal for regular languages, because it
memoizes parses to speed up backtracking, and regular languages do not need
backtracking.

For example, if you were to write a recursive-descent parser for JSON and
convert it to a packrat parser, you will often find the packrat parser is
slower.

Now, extended regex's include backtracking, and that's where packrat parsers
can soundly defeat recursive-descent parsers: super-linear time parses can
become linear time. This makes packrat parsers a wonderful "default choice"
but if constant factors are important _and_ your language is regular, you will
want to look beyond packrat parsers.

------
peteforde
Sounds exciting if it's real, but extraordinary claims require extraordinary
evidence.

OP (@ychetyrko) are you involved with this project? If so, announcing this
before you have usable code that people can use without licencing encumbrance
might have been an opportunity lost. Much more detail is required.

------
speedplane
Higher level text searching has been around for decades. In fact lawyers use
this all the time. For example:

motion w/3 (denied or deny!)

This will match any sentence with the word "motion" within three words of the
word "denied" or any other word that starts with "deny" (e.g., "denying").
Often, these systems use regex underneath with fancy libraries built on top to
determine word separation, sentence separation, etc.

The truth is that regex is really great for character level searching, but if
you commonly do word or sentence level searching there are a variety of
solutions already available.

------
runxel
> […] works in every language

> Operates on words, not characters.

Yeah, well. How shall I tell you?

------
mikemoka
regex is quite powerful and free anyway, I would consider some of these
opensource tools to tackle its complexity:

[http://regex.inginf.units.it/](http://regex.inginf.units.it/)
[http://buildregex.com/](http://buildregex.com/)
[https://regexr.com/](https://regexr.com/)

------
stevefan1999
Sigh.
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

Sadly, RegEx has evolved far away from the original regular expression we
learnt in school, and it is certainly less NFA like. This make it harder to
execute a faster speed, e.g. backtracing makes it more context sensitive etc.

------
mimixco
This could be interesting but without any information about licensing, cost,
or available platforms, most people are going to ignore it.

------
saagarjha
I’m just glancing at the reference
([https://nevod.nezaboodka.com/#reference](https://nevod.nezaboodka.com/#reference))
and this looks like a superset of regular expressions? I’m curious as to how
you’re getting the speedup you claim on this.

------
meruru
I learned about lua patterns while trying out OpenBSD httpd and I quite liked
it as Regex alternative. It's easy to learn and covers most of my needs:
[https://man.openbsd.org/patterns.7](https://man.openbsd.org/patterns.7)

------
dzink
If they want to actually succeed in making this ubiquitous, it cannot be a
grammarly-like plug-in that sends text back to the motherland. It has to be
standalone and locally hosted. Otherwise it’s just another MITM / spying prone
library.

------
cheeaun
Found the Github repo
[https://github.com/Nezaboodka](https://github.com/Nezaboodka) but I can't
seem to find the code :/

Stepping thru code in Inspector is pretty troublesome as well...

------
jermaustin1
I tried to use the URL matching pattern from the reference, and I got the
following error:

Error: Compilation error: Pattern capture is not supported in this version of
the pattern matching engine, pattern 'Url'

------
amitport
Please add implementation details. I've once made a search engine that could
support a similar feature set. It's possible that we did the same core tricks
but I have no way to evaluate this.

------
chaitanya
This might be a good tool but what is patent worthy about it?

------
bratao
It remembered me how much I miss LPeg - Lua parsing expression grammars. I was
very fast and way more nicer to work than regex.

------
JoelMcCracken
I have often thought that a EBNF-grammar like system would work better in many
situations than regular expressions.

------
ungamed
The syntax reminds me very much of the under-appreciated pyparsing library.

------
maxk42
Patent-pending?

No thanks.

------
tentakull
lol, spare me, only reinvent the broken and work towards standardization

------
beiller
But can it parse XML!?

