Hacker News new | past | comments | ask | show | jobs | submit login

Too many to enumerate, already the utter lack of Unicode is a killer. The tool supports only the academic interpretation of regular expressions that are strictly equivalent to some NFA/state machine, which is useless because in the last forty years we use programming languages and libraries with extended abilities to deal with real world problems.

Some programmers have adopted to vernacular "regular expression" for the former and "regex" for the latter for easier distinction, see quote in http://enwp.org/Regexen#Patterns_for_non-regular_languages




Real regular expressions are hardly useless, I speculate that majority of real world uses of regexp are actually regular. Many of state of art regex engines (re2, hyperscan, rust regex) support only regular expressions


> I speculate that majority of real world uses of regexp are actually regular

I just surveyed a corpus of regexes with a crude static analysis tool and only 4% fit that restriction, I believe the result to be accurate within the order of magnitude. It makes sense: non-regular features are widely available, and thus people use them.

> state of art regex engines (re2, hyperscan, rust regex)

These are a clear regression from the actual state of art that's in use everywhere. (I know that re2's reason for being is precisely to have less features.) The advent of Perl (and related, libpcre) has utterly obliterated the competition at that time, and newcomers were not able to wrest their crown.


PCRE is hardly used "everywhere". Vim switched from backtracking engine to dfa one. Emacs has it's own non-backtracking engine. As does GNU grep. Ripgrep notoriously is built around rust-regex. VSCode uses ripgrep for searches. Google obviously uses its own re2 in its products. Hyperscan has some case studies of its use:

https://www.hyperscan.io/2016/01/21/rspamd-1-1-released-hype...

https://www.hyperscan.io/2020/09/28/optimize-azure-cloud-sec...

https://www.hyperscan.io/2018/10/19/hyperscan-adopted-by-git...

Even PCRE itself has a dfa mode. Because people want and use that. It's not just some academic navelgazing.

I'm not saying that non-regular engines, pcre in the forefront, are not popular. They are. But at the same time they are not be-all end-all for regex, regular engines still see lot of use especially in performance sensitive applications.


> PCRE is hardly used "everywhere".

The topic under discussion was extensions that make regex non-regular (as popularised by Perl and libpcre), not PCRE per se. Per this site's rules, I assume good faith and that you simply misunderstood me and did not deliberately put up and topple this straw-man.

Adoption of non-regular extensions is overwhelmingly larger than adoption of the opposite.

1. These non-regular extensions can be found in Java/Kotlin/Scala/etc., Javascript, Perl, PHP, Python, Ruby, C#, R, Swift, Matlab, Julia, Haxe, Ocaml and literally dozens of other languages on various popularity charts, and as a first pick option in C, C++ and Lua. Go and Rust are the exemptions to the rule! There are millions of pieces of software written using these which one can't even see because they are not public.

2. Programmers and end users want features and power much more than they want determinism. (Performance is a red herring because the vast majority of the time, performance is good enough, or even identical to non-extended.) That's why ripgrep and GNU grep and rspamd have them.

https://github.com/BurntSushi/ripgrep/blob/master/FAQ.md#how...

https://www.gnu.org/software/grep/manual/html_node/Regular-E...

https://rspamd.com/announce/2016/03/21/rspamd-1.2.0.html

3. A factual survey where libraries are used. This will be invisible for the aforementioned programming languages because they have built-in regex, but simply libpcre alone versus re2 and libhs shows clearly which paradigm is dominant and which is a niche.

libpcre: ag, apache2, blender, clamav, cppcheck, exim, fish, git, gnome-builder, godot, grep, haproxy, kodi, libvte, lighttpd, lldb, mariadb, mongodb, mutt/neomutt, mysql-workbench, nginx, nmap, pam, postfix, Qt5/Qt6, rspamd, selinux, sway, swig, syslog-ng, systemd, uwsgi, uwsgi, varnish, vlc, wget, zsh … … … and 110 more.

re2: bloaty, chromium/chromedriver/qtwebengine, clickhouse, libgrpc

libhs: libndpi, rspamd


The tool could be extended to support Unicode, whereas AFAIK it would not be possible to extend it to support backreferences. Are there any other “regex” features that would be impossible to support?


>> Too many to enumerate

I take back my previous claim, this is a wrong exaggeration.

> The tool could be extended to support Unicode

Not an easy task. There are some things in the standard that do not map neatly to states, notably foldcasing of characters that change the count of characters and the treatment of the generic line boundary. Edit: after browsing UTS#18, I am almost certain that a conforming implementation cannot be mapped as exemplified in the tool. Maybe there's a neat work-around possible.

> features that would be impossible to support?

(?=, (?!, (?<=, (?<!, (?{, (??{, (?&, (?(…), (?>, (*asr:, (*SKIP)


Some of these are in fact possible, though with some restrictions. I wrote a blog post on how to support some negative lookbehinds: http://allanrbo.blogspot.com/2020/01/alternative-to-negative...


Intersection and complement are both expressible in regular expressions (in the CS sense). Is that not what you mean by impossible to support?

(I'm too lazy to look up what the other listed notations mean.)


> Is that not what you mean

no, I mean generic line boundary




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: