>> I’m looking forward to someone improving the performance of Go’s regexp package, which is quite slow.
Hah, this reminds me how over several years ago, I rewrote one of my homework from Go to Python, since Go version was so terribly slow because of regex. I really hoped it was better in 2021.
Interesting, though not too surprising. Do you remember what the regexes were that your assignment used?
Go's regexp package does have a couple of advantages over Python, Perl, and so on: 1) it's guaranteed linear time in the length of the input, regardless of the regex, see https://swtch.com/~rsc/regexp/regexp1.html, and 2) it's a relatively simple implementation.
That's excellent! Those all look like pretty nice small code changes that all add up. I especially like the very small "avoid copy" change (https://go-review.googlesource.com/c/go/+/355789) that adds up to a 30% speedup on many benchmarks. I hope they get included in Go 1.19. Good work!
Simplicity of implementation isn't what users need, though; they need performance. For example, you can make GCC into a much simpler C compiler by compiling at -O0, but in practice nobody does that.
Totally agreed: almost all users (me/GoAWK included) want performance and don't care nearly as much about simplicity under the hood. Simplicity of implementation is of value for educational purposes, but we could easily have a small, simple 3rd party package for that. Go's regexp package is kinda too complex for a simple educational demonstration and too simple to be fast. :-)
I actually tried BurntSushi's https://github.com/BurntSushi/rure-go (bindings to Rust's regex engine) with GoAWK and it made regex handling 4-5x as fast for many regexes, despite the CGo overhead. However, rure-go (and CGo in general) is a bit painful to build, so I'm not going to use that. Maybe I'll create a branch for speed freaks who want it.
I've also thought of using https://gitlab.com/cznic/ccgo to convert Mawk's fast regex engine to Go source and see how that performs. Maybe on the next rainy day...
Yeah, that's a good idea, I did consider it, but haven't tried it yet. Do you hook and look at the regex string before it's compiled, or do you hook in at the parsed regex AST level? (eg: regexp/syntax in Go).
In the go code where I did this, it was a little different, with a static pattern. Something like "(\w+) apple" to find all apple adjectives or whatever, but the regexp wasted so much time matching words before not apples. A quick scan for "apple" to eliminate impossible matches made it faster. This depends more on knowing regex and corpus, so probably less relevant for awk.
It's been a while since I've looked at the source code, but it is almost certainly already doing basic prefix literal optimizations.
The more advanced literal optimizations come into play when every match must start with one of a few possible characters or strings. Then you get into things like Aho-Corasick or fancy SIMD algorithms (Teddy).
Oh, regarding rure go, the bugs note about using int is inaccurate. Go spec says max length of any object can be stored in int. You can't build a too big haystack.
> almost all users (me/GoAWK included) want performance and don't care nearly as much about simplicity under the hood.
that works up until you're at scale and the complexity of the implementation causes hard to understand edge conditions and just bugs from the complexity being beyond what the developers are able to handle.
Exactly, at some point code is such a mess that further optimization and features are harder to get merged. And the smaller the number of contributors the bigger the problem gets, till the main maintainer calls it quits and there's need for a completely new library.
Is it also true when even the essential complexity of the project is quite demanding? Like runtimes, especially with JIT compilers are not easy to begin with.
I don't exactly agree. Sure end users don't care about implementation directly. But simplicity of implementation does affect them indirectly. Go is already over 10 years old with maybe many more years ahead. All code bases rot. I think the simpler the implementation, the easier it is to cure rot and code smells which hopefully means Go has a long life as the implementation becomes easier to work on over time. While user's maybe don't care, it does impact them.
You can make the same argument about CPUs. Modern CPUs are horrendously complex. But nobody is asking to remove, say, out-of-order execution on simplicity grounds, because that would hurt users and cost millions of dollars at scale for no reason other than engineering aesthetics.
It's only in a few areas, like programming languages and operating systems, that we're obsessed with simplicity, and it makes little sense to me.
Branch prediction is a CPU "complexity" that got us in to some amount of trouble.
I don't see simplicity as a "virtue", as such: it's all about the ability to reason about a system. The more complex it is, the harder this becomes. This makes it harder for implementers to work on it, and harder for users to figure out what is going wrong.
On the other hand, complexity often offers useful enhancements such as features or performance. There is some amount of tension here, which is hardly unique to software: you see this in, say, tax systems, regulation/law, design of cars and all sort of things, etc.
I'd say it is probably because we are all worst at writing code than we'd like to imagine. So writing necessarily complex code, especially in a FOSS compiler or system, makes little sense since some day someone else is going to have to step in and learn it.
I agree, but simplicity of implementation is a net positive in a vacuum. When balanced against things like performance, it's definitely worth some trade-offs for better performance... but simplicity of implementation definitely has lots of upsides that users indirectly benefit from. Therefore, I think it's important to at least have a balance.
Simplicity of implementation also contributes to Go’s fast compile times, which is a different sort of performance. Trying to find a sweet spot between “slow” interpreted languages and “fast” compiled languages with long compile times (e.g. C++ template hell) is a worthy goal.
I think multi-profile compilations is much better - have a really fast debug build that hardly does any optimizations, and a release one that can take whatever amount of time but will be really optimized.
> Simplicity of implementation isn't what users need, though; they need performance.
It's a tradeoff, in the end. I mean sure, users don't really need to know how things work under the hood, but the people building and maintaining the language do; Go's philosophies on maintainability extend to the language's internals as well.
This is one reason why generics took over ten years to land; they needed to find the middle ground between functionality and sanity within the compiler. Java's generics implementation takes up a huge chunk of the spec and tooling. I don't even want to know about Scala's. It added so much more complexity to the language that I'm not surprised Java stagnated for as long as it did.
Regexp is one of the things you see attacks on all the time (mostly DOS). Users care a lot about security, and simplicity of implementation correlates with it. It's not something users need, but they do benefit from it.
Performance is relative, so I'm not really sure the point being made here. Sure if your program has regex as the constrained resource this matters, but again it's all relative.
Go is not about providing the fastest implementations out of the box, it's about having a broad toolset in the standard library to solve the problems Go was built for.
Faster (and often more complex) implementations are a maintenance burden for Go contributors. It's far better for a high performance regex library to be a third party package for those that need it.
For those where regex is a limiting factor in performance they'll soon find out why. But for most people fast regex is nothing compared to the overhead of a simple HTTP request.
For most regexes, backtracking and Thompson NFA have the same asymptotic complexity, which is why most languages adopted backtracking. The implementors of such languages knew what they were doing, especially when you consider that by adopting the Thompson NFA means you give up backreferences. The differences only arise with pathological regexes.
I used to think that backtracking was superior to the Thompson NFA in practice on typical regexes, but modern implementations of the Thompson NFA have shown that I was wrong in that and the Thompson NFA can match backtracking's performance. Still, the situation isn't nearly as simple as Russ's article makes it out to be by only focusing on /a?a?a?aaa/, a regex which nobody would write. (This is not to deny that people do write pathological regexes from time to time. But what's the size of the userbase that writes pathological regexes compared to the size of the userbase that uses backreferences?)
FWIW, I would say that it's difficult for a Thompson NFA on its own to beat backtracking in non-pathological cases. So I actually think your prior is still mostly correct.
Now a hybrid NFA/DFA (or a "lazy DFA") that does subset construction at search time using a Thompson NFA can definitely beat out a backtracker. A lazy DFA will generally match the speed of a fully compiled DFA, and is about an order of magnitude faster than a Thompson NFA simulation:
People will gleefully write that regex if it causes a denial of your service. People would similarly blow up ftp servers with "ls starstarstarstar" globs.
Arguably the best implementation would be a backtracking-based implementation that goes fast, with some sort of strong API for controlling the amount of effort the implementation puts into matching, because the major concern is the truly pathological cases. For the most part I'm not all that concerned about cases where one or the other implementation is slightly faster on this or that input, I'm worried about the DOS attack. However, determining the correct amount of "effort" to put into a match could be nontrivial; there's the obvious "time out after this amount of clock time" which has its uses for security, but if you want something more sophisticated (including measuring by actual effort applied rather than just clock time, which only loosely correlates with how long you've actually had to run) it's difficult to even know exactly how to express what you want, let alone implement it sensibly in the current programming environment.
If such a thing exists, I haven't seen it, but that's not to say it doesn't exist. In general, though, programming languages and libraries have poor-to-nonexistent generalized support for bounding the amount of resources to spend on a given internal computation. Because that's obviously a rather niche use that's hard to justify a lot of effort, especially since "bound the amount of resources for a given program", a well-implemented bit of functionality, tends to hit a lot of the use cases.
It was 7 years ago, so I don't exactly remember what regex I used. I remember it was information retrieval course, and was supposed to write a crawler and index the webpage. So I think part of it was definitely to find all the links within webpage. I was working on extra credit so there might be some funky stuff.
According to my totally unscientific benchmarks, if the performance of Rust's regular expression module were 1x, Go's was 8x and Ruby's was 32x, and Java's was 42x. Using Google's Java regex module improved the speed quite a bit but still was at ~18x. I was very impressed to see Rust doing so well. And it was sad to see Java so underperforming in such a typical workload. I know we're measuring libraries not languages, but I think regexes are so prevalent that not optimizing for it would hinder the language's real life performance.
For the JVM benchmark, unless you used Pattern.compile() and let the VM warm up (or used the JMH benchmark framework), your numbers are likely wildly off.
It depends on the goal of the measurement. For example, if you wanted to write a grep in Java, would the total runtime of the program be faster or slower if you "waited for the VM to warm up?"
The Java and golang numbers are not apples to apples. The Java solution is using the standard library code, while golang's is calling out to native code (wrapper C library) as far as I can tell. Same with Python.
Can't comment on that but why are these people putting spaces after commas (correct) and not to the left and right of the assignment sign = ? It's so weird to read a=b
That’s very straightforward translation of C code using pcre to Python using ctypes. It doesn’t look great because the ugliness isn’t wrapped in a library. But Python ctypes definitely isn’t hard to use. It is unsafe as hell though.
That's interesting that Ruby is as slow as it is, the regex engine (Onigmo) is written in C. I wonder where the bottleneck is compared to the other languages.
Seems to be mainly a test of compiling regexes, rather than executing them. Which is a bit pointless because few applications would not pre-compile regexes ... and then it would actually be penalising languages that put more effort into optimisation and hence perform better in the real world at execution.
Why would you even say this? It is certainly not a test of compiling regexes. Go and profile any one of those programs. You'll see that the majority of time is spent executing the search, not the compilation.
Look at the runtimes being mentioned here. We're talking on the order of seconds. The benchmark itself calls for 15 pretty simple regexes to be compiled. Regex compilation of 1ms or longer would be considered slow I think. So that's 15ms. That's nearly nothing compared to the full runtime of the benchmark. Even if you assumed each regex took 10ms on average to compile, you're still only looking at 150ms total.
I say this as the author of Rust's regex engine and as someone who has spent a non-trivial time looking at and thinking about this particular benchmark.
Unfortunately, the test code belongs to the company I work for so I cannot take it out. It was done to determine what language we should use for an internal tool. I hope to conduct another benchmark one day, publicly this time.
There's a lot of room for improvement on the compiler and library end. RE2 and Hyperscan demonstrate the ceiling here.
The NFA simulator is heavily optimized for readability, which means lots of recursion, and fewer special cases outside of small regexps. The compiler also doesn't perform certain optimizations like vectorization and emitting jump tables, which might be useful here.
There isn't a metaprogramming facility to generate equivalent Go code like in re2go: https://re2c.org/manual/manual_go.html. The best we can do is pre-declare the regexps globally as to initialize them once, but we still have to run the interpreter.
If it helps, a version is simply not a decimal number and the dot is not a decimal point. It’s just a separator. The dot often used for purposes other than a decimal point (a few times in this comment!).
Version numbers are effectively base infinity, not base 10. Periods are digit separators, and each digit is written in decimal. The usual representation for IP addresses is basically the same (four base-256 digits separated by periods)
Interesting how somebody can be tripped up by 1.2 vs 1.18 yet not tripped up by 1/2 vs 1/18. It could just take time, but I think there’s also a factor of feeling that dates are worth learning while version numbers don’t deserve direct attention and instead should be a subset of a nearby concept that does.
For Virgil, because I am a fan of the roman poet, I am using roman numerals for the major versions and append a monotonically-increasing-regardless-of-major-version "00XXX" build number to that. This ensures they always sort lexicographically.
...until I get to version IX, i.e. 9, which would sort incorrectly. So I'll just skip 9, 19, 29...90...heh. :)
If your algorithm is “choose the next number for which its Roman numeral sorts lexicographically higher than the current number”, then if you start at 1 you will only get 35 elements, since 38 (XXXVIII) is the lexicographically-highest Roman numeral.
Using this precise rule, the starting point that gets you furthest is 250, which grants you 153 elements before stopping at 1038 (MXXXVIII).
It’s possible to devise a longer sequence by skipping some numbers (e.g. after 1000 you’d jump to 1100), but I’m too lazy to calculate how you’d do all that.
All this also assumes subtraction rules, which were a late addition to Roman numerals; without subtraction rules, four is written IIII instead of IV, and nine VIIII instead of IX, and so all of 1–49 would be lexicographically sorted.
from docutils.utils.roman import toRoman
def sequence(start):
last = ''
out = []
for i in range(start, 5000):
this = toRoman(i)
if this > last:
last = this
out.append(i)
return out
# Brute force way because I’m lazy; takes around ten seconds on my machine.
start, s = max(((start, sequence(start)) for start in range(1, 5000)), key=lambda x: len(x[1]))
print(f'{start} gives you {len(s)} items, {s}')
Me too. I have a project suffering pretty badly from the performance hit taken when a regexp doesn't start with a fixed string. Matching against `[ab]cd` is up to 9x slower than against `ab[cd]`.
One of our ACLs at work relies heavily on Go regexp. The performance of evaluation is actually not too bad. What is quite terrible is the performance of *compiling* regexps.
I find this a bit surprising -- do you have numbers? Though even if compiling them is relatively slow it doesn't matter too much, because usually you're compiling once and evaluating many many times (e.g., in the case of a typical AWK script, you compile once and evaluate for each line in the input).
Yeah, if you can call straight into a C module (e.g., regex or csv or etc) and not do anything of consequence in Python, that’s the sweet spot from a performance perspective.
Hah, this reminds me how over several years ago, I rewrote one of my homework from Go to Python, since Go version was so terribly slow because of regex. I really hoped it was better in 2021.