In addition to the relatively clean Rust -> C interface it's very nice to have higher degree of compile-time assurance of the correctness of the property model itself. As the models become non-trivial it becomes increasingly difficult to build them reliably in C itself.
QuickCheck-style tools typically generate inputs based on high-level static information, like types or programmer-provided specs. To reach the sophistication of modern fuzzers, you need to be smarter.
For one, you should inspect the code of the function. If you know how the code branches, you can more efficiently generate inputs that exercise both sizes of the branch.
But since you can't always figure out how to exercise both sizes of a branch just based on static analysis, fuzzers also observe the dynamic execution behavior of the program. This allows them to randomly adjust inputs until they know both sides of a branch have been hit.
I imagine that lifting fuzzing techniques to a higher level could have efficiency benefits. Maybe someone builds a fuzzer that runs on Rust MIR? Or maybe Rust can dumps out additional metadata (alongside the LLVM bitcode) to help the fuzzer figure things out?
I'd still consider this to be essentially "static" information, since they don't change as the program executes.
The rust community is pretty aware of quickcheck (there's a rust version) I would say.
It seems a lot nastier to me, which I think is why QC style automatic testing has made less inroads in impure languages.
It also makes fewer inroads to fuzzing because fuzzing starts a few steps back where you know even less about the application. You could do a grammar-based fuzz of the application, but then you have to worry about the map/terrain of the grammar/application. Are you missing testing some case in the application because your grammar is incomplete?
In QC, the map is the terrain because the "grammar" is just a type that your application uses and this isn't really the case in fuzzing. It could be, in fuzzing some languages (like Java) where you could interrogate the program you'r testing (even a binary program) for types that the program uses. Think about C programs though, where the application input and the "types" that the application uses are (or can be) pretty distant.
Anyway, I don't think it's disquieting. I think they're two very different software assurance techniques that apply in two very different domains and that explains why people don't say too much about them in the same breath. They're united by randomness, but I don't think that much more.
Letting the generation be guided by static types was just a good design decision for the original Haskell version of the library---it really helps with convenience.
 I was actually about to footnote here that it was invented in Erlang-land, but apparently not: https://en.wikipedia.org/wiki/QuickCheck . Not sure why I thought that, maybe just because Joe Armstrong has been an advocate of QC-style testing for a long time? Or maybe it was he who invented QC in Haskell?
 Which, granted, in Haskell will mostly be run-time checked, but at least you'll get an error at the earliest possible opportunity.
I think I will use this in the future.
IIRC the bits for OSX support are almost there, someone was trying this out before.
From the docs:
Unlike C, Undefined Behavior is pretty limited in scope in Rust. All the core language cares about is preventing the following things:
* Dereferencing null or dangling pointers
* Reading uninitialized memory
* Breaking the pointer aliasing rules
* Producing invalid primitive values:
a bool that isn't 0 or 1,
an undefined enum discriminant,
a char outside the ranges [0x0, 0xD7FF] and [0xE000, 0x10FFFF],
A non-utf8 str
* Unwinding into another language
* Causing a data race
Rust is otherwise quite permissive with respect to other dubious operations. Rust considers it "safe" to:
* Have a race condition
* Leak memory
* Fail to call destructors
* Overflow integers
* Abort the program
* Delete the production database
It's just that Rust doesn't consider it unsafe to build a design that permits deadlocks and race conditions.
Hence trivial. Making something provably deadlock-free where the message flow is a DAG is a trivial case.
> race conditions are impossible
I very much doubt that since race conditions are not only a technical issue (i.e. "is updating that counter atomic?"), but largely a specification issue. Recall that issues like lost updates (or all those file system races) are race conditions, as well as the many issues with non-atomic REST APIs.
This bug was in the regex-syntax crate, which doesn't use any unsafe. The regex crate itself does have two uses of unsafe however (one for eliding bounds checks in the inner DFA loop and another for eliding bounds checks in Teddy).
It means that when you mistakenly assume an invariant that does not hold in safe code, the worst that happens is a panic. This is much better than what could happen, since it stops the process in a well defined manner. In a memory-unsafe language, you'll be lucky to get a crash. Instead, you might suffer silent data corruption or a potential exploit, or both.
The code here made an assumption that didn't hold, and predictably hit a panic when that assumption was tested.
Some folks use "crash" to mean "segfault", which is related to safety. This can get confusing :)
Segfaults are usually indicative of bad memory safety issues; ones which were not anticipated.
For example, the following two pieces of code do essentially the same thing on a null ptr, assuming the first one triggers a segfault:
int contents = *ptr;
int contents = *ptr;
Also, segfaults don't just occur on null pointers, they occur whenever a pointer whose virtual address is currently not allocated by the process. A use after free, basically. In this case, you have two ways something can go -- it can either segfault (the best result!), or your pointer may by chance still point to memory owned by the process, just not the memory that it's supposed to. This can lead to all kinds of badness.
In Rust, for example, you represent nullables with `Option<T>` (like Haskell's Maybe or Java's Optional). Rust forces you to deal with the `None` case whenever you wish to interact with the Option. You have the ability to assert that the Option contains something and obtain the inner value via `unwrap()` or `expect()`. These will panic (unwind, but this can also be set to abort. for simplicity, let's just assume it's an abort). Usually ends with the program or thread exiting.). Now, runtime-wise speaking, `let x = nullable.unwrap()` isn't really that different from `int contents = *ptr;` when the latter is null and segfaults. Both end up with the program exiting, and with much sadness on the part of the user.
However, the first one was anticipatory. With the Rust panic, the programmer said "okay, I believe that this can never be None here, and if not, something is horribly wrong, please abort". Something went horribly wrong, and it aborted. With the segfault, there was no explicit acknowledgement, and it was the kernel -- not the programmer -- intervening and shutting down the program. Like I already said, the kernel intervening is the best possible result here, other results are much worse.
So while effect wise an explicit abort and segfault are quite similar, implication-wise they're quite different. A lot of people thus distinguish between the two by using "crash" to refer to segfaults and other memory safety crashes exclusively. There's not much consistency here, so I thought it would be worth clarifying -- "'Safety' doesn't mean your program will never crash." is false if you define crash as segfault :)
A panic is a way of signalling "uh oh the program made an assumption that doesn't hold, there's something horribly wrong, please exit". A fatal exception.
Most languages, including Rust, will not let you encode all invariants of a program into the type system. Only languages that come bundled with proof checkers can do that, and that's tedious. So you will have situations where you have to assume an invariant to be true, and if it isn't, there's something horribly wrong and the program should exit. That's a panic.
Rust makes sure that wrong assumptions about invariants can't lead to memory unsafety. It can't magically make bugs go away.
 - http://rust-lang.github.io/book/second-edition/ch09-00-error...
I suppose it's possible that's what generates most instances of this question, but I had the same question when I was first introduced to Rust and it came from a different root.
I had been reading about Erlang, and from that context I expected "safety" to be about avoiding system downtime (e.g. gracefully handling failed asserts) as much as anything.
There will still be plenty of bugs to go around.
I suspect the only way to answer that for certain is to do it. I'd be vaguely interested in trying with Idris or similar.
Looking at the code from Reddit it seems like the important distinctions in this instance - between whitespace and not, between the raw input string and the input string with whitespace skipped, between a character that has been confirmed to lie between 0 and 7 and an unknown input character - ought to be trackable in Rust as well though.
> And if so, how much overhead is associated with said process?
In theory it should be zero overhead at runtime. In practice of course one should have to see how it turned out in benchmarks.
 Which, sure, would require explicitly choosing possible nontermination when running the regex, if that's the desired behaviour. At one job my team deliberately added a step counter and limit to a regex matcher (abort after 10000 or so) as that was better behaviour than running indefinitely.
Putting a limit on the regex engine won't fly. How big should that limit be? I use this particular regex engine to search files that are many GB in size. Some people might search bigger files. What should the limit be? Limits only work when you can restrict your use cases significantly, which is not a luxury I can enjoy.
The broader point I was trying to make with my question is that "invalid states aren't represented" isn't particularly interesting to me on its own. What's interesting to me is what it takes to achieve it. i.e., What do I have to give up to make it so? Developer time? Performance? Code complexity? Other practical things? Keep in mind that I'm after things that work, and not a hypothetically advanced-enough tool.
The limit just needs to be applied to backtracking (and other extended regex features that lead to turing completeness if there are any), not to input sizes or similar. I'm content to have a tool that errors if a given regex backtracks to the same point more than say 10000 times. I appreciate the principle of not restricting use cases, but in practice the overwhelming majority of the time that leads to the tool going into what's effectively a busy loop, which is not very user-friendly even if it's in some sense correct.
(If I had control over the whole system I would say that perl-style extended regexes are fundamentally the wrong abstraction - the whole original point of regular expressions was that they were expressions for regular languages that were efficiently implementable, and the rare use cases that require turing-complete parsing/matching are better addressed through languages that are designed as turing-complete. But often in practice one doesn't have that freedom).
> The broader point I was trying to make with my question is that "invalid states aren't represented" isn't particularly interesting to me on its own. What's interesting to me is what it takes to achieve it. i.e., What do I have to give up to make it so? Developer time? Performance? Code complexity? Other practical things? Keep in mind that I'm after things that work, and not a hypothetically advanced-enough tool.
It's interesting in so far as it affects defect rates rather than in and of itself, sure. My sense is that it's natural to avoid this kind of problem in a language that encourages you to represent this kind of distinction, and that the shift of style shouldn't carry a cost beyond that. I mean in this example there was a big explanatory comment above the panic-ing line about why the unwraps were safe, which comes with its own cost (both for the initial developer and for anyone reading the code); to my mind expressing that comment in a type system rather than prose would not have been substantially more expensive, and would have caught the issue. But of course that's a lot to claim without a concrete implementation showing these advantages in practice; most of the experience I'm drawing on is unfortunately private, and skepticism is quite warranted.
Also, FWIW, Rust's regex engine guarantees linear time search. (Interestingly, it may decide to use backtracking to resolve the search, but it's only an optimization. And the backtracking is always bounded so that it never revisits the same state on the same input byte.)
Ah, fair enough - I saw "extended" and assumed that meant backreferences and the like (which I think require backtracking?). If the search is guaranteed-linear-time in any case then why the requirement for a Turing-complete language?
Because suggestions of using Turing-incomplete languages seem somewhat uninteresting to me. To be honest, I have never used one in anger, but if you told me to go write a production grade regex engine in Agda and Coq, I'd say you're nuts. It sounds like a fine experiment to do given infinite time, but I want people to actually use what I build. And I have a very strong suspicion that using one of those languages would take me 10 times as long to write it, if it's even possible at all.
I try hard to blend the practical with the theory. I'm not aware of any Turing-incomplete language that comes close enough to the practical side of things. I hope that changes as PL research advances. :-)
For the most part, my problem with your initial comment on the matter is that it neglects everything except for the "type system could have stopped it" part. But in reality, there are a host of trade offs at play given the actual choices available to me. (Even if those trade offs may be fundamentally orthogonal to the notion of type safety, they still exist and they are still things I must confront.)
I feel like an Agda/Coq implementation would be dead in the water pretty quickly. How easy or hard is it to manipulate raw byte buffers in those languages? Or memory layout? That's why Rust is such a nice tool for a regex engine, because it gives you access to the hardware at the level of C without any fuss at all, while still remaining reasonably safe.
A long time ago, we had a discussion about ASTs and property based testing. I noted that my "AST" couldn't be round-tripped with the concrete syntax accurately, and you noted that it was therefore not an AST. I'm in the middle of making that happen, and I've tried a bit harder this time around to use the type system to avoid invalid states. I do still have one that I haven't been able to rule out though, which is related to an invariant on a sequence of operations applied to a stack: https://github.com/BurntSushi/regex/blob/syntax-rewrite/rege...
What if it took 3 times as long to write? I'm somewhat wildly guessing that would be unacceptable because from your perspective you don't see the benefits for it taking 3 times longer?
I'm interested in all the possible trade offs here, not just the type safety trade off. (That's why Rust is so appealing to me, because it nails a lot of my requirements while also giving a fairly high degree of safety.)
Idris is where my hopes would lie; while I've only played with it a little it doesn't feel like it takes a lot more work - indeed mostly it just behaves like Haskell.
> For the most part, my problem with your initial comment on the matter is that it neglects everything except for the "type system could have stopped it" part. But in reality, there are a host of trade offs at play given the actual choices available to me. (Even if those trade offs may be fundamentally orthogonal to the notion of type safety, they still exist and they are still things I must confront.)
Sure; I've got my own agenda/narrative and I'm talking about this bug because it aligns with that. (My belief that greater use of type systems is the best way forward for the industry is genuine and good-faith, but that doesn't mean I'm right)
> A long time ago, we had a discussion about ASTs and property based testing. I noted that my "AST" couldn't be round-tripped with the concrete syntax accurately, and you noted that it was therefore not an AST. I'm in the middle of making that happen, and I've tried a bit harder this time around to use the type system to avoid invalid states. I do still have one that I haven't been able to rule out though, which is related to an invariant on a sequence of operations applied to a stack: https://github.com/BurntSushi/regex/blob/syntax-rewrite/rege....
Glad to hear that.
The invariant comes from only ever adding alternations to the stack via push_or_add_alternation, right?
My first thought is to encapsulate the stack and the way alternations are added to / removed from it behind an opaque type. i.e. rather than Parser having direct access to the stack, it has a wrapped stack that it can push to (in a way that will always add alternations when necessary) and pop from (in a way that will provide it with the invariant - e.g. by returning an alternation-with-content or non-alternation). That doesn't offer any more proof, but practically it means the two halves of the invariant are in the same place and easier to review, and it's clear where the responsibility is.
Going further and looking to actually eliminate the unreachable, it seems like a stack with an alternation on top is treated differently both on push and pop (not just at the end but also in pop_group)?. Maybe those should be different types. Or maybe the representation isn't quite right - maybe it would be better for each alternate to be its own stack entry (i.e. the AST looks like Alternate(case1, Alternate(case2, Alternate(case3, ...))). That way seems like a better separation of concerns - on encountering a '|' you just push an entry onto the stack rather than having to push-or-mutate, and then when you know the alternation has finished (because you hit ')' or the end) you peek and pop all the alternations that are currently at the top (possibly zero, which I think should actually simplify pop_group because there are no longer two different cases). In turn that suggests a representation where AstGroup always contains AstAlternation (possibly of only one alternate) and the root value is always AstAlternation too.
I don't know whether that would work - maybe there're constraints I haven't thought of - but that's the kind of direction I'd be thinking in. Where there's a conditional, see if it can be moved into data (an enum or similar); where the representation has invalid states, look for a way to represent that avoids them. I think to a certain extent this awkwardness reflects a true awkwardness in the regex grammar - you can't tell whether what you're parsing is part of an alternation or not until you hit the | - but to the extent that it's possible to parse regexes it should be possible to come up with a representation that corresponds to what you do when parsing, and I feel like there's a correspondence/duality between operator precedence and deciding that the children of a certain node must be a particular type of node (e.g. if we wanted to parse arithmetic expressions consisting of +, *, () and variables/literals, we could say that both arguments to + are products (possibly of 1 element)).
 Including at runtime; if you're looking to write a library that can be embedded into other languages then Rust is indeed probably still your best option. One of my very long-term projects is to extend Rust (or a Rust-like language) with Noether-like functionality, but that's purely theoretical at this point.
(Turing incompleteness is not too much of a restriction for them in practice.)
Not sure about production grade, but Google has plenty of results for "Agda regular expressions".