PEGs are extremely difficult to debug when things go wrong because you get a MASSIVE trace (if you're lucky) to sift through. This tool has rewind/playback controls to dig into what the PEG rules are doing.
Edit: and now having read through comments here, Chevrotain's PEG tool looks really impressive, although it doesn't use a simple PEG syntax for the input: https://chevrotain.io/playground/
BTW, if you're like me and have been wanting to target WASM bytecode directly for ages without any of the heavy existing toolchains, but unable to put the separate pieces of the spec together yourself: the core maintainer of this library is co-authoring a book for that[0]. I decided to take the risk of giving the early access version a try and it's really nice and accessible.
I'm mentioning it because it implements a toy language using Ohm to explain how WebAssembly works (gee wonder why). So it actually includes a mini-Ohm tutorial.
Funny enough, since WASM bytecode is designed to be compact, and speccing out parsers tends to be a more verbose business, the book technically ends up spending more words on implementing the parser than on WASM itself (all with the goal of developing a mental model for WASM though, which it succeeded at for me so this is not a critique).
I am jealous of kids these days learning the theory of parsing. There are so many great resources out there! Ohm in particular looks great, attention to detail, care for the theory. Makes me wish I had a project to try it out.
I am a big fan of PEG parsers. They do come with their set issues and difficulties but I always found them great to work with. My to go tool (also a PEG parser similar to Ohm) used to be pegjs now https://peggyjs.org/
When I needed speed, or a more classical take, I would use jison. But I think today I would have to find a good reason not to build a hand made parser.
I just wish someone would write a "how Marpa parsing works for dummies" article. I've been curious about how it actually works ever since I was colleagues with Aria Stewart, who ported it to JS once upon a time and mentioned it's existence to me (we worked remotely and in opposite time-zones, so I never had a chance to sit down with her for a good old in-person explanation).
Great to see this is alive and progressing! I believe Ohm started life in Alan Kay’s research group, to build a graphical OS and office suite in 10k lines of code. I found this talk immensely inspiring https://m.youtube.com/watch?v=ubaX1Smg6pY
Very close! Alex Warth created OMeta (https://en.wikipedia.org/wiki/OMeta) as part of the STEPS project. Ohm was designed as a kind of successor to OMeta, but was created after STEPS.
The main difference in Ohm grammars vs "standard" PEGs is support for left recursion. It's the same approach that was used in OMeta, which is described in the paper "Packrat parsers can support left recursion": https://web.cs.ucla.edu/~todd/research/pepm08.pdf
The other thing Ohm does that's different from many PEG parsing libraries is a strict separation of syntax and semantics. Many PEG tools mix these two concepts, so you write little snippets of code (semantic actions) alongside the rules.
Top thing I want to know about a parser/parser generator library, is will it go fast? After a poor experience with Chevrotain startup performance I’m more open to parser-generators than parsers, I’m very happy to pay for a build step if my compiler will boot up faster (important in the browser) and have better throughout.
Ohm is a PEG parser generator. There are theoretical algorithms for making PEG parsers run fast, and people have published some excellent papers on them, but I've never seen them deployed in real software. (Search for PEG automatic cut insertion.) I haven't tried Ohm, but nothing on its home page suggests that it's an exception to the general rule that PEG parsers are very slow.
Probably if what you most want is for your parser to go fast, you should write a predictive recursive descent parser by hand.
It's worth noting that Python 3 uses a PEG-based parser and its performance (when it was introduced at least) was within 10% of the old parser: https://peps.python.org/pep-0617/
In absolute terms, the numbers given in the PEP work out to about a thousand machine instructions per byte of input, for both the old and new parsers. From my point of view, that's pretty slow.
Note that this includes the time for tokenization, which is generally the bulk of the time spent in parsing, and which I believe did not change between the old recursive-descent parser and the new PEG parser.
Typically the tokenizer must iterate over some 32 characters per line of input code, which it reduces to some 8 tokens for the benefit of the parser proper, so usually the tokenizer is the bottleneck. However, the performance in this case seems too poor to attribute to the tokenizer.
That PEP also discusses some avoidable inefficiencies in the old parser that accounted for a substantial fraction of its runtime.
One of the biggest advantages of PEGs is that you don't need a separate tokenization pass. But that means you are incurring Packrat's inefficiency per character instead of per token. As I said above, I believe the CPython implementation does not do this.
I've written and worked on PEG parser generators, and I think PEG parsing is the best option in a wide variety of circumstances. But it is rarely the fastest option and usually far from it.
Yes, you're right that PEG parsing is usually not the fastest option. As with many things, the operative question is: is it fast _enough_?
> Note that this includes the time for tokenization, which is generally the bulk of the time spent in parsing
Hmmm, I've never heard this before and I'm curious if you can point me to any benchmarks or other sources that demonstrate this. If it's true it's surprising to me!
In my line of work (writing JavaScript that will run on Android devices) there is no such thing as fast enough. If you’re not going as fast as possible the experience is bad. After all this is a platform where getting data from the local disk is frequently slower than the network because there’s so little CPU & disk bandwidth…
Also, if you're computing for 3 milliseconds for every touchmove event instead of 0.3 milliseconds, you'll run the battery down a lot faster. Though not actually ten times faster, because the screen keeps sucking power even when the CPU is asleep.
I can't! But if you have a conventional lex/yacc parser handy, it should be easy to test by measuring the time for just lexing a large input file and discarding the tokens. If you do the experiment, I'm interested to hear your results!
I don't believe that software is ever fast enough, but trading off speed is often worthwhile, for example to get composability or readability.
Project Parsing/Lexing Ratio
----------------------------------------
jdk8 5.34x
Spring Framework 3.81x
Elasticsearch 5.76x
RxJava 5.69x
JUnit4 4.51x
Guava 5.37x
Log4j 2.92x
Obviously this is just one toolkit, one language, and one set of examples. But in these benchmarks, the tokenization (lexing) time is a small fraction of the total parse time.
Well, "small" is relative. It's certainly not a bottleneck, but it's still between 17.4% to 34.2% of the total time. That's definitely still in range where optimizing could have a measurable impact on performance difference (depending on how much room there's left for optimization).
Right, but it's clearly not the situation I was describing where cutting the parse time to zero for the stages after the tokenization stage would only slightly decrease total time.
This looks very interesting, particularly the visualization tool.
One thing I think it sheds light on is the difference between a simple library and a framework. Many frameworks can be thought of as providing an interpreter that you configure using callbacks. Debugging a high-level program by stepping through the interpreter that's executing it isn't easy. Custom debugging tools make it easier.
In the past, I had always had a distaste for 'stringy' data, always preferring more structure (e.g objects). Once I learned about parsers, I realized you can still have structure in a string.
Looks quite nice. The online editor with color coding looks sleek. I suppose that they are using a PEG-parser to parse the grammar in the online editor.
Can you also specify color coding for your grammar?
This is a parser generator. In the online editor, even with the example grammar, you can see that there is some delay when you change the grammar in what is being parsed. I wrote an interactive back-tracking parser with caching that does not show less delay with a bit larger grammar. Search for 'IParse Studio'.
Anybody out there who has tried to convert js to ast and back and got OOM?! (I tried @babel/parser). I pray this one doesn't disappoint.
Oh, and comments!!! My general observation is that one cannot capture the comments when code is generated from ast. Although esprima has something in the standard to this effect, implementations generally stuck with weird commenting styles and how to generate code from them...for e.g
I have a rule that says: You are not allowed to use these tools until you have written a recursive descent parser yourself. Parsing manually is not that hard, but giving good errors while parsing automatically is not that easy.
Oh hell yeah. Stoked to see something like this in TS. I enjoy making a toy DSL here and there, but it's a lot of friction to relearn Lex and Yacc every time. The editor looks great too, really like the visualizations.
For anybody already using this, what's your workflow like for getting from this to IDE syntax highlighting?
I could swear I recently saw an article about a peg-tool that came with lsp support - sadly I can't find it. The closest I come is pest which has an ide-tools crate and some plugins for lsp/ide/editor support:
I have been on the lookout for something that code help me with Jinja files and it seems like a plausible candidate. Extra bonus points if I could actually embed the Python in the same grammar but at this point even having the code blocks segregated from the literal parts is a good start
Side note, I really wish software would stop using EE terminology to name things. Ohm, Capacitor, Electron, etc. It muddies up search results for no good reason. Is there a reason this has become a trend lately?
Ohm is presumably called that because it's based on OMeta, which was called that because it was a successor to META-II that had inheritance, like OO languages.
More generally, I suspect software programmers have EE envy because electrical engineers build things that actually do work, and their designs get better over time.
PEGs are extremely difficult to debug when things go wrong because you get a MASSIVE trace (if you're lucky) to sift through. This tool has rewind/playback controls to dig into what the PEG rules are doing.
Previously, the best I'd seen was cpp-peglib's online parser, here: https://yhirose.github.io/cpp-peglib/
Edit: and now having read through comments here, Chevrotain's PEG tool looks really impressive, although it doesn't use a simple PEG syntax for the input: https://chevrotain.io/playground/