Hacker News new | past | comments | ask | show | jobs | submit login
Packrat Parsing: Simple, Powerful, Lazy, Linear Time [pdf] (csail.mit.edu)
99 points by gjvc 12 days ago | hide | past | favorite | 24 comments

Packrat Parsing: Simple, Powerful, Lazy, Linear Time, Unlimited Memory

But I still think it's a pretty good idea.

Yes, this. It's really common for naive implementations of packrat parsing to work just fantastic on short input strings (aka benchmarks) and then take literal minutes of "linear time" to parse, say, a 1kbyte source code file. Particularly if you (as is relatively common) do it in a scripting language with a naive memory management policy where creating and connecting vast numbers of small objects can be a performance problem.

It turns out that, in practice, it's better to be selective about where you apply the packrat mechanism.

In practice, IME, a good rule of thumb is that if you're programmatically generating any of the rules then the grammar is probably too big for a packrat parser to be fast enough on realistic inputs. If not, and if the entire parser is written by a single person in a time-frame that would make sense for a parser, then things are probably gonna be ok in all but the worst implementations. YMMV

Memory usage is linear too. It’s bounded by length of input * number of rules in the grammar.

For example 1 GB per character is linear... but it's still a hell of a lot.

The mathematical function doesn't matter as much as the practical overhead.

I mean... the clue's in the name, right?

(I worked on PEG and Packrat for my MEng.)

But packrat parsers don’t require nearly that much space in practice, right? Most languages I’ve worked with will have a relatively small number of production rules, requiring only a small number of results at each parsing position.

Languages with a small number of rules barely require parser tooling; ad-hoc hand-written recursive descent parsers don't add any dependencies, are trivial to write and are known to run well, if you have constant lookahead.

I think in practice they're pretty chunky.

I've also had luck with only storing results for only a very small subset of the productions of a grammar.

Any parsing algorithm uses "unlimited memory" if you make the input big enough.

But, seriously, it's called "packrat parsing," because it saves all the things. You have to expect it's going to use a bit of memory.

I still use it for everything. Memory at that scale just isn't an issue anymore. Your grammar would have to be huge and your data structures highly unoptimized for it to matter all that much.

It is an interesting principle.

If I am reading the page 20 graphs, though, the parsing time gets crushed by the competition when parsing programming languages (Java, in the example). It is only competitive for simple grammars.

Beyond that, I feel like recursive descent parsers are the PostgreSQL of parsing: they are not exciting, but they get the job done well without fail. The rare cases where you need something else require subtle justification.

If you have a language that does TCO recursive descent parsers are fantastic to write. I wrote a JSON parser in guile scheme that was fast enough for My purposes in about 170 LOC.

It was extremely simple. No state variables, each function was its own state. The only place where there was any call stack buildup was of course in nested arrays/hashes, but guile doesn't do stack overflows, just out of memory, so that was less of an issue.

This is an excellent Packrat parser library in the D language:


If you want to understand the compile-time techniques available in D language that are being used by the Pegged library this is a handy guide:


I use the esrap packrat parser in Common Lisp for computational chemistry file formats. I've written half a dozen parsers with it and had a few written for me by others. They are easy to write and easy to maintain - even when I don't look at them for a couple of years. I am happy to not have yacc and bison in my life anymore (no offense intended).

Tag title with 2002? It’s a great paper.

dammit. missed the edit window. will do so in future; thank you for the reminder.

For LL and LR languages. Not for general CFGs. The abstract also mentions other classes, but isn't specific, probably referring to other deterministic classes of CFG.

Conservation of genius.

Nice paper. Two columns. Why? In the 1980's computer science conferences like STOC and FOCS would hand out telephone-book sized proceedings, and it took two columns to squeeze it all in. Not relevant now, but the two columns make it that much harder to read on a tablet. I don't generally wish ill on anyone, but it's hard not to wish that the editors responsible for this idiotic convention would just die or retire in either order.

If you had said "phone" I would have some sympathy, but "tablet"? Seriously? You act like two columns is somehow ridiculous, but magazines and newspapers also all use multiple columns. You only see single columns on narrow formats like handheld books (most paperbacks). A tablet is large enough to show an entirely page at once at a reasonable font size and you want two columns for that. If anything the technological restriction issue is that it was hard to make a two column typewriter so typed, instead of typeset, documents often have a humorously large single column. (Even on a phone it just isn't difficult enough to read that it deserves such hate.)

Consider using k2pdfopt or some such tool.

Also, this paper is from ICFP 2002 whose proceedings were published in print form. Several libraries hold it: https://www.worldcat.org/title/proceedings-of-the-seventh-ac...

Anyway it appears that from 2017 onwards the ICFP proceedings are no longer published in print (they're published as part of PACMPL, an electronic journal), and indeed the ICFP 2020 papers for example (https://dl.acm.org/toc/pacmpl/2020/4/ICFP) are not in two columns anymore. So everything seems reasonable IMO.

Narrow columns are easier to mark up with a pen for editing and note-making purposes.

On the other hand, I have often wondered why this convention was used so much, especially when some types of content would never reasonably fit into a single column (and people didn't want to float a graphic in there)

What sort of tablet are you using in 2020 which doesn't have the pinch-to-zoom functionality?

"two columns to squeeze it all in" Just use a single column and half the pages then...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact