Hacker News new | past | comments | ask | show | jobs | submit login

the basic idea of which is instantly shot down by the fact that language grammars are just more complicated than counting individual characters

I'm sure someone overly excited is at this very moment trying to make this a mathematically precise statement so they can see if you're right, and if so, if there's a way to change the author's approach to support more complicated computations on the text. Maybe you can help them along if you're not just guessing.




There was a link to the Wikipedia article on the Dyck language [1] somewhere, which lead me to the Chomsky–Schützenberger representation theorem [2], which says that every context-free language can be represented by a homomorphism applied to the intersection of a Dyck language (possibly with multiple different kinds of parentheses) with a regular language.

The Dyck language can be recognized by a monoid as described. The regular language can be recognized by a monoid whose elements are functions between states of the recognizing automaton, each symbol being identified with the transitions the automaton performs when reading the symbol. The intersection should be describable by the product of monoids. (If such a thing exists, anyway. My abstract algebra is a bit rusty.)

The hard part is dealing with the homomorphism that turns the intersection of the two languages into the context-free language. I'm guessing that the general construction might require turning parentheses into empty strings, which would destroy the simplicity of the monoidal construction.

Anyway, I'm unfortunately not excited enough to see whether I can get it to work, at least not right now. But maybe it helps point someone else in the right direction.

[1] https://en.wikipedia.org/wiki/Dyck_language [2] https://en.wikipedia.org/wiki/Chomsky%E2%80%93Sch%C3%BCtzenb...


To get the parentheses right, you have to parse the language.

There is an extensive body of literature on parsing that goes back decades. Most of it I don't think is that useful. But some of it is about parallel parsing. If you are interested, there are quite a number of people with something to say about it. However, the speed wins in practice are not very big.

On the other hand, if you just write the parser so that it's fast to begin with, you don't really have a problem. The language I am working on parses 2.5 million lines of code per second on a laptop, and I have only spent a couple of hours working on parser speed. To do this it does go in parallel, but it goes parallel in the obvious way using ordinary data structures (1 input file at a time as a distinct parallel unit). So it's not "parallel parsing" in the algorithmic sense.


Why do you need to parse the language to get parens correct? For most languages, comments and strings will need to be considered, but neither of these requires doing a full parse.

Of course, I don't disagree with your point that a fast parser makes making a distinction here less useful. However that number sounds interestingly large, without context, do you have more info I can read about it?


Parsers don't often handle degenerate cases very well, but degenerate cases are quite common in text editing. (Think about how often work in progress code might actually parse correctly.)

Lexers/tokenizers handle degenerate cases very well (after many years of being used for syntax highlighting in IDEs), and some parsers intended for IDE consumption are getting much better at dealing with degenerate cases (the Roslyn family and Typescript in my experience have some very interesting work put into this area), but most parsers still have a long way to go. (Especially, because many of the most common parser-generators themselves have never bothered to concern themselves with degenerate cases.)


You make a good point (not only is parsing unnecessary, lexing-only/"ad hoc" analysis can be more resilient for many of the tasks in an editor), however your phrasing makes it sound like you're disagreeing with me, but I don't understand with which part of my comment. Could you expand?


Interesting. I'm not sure what tone you were seeing, other than I realize "degenerate" has a very negative tone, but is the most apt technical term I can find.

Was neither agreeing nor disagreeing with your points, simply expanding "sideways", because I think the conversation about the usefulness of parsing to general usage in text editors and information processing gets derailed by the "degenerate" cases where things don't parse (because those are very important to text editors).

I think people often forget or underappreciate the lexing half of the lexer/parser divide. Yet syntax highlighting engines in most of the text editors we use these days already hint that you can do a lot of user meaningful things with "just" rudimentary, generic lexers.

As a continued aside: I felt I got really good results using a lexer as the basis for a character-based somewhat "semantic" diff tool, but still to date I've yet to really see it come into general usage outside my prototype toy (http://github.com/WorldMaker/tokdiff).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: