

On the Speed of Light, Innovation, and the Future of Parsing - jng
http://www.codekana.com/blog/2009/04/02/on-the-speed-of-light-innovation-and-the-future-of-parsing/

======
Hexstream
And here's something I'd like to see in the future: A persistent OS-provided
facility for explaining the origin of configuration decisions.

Linux has most configuration in text files, which is already better than
binary files, but sometimes you have to do a lot of digging around in
documentation to determine what statements of what files determine what
configuration options.

In many ways, the links between configuration files are opaque as if they were
dynamic (try determining semantic links between disparate config files using
standard unix tools), however it's clear that most configuration is in fact
mostly static and so it would be great to be able to access this data (like
dependencies) easily to aid with automatic dependency management and
debugging.

It would be great if in most cases, when the actual configuration differs from
what you expect, you could ask the system why and it would tell you all the
specific statements of config files that affected the configuration of the
particular thing you're asking about (at various levels of granularity and
abstraction). You could then easily see what you need to change to have the
configuration you want.

------
Raplh
On the idea that problems which seem solved to some people are far from
solved, indeed. Of course.

I suspect, however, that the idea that there has ever been some kind of
consensus that all problems in software were solved, or all problems in
Physics were solved, is BS. Fun BS because it is always fun to ascribe a
single personality to a large group, and then point out the foolish
inconsistencies or limitations to what "they" say as though they were a
singular personal pronoun.

------
albertni
This was a nice read and I learned a thing or two, thanks.

Hopefully progress continues to be made in the area of things like better
parsing for more helpful coding environments, because while it feels
incredibly "impure", there are times when I almost feel like having language
rules (such as "End If", though not as ugly hopefully) solely for the purpose
of making my editor more helpful would be a worthwhile tradeoff. Emphasis on
"almost", but still.

Just as intelligent programming language design saves time, intelligent syntax
highlighting and other editor provided assistance does too (and a non-
negligible amount at that).

------
abecedarius
The big new idea that's said to be uncharted territory reminds me of research
from the 90s, actually:

"We also exploit the tight integration between versioning and incremental
analysis to provide a novel history-sensitive approach to error handling. Our
error recovery mechanism reports problems in terms of the user's own changes
in a language-independent, non-correcting, automated, and fully incremental
manner."

[http://techreports.lib.berkeley.edu/accessPages/CSD-97-946?a...](http://techreports.lib.berkeley.edu/accessPages/CSD-97-946?abstract=)

(Excellent thesis, BTW.)

~~~
jng
I'm the author of the article. I was aware of the research you link to. I
enjoyed it, but research is too often focused on the formal side and too
little on the user-experience side. I even mention it succintly in the article
(not by name):

"I had a look at what was available on incremental parsing, and found a few
research initiatives, and even mentions of a few products. But it turns out
none of them was focused on the kind of problem I was trying to solve. They
were either trying to improve the performance [...] or they were trying to
study the formal mathematics of incremental parsing"

I mean to include other research initiatives in the above description.

Their technique is based in not accepting text edits until they parse
properly, a technique which I describe in the article and which I dismissed as
a hack and too basic for my goals.

My system is more sophisticated and powerful: I actually incorporate
partially-wrong stuff into the parse tree! My tool highlights an incomplete
'if' block with the correct if-color, because it knows what it's dealing with!

I can also apply heuristics to choose the prefered parsing, which they don't.

There's plenty of work pending in incremental parsing, and that thesis just
skims the surface (even if the academic-looking presentation may seem like
it's going very deep).

~~~
scottdw2
I'd be interested in learning more how you implemented this.

I reccomended something similar to this for use as the parser for the VB.NET
IDE, back around the start of the VS 2008 development cycle. It ended up being
kind of expensive , because it required pushing changes all the way down to
the text representation. All the IDE schedule time for Orcas went into
supporting the LINQ features, so nobody was really interested in hearing about
expensive IDE changes (by the way, I'm not disputing the choice, from a
product perspective, it was the right one).

I was trying to make the parser faster, and to minimize rebinding of
declarations, rather than on improving intermediate states, but I can see how
it might be possible to make your stuff work.

My idea was based on several things:

1) Incremental tokenization, using a specialized editor text representation

2) Knowledge of the amount of lexical and syntantic lookahead needed by the VB
grammar

3) An index from tokens to parse tree nodes

When an edit happend, the incremental scanner would then move backwards a
constant number of tokens (based on some grammar properties) and then scan
forward patching up the token stream until the scanner produced a duplicate
token in the appropriate offset position.

The parse tree index would then be used to figure out where to restart
parsing, again based on some properties of the grammar.

My focus was on producing the same trees as the existing parser, and on
minimizing changes to declarations (because of all the rebinding that needs to
be done), and so my incremental termination condition was based on entering a
method (in the top-down hand written parser) that would generate a node of the
same type as the "current position"(offset based on the edit), with the same
parent, in the existing parse tree with the same starting token (based on
object identity). This all worked because the scanner was incremental, which
required the editor to structure it's data in a certain way

I would imagine that you probably used something similar to this, but with
different termination conditions.

I would really want to see how similar (or not) what you did was to the scheme
I came up with.

~~~
JabavuAdams
I'm kind of working on something similar.

One thing that struck me about C++ is that although template parsing, etc, is
pretty hard, you can get a long way by segmenting the source into delimited
regions. This might also speed up your semantic analysis, by allowing you to
write code that's operating on a known kind of region.

For instance after pre-processing, it should be easy to find nested "{", "[",
"(", "\"", "'", and "<". Segmenting the source in this way, first, might make
further processing easier.

It seems that the standard parsing methods we learn in CS are optimized for
single-pass speed, but not for maintainability or development efficiency.
Computers have gotten fast. Using multiple passes of simpler operators seems
like a good promising approach.

I see the process of parsing as some kind of folding, where you start with a
linear sequence of chars and gradually fold it over and over into a "lumpy"
tree.

------
antirez
Simpler thing that can be exploited is indentation:

    
    
        while(1) {
          if(foo) {
             if (bar) {
               ... I'm editing here!
          }
        }
    

it is clear from indentation that the unmatch is about if(bar). Of course in
order to get the level of the open bracket you need to get the index of the
first non blank character of the same line. It's a trick but in the practice
may work well.

I could like GCC being able to give me an hint about this instead of pointing
to the end of the file... that is, if an unmatched bracket is detected and
there is something odd about indentation levels of brackets it could show:

Warning: non closed bracket may be around line ...

It is simple even to understand from the source if it's a program where
indentation is usually respected or not.

~~~
jng
I plan to add indentation support to my system in the future... it's a very
interesting area: gathering information from all available sources and using
that to help the user.

I do believe using the history of edits, as the article describes, has a much,
much greater potential.

------
pshc
In the next century, I can only hope, people will look back and laugh about
how we used to store source code in textual form, and marvel at the idea of a
"syntax error."

------
liuliu
I love this guy's ViEmu and definitely will pay for his new toy codekana.

------
antirez
Btw, every time I read about "we are just at the beginning, and we already
think we know all about that" I have a strange feeling. Basically nobody can
tell. The contrary happened a lot of times, think about AI, it basically
stopped to produce significant results for decades. The only way to know is to
invest more work in the field.

And if we want to remember the advice in "you and your research" what really
matters is to understand in time fields where it's worth to spend time and
investigate more.

~~~
jng
Maybe it's obvious just to me? There _will_ be full AI. We _will_ create
artificial brains. We _will_ fully understand how the parallel system in the
brain works. We _will_ create brains more powerful than ours.

I don't think it's that far, either.

~~~
bdr
The only claim I strongly disagree with is that "we will fully understand how
the parallel system in the brains works". How can you be sure? If
understanding is made of abstraction, what if the brain is computationally
irreducible? Our understanding will increase of course, but at some point the
"explanation" of some component will be equivalent to describing all the
neurons in that component and their connections.

~~~
jng
Obviously we're discussing faith and intuition here, as none of us has the
facts. I don't think there is anything computationally irreducible in the
brain.

About some component not being expressible as the result of simpler
computations: that has to happen at some point, by definition. But that point
is probably within the reach of our understanding.

All in all, I don't think there is much magic, just lack of knowledge on our
part. Of course, it's just my intuition.

