
Rope Science – Computer science concepts behind the Xi editor - krat0sprakhar
https://github.com/google/xi-editor/tree/master/doc/rope_science
======
arximboldi
Interesting! I find it amazing that they went all the way into implementing a
CDRT to support efficient plugins.

At the moment I am also writing a text-editor with my partner, to show-case
the C++ RRB-Trees implementations in Immer [1]. We are just started, but my
plan is to stop at 1000k lines. Interestingly, with such persistent data
structure, you go already a long way in implementing undo, parallel
processing, and many editing algorithms are super simple thanks to
slicing/concat. Memory consumption is still sub-optimal (also I am using
wchar_t to simplify my life), but I am very satisfied with the results so far
(it is the fastest editor I have on my machine when editing 1GB file, also can
edit a ~100MB file on a Raspberry --- larger files fail only due to excessive
memory use :/).

[1] [https://github.com/arximboldi/immer](https://github.com/arximboldi/immer)

~~~
geezerjay
Pardon my ignorance, but what's a CDRT?

~~~
oconnor663
Here's the relevant section: [https://github.com/google/xi-
editor/blob/master/doc/rope_sci...](https://github.com/google/xi-
editor/blob/master/doc/rope_science/rope_science_09.md#a-crdt-approach-to-
async-plugins-and-undo)

------
gbrown_
I love to geek out on this sort of stuff. Data Structures for Text Sequences
by Charles Crowley [1] is a great read. I would also recommend checking out
the Vis editor [2]. It's an interesting Vi like editor that uses the piece
chain as its data structure and supports Sam's structural regular expressions.

[1]
[https://www.cs.unm.edu/~crowley/papers/sds.pdf](https://www.cs.unm.edu/~crowley/papers/sds.pdf)

[2] [https://github.com/martanne/vis](https://github.com/martanne/vis)

------
pcwalton
For those interested in reading more about this idea, the term is "monoid
cached trees". The Fenwick tree is a particular well-known special case.

(I looked into monoid cached trees for float placement in CSS at one point in
an effort to come up with an O(n log n) algorithm. I succeeded, but the
constant factor was so high that it wasn't worth the price. I ended up
switching to an algorithm based on splay trees that, while O(n^2) in the worst
case, ended up being O(n) with a small constant factor on real-world pages.)

------
cscheid
Reading the fourth entry on parenthesis matching made me wonder whether one
could store, in the monoid, partials views into the table that is generated by
CYK parsing:
[https://en.wikipedia.org/wiki/CYK_algorithm](https://en.wikipedia.org/wiki/CYK_algorithm)

I love the idea of using monoids like they're described in the blog series,
but the examples suggest that there's a certain amount of non-generalizable
cleverness that goes into defining each monoid. Could you do CYK subtables
inside the monoid, so that people can define arbitrary CF grammars, as long as
they're in Chomsky normal form?

~~~
modeless
I've been wondering lately if it would be possible to make an entire compiler
stack incremental, so that the binary changes on disk as I type. I am
positively sick of waiting tens of seconds or even minutes for the compiler to
redo all the work it's already done thousands of times just to make a one-byte
change to my binary.

~~~
cscheid
Presumably, moving to an infrastructure like this (everything is incremental)
is the biggest difference between Old compilers and New compilers, because of
the ubiquitous IDE. I think I remember watching a talk by Anders Hejlsberg
about this.

------
barrkel
I first discovered ropes - aka cords - from Boehm's library, alongside the
conservative collector.

IIRC, Internet Explorer used a binary tree to represent its strings, at least
at version 5 in the early 2000s, because of the inefficiency of doing lots of
copying for string operations - looped concatenation was one of the primary
drivers. That doesn't mean it went all the way to ropes, of course.

------
jblow
Please read this with a grain of salt as it does not seem practical or
necessary. It seems like the kind of thing written by a young person who is
excited but doesn't really have much experience. Most of the ideas would not
be real-world-useful as stated.

Excitement is nice to feel, but it takes some experience to know when
excitement is really aimed in a productive direction. Otherwise we end up with
the kind of motivation that so often produces over-complex and mis-aimed
software: having a "cool idea" for "exciting technology" and then looking for
places to apply it, and the applications don't really fit or don't really
work, but we don't want to notice that, so we don't.

To pull examples: an entire one of these essays is on "paren matching" and how
it would be really great if you monoidized (ugh) and parallelized that ... the
basic idea of which is instantly shot down by the fact that language grammars
are just more complicated than counting individual characters. Hey bro, what
if there is a big comment in the middle of your file that has some parens in
it? The author didn't even think of this, and relegates this to a comment at
the end of that particular essay: "Jonathan Tomer pointed out that real
parsing is much more interesting than just paren matching." Which is a short
way of saying "this entire essay is not going to work so you probably
shouldn't read it, but I won't tell you that until the bottom of the page, and
even then I will only slyly allude to that fact." Which in itself is
contemptuous of the reader -- it is the kind of thing that happens when you
are excited enough about your ideas that the question of whether they are
correct is eclipsed. This leads to bad work.

There's the essay about the scrollbar -- if you have a 100k-line text file, do
you really want a really long line somewhere in the middle to cause the
scrollbar to be narrow and tweakyin the shorter, well-behaved majority of the
file? No, you probably don't! But this shoots down the idea that you might
want to do a big parallel thing to figure out line length, so he declines to
think about it. In reality what you probably want is the scrollbar to be sized
based on a smooth sliding window that is slightly bigger than what appears on
the screen (but not too much).

Besides which, computers are SO FAST that if you just program them in a
straightforward way, and don't do any of the modern software engineering stuff
that makes programs slow, then your editor is going to react instantly for all
reasonable editing tasks.

I don't want to be too overly critical and negative -- these sorts of thoughts
are fine if they are your private notes and are thinking about technical
problems and asking friends for feedback. It becomes different when you post
them to Hacker News and/or the rest of the internet, because this contains an
implicit claim that these are worth many readers' time. But in order to be
worth many readers' time, much more thought would have had to go in ... and as
a result, the ideas would have changed substantially from what they are now.

I didn't read past essay 4, so if it gets more applicable to reality after
that I don't know!

~~~
dkarl
_the basic idea of which is instantly shot down by the fact that language
grammars are just more complicated than counting individual characters_

I'm sure someone overly excited is at this very moment trying to make this a
mathematically precise statement so they can see if you're right, and if so,
if there's a way to change the author's approach to support more complicated
computations on the text. Maybe you can help them along if you're not just
guessing.

~~~
jblow
To get the parentheses right, you have to parse the language.

There is an extensive body of literature on parsing that goes back decades.
Most of it I don't think is that useful. But some of it is about parallel
parsing. If you are interested, there are quite a number of people with
something to say about it. However, the speed wins in practice are not very
big.

On the other hand, if you just write the parser so that it's fast to begin
with, you don't really have a problem. The language I am working on parses 2.5
million lines of code per second on a laptop, and I have only spent a couple
of hours working on parser speed. To do this it does go in parallel, but it
goes parallel in the obvious way using ordinary data structures (1 input file
at a time as a distinct parallel unit). So it's not "parallel parsing" in the
algorithmic sense.

~~~
dbaupp
Why do you need to parse the language to get parens correct? For most
languages, comments and strings will need to be considered, but neither of
these requires doing a full parse.

Of course, I don't disagree with your point that a fast parser makes making a
distinction here less useful. However that number sounds interestingly large,
without context, do you have more info I can read about it?

~~~
WorldMaker
Parsers don't often handle degenerate cases very well, but degenerate cases
are quite common in text editing. (Think about how often work in progress code
might actually parse correctly.)

Lexers/tokenizers handle degenerate cases very well (after many years of being
used for syntax highlighting in IDEs), and some parsers intended for IDE
consumption are getting much better at dealing with degenerate cases (the
Roslyn family and Typescript in my experience have some very interesting work
put into this area), but most parsers still have a long way to go.
(Especially, because many of the most common parser-generators themselves have
never bothered to concern themselves with degenerate cases.)

~~~
dbaupp
You make a good point (not only is parsing unnecessary, lexing-only/"ad hoc"
analysis can be more resilient for many of the tasks in an editor), however
your phrasing makes it sound like you're disagreeing with me, but I don't
understand with which part of my comment. Could you expand?

~~~
WorldMaker
Interesting. I'm not sure what tone you were seeing, other than I realize
"degenerate" has a very negative tone, but is the most apt technical term I
can find.

Was neither agreeing nor disagreeing with your points, simply expanding
"sideways", because I think the conversation about the usefulness of parsing
to general usage in text editors and information processing gets derailed by
the "degenerate" cases where things don't parse (because those are very
important to text editors).

I think people often forget or underappreciate the lexing half of the
lexer/parser divide. Yet syntax highlighting engines in most of the text
editors we use these days already hint that you can do a lot of user
meaningful things with "just" rudimentary, generic lexers.

As a continued aside: I felt I got really good results using a lexer as the
basis for a character-based somewhat "semantic" diff tool, but still to date
I've yet to really see it come into general usage outside my prototype toy
([http://github.com/WorldMaker/tokdiff](http://github.com/WorldMaker/tokdiff)).

------
amirouche
Is there any comprehensive documentation on the rope datastructure?

~~~
nirvdrum
If you're talking about ropes in general, Boehm et al. [1] is the
authoritative source. There's a wikipedia page, too, but I find it way more
confusing than it needs to be.

If you're interested in applications to other domains, we use ropes as the
basis of the String data type in TruffleRuby. TruffleRuby is open source, so
you can see that implementation by checking out the code in the
org.truffleruby.core.rope package [2]. We had to extend the basic idea of
ropes to better match Ruby's semantics, such as making them encoding-aware. I
gave a talk about it at last year's RubyKaigi [3] that dives into real world
trade-offs.

There are also a lot of various rope implementations out there. You probably
can find one for your language of choice.

[1] -
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.9450&rep=rep1&type=pdf)

[2] -
[https://github.com/graalvm/truffleruby](https://github.com/graalvm/truffleruby)

[3] -
[https://www.youtube.com/watch?v=UQnxukip368](https://www.youtube.com/watch?v=UQnxukip368)

------
asrp
This write up is really good! Are there similar write-ups for things other
than text editors?

I wonder if only wanted a printable character ASCII editor would simplify
things a lot or only a little. And I guess no tabs.

> Part 2 Line breaking

I don't really understand the problem here. Can't we count the line breaks
like anything else? Is it because that's not the values we want in the end?

> Part 4: Again, making this into a monoid is pretty easy. You store two
> copies of the (t, m) pair - one for the simple case, and one for the case
> where the beginning of the string is in a comment. You also keep two bits to
> keep track of whether the string ends or begins a comment. In principle, you
> have to do the computation twice for both cases, whether the first line is a
> comment or not, but in practice it doesn’t make the computation any more
> expensive: you compute (t, m) for the first line and for the rest of the
> string, and just store both the first value and the monoid sum.

What if a node of the rope contains an "end comment" and (later) a "("? What
should the two pairs of (t, m) be? Now that substring might be entirely
inside, outside or partially outside and inside a comment.

Although I do understand the general idea of computing the result for all
possible initial/input state to achieve paralellism.

------
beefsack
For those interested in following the project or participating in discussions,
there's a subreddit over at
[https://www.reddit.com/r/xi_editor/](https://www.reddit.com/r/xi_editor/)

~~~
cmyr
For anyone interested in contributing, there is also (as of yesterday) #xi on
irc.mozilla.org.

------
pklausler
The text editor that I use for everything is one that I wrote myself a decade
ago in 5K lines of C, based on gap buffers. Save the "advanced computer
science" for the problems that need it.

~~~
nikolay
[http://scienceblogs.com/goodmath/2009/02/18/gap-buffers-
or-w...](http://scienceblogs.com/goodmath/2009/02/18/gap-buffers-or-why-
bother-with-1/)

------
vardump
Would it be possible to add in-memory LZ4 compression in xi-editor? For those
huge log, XML, csv, etc. files?

Maybe it'd still be possible to maintain good response time while enjoying
4-10x memory savings.

~~~
raphlinus
It would be possible. I'm thinking of adding "pack/unpack" operations to the
Leaf data structure, but it's more motivated by getting a varint encoding for
line breaks; right now each break is a 64 bit integer (on 64 bit builds), so
if you have file consisting of empty lines, the corresponding line break data
structure is 8 times bigger than the text. With the varint encoding I have in
mind, it would be bounded at 1/4.

That said, I'm very skeptical that lz4 on text would be worth it.

------
erikb
I opened two of the rope documents and I don't even get the problems they try
to solve. How can I decide whether these problems are mine as well?

Sure my text editors aren't perfect, but they mostly get the job done, so any
editor coming along needs to show that it tries to solve a problem that the
user has. I'm not yet convinced this one does, so I probably will never find
out what makes it brilliant.

~~~
staticassertion
Given a long string it will make insertion into the middle of text fast. Have
you ever opened a file in Atom, added one character, and the whole thing locks
up? That's the use case this solves, among others.

~~~
fjdlwlv
No editor besides Atom has this problem. Do they all use ropes?

~~~
staticassertion
I just attempted to open a 2MB file in gedit and it took about 25 seconds to
load enough to click into some text. Attempting to edit it did not go well and
the process effectively hung.

I opened the file in intellij and it's performing admirably but there is
noticeable lag when entering input. Keep in mind that features were disabled
automatically due to the file size.

2MB may not be very representative for all files, but when having to jump into
generated protobuf class files I've certainly wished my IDE could handle the
load. I can't tell you what data structures these IDEs are using internally
though.

And, again, the fact that features had to be disabled in intellij really says
a lot. With more power, you can have more features, you can get feedback
faster in your IDE, you can have more linters, more plugins, etc. That's a
huge benefit to having fast software. Performance is _enabling_.

------
anodin3
In case you're wondering what's going on, this is the list of posts:
[https://github.com/google/xi-
editor/tree/master/doc/rope_sci...](https://github.com/google/xi-
editor/tree/master/doc/rope_science)

