
Text Editor: Data Structures - LaSombra
http://www.averylaird.com/programming/editor/2017/09/30/the-piece-table/
======
raphlinus
I use ropes in xi editor. I did not find the argument against ropes
convincing. Yes, they're not trivial to implement, but in a proper programming
language you're not dealing with the data structure directly, you're always
going through the interface, so you get the logic right once in the rope
library implementation and then forget about it.

In a low-level design, your editing operations would be poking at the data
structure directly. There, the simplicity of a gap buffer is a pretty big win.
I agree in this environment ropes are too complicated. However, I don't see
any good reason to architect a text editor in this way. Use abstractions.

The linked article contains a factual error, the referenced Crowley paper does
not consider ropes. Thus it cannot be used in support of the argument that
piece tables outperform ropes.

There's one other important concern with piece tables I didn't see addressed.
It depends on the file contents on disk not changing. If your file system
supported locking or the ability to get a read-only snapshot, this would be
fine, but in practice most don't. It's very common, say, to checkout a
different git branch while the file is open in the editor. Thus, the editor
_must_ store its own copy to avoid corruption. In the long term, I would like
to see this solved by offering read-only access to files, but that's a deeper
change that can be made piecewise.

~~~
Someone
_”I agree in this environment ropes are too complicated. However, I don 't see
any good reason to architect a text editor in this way. Use abstractions.”_

Abstractions can cost you dearly. If your abstraction for indexing into your
buffer moves from O(1) with a small constant to O(log(n)) with a larger one,
that global replace using regular expressions can get a lot slower. Even a
simple page down may get noticeably slow when at the end of a large file with
long lines.

~~~
amelius
Then you've chosen the wrong abstraction.

~~~
rootlocus
Maybe I've misinterpreted the discussion, but there's not always a perfect
abstraction [1].

1\. [https://www.joelonsoftware.com/2002/11/11/the-law-of-
leaky-a...](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-
abstractions/)

------
userbinator
_The worst way to store and manipulate text is to use an array. Firstly, the
entire file must be loaded into the array first, which raises issues with time
and memory. Even worse still, every insertion and deletion requires each
element in the array to be moved. There are more downsides, but already this
method is clearly not practical. The array can be dismissed as an option
rather quickly._

The authors of the dozens of tiny editors using this "structure", which were
quite popular on the PC in the late 80s through early 90s, would disagree. A
memcpy/memove() runs at many GB/s on a typical machine today, so you would
have to be editing absolutely _huge_ files to notice. Even back then, memory
bandwidth was a few MB/s --- still plenty fast, considering that the typical
files of the time were also much smaller.

I can remember a time when I was attempting to write my own editor and at
first spending a lot of time obsessing over the data structures (it was harder
to find such information at the time) --- only to realise that a lot of the
editors I'd tried, including the one I was using the most at the time --- were
working perfectly well with just one big buffer.

I've opened files of a few hundred MB in Windows' Notepad, which also belongs
to this family of editors; and on a machine a few years old, opening the file
takes the longest because it has to be read into memory --- once it's opened,
moving around and editing lines doesn't show much lag at all. "Worse is
better", indeed.

~~~
mpweiher
Emphatically: Yes! KISS.

Moby Dick, at its slender 752 pages, is 1.2MB of text. You can save the entire
text to disk on every keystroke on just about any system today and keep up
with typing just fine.

Assuming you are actually dealing with text in a _text editor_ , you should be
fine.

If you have 100MB+ files, chances are they aren't actually text.

~~~
weavie
Log files..

~~~
mpweiher
Fair point, but do you actually _edit_ log files in a text editor?

Because if you're just viewing, the "array of characters" representation is
going to beat just about any other hands down, especially if you just mmap()
the whole thing.

~~~
weavie
True. Generally no, but sometimes I will insert several empty lines around a
particular line of interest..

------
Johnny_Brahms
Y'all should have a look at ewig which uses Immer - really fast persistent
data structures for c++. Ewig is just a proof of concept, but the code is
really nice. I have thought about taking that and building something bigger.

The upsides are the same as for a piece table (really simple undo/redo) but
with the downside of not being able to just mmap a file. You also get
basically zero memory usage when you do cut and paste (you can paste a file
into itself until it is bigger than RAM without problems, since you are
actually not copying the contents, just the pointer)

Look at the YouTube video as well. It is all very cool, at least if you are
not already spoiled by using clojure :)

[https://github.com/arximboldi/immer/blob/master/README.rst](https://github.com/arximboldi/immer/blob/master/README.rst)

Edit: Forgot to mention: Ewig can be found among Immer's author Arximboldi's
repos. On my phone right now on GPRS connection, so maybe another friendly
soul can provide the link.

~~~
arximboldi
Thanks a lot for mentioning it! :-)

Ewig uses RRB-Trees (Relaxed Radix Balanced Trees) which like ropes, is
confluent (supports fast concatenation) but has very stable bounds otherwise,
similar to a vector-like type.

EDIT: The parent mentions a video, I gues it is the CppNow talk:
[https://www.youtube.com/watch?v=ZsryQp0UAC8](https://www.youtube.com/watch?v=ZsryQp0UAC8)
Last week I did another version of that talk at CppCon (with slightly deeper
coverage of Ewig) but I don't think it is in Youtube yet.

~~~
Johnny_Brahms
I think immer is brilliant. I just saw that you have guile bindings, which
makes it even more exciting!

I like that they are lgpl as well. I am a believer in that kind of freedom
definition, and I hope you can dual-license it successfully.

------
maxbrunsfeld
Atom uses a piece-table-inspired data structure to represent text[0]. We store
the file's original contents in a single contiguous buffer. All of the user's
unsaved edits are then stored in a separate mutable structure called a Patch,
which we represent as a splay tree. When we need to read from the buffer in a
background thread, we can 'freeze' the patch, and store any subsequent writes
in a new patch which is layered on top of the previous one.

Even though we do load the entire file into memory (as opposed to mmap-ing the
file), the piece table design is still very useful. It makes it very cheap to
compute the buffer's current unsaved change set, which is a value that we
periodically serialize for crash recovery purposes (similar to vim's `.swp`
files).

It's also just a very compact way to store a large chunk of text, which is
good from a cache-locality and memory usage perspective.

[0]
[https://github.com/atom/superstring/blob/master/src/core/tex...](https://github.com/atom/superstring/blob/master/src/core/text-
buffer.h)

------
colanderman
I've been doing the same thing (implementing a text editor for fun) and
settled on a data structure which is a hybrid rope+2-3 tree+B+tree+gap buffer.
I don't know what to call it but maybe someone here has seen it before.

Basically the idea is, the inner nodes are a 2-3 tree which track the size of
each child like a rope does. The leaves are gap buffers of a fixed size (a few
cache lines), which, when full, split into two like leaves of a B+tree. So you
get the dense-ish packing of a gap buffer with the O(log n) performance
guarantee of a 2-3 tree, while avoiding the potential linear copies associated
with ropes and gap buffers.

(I know this is overkill for a toy text editor but it wouldn't be a side
project if it weren't!)

~~~
audidude
I did something like this recently too. A b+tree w/ linked leaves containing
piecetables in the leaf nodes. To avoid the memmove() costs it uses the right-
edge of the leaf to keep a sorting queue so the operations are still O(1)
inside the leaf.

I still need to finish up deletions, but early performance numbers are showing
similar/better than the RBTree approach while being much more
cacheline/allocation friendly.

Ultimately, I want to use this in a replacement for GtkTextBuffer/GtkTextView
in gtk 4.x.

[https://github.com/chergert/pieceplustree](https://github.com/chergert/pieceplustree)

------
abainbridge
Beyond just loading, editing and saving text, how do these data structures
compare when it comes to implementing other "tricky" features, such as:

1\. Multi-line regular expressions. Typically the regex library is given an
array of text to search in ( _). We have no such buffer to give.

2\. Line wrap. Adding or removing a single character near the top of the file
might require the entire file to be re-wrapped. You need to complete the
wrapping process before you know how many lines are in the buffer and
therefore can update the size and position of the vertical scroll bar's
handle.

3\. Column mode editing of text containing tabs.

Beyond those features, these data structures (gap buffer and piece table)
don't seem well suited to operations that effect the entire buffer, such as
convert-tabs-to-spaces.

_ At least when I used PCRE, it seemed to require this.

~~~
jdmichal
1\. All of them could probably easily create a bi-directional or markable
stream of characters, which should be enough for the regex engine to work
with.

2\. That's a display issue. I would suspect that line-wrapping would want its
own data structure within the display system.

3\. I think this would fall under the same note as (2).

Basically, a layered approach like (display + editing mode) / (buffer) /
(array or ropes or pieces or whatever). Because the editing mode would effect
the edits being made to the buffer, which would then be translated into
whatever underlying data structure is actually storing the file.

~~~
abainbridge
Reasonable points. As for (1), I couldn't find how to get PCRE to work with a
stream like that. But that's either my fault or PCRE, not the text editor data
structure.

As for 2, I'd just say that I think these are much harder than the problem
that the linked article discusses. There have been other articles on the same
topic on Hacker News recently. They all seem to focus on solving the trivial
problem and ignoring the difficult ones. I mean, a std::list<std::string> is
fine for the main data structure if you just want to load, edit and save and
don't require (2) or (3).

~~~
jdmichal
I've also never seen a PCRE that _actually_ works with such a stream, though
it would be _possible_ to construct one.

For your remaining points, sure. I don't know enough about text processing to
know where the hard problems are. I do think that engineers sometimes overlook
the possibility of having multiple data structures for multiple use cases for
a single set of data, because it doesn't seem as elegant as one magical data
structure that does it all.

------
sleepychu
Had an assignment in functional programming which was to implement a text
editor, an interesting data structure was provided.

Buffer of X is a BackwardsList of X, Cursor X, ForwardsList of X

So if you make Line a Buffer of char and your document a Buffer of Line you
can easily insert and remove characters on a line or lines in the document.

If you wanted to move the cursor forwards you could just place the cursor onto
the head of the backwards list behind you and take the head off the forwards
list in front of you to be your cursor.

~~~
icen
Such a structure is called a Zipper, for anyone wanting to look into it
further.

~~~
eru
In this case it's a Zipper for a list. You can write zippers for other data
structures as well (and in fact you can mechanically derive them, even).

------
nebabyte
> They should! Why aren’t they!?!?! Somebody needs to make that a markdown
> extension. Every time you want to insert an indexed footnote, you type [^#]

Probably because there's no elegant way to support all the potential edge
cases of that default behavior. For instance, if you introduced a reference
with that format, you can't refer to it later - if you use the static number
and it changes, you're now pointing to the wrong reference. So it reduces to
using names, and the hassle of coming up with a name for each new link or
footnote reduces to just using numbers - so the only thing that needs to be
supported is manually named or numbered items.

The author correctly describes the ideal solution - a plugin that replaces
unnamed links before saving or such - but likely fails to understand why that
(as opposed to adding behavioral cruft to a markup language) is the correct
level of abstraction for such a solution. Imagine if a project like wikipedia
was riddled with the ambiguity of dozens of people's various attempts to
wrangle the autonumbering to their writing.

~~~
zeveb
Interestingly (and IMHO unsurprisingly) org-mode does this correctly[0].
Footnotes are identified by 'fn:' and a unique token (of course, numbers are
unique tokens); they can be easily inserted with C-c C-x f, and can also be
renumbered — including normalisation.

I reminded of this discussion from a week and a half ago:
[https://news.ycombinator.com/item?id=15321850](https://news.ycombinator.com/item?id=15321850)

IMHO org-mode is, as the original author puts it, 'one of the most reasonable
markup languages to use for text.' I suggest that rather than trying to
improve Markdown, folks just use org-mode instead.

[0]
[http://orgmode.org/manual/Footnotes.html](http://orgmode.org/manual/Footnotes.html)

~~~
thomastjeffery
I still think that org-mode and Markdown have quite different targets.

org-mode seeks to be organized and manipulable, whereas Markdown seeks to be
readable, and parsable.

------
azinman2
I had a technical interview once where the interviewer asked me to create a
data structure for a text editor written for a 1980s computer with extremely
slow and expensive I/O and almost no ram. I ended up with a variant of the
piece table and afterwards he told me that’s how Word worked (he used to work
at Microsoft).

~~~
cobalt
That reminds me of this tutorial for writing a text editor (in win32 land):
[http://www.catch22.net/tuts/piece-chains](http://www.catch22.net/tuts/piece-
chains)

------
tlb
I know the conventional wisdom is that you can't use an array, because O(n).
How big of a file can you insert a character into with an O(1) algorithm
before it takes more than one refresh time (16 mS)? From the following test
program, the file can be 150 MB of text (MBP 2015). I've never had a source
file, even a generated one, this long. So I claim that this wisdom is
obsolete.

    
    
      #include <stdlib.h>
      #include <string>
      using namespace std;
      
      int main(int argc, char **argv)
      {
        size_t size = (size_t)atol(argv[1]);
        string editor_buffer(size, 'x');
      
        for (int iter=0; iter<100; iter++) {
          editor_buffer.insert(5, "foo");
        }
      }

------
emmelaich
More abstractly, I enjoyed a description of a text editor in a Categories for
Computer Science class I took waay back.

It's two stacks head to head. Moving the cursor is popping on one and and
pushing on the other. All the other operations were given too. There were the
category diagrams and everything -- the meat of which I've forgotten but I
might be able to find if anyone is interested.

~~~
mcguire
I'd be interested. Using two stacks is a nifty technique.

~~~
z0r
That technique sounds analogous to using a zipper to traverse a list (I've
found a decent looking link for the technique if you're interested -
[https://ferd.ca/yet-another-article-on-zippers.html](https://ferd.ca/yet-
another-article-on-zippers.html))

------
catpolice
One thing I think about a lot in these situations is how constraints that come
in at the high level can end up influencing your low level design. I've spent
the last couple months writing a codemirror replacement for an in-house DSL
IDE (in Javascript).

My particular use case required that I parse whatever was written in the
editor in a few ways: a lexer pass for syntax highlighting and a full parse
for semantic feedback (much of which is provided somewhat unpredictably from
AJAX calls, for various reasons). And I knew that touching the DOM (which I'd
have to do to keep the syntax highlighting/semantic feedback up to date) is
typically going to be a lot slower than doing a few thousand loop comparisons
in JS.

I looked into structures like ropes and whatnot that would enable fast edits,
but I realized that in the end I wasn't going to do better than linear in the
worst case, for a few reasons, e.g.: 1\. Suppose the user enters an open quote
at the first character: every other character just flipped from outside a
quote to inside or vice versa and I'm going to have to re-parse the whole
document to update my semantic analysis (and syntax highlighting, though that
would only require a partial re-parse). 2\. Edits that change the line number
associated with errors are going to require me to update that display and you
can make a document with a number of errors that grow linearly, so entering a
newline at the first character was potentially linear...

So I ended up just using a doubly linked list of tokens (as returned by the
lexer) and re-lexing tiny sections around a cursor when the user enters text.
Collectively the whole thing turns out linear and but it saves a ton of work
by re-using the same structure when I'm doing parsing and semantic analysis.
Doing it this way let me do my own DOM reconciliation and update the absolute
minimum number of DOM nodes for every edit, which ultimately has a huge effect
on performance because touching the DOM is expensive. And one can set this up
so that (at least from the lexer's perspective) undo involves just snipping
the old middle segments of the list back into place.

So it turned out that higher level requirements (eventually having to actually
parse the text and update the display in various ways) made it so I wouldn't
really save any work optimizing at the low level.

------
z3t4
The most costly operation is usually showing the text on the screen, so you
want to optimize the data structure for that. Although it depends on what
level you are programming on. Updating arrays is _not slow_ (unless the file
is huge) and can be done in parallel.

------
thomastjeffery
The other day, I was implementing a Piece Table in Haskell, and realized a
"Piece" is a "Monoid", and a "Piece Table" (which I call a "Buffer") is just a
list of "Pieces".

    
    
        -- A Piece is a String, start, and end.
        data Piece a = Piece { list :: [a] -- 'a' will be 'Char' later, but we can support any type here.
                             , start :: Int
                             , size :: Int
                             } deriving (Show)
    
        -- Pieces are Monoids, like Lists, Trees, etc.
        --  note that the type declarations here are not allowed (Monoid defines them already). I put them there for your leisure.
        instance Monoid (Piece a) where
            -- mempty is an empty Piece.
            mempty :: Piece
            mempty = Piece mempty 0 0
            -- mconcat takes a list of Pieces, and condenses them into one Piece.
            mconcat :: [Piece] -> Piece
            mconcat = foldl mappend mempty
            -- mappend takes two Pieces, and puts them together, resulting in one Piece.
            mappend :: Piece -> Piece -> Piece
            mappend (Piece firstList firstStart firstSize)
                    (Piece nextList nextStart nextSize) =
                        let start = fst $ splitAt nextStart firstList
                            middle = nextList
                            end = snd $ splitAt nextSize firstList
                        in Piece (concat [start, middle, end]) firstStart firstSize
    
        type Buffer = [Piece Char]
    

Now we can implement some pure manipulations:

    
    
        -- replace creates a new Piece, and appends it to the Buffer.
        replace :: String -> Int -> Int -> Buffer -> Buffer
        replace text from to buffer = buffer ++ [(Piece text from to)]
    
        insert :: String -> Int -> Buffer -> Buffer
        insert text at = replace text at 0
    
        delete :: Int -> Int -> Buffer -> Buffer
        delete from to = replace "" from to
    

...and all that is left is to get the text from the buffer

    
    
        bufferText :: Buffer -> String
        -- mconcat folds our Pieces together, resulting in one Piece
        -- list gets the Piece's "list" (String = [Char] in Haskell)
        bufferText = list . mconcat
    

Now to use it, all we need to do is string our manipulations together:

    
    
        >>> buffer = replace "Hello" 0 7
                   $ insert ", world!" 7
                   $ insert "Goodbye" 0
                   $ mempty
    
        >>> bufferText buffer
        "Hello, world!"

~~~
thomastjeffery
One advantage that is obvious now is that "Char" isn't specified until we
define "Buffer", meaning a "Piece" can deal with _any_ datatype, so you can
easily get your Piece Table to deal with _any_ character format from ASCII,
UTF-8, UTF-16, etc. to raw bytes, Hex values, etc. It's even fairly obvious
how to merge buffers that use different character formats.

The strongest point I see regarding Piece Tables is that Pieces are so
discrete. You can define a Piece to be whatever you want it to be, as long as
you implement "mappend" for it, or you can put the Piece in a box with other
data, and make your buffer a list of boxes, and define "mappend" for the rest
of the "box"'s data.

Another fairly obvious implementation is an undo/redo tree (like Vim has):
Instead of Buffer being a List of Pieces, it can be a Tree of Pieces. The
buffer is merged by merging the leftmost leaves of the tree, and "undo" is
done by swapping the last left Leaf with an empty Leaf. Redo is done by
rotating the last left node's leaves.

A final note regarding the above implementation: This simple mappend can only
merge Pieces in order. If a Piece tries to edit part of another Piece that
doesn't exist (out of bounds), it will be appended, or prepended without any
filler, because that is how "splitAt" handles edge cases (rather than being
implemented with Maybe). It wouldn't be too difficult to implement "mappend"
for these edge cases, but you would need to decide on a filler character like
space, or you would have to implement some kind of lazy merge that just keeps
the second Piece around until it can be merged.

~~~
eru
If you use persistent data structures in your implementation (like would be
the default in Haskell), your data structures themselves don't need to know
about undo/redo. Just keep references to the old state around (in an undo tree
or a linear structure for simplicity), and just switch back to them if
necessary. Don't explicitly manipulate your data structures. Sharing will take
care of keeping the overheads low.

~~~
thomastjeffery
> Don't explicitly manipulate your data structures. Sharing will take care of
> keeping the overheads low.

Immutability is one of the strong points of Piece Tables.

Don't forget that this is the most _simple_ implementation of a Piece Table.
If you are in any way concerned about performance, you can quite easily
implement caching, sharing, etc.

~~~
eru
I suspect we are in violent agreement. I was just taking aim at

> [...] "undo" is done by swapping the last left Leaf with an empty Leaf. Redo
> is done by rotating the last left node's leaves.

Which is more complicated than you need to be in an immutable setting to
implement undo/redo.

------
jussij
Back in the mid 90s I created a shareware text editor that ran on Windows 3.1
(on a 80386 machine) and was written in C++ using pure Win16.

At that time I used a _double link list_ like structure with a _line cursor_
to help with list traversal.

While this was not one of the _formal text editor_ design patterns, I found
that pattern very worked well.

~~~
usrusr
Maybe it only worked well because you never encountered a file consisting of
one very long line? Back then most somewhat human-readable data formats where
line-based. Lines also where the de-facto standard work unit for stream
processing so that there was an implicit understanding that a large file
without newlines would not be processable as text even if all parts of that
line were human readable. This only changed once XML went big which made most
newlines a convenience that is frequently omitted in machine to machine
communication, a paradigm shift that almost all younger formats followed. Now
those m2m versions of optionally linebroken data occasionally hit our editors
and make the limitations of lines as the basic unit of text editors painfully
noticeable.

That being said, I have often seen otherwise reliable editors choke on very
long lines, so your former self would still be in good company, even two
decades later.

~~~
jussij
> Maybe it only worked well because you never encountered a file consisting of
> one very long line?

You correct when you say very long lines are hard to handle, but I'm not sure
this is because of the internal data structure used.

In this case the internal data structure allocated a line buffer at every node
of the double linked list and even on that Win16 environment that allowed for
a line size of up to 64 kBytes (i.e. 2^16) in length.

However, to handle tabs correctly the editor is forced to recalculate the
column position on every user event and since that calculation has to consider
the entire line, the time needed to do this _column calculation_ grows with
line length.

So the editor could handled long lines but as the line lengths around 500
characters the speed became noticeably slow.

~~~
eru
You could guess the tab handling with some fast method at first, and
asynchronously run the proper handling to catch up with a pause in user input
a few seconds later.

------
testcross
Interesting comment on lobste.rs to explain why the piece table is rarely
superior to the gap buffer:
[https://lobste.rs/s/xpab69/text_editor_data_structures](https://lobste.rs/s/xpab69/text_editor_data_structures)

------
tudorw
If you're interested in this kind of structure it's worth taking a look at
'transclusion' devised by Ted Nelson in the mid 60's
[https://en.wikipedia.org/wiki/Transclusion](https://en.wikipedia.org/wiki/Transclusion)

------
jmnicolas
I'm always amazed that people are still working on new text editors. When I
have an idea for an app and I see there's even only one alternative I usually
loose all motivation for coding this app (until I realize the alternative
doesn't fulfill my need).

~~~
dilap
I use emacs, but it has lots of problems; I'm still waiting for an editor that
combines its good qualities while resolving the problems. (And I'm tempted to
try to write one.)

~~~
shalabhc
What do you think are the problems of emacs?

~~~
kaens
30 years of cruft + mix/matching of semi-incompatible semi-backwards-
compatible systems if you use 3rd-party elisp and/or maintain a config for a
long time.

I love emacs, use it daily, and think that it's positives far outweigh it's
negatives, but hooo boy sometimes you need to wade through decades of muck to
make it do a thing, and it's often not clear what the best way to go about
trying to do a thing is, particularly to people who haven't been using emacs
for a decade+.

------
gabrielcsapo
Incredibly thorough, are there any plans to implement this into a text editor?

~~~
Johnny_Brahms
It is already in use in a bunch of smaller editors like Vis (iirc)

~~~
panic
Yeah, the vis implementation is here:
[https://github.com/martanne/vis/blob/master/text.c](https://github.com/martanne/vis/blob/master/text.c)

~~~
steinso
That was a great read, and really shows the benefits of writing good comments.

------
norswap
I'm more interested in what he will use for the UI. The popular option
nowadays seems to embed WebKit, but that has obvious drawbacks. On the other
hand, having scouted the grounds myself a year ago, it is one of the easiest
option to set up, and it looks the best.

------
jdmoreira
The first thing that came to my mind was mmap.

I’m sure there is something very wrong with using mmap and I can probably find
the answer with just a google search but it would have been great if it was
covered by the article.

------
gigatexal
Well if I ever wanted to ask do I really need to know advanced data structures
this answers that question. Really interesting article.

------
platz
Figure out whatever VSCode is doing and use that.

------
Ygg2
Are some types of binary trees favored for making ropes? I heard 2,3 - finger
trees are good as a backing data structure for rope.

~~~
prospero
I’d think a relaxed radix tree would be ideal.

