
A Brief Glance at How Various Text Editors Manage Their Textual Data (2015) - hoffmannesque
http://ecc-comp.blogspot.com/2015/05/a-brief-glance-at-how-5-text-editors.html
======
akavel
The article is missing one of the important but commonly overlooked
structures: a "piece chain" [1] (aka "piece table"). Not easy to find editors
using it, especially among the popular ones, but:

\- from what I've read somewhere, MS Word was/is (?) using it internally
[[http://1017.songtrellisopml.com/whatsBeenWroughtUsingPieceTa...](http://1017.songtrellisopml.com/whatsBeenWroughtUsingPieceTables),
]

\- AbiWord [1]

\- [1]: [http://www.catch22.net/tuts/piece-
chains](http://www.catch22.net/tuts/piece-chains)

\- [https://github.com/martanne/vis](https://github.com/martanne/vis)

\- the text editor of the Oberon OS used it [1]

One of its most intevesting advantages is that it trivially supports fast
unlimited and persistent undo/redo

~~~
jhallenworld
A problem with this data structure is that it's not self optimizing. Suppose
you insert a character after every other character (not common, but easy to do
with a macro). You will end up with a link-node per character. On the one
hand, the size expansion is probably fine because the undo records of other
data structures would equal it (if they support unlimited undo). On the other,
all data access now involves pointer chasing for each character. This could be
bad for screen update, search, etc.

The doubly-linked list of gap buffers method is somewhat similar, but is self-
optimizing because adjacent small buffers can be merged.

~~~
akavel
The piece chain approach _does_ indeed have some cons too, not only
advantages. E.g. retrieving a byte at a specific absolute offset requires
iterating sequentially over all preceding "pieces" of the chain in the basic
variant of the structure (i.e. if used without additional helpers/caches/...).
But which method doesn't bring its own disadvantages as well as advantages? It
would probably be interesting if someone managed to write a _comprehensive_
comparison/analysis of _all_ the various known data structures for editors...

My intention was to highlight the piece-chain method as it seems to me
surprisingly often overlooked (not known?), while having quite some noteworthy
features. That said, I'm starting to think that maybe its non-trivialness can
be seen in itself as one of the disadvantages too (i.e. by maybe making the
program more complex than for the other, "dumber" methods)? Not sure on that,
though. Adding undo upon the other ("simpler") approaches may possibly make
them more complex anyway?

------
mhw
One interesting aspect of the implementation of sam is that there are actually
two different text representations. What's discussed here is the part in the
main sam process, but there's another data structure in the samterm process
that contains only those sections of the file that have actually been
displayed in the samterm GUI.

This meant that you could run sam and samterm on opposite ends of a slow link
and still be able to edit very large files. The remote sam process loads the
file into the data structure described in the original article. samterm only
loaded (over the slow link) the section of the file needed to draw a window
containing the part of the text the user was looking at. As you moved around
the file samterm would fill in parts of the data structure with the text you
needed to see.

The data structure used on the samterm end is called a Rasp: a file with
holes. See
[https://plan9port.googlesource.com/plan9/+/refs/heads/master...](https://plan9port.googlesource.com/plan9/+/refs/heads/master/src/cmd/samterm/rasp.c)

------
flyingmutant
If you are interested in this, [Data Structures for Text
Sequences]([https://www.cs.unm.edu/~crowley/papers/sds.pdf](https://www.cs.unm.edu/~crowley/papers/sds.pdf))
by Charles Crowley is an excellent paper to study.

~~~
amelius
Yes, very nice. But it needs an update on structured documents (tree-like, as
in HTML). And, ideally, it should be accompanied by work that addresses
collaborative editing in such structured documents.

------
fallat
Author here: I woke up this morning and I'm pretty surprised! Finally some
feedback over a year later.

I will compile a list of more people will want to see (seeing a lot of Atom
and editors with piece table (vis uses piece tables)).

Thank you for the positive feedback. My blog before had a total of 30,000
views over 2-3 years, and now just overnight has double that - makes me pretty
happy and want to put more effort into write ups! :)

------
zelos
Wow, this prompted me to go look for a GPL PalmOS editor I wrote 11 years ago
and _someone's put it on GitHub [1]_ and updated it 2 months ago. It was the
first serious programming I did.

I used a bizarre structure where the each onscreen line in the document (soft-
wrapped) was represented by a (uint16?) character count and pointer to an
array containing the characters. I had to write a custom memory allocator that
put adjacent lines next to each other in memory to make moving characters
between lines efficient.

Way too much overhead, and if you changed font you had to completely reload
the file, but it was pretty quick even on the old 68K Palms.

[1] [https://github.com/rtiangha/SiEd-Dana](https://github.com/rtiangha/SiEd-
Dana)

------
lewisjoe
I'd like to see someone write on how web based rich text editing tools store
their textual data.

Like,

\+
[quill.js]([https://github.com/quilljs/quill](https://github.com/quilljs/quill))

\+ [draft.js]([https://facebook.github.io/draft-
js/](https://facebook.github.io/draft-js/))

\+ [prosemirror.js]
([https://github.com/ProseMirror/prosemirror](https://github.com/ProseMirror/prosemirror))

\+
[trix.js]([https://github.com/basecamp/trix](https://github.com/basecamp/trix))

Each of them might be relying on some form of _Tree_ to store the content
which actuates the views to react. I'm only guessing. A deeper, closer look
might be interesting and useful.

~~~
zeven7
Usually web based rich text editors just use a <div> with
contenteditable=true. They don't manually manage the textual data. The browser
manages the data, automatically adding and removing <p> elements to contain
the data. The editor just adds buttons which call built-in functions that
transform the text.

I imagine someone somewhere has written a more complex editor that manages
things a lot more manually, though. I checked the first one you linked
(quill), and it, as expected, uses contenteditable.

~~~
asteadman
content editable has it's own problems ([https://medium.com/medium-eng/why-
contenteditable-is-terribl...](https://medium.com/medium-eng/why-
contenteditable-is-terrible-122d8a40e480)), and more modern rich text editors
do build up some kind of model of the data they are representing. My favorite
is prosemirror, which presents the user a contenteditable, but maps changes as
transformations to it's own internal document model. Each version of the
document is immutable and it uses a fast diffing algo (similar to react) to
determine the DOM mutation steps necessary to represent the newest version
document. As a result, it becomes "trivial" to support multi-user
collaboration, since you just import the other user's transformations to the
document model as they come in ([http://prosemirror.net/demo/collab.html#edit-
Example](http://prosemirror.net/demo/collab.html#edit-Example)). Since the
entire document is now modeled as a tree structure, it becomes easy to do
custom serialization/deserialization, ie: to markdown and back again
([http://prosemirror.net/demo/markdown.html](http://prosemirror.net/demo/markdown.html)).

------
StreakyCobra
"The Craft of Text Editing"
([http://www.finseth.com/craft/](http://www.finseth.com/craft/)) could also
interest people who like reading about text editor internals.

------
brynjolf
I had a 4gb textfile I wanted to open. Vim choked while Scite handled it like
it was no big deal. Later I configured vim to handle it okay but Scite still
is there in my toolbox if I need it.

~~~
spatulon
Turning off line wrapping and syntax highlighting before loading a large file
seems to help enormously in Vim.

~~~
RBerenguel
And on emacs. I have "turn on/off syntax highlighting" as a shortcut.
Specially useful for largish JSON files, for some reason emacs chokes easily
on them

~~~
kuschkufan
I like how you say "on emacs". Makes it sound like "on Windows", i.e. Emacs
the OS... ;)

~~~
RBerenguel
Well, I'm not native so on and in are somehow hard to pinpoint. Quite likely
I'd have said also on vim :) (I use emacs+evil by the way)

------
marijn
Potentially relevant: CodeMirror's BTree representation
[http://marijnhaverbeke.nl/blog/codemirror-line-
tree.html](http://marijnhaverbeke.nl/blog/codemirror-line-tree.html)

~~~
espadrine
As I was reading the original post, I was wondering what a similar comparison
of Web-based text editors would look like: Ace, CodeMirror, Atom, Monaco… They
all feel like they have slightly different performance characteristics, but
I've never dug into how each was done. That post is a great start!

------
erichocean
Unsorted counted B-trees are an interesting way to manage text in a text
editor:

[http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtre...](http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html)

You can do all the normal editing operations in log time, and B-trees can
(obviously) be run from disk, so you don't need to use a lot of RAM, either,
and can efficiently edit extremely large text files. In-memory, B-trees also
use the memory hierarchy efficiently.

~~~
inspectahdeck
This seems functionally pretty similar to the 'rope' [0] data structure used
by GtkTextBuffer

[0]
[https://en.wikipedia.org/wiki/Rope_(data_structure)](https://en.wikipedia.org/wiki/Rope_\(data_structure\))

------
melling
"For some, they want editing to have high physical efficiency - using keyboard
shortcuts and command keys, maybe the odd time the mouse comes in handy.
Others want their editors to be virtually efficient - to make the most out of
their resources (RAM, disk space) - or to be "small". "

In 2015, I would think that most (i.e. > 90%) want an editor that's makes them
most efficient as developers.

A little side project I have is to create a set of notes on editors I've used,
or want to use, so I can compare them, and use each more efficiently.

[https://github.com/melling/EditorNotes](https://github.com/melling/EditorNotes)

Edititing should be a precise skill with less hand movement and fewer keys
pressed.

~~~
marssaxman
The editor which makes me most productive as a developer is the one I have to
think about least. I spend an order of magnitude more time reading and
thinking than typing, so no amount of increased editor efficiency could make a
significant improvement to my productivity. Having to stop focusing on the
problem at hand in order to remember how to operate some sophisticated editor
feature, on the other hand, could be a significant distraction. I want to use
the simplest, most straighforward editor capable of doing the job, and if I
have to spend a few extra minutes here and there doing some repetitive code
munging by hand, it is not a problem because I can continue doing the hard
part of the job (thinking) while my fingers are busy typing.

~~~
melling
You are going to be using an editor for dozens of hours a week over decades.
It's probably worth the effort to master one of the better editors.

~~~
marssaxman
After using an editor for dozens of hours a week over decades, I decided it
was worth the effort to write one of my own. I've been living in it for close
to a year and a half now, and while it has a couple of quirks that
occasionally bug me, it's nothing serious, and I feel pretty happy with my
decision so far.

------
bch
nvi[0] (New vi, BSD) by Keith Bostic uses bdb[1] (Berkeley Database) as it's
backend, which is interesting. It may be because he (and Margo Seltzer)
_invented_ the bdb that it was on his mind, but interesting design decision
nonetheless.

[0] [https://en.wikipedia.org/wiki/Nvi](https://en.wikipedia.org/wiki/Nvi)

[1]
[https://en.wikipedia.org/wiki/Berkeley_DB](https://en.wikipedia.org/wiki/Berkeley_DB)

------
JulianMorrison
I'm curious what JOE editor[1] uses. With its syntax highlighting turned off,
that program can open really huge files much quicker than vi or emacs.

1\. [http://joe-editor.sourceforge.net/](http://joe-editor.sourceforge.net/)

~~~
jhallenworld
JOE was written in the final days of expensive memory and was written so that
it can edit files larger than memory. Even today this is sometimes useful: you
can edit an 8 GB file on a 32-bit machine.

It uses a doubly linked list of gap buffers. Each gap buffer has a header and
a 4K data page. The headers are always in memory, but the data pages can be
swapped out to a file in /tmp. The memory usage limit is 32 MB. Possibly this
is no longer a good idea- it's easily possible that you could have more RAM
than /tmp space.

The header has the data page's offset in the swap file, the link pointers, the
gap location and a count of the number of newlines in the gap buffer.

When a file is read in, the gap buffers are completely full. So read-in turns
into a direct read of the file into memory (or into the swap file). The only
thing it has to do is count the newlines in each 4K data page and generate the
headers.

The newline count is to speed up seeks to specific line numbers. [A long
standing enhancement idea is to generate the newline count on demand and use
mmap. This would allow the read in to be a NOP- just demand load the pages
from the original file as needed and use copy-on-write when any change is made
to preserve the original. But I'm also not sure it's a good idea to not take a
snapshot of the original file- so this probably should be optional.]

JOE uses smart pointers to the edit buffer. Each pointer has the address of
the header and a memory pointer to the data page (which is always swapped in
if there is a pointer to it). The software virtual memory system has a
reference count on each page. Each pointer holds a reference on the data page
it's pointing to. If there is no pointer to a page, the reference count is
zero, so it can be swapped out.

The other purpose of the smart pointers is automatically stick to the text
they are pointing to, even through insert and delete operations. So if you
insert at one point in the file, any pointers to further locations are updated
(including line number, byte offset, column number and memory offset).

~~~
unwind
Guessing from your user name, are you one of the authors of JOE? That seems
rather likely, and is just generally why HN is awesome. Thanks for that very
well-written explanation.

------
cyphar
One of the more interesting editors I've seen being worked on is edlib
([https://github.com/neilbrown/edlib](https://github.com/neilbrown/edlib)).
The theory is that rather doing a buffer-based approach as in Emacs, the text
is _actually_ just a representation of a file. The consequence being that you
can have a presentation open as one representation and also be modifying the
presentation as a different representation. It's very young (and is a toy
project as well), but it's definitely a cool idea IMO.

~~~
hittaruki
Isn't an emacs buffer exactly that? Depending on what modes are turned on the
way the file data is presented will also change.

For ex, how you can open a svg file, if you use the image mode you see the
image representation otherwise you get the code.

~~~
cyphar
Emacs buffers are copies of the representation, not _actual_ representations.
The author had a talk at a recent linux.conf.au (can't find the link right
now), which explains it much better than I could.

------
jmngomes
Would love to see Notepad++ added to this, especially given its efficient
handling of large text files.

~~~
ank_the_elder
Notepad++ uses Scintilla internally.

~~~
voltagex_
Does it still use an upstream Scintilla or a fork?

~~~
ygra
It's unlikely they replaced the editor data structure in a fork; that'd
basically mean they couldn't use any patches.

------
LeifCarrotson
> The buffer is initialized as a vector with 10,000 lines. If a file is loaded
> with more than 10,000 lines, or surpasses this limit while editing, it will
> add 1/10th of the amount of max lines to the buffer....This behavior only
> becomes useful after the 10,000 line mark, although a little wasteful. If
> say while typing we hit the 1,000,000 line mark, Moe will allocate space for
> 100,000 lines. This is not very resource efficient, unless you are someone
> who typically types an additional 100,000 lines after your millionth line.

This is incorrect. Intuitively, if you're using large files, you're likely to
generate large changes in file sizes. Algorithmically, this is a very common
approach. Collections in most languages double their size whenever the
existing capacity is exceeded. This minimizes the number of times you need to
reallocate lists of increasing size, and works nicely with the memory
management systems of various operating systems.

It's also not resource efficient to need to change the size of the collection
for every new line. I would suggest that a 10,000 line initial capacity and
increasing by 10% on reallocation are appropriate compromises.

------
mhd
It would be interesting to see something like this for "rich text" editors,
too. Where it's not just text, but might include formatting instructions or
even media. Would be especially nice if one could contrast something from the
WordStar era with 90s OO tech (flyweight objects?) to whatever's currently in
vogue (functional shenanigans?).

~~~
m_mueller
I feel that rich text is sadly something our trade has mostly just given up
on. What's in vogue? Still HTML, which is IMO a really really sad state of
affairs, since there's basically no way you can cleanly map between HTML and a
wysiwyg editor. The simple question of "where is the cursor?" is almost
impossible to answer in the general case.

~~~
espadrine
There is a recent wealth of new Web-based rich text editors. ProseMirror,
Basecamp's Trix, FastMail's Squire, Guardian's Scribe, Summernote…

Even old ones such as CKEditor plan on updating their engine:
[https://medium.com/content-uneditable/ckeditor-5-the-
future-...](https://medium.com/content-uneditable/ckeditor-5-the-future-of-
rich-text-editing-2b9300f9df2c#.gt4uibrtr).

~~~
m_mueller
It's certainly good to see some development. I remain skeptical however, until
we have an OSS editor available for production use that internally maps (and
requires storage) to a sane data model (!=HTML).

------
agentgt
I wonder what Linus' favorite editor microemacs/uEmacs uses for data
structures. Its actually fairly hard to find the source code (all I could find
is some ftp and I admit I was too lazy to fire up ftp and download the code).

EDIT: oh here is where it is:
[https://git.kernel.org/cgit/editors/uemacs/uemacs.git](https://git.kernel.org/cgit/editors/uemacs/uemacs.git)

And it looks like its line based like old vi (I think):
[https://git.kernel.org/cgit/editors/uemacs/uemacs.git/tree/l...](https://git.kernel.org/cgit/editors/uemacs/uemacs.git/tree/line.c)

Interestingly there was a line limit per window based on a signed char (127)
that just got changed to an int (that was the last change made).

------
amelius
Tcl/Tk's editor also used balanced trees.

------
voltagex_
I wonder what Wordpad and Notepad do? Notepad will fail on anything bigger
than ~50MB but Wordpad will work on it and have a progress bar to show it
hasn't completely hung.

Text editing on Windows still mostly sucks.

~~~
nikbackm
Notepad and Wordpad are more like demo applications for the Windows common
edit control and Rich Text edit control respectively.

Hardly meant as serious editors. But they will do for simple tasks.

------
pklausler
aoeui uses a simple mmap for unmodified files and a gap buffer mmap'ed from a
temp file for modified texts. This allows huge files to be viewable instantly.
Gap buffers are awesome, BTW.

------
golergka
> b->nc += m; > q0 += m; > s += m; > n -= m;

Why do people who write in C so often make it look like they have a shortage
of letters to use for names?

~~~
kps
They had a math/science background, so they wrote things like

    
    
      F = m * a
    

rather than

    
    
      ADD STATE-SALES-TAX TO STATE-TAXABLE-SUBTOTAL GIVING SUBTOTAL-INCLUDING-STATE-SALES-TAX
    

or

    
    
       activityLoggingAccessForIsolatedWorldsPerWorldBindingsLongAttributeAttributeSetterCallbackForMainWorld();

------
detrino
I've written a Data Structure[1] for C++ to handle this problem. It uses
counted B+Trees internally and is a drop in replacement for C++14's
std::vector.

[1]
[https://github.com/det/segmented_tree](https://github.com/det/segmented_tree)

------
mavhc
The 8bit Acorn BBC editor called Edit put the entire file in RAM, all the
characters before the cursor at the start of RAM, all the characters after at
the end. Quick for inserting, but there was a half second delay if you jumped
from the top to the bottom of the text.

~~~
EdiX
This data structure is called gap buffer, it's rather common in text editors
because it's simple to implement and efficient on the most common use cases.

In the usual implementation the gap is only moved when you actually insert
text, not when the cursor is moved, which makes it even more efficient.

------
z3t4
I'm making a code editor and chose a grid: two dimensional array, And a
string. Then it can cross check the string and array to see if there is a bug
somewhere. This is of course not efficient, but it keeps me sane. The grid
also makes it easy to render the text.

~~~
afandian
Is that a standard immutable string? Does this mean allocating a whole new
string every time a character edit is made?

~~~
ant6n
Surely not one allocation per key, that could be up to like 10 allocations per
second.

~~~
z3t4
You will not notice anything unless you are watching the memory graph. I see
no reason for "optimization" at this time.

------
hathym
C/C++ still rocks.

------
rasz_pl
Gnome Gedit is "interesting" when working with big files

2005-2016:"it takes too long to open big files"
[https://bugzilla.gnome.org/show_bug.cgi?id=172099](https://bugzilla.gnome.org/show_bug.cgi?id=172099)

------
octatoan
I needed this :)

