Hacker News new | past | comments | ask | show | jobs | submit login
A Brief Glance at How Various Text Editors Manage Their Textual Data (2015) (ecc-comp.blogspot.com)
332 points by hoffmannesque on March 8, 2016 | hide | past | favorite | 97 comments

The article is missing one of the important but commonly overlooked structures: a "piece chain" [1] (aka "piece table"). Not easy to find editors using it, especially among the popular ones, but:

- from what I've read somewhere, MS Word was/is (?) using it internally [http://1017.songtrellisopml.com/whatsBeenWroughtUsingPieceTa..., ]

- AbiWord [1]

- [1]: http://www.catch22.net/tuts/piece-chains

- https://github.com/martanne/vis

- the text editor of the Oberon OS used it [1]

One of its most intevesting advantages is that it trivially supports fast unlimited and persistent undo/redo

In the same vein as the article, the code for insertion can be found at[0]. The interesting thing about the piece chain is indeed that it is a persistent data structure.

This makes it relatively easy to implement non chronological undo/redo operations i.e. an undo tree.

Because old versions of the text are available one could for example link compiler error messages back to the state when the file was last saved. Thus after a long build only those errors which are still relevant (i.e. not already fixed in the meantime) would be displayed.

Piece chains also map well to POSIX system where the initial file can be mmap(2)-ed. That is the reason why vis supports editing of large files efficiently. All editing operations are independent of the file size, but linear in the number of non-consecutive changes. Of course things like go to line nn will have to read the file content, there is no way around that. But for example navigating using the nn% command is possible in constant time. The way syntax highlighting is implemented it should also (mostly) work for all types of files.

[0] https://github.com/martanne/vis/blob/master/text.c#L555

Yo martanne, I'll be adding vis too in the future. Keep it up :)

A problem with this data structure is that it's not self optimizing. Suppose you insert a character after every other character (not common, but easy to do with a macro). You will end up with a link-node per character. On the one hand, the size expansion is probably fine because the undo records of other data structures would equal it (if they support unlimited undo). On the other, all data access now involves pointer chasing for each character. This could be bad for screen update, search, etc.

The doubly-linked list of gap buffers method is somewhat similar, but is self-optimizing because adjacent small buffers can be merged.

The piece chain approach does indeed have some cons too, not only advantages. E.g. retrieving a byte at a specific absolute offset requires iterating sequentially over all preceding "pieces" of the chain in the basic variant of the structure (i.e. if used without additional helpers/caches/...). But which method doesn't bring its own disadvantages as well as advantages? It would probably be interesting if someone managed to write a comprehensive comparison/analysis of all the various known data structures for editors...

My intention was to highlight the piece-chain method as it seems to me surprisingly often overlooked (not known?), while having quite some noteworthy features. That said, I'm starting to think that maybe its non-trivialness can be seen in itself as one of the disadvantages too (i.e. by maybe making the program more complex than for the other, "dumber" methods)? Not sure on that, though. Adding undo upon the other ("simpler") approaches may possibly make them more complex anyway?

Of course all data structures have different trade-offs. I just think the piece chain has some of the most favorable ones. The fragmentation issue you mention is true.

Its effects can be mitigated by:

1) storing the pieces in a balanced search tree instead of a linked list. This would make random access logarithmic in the number of non-consecutive changes.

2) implement some kind of compaction algorithm which merges adjacent changes. This would increase memory usage because not only the changed text but also some unrelated data in between would have to be stored in a contiguous memory region. Furthermore this duplication of "unchanged data" would harm some other nice properties of the data structure. For example marks could no longer be represented as simple pointers into the underlying buffers because their content would not be unique (i.e. a character could be part of multiple buffers).

The pointer chasing you mention is an issue for all non-trivial (i.e. not stored in a consecutive memory region) data structures. Screen updates are a non issue since the output region is of fixed size (a terminal). For syntax highlighting using LPeg the relevant text region is temporarily copied to a contiguous memory region. This is simplified by the fact that vis currently always soft wraps lines and does not support folds.

Gap buffers have other issues, for example you only have one gap. What happens if you support multiple cursors/selections which might be on totally different regions of the file? You will have to move the gap.

This is not an issue for a doubly-linked list of gap buffers unless the cursors are on the same line. However what happens if you have a single line file (like minified Javascript)?

Anyway I find the piece chain and its support for (non chronological) undo/redo as well as mark management rather elegant.

In practice it seems to work quite well for vis. I haven't encountered performance issue in daily usage. To the contrary users reported that is "blazing fast".

Catch22 has an article about this that goes pretty in-depth.


I would guess that the rope data structure has probably displaced the piece chain in new projects.

I was looking for an old post I read about a piece chain text editor, turns out it was about vis: "[dev] [RFC] Design of a vim like text editor" [0]

[0] http://lists.suckless.org/dev/1409/23497.html

How do you do a linear search in a piece chain? Doesn't seem very efficient.

i interviewed there earlier this year and they asked me a question on the piece table. i didn't know what it was called until now but based on that question, they might still use it.

by "there" do you mean the MS Word team? or something else?

office team

One interesting aspect of the implementation of sam is that there are actually two different text representations. What's discussed here is the part in the main sam process, but there's another data structure in the samterm process that contains only those sections of the file that have actually been displayed in the samterm GUI.

This meant that you could run sam and samterm on opposite ends of a slow link and still be able to edit very large files. The remote sam process loads the file into the data structure described in the original article. samterm only loaded (over the slow link) the section of the file needed to draw a window containing the part of the text the user was looking at. As you moved around the file samterm would fill in parts of the data structure with the text you needed to see.

The data structure used on the samterm end is called a Rasp: a file with holes. See https://plan9port.googlesource.com/plan9/+/refs/heads/master...

If you are interested in this, [Data Structures for Text Sequences](https://www.cs.unm.edu/~crowley/papers/sds.pdf) by Charles Crowley is an excellent paper to study.

Yes, very nice. But it needs an update on structured documents (tree-like, as in HTML). And, ideally, it should be accompanied by work that addresses collaborative editing in such structured documents.

Author here: I woke up this morning and I'm pretty surprised! Finally some feedback over a year later.

I will compile a list of more people will want to see (seeing a lot of Atom and editors with piece table (vis uses piece tables)).

Thank you for the positive feedback. My blog before had a total of 30,000 views over 2-3 years, and now just overnight has double that - makes me pretty happy and want to put more effort into write ups! :)

Wow, this prompted me to go look for a GPL PalmOS editor I wrote 11 years ago and _someone's put it on GitHub [1]_ and updated it 2 months ago. It was the first serious programming I did.

I used a bizarre structure where the each onscreen line in the document (soft-wrapped) was represented by a (uint16?) character count and pointer to an array containing the characters. I had to write a custom memory allocator that put adjacent lines next to each other in memory to make moving characters between lines efficient.

Way too much overhead, and if you changed font you had to completely reload the file, but it was pretty quick even on the old 68K Palms.

[1] https://github.com/rtiangha/SiEd-Dana

I'd like to see someone write on how web based rich text editing tools store their textual data.


+ [quill.js](https://github.com/quilljs/quill)

+ [draft.js](https://facebook.github.io/draft-js/)

+ [prosemirror.js] (https://github.com/ProseMirror/prosemirror)

+ [trix.js](https://github.com/basecamp/trix)

Each of them might be relying on some form of Tree to store the content which actuates the views to react. I'm only guessing. A deeper, closer look might be interesting and useful.

Usually web based rich text editors just use a <div> with contenteditable=true. They don't manually manage the textual data. The browser manages the data, automatically adding and removing <p> elements to contain the data. The editor just adds buttons which call built-in functions that transform the text.

I imagine someone somewhere has written a more complex editor that manages things a lot more manually, though. I checked the first one you linked (quill), and it, as expected, uses contenteditable.

content editable has it's own problems (https://medium.com/medium-eng/why-contenteditable-is-terribl...), and more modern rich text editors do build up some kind of model of the data they are representing. My favorite is prosemirror, which presents the user a contenteditable, but maps changes as transformations to it's own internal document model. Each version of the document is immutable and it uses a fast diffing algo (similar to react) to determine the DOM mutation steps necessary to represent the newest version document. As a result, it becomes "trivial" to support multi-user collaboration, since you just import the other user's transformations to the document model as they come in (http://prosemirror.net/demo/collab.html#edit-Example). Since the entire document is now modeled as a tree structure, it becomes easy to do custom serialization/deserialization, ie: to markdown and back again (http://prosemirror.net/demo/markdown.html).

Here's a simple demo: http://jsbin.com/satakudifa/1/edit?html,css,output

It's way easier than you might expect. Only one line of JS needed for this demo (inline `onclick` handler for the bold button). Shortcuts like Ctrl+B, Ctrl+I, Ctrl+U, Ctrl+C, Ctrl+V, Ctrl+X, Ctrl+Z work out of the box (in Chrome; IIRC Firefox might open bookmarks for one of those).

Most of rich-text editor i know (quill, trix,...) uses contentEditable just for render surface.

Try to open your console when use quill and type `quill.getContents()`. And try this with Trix: `document.querySelector('trix-editor').editor.getDocument()`.

Seems like you don't know what is 'rich-text' at all.

"The Craft of Text Editing" (http://www.finseth.com/craft/) could also interest people who like reading about text editor internals.

I had a 4gb textfile I wanted to open. Vim choked while Scite handled it like it was no big deal. Later I configured vim to handle it okay but Scite still is there in my toolbox if I need it.

> I had a 4gb textfile ...

A few years ago I had to research text editors designed to handle files >200 MB. Vim, which I love, wasn't up to the job. Here's what I found (sorry, I don't seem to have the links at my fingertips; also this is partly based on 4-year old memories):


* EmEditor: This was very impressive. The speed is breathtaking for what it does, and it can handle advanced functions such as column operations on the full file with ease. We chose this and users were very happy; it transformed their abilty to work with large text files.


THese were on our shortlist to test next but we didn't get to them due to available time and EmEditor's fantastic results:

* VEDIT: Designed for large files. They charge for it, IIRC, but based on its reputation it is worth the investment.

* The Semware Editor (TSE): Also a great reputation

* 010 Editor: http://www.sweetscape.com/010editor/

* PilotEdit: http://www.pilotedit.com/


Also of interest:

* File Query: Treats file as a database, enabling parsing, queries; very flexible and powerful, per reviews: http://www.agiledatasoftware.com/

* PDT-Windows: A database editor which reputedly has a max filesize of 18 EB (exabytes)

* bvi: Binary VI


Finally, , some excellent resources

* http://texteditors.org/cgi-bin/wiki.pl?CategoryLargeFileHand...

* http://www.knudvaneeden.com/links/file/addactivity/computer/...

Interesting. I've found Geany (UI) and the ubiquitous `less` pretty effective at handling very large files in my experience. I might try Scite out to see how it compares.

Turning off line wrapping and syntax highlighting before loading a large file seems to help enormously in Vim.

And on emacs. I have "turn on/off syntax highlighting" as a shortcut. Specially useful for largish JSON files, for some reason emacs chokes easily on them

I like how you say "on emacs". Makes it sound like "on Windows", i.e. Emacs the OS... ;)

Well, I'm not native so on and in are somehow hard to pinpoint. Quite likely I'd have said also on vim :) (I use emacs+evil by the way)

Emacs has (or maybe had, I didn't check for a few year) a hard limit which is 512 MiB on 32 bit platforms.

Here's my list of commands for large files in Vim, but I'm no Vim guru:

    set nobackup
    set nowritebackup
    set noswapfile
Also, from the Vim Tips wiki: "Faster loading of large files" http://vim.wikia.com/wiki/VimTip611

One can turn off syntax highlighting only for long lines in vim.

    set synmaxcol=N_characters

On some older Unixes I've used, there was a command called bfs, for big file scanner. It could open text files in read-only mode and you could move around in them like in an editor.

Also, for files too large for an editor, Unix split and csplit commands can help break it into chunks; edit them, then join back with cat.

I've opened text files twice that size with vim. Unless it got much worse in the last 10 years the problem you had is that vim was trying to count how many lines are in your file, abort the operation pressing ^C and it will work fine.

Try "joe". (Force the syntax highlighting off though. I wish they had never added it.)

Syntax highlighting was the number one most requested feature before it had it :-)

It drags things to a crawl and there's no global off switch or command line off switch. Having to edit distro provided .rc files makes me grumpy :-(

You should submit a bug for this. You can say: joe --highlight foo.c for one file, but you're right, it's not global. Maybe better to auto-disable if it finds long lines or large files.

On the other hand, I submit that JOE's highlighter is the fastest available (someone should test this- I've not done it recently).

You can temporarily turn off all syntax-highlighting and line wrapping in vim by opening a file like this:

vim -u NONE big_file.txt

This works very well for very large files.

Potentially relevant: CodeMirror's BTree representation http://marijnhaverbeke.nl/blog/codemirror-line-tree.html

As I was reading the original post, I was wondering what a similar comparison of Web-based text editors would look like: Ace, CodeMirror, Atom, Monaco… They all feel like they have slightly different performance characteristics, but I've never dug into how each was done. That post is a great start!

Unsorted counted B-trees are an interesting way to manage text in a text editor:


You can do all the normal editing operations in log time, and B-trees can (obviously) be run from disk, so you don't need to use a lot of RAM, either, and can efficiently edit extremely large text files. In-memory, B-trees also use the memory hierarchy efficiently.

This seems functionally pretty similar to the 'rope' [0] data structure used by GtkTextBuffer

[0] https://en.wikipedia.org/wiki/Rope_(data_structure)

Yeah, the "goto this byte offset" operation is possibly fast enough that you can just use byte offsets for buffer pointers instead of anything more elaborate (pointing to the data structure internals). The leaf nodes could be gap buffers.

You could have (a secondary index of) line counts as well as bytes so that you can also seek to specific lines in log time.

"For some, they want editing to have high physical efficiency - using keyboard shortcuts and command keys, maybe the odd time the mouse comes in handy. Others want their editors to be virtually efficient - to make the most out of their resources (RAM, disk space) - or to be "small". "

In 2015, I would think that most (i.e. > 90%) want an editor that's makes them most efficient as developers.

A little side project I have is to create a set of notes on editors I've used, or want to use, so I can compare them, and use each more efficiently.


Edititing should be a precise skill with less hand movement and fewer keys pressed.

The editor which makes me most productive as a developer is the one I have to think about least. I spend an order of magnitude more time reading and thinking than typing, so no amount of increased editor efficiency could make a significant improvement to my productivity. Having to stop focusing on the problem at hand in order to remember how to operate some sophisticated editor feature, on the other hand, could be a significant distraction. I want to use the simplest, most straighforward editor capable of doing the job, and if I have to spend a few extra minutes here and there doing some repetitive code munging by hand, it is not a problem because I can continue doing the hard part of the job (thinking) while my fingers are busy typing.

You are going to be using an editor for dozens of hours a week over decades. It's probably worth the effort to master one of the better editors.

After using an editor for dozens of hours a week over decades, I decided it was worth the effort to write one of my own. I've been living in it for close to a year and a half now, and while it has a couple of quirks that occasionally bug me, it's nothing serious, and I feel pretty happy with my decision so far.

I 100% agree. Which is why I've gone with Emacs (hehe).

Which keybindings? Classic, vim, ergoemacs?


Emacs is extremely customizable. The question is which keybindings are the most efficient? Does a modal mode help? god-mode perhaps?


Recently, spacemacs.

nvi[0] (New vi, BSD) by Keith Bostic uses bdb[1] (Berkeley Database) as it's backend, which is interesting. It may be because he (and Margo Seltzer) invented the bdb that it was on his mind, but interesting design decision nonetheless.

[0] https://en.wikipedia.org/wiki/Nvi

[1] https://en.wikipedia.org/wiki/Berkeley_DB

I'm curious what JOE editor[1] uses. With its syntax highlighting turned off, that program can open really huge files much quicker than vi or emacs.

1. http://joe-editor.sourceforge.net/

JOE was written in the final days of expensive memory and was written so that it can edit files larger than memory. Even today this is sometimes useful: you can edit an 8 GB file on a 32-bit machine.

It uses a doubly linked list of gap buffers. Each gap buffer has a header and a 4K data page. The headers are always in memory, but the data pages can be swapped out to a file in /tmp. The memory usage limit is 32 MB. Possibly this is no longer a good idea- it's easily possible that you could have more RAM than /tmp space.

The header has the data page's offset in the swap file, the link pointers, the gap location and a count of the number of newlines in the gap buffer.

When a file is read in, the gap buffers are completely full. So read-in turns into a direct read of the file into memory (or into the swap file). The only thing it has to do is count the newlines in each 4K data page and generate the headers.

The newline count is to speed up seeks to specific line numbers. [A long standing enhancement idea is to generate the newline count on demand and use mmap. This would allow the read in to be a NOP- just demand load the pages from the original file as needed and use copy-on-write when any change is made to preserve the original. But I'm also not sure it's a good idea to not take a snapshot of the original file- so this probably should be optional.]

JOE uses smart pointers to the edit buffer. Each pointer has the address of the header and a memory pointer to the data page (which is always swapped in if there is a pointer to it). The software virtual memory system has a reference count on each page. Each pointer holds a reference on the data page it's pointing to. If there is no pointer to a page, the reference count is zero, so it can be swapped out.

The other purpose of the smart pointers is automatically stick to the text they are pointing to, even through insert and delete operations. So if you insert at one point in the file, any pointers to further locations are updated (including line number, byte offset, column number and memory offset).

Guessing from your user name, are you one of the authors of JOE? That seems rather likely, and is just generally why HN is awesome. Thanks for that very well-written explanation.

Remarkably satisfying treatise. Thank you very much.

Thank you, that was really interesting.

One of the more interesting editors I've seen being worked on is edlib (https://github.com/neilbrown/edlib). The theory is that rather doing a buffer-based approach as in Emacs, the text is actually just a representation of a file. The consequence being that you can have a presentation open as one representation and also be modifying the presentation as a different representation. It's very young (and is a toy project as well), but it's definitely a cool idea IMO.

Isn't this just MVC (model-view-controller), with multiple views?

Isn't an emacs buffer exactly that? Depending on what modes are turned on the way the file data is presented will also change.

For ex, how you can open a svg file, if you use the image mode you see the image representation otherwise you get the code.

Emacs buffers are copies of the representation, not actual representations. The author had a talk at a recent linux.conf.au (can't find the link right now), which explains it much better than I could.

Would love to see Notepad++ added to this, especially given its efficient handling of large text files.

Notepad++ uses Scintilla internally.

Does it still use an upstream Scintilla or a fork?

It's unlikely they replaced the editor data structure in a fork; that'd basically mean they couldn't use any patches.

> The buffer is initialized as a vector with 10,000 lines. If a file is loaded with more than 10,000 lines, or surpasses this limit while editing, it will add 1/10th of the amount of max lines to the buffer....This behavior only becomes useful after the 10,000 line mark, although a little wasteful. If say while typing we hit the 1,000,000 line mark, Moe will allocate space for 100,000 lines. This is not very resource efficient, unless you are someone who typically types an additional 100,000 lines after your millionth line.

This is incorrect. Intuitively, if you're using large files, you're likely to generate large changes in file sizes. Algorithmically, this is a very common approach. Collections in most languages double their size whenever the existing capacity is exceeded. This minimizes the number of times you need to reallocate lists of increasing size, and works nicely with the memory management systems of various operating systems.

It's also not resource efficient to need to change the size of the collection for every new line. I would suggest that a 10,000 line initial capacity and increasing by 10% on reallocation are appropriate compromises.

It would be interesting to see something like this for "rich text" editors, too. Where it's not just text, but might include formatting instructions or even media. Would be especially nice if one could contrast something from the WordStar era with 90s OO tech (flyweight objects?) to whatever's currently in vogue (functional shenanigans?).

I feel that rich text is sadly something our trade has mostly just given up on. What's in vogue? Still HTML, which is IMO a really really sad state of affairs, since there's basically no way you can cleanly map between HTML and a wysiwyg editor. The simple question of "where is the cursor?" is almost impossible to answer in the general case.

There is a recent wealth of new Web-based rich text editors. ProseMirror, Basecamp's Trix, FastMail's Squire, Guardian's Scribe, Summernote…

Even old ones such as CKEditor plan on updating their engine: https://medium.com/content-uneditable/ckeditor-5-the-future-....

It's certainly good to see some development. I remain skeptical however, until we have an OSS editor available for production use that internally maps (and requires storage) to a sane data model (!=HTML).

I wonder what Linus' favorite editor microemacs/uEmacs uses for data structures. Its actually fairly hard to find the source code (all I could find is some ftp and I admit I was too lazy to fire up ftp and download the code).

EDIT: oh here is where it is: https://git.kernel.org/cgit/editors/uemacs/uemacs.git

And it looks like its line based like old vi (I think): https://git.kernel.org/cgit/editors/uemacs/uemacs.git/tree/l...

Interestingly there was a line limit per window based on a signed char (127) that just got changed to an int (that was the last change made).

Tcl/Tk's editor also used balanced trees.

I wonder what Wordpad and Notepad do? Notepad will fail on anything bigger than ~50MB but Wordpad will work on it and have a progress bar to show it hasn't completely hung.

Text editing on Windows still mostly sucks.

Notepad and Wordpad are more like demo applications for the Windows common edit control and Rich Text edit control respectively.

Hardly meant as serious editors. But they will do for simple tasks.

I believe it's one big linear buffer that gets resized as needed. The simplest solution, takes the least amount of space, and although it may seem quite inefficient to move essentially everything in the file after the character you insert, if you consider that a modern CPU can do memcpy()/memmove() at several GB per second, it doesn't seem like any delay should be perceptible. (I.e. any slowness is down to some other aspects of bad implementation.) Many of the old DOS text editors did the same, in a time when memory bandwidth was not much more than a few MB/s and files rarely over 1MB, and it worked well enough.

I've opened 200MB+ files in Notepad before, it depends entirely on processor speed.

The problem with Notepad is that it basically just loads the contents of a file into a big Windows-stock text field. That component was never designed to be used as a full-fledged text editor.

aoeui uses a simple mmap for unmodified files and a gap buffer mmap'ed from a temp file for modified texts. This allows huge files to be viewable instantly. Gap buffers are awesome, BTW.

> b->nc += m; > q0 += m; > s += m; > n -= m;

Why do people who write in C so often make it look like they have a shortage of letters to use for names?

They had a math/science background, so they wrote things like

  F = m * a
rather than



It was written in the 80s when there was a shortage of letters.

A little anecdote: When I first learned programming I didn't know you could have long variable names. When I later found out you could, it felt like cheating.

I've written a Data Structure[1] for C++ to handle this problem. It uses counted B+Trees internally and is a drop in replacement for C++14's std::vector.

[1] https://github.com/det/segmented_tree

The 8bit Acorn BBC editor called Edit put the entire file in RAM, all the characters before the cursor at the start of RAM, all the characters after at the end. Quick for inserting, but there was a half second delay if you jumped from the top to the bottom of the text.

This data structure is called gap buffer, it's rather common in text editors because it's simple to implement and efficient on the most common use cases.

In the usual implementation the gap is only moved when you actually insert text, not when the cursor is moved, which makes it even more efficient.

I'm making a code editor and chose a grid: two dimensional array, And a string. Then it can cross check the string and array to see if there is a bug somewhere. This is of course not efficient, but it keeps me sane. The grid also makes it easy to render the text.

Is that a standard immutable string? Does this mean allocating a whole new string every time a character edit is made?

Yes, thanks god for the garbage collector. Although this is JavaScript so I see it as a mutable string.

I don't understand? Strings in JS are immutable. There's no way to mutate them.

Using strings like this will thrash your garbage collection, especially with long lines. An array, or more interesting data structure, might be more sensible.

(only pursuing this because I'm interested in your project!)

Sorry, I keep mixing up variables with types. I do not think about types when I program (JavaScript).

You might find my blog interesting: http://webtigerteam.com/johan/en/blog/editor2.htm http://webtigerteam.com/johan/en/blog/editor.htm

Surely not one allocation per key, that could be up to like 10 allocations per second.

You will not notice anything unless you are watching the memory graph. I see no reason for "optimization" at this time.

That's why I asked! (The answer was yes)

C/C++ still rocks.

Gnome Gedit is "interesting" when working with big files

2005-2016:"it takes too long to open big files" https://bugzilla.gnome.org/show_bug.cgi?id=172099

I needed this :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact