Hacker News new | past | comments | ask | show | jobs | submit login
Text Editor: Data Structures (averylaird.com)
664 points by LaSombra on Oct 2, 2017 | hide | past | favorite | 121 comments

I use ropes in xi editor. I did not find the argument against ropes convincing. Yes, they're not trivial to implement, but in a proper programming language you're not dealing with the data structure directly, you're always going through the interface, so you get the logic right once in the rope library implementation and then forget about it.

In a low-level design, your editing operations would be poking at the data structure directly. There, the simplicity of a gap buffer is a pretty big win. I agree in this environment ropes are too complicated. However, I don't see any good reason to architect a text editor in this way. Use abstractions.

The linked article contains a factual error, the referenced Crowley paper does not consider ropes. Thus it cannot be used in support of the argument that piece tables outperform ropes.

There's one other important concern with piece tables I didn't see addressed. It depends on the file contents on disk not changing. If your file system supported locking or the ability to get a read-only snapshot, this would be fine, but in practice most don't. It's very common, say, to checkout a different git branch while the file is open in the editor. Thus, the editor must store its own copy to avoid corruption. In the long term, I would like to see this solved by offering read-only access to files, but that's a deeper change that can be made piecewise.

I'd add that the assumptions that the specialized data structures (gap buffer and piece table) make are easily violated—someone mentioned a search-replace further down—at which point I can imagine their performance degrading unless special measure, i.e. more complexity, are added.

So yeah, in my editor (CodeMirror) I'm also using a rope/tree style representation [1], because it's pleasantly general and easy to reason about.

[1]: http://marijnhaverbeke.nl/blog/codemirror-line-tree.html

mmap(...MAP_PRIVATE...) will protect you from changes made to the file on disk.

My reading of the man page[0] is the opposite; your changes won't be reflected to the file, but "It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region."

I'd love to see the other direction. One approach that I think is worth exploring is when the file contents correspond to a git hash that the editor has access to. In fact, there is a pretty relationship between "deltas" and git commits, and this could possibly be exploited usefully.

[0]: http://man7.org/linux/man-pages/man2/mmap.2.html

Wow, you're right, and the Posix page indicates the same thing. I believe it used to work properly in Unix / SunOS / HP-UX / AUX (IBM) / Solaris.

I did a little digging, and don't think so. Being able to access a readonly snapshot is a property of the filesystem, it's a guarantee that if you overwrite file contents, the original data is still available. With btrfs and friends, you could create a "reflink", but these filesystems are not in very widespread use, and it's also not clear to me that it's an overall performance win.

No need for a fancy filesystem; when one process asks for MAP_PRIVATE (aka Copy-On-Write = COW) on a file-backed page, all other references to that page can be marked COW as well, and if anybody does a write to the page, a new copy of the page gets created by the OS, and the various processes' references get sorted out to point to one copy or the other. The OS already needs almost this much implemented to allow multiple explicit COW references to the same page, which is surely allowed in Linux and Posix.

That idea works when the page is swapped in, but I don't think you're accounting for the case where the only copy of the contents is in the file that's backing the mapped region.

I'm sure we're boring everybody, but yes, I am accounting for the case where the only copy is in the file. The OS knows that there are live references to the non-resident, file-backed page, marked as such in one or more process page tables. For any change to happen to the actual page on disk, from any program in user-space or not, and even if they think they're using write(2) rather than mmap, under the covers the OS does the moral equivalent of mapping the page, and the modification happens in memory, and the page eventually gets written back to the physical disk, but it's all unified between write() and mmap() pages, and all the wonderful semantics of mmap and sharing happen here as well. I'm not sure that this goes all the way back to the original implementation of mmap(), but as you point out, it's a pretty useless observation given the unfortunate backsliding in the Posix specification (perhaps some vendor didn't manage to get up to speed on the unified page scheme?)

All of this relies on the rest of the system playing well and only using POSIX rename() to replace files. So to do it safely, you need to make a swap file copy (a'la vim) to do the paginated mmap() + page-replacement cache on the swap file.

However, often times you also need to do encoding conversion, and you can do this at the same time as the copy to the swap file.

If you detect you have btrfs/xfs w/ reflink support, and no encoding change is necessary, you can cheat and just do a CoW of the underlying block ranges and let the FS garbage collect it afterwards (just unlink() your reflink copy immediately).

Another convenient thing you can do with the swap file, is use the tail of it for a write ahead log of the changes in a piecetable and crash recovery.

Believe me, I'm not bored.

”I agree in this environment ropes are too complicated. However, I don't see any good reason to architect a text editor in this way. Use abstractions.”

Abstractions can cost you dearly. If your abstraction for indexing into your buffer moves from O(1) with a small constant to O(log(n)) with a larger one, that global replace using regular expressions can get a lot slower. Even a simple page down may get noticeably slow when at the end of a large file with long lines.

Something I have always wanted to try is use a high level abstraction like say ropes for representing some domain specific data, but for some operations, navigating the high level data structure dominates. What if one could do a linear scan through memory and then collect the "hits", and map those back to handles in the high level data structure?

Some hits will be false positives, where by dumb luck, it looks like you have a match, all hits have to resolved against user data.

It feels analogous to a bidirectional lens, but instead of being between high level data structures, it is using the underlying raw memory.

Then you've chosen the wrong abstraction.

Maybe I've misinterpreted the discussion, but there's not always a perfect abstraction [1].

1. https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...

The worst way to store and manipulate text is to use an array. Firstly, the entire file must be loaded into the array first, which raises issues with time and memory. Even worse still, every insertion and deletion requires each element in the array to be moved. There are more downsides, but already this method is clearly not practical. The array can be dismissed as an option rather quickly.

The authors of the dozens of tiny editors using this "structure", which were quite popular on the PC in the late 80s through early 90s, would disagree. A memcpy/memove() runs at many GB/s on a typical machine today, so you would have to be editing absolutely huge files to notice. Even back then, memory bandwidth was a few MB/s --- still plenty fast, considering that the typical files of the time were also much smaller.

I can remember a time when I was attempting to write my own editor and at first spending a lot of time obsessing over the data structures (it was harder to find such information at the time) --- only to realise that a lot of the editors I'd tried, including the one I was using the most at the time --- were working perfectly well with just one big buffer.

I've opened files of a few hundred MB in Windows' Notepad, which also belongs to this family of editors; and on a machine a few years old, opening the file takes the longest because it has to be read into memory --- once it's opened, moving around and editing lines doesn't show much lag at all. "Worse is better", indeed.

Emphatically: Yes! KISS.

Moby Dick, at its slender 752 pages, is 1.2MB of text. You can save the entire text to disk on every keystroke on just about any system today and keep up with typing just fine.

Assuming you are actually dealing with text in a text editor, you should be fine.

If you have 100MB+ files, chances are they aren't actually text.

Many times in my career I've had to open very large files (usually CSV data) and inspect/edit them by hand.

Of course there are other ways to do this work, but a text editor that can efficiently open and edit very large files (e.g. Vim) is a great tool for the job. I will put up with otherwise excellent code-specific editors/IDEs that cannot work on such files but my default general-purpose text editor is always going to be something that can.

I appreciate that the design and implementation of text editors is basically an art form (one I've dabbled in myself) in which simplicity has aesthetic value, but efficiency and flexibility are very important for an editor that's going to be used for real work.

Textual dumps of SQL databases also fall in this category. I remember having downloaded a huge sql dump. I couldn't open it in most editors because they all buffered into RAM. The file was about 13G in size and did not fit into the RAM of any machine I own. But in this case using ropes wouldn't have helped either, I guess.

Text editors should also take huge files into account and provide sequential reading from the disk to memory. Even on Emacs, I couldn't work with the file. I ended up fixing it and importing it to Postgresql. And then I spent hours on indexing the necessary fields :).

I believe you can use the vlf package (https://github.com/m00natic/vlfi) for dealing with large files in Emacs. I haven't used it myself, so I am not sure how stable it is.

Thanks for the pointer. I guess I could also have written a small program or script to go through the file. Regular fopen and fread should work. However the vlf package does exactly what I looked for. Only prerequisite I can see is that one must compile Emacs with bignum support on 32bit systems to read files greater than 512M.

Log files..

Fair point, but do you actually edit log files in a text editor?

Because if you're just viewing, the "array of characters" representation is going to beat just about any other hands down, especially if you just mmap() the whole thing.

True. Generally no, but sometimes I will insert several empty lines around a particular line of interest..

I've used vi where I meant the pager quite often. I was late at making the use of head, tail, grep, wc, du, stat, etc. to peak into data to become muscle memory.

Yes. Occasionally I'll open a 100MB+, or even a GB, log file without realizing it, and it will take vim many seconds to give me control.

I am dealing with gigabyte sized log files almost every day.

Why not send them into a log aggregator for search and long term storage and logrotate on the server so you don't get this problem? Hell, why not logrotate anyway and keep multiple smaller files?

Sur4e. My response was to the comment that 100MB+ files are most likely not text. Like others I just tried to express that there are plenty of use cases for a text editor that can handle very large files. It's just a very practical thing.

I was consulting on a project the other day; turned out they never rotated their main Apache log file, and after 5 years it's now 32GB.

Are you sure the editors you're referring to weren't using gap buffers? The logic for a gap buffer is hardly any more complicated, and the performance benefit is significant.

Back when I first (and last) had to write my own text editor, 30 years ago, memory was at a premium, so I did what everybody else in the old Commodore days did: use the screen buffer, a paltry 40*25=1000 bytes, as additional memory.

When you first loaded a file, it would go straight to the screen buffer, B. Well, almost straight: you had to move to a new line when a CR character was seen. The rest of the file would go into buffer C, the region at the end of the free memory space.

As you scrolled down, the first line would go into buffer A, the region at the beginning of the heap, minus trailing whitespace. And so on.

In hindsight, this is a variant of gap buffers. If you edit the first line in a file, changes stay within the current paragraph on screen, until the paragraph needs to take up an extra line. Even then, though, the line at the bottom that scrolls off the screen only needs to be prepended to C, which, again, is at the end of the memory map and thus grows backwards.

The point is that each insertion at the start of the file would memcopy the entire buffer, though. With a few hundred mb file that sounds slow even today.

Are you sure that they don't keep a buffer of 10k lines or a gap buffer?

> With a few hundred mb file that sounds slow even today.

That depends upon how fast one can perform the memcopy of the entire buffer.

It has been said that latency amounts of 100ms are perceived as instantenous (https://www.pubnub.com/blog/2015-02-09-how-fast-is-realtime-...).

So if we take 100ms as our latency budget, then the max file size we can 'edit' with a straightforward memcopy (i.e., no gap buffer) becomes how much data the CPU can move in memory in 100ms.

With a modern CPU, moving on the order of 5-7 GiB/sec (http://jamesslocum.com/post/64209577678) a simplistic memcopy could handle editing the start of a 512MiB to 716MiB file and still feel 'instantaneous'.

With the much smaller file sizes of yesteryear (someone did mention Dos, many of the old Dos editors had a 64KiB max file size limit) in order for editing of a 64KiB file with a max latency of 100 ms required only that the CPU be capable of moving memory at a maximum of 640KiB/sec. On an 8088 IBM PC at 4.77Mhz with a 4 cycle latency per byte (https://trixter.oldskool.org/2013/01/10/optimizing-for-the-8...) the rep movsb instruction should achieve a maximum of 1164KiB/sec memory move speed. More than enough headroom to make editing a 64KiB max file appear to be instant, even using the basic memcopy method. Even if we assume 8 cycles per byte (4 to read, 4 to write) we still get 582KiB/sec of memcopy performance. Enough that editing would appear instant on all files less than 58KiB.

> It has been said that latency amounts of 100ms are perceived as instantenous (https://www.pubnub.com/blog/2015-02-09-how-fast-is-realtime-...).

Yes, but that was 50 years ago(!!!) and technology + expectations have moved on considerably since then; which that article explores. Thus my take away from that article is that 100ms is very perceivable and thus simply not fast enough for some applications.

For what it's worth, the article correlates with my own personal anecdote as well.

Going back to the original point, 100ms might still be good enough for a text editor, generally speaking. But going back to my own anecdote, I do find it highly frustrating when running into lag on a text editor when I'm banging out quick edits when either in the zone or working to a tight deadline. Frustrating enough that I have stopped using some text editors because of it.

If you are talking about typing, then I agree, 100ms lag after every keystroke, would be noticeable and annoying, but edits?

Surely you are not claiming to be doing 10 edits a second, where you might need to start complaining about 100ms lag?

I'm not really sure how you've interpreted my claim but the context of this discussion was about holding the text as an unbuffered array. This would mean every keystroke is effectively an edit. But even that aside, I'm sure we've all done stuff like:

    type "foobar\n" -> select "foobar\n" -> copy
...or the equivalent with replacing text within blocks of the same document; or other repetitive textual tasks that could easily be marcoed if it were a job you'd need to repeat again at a future point in time.

> A response time of 100ms is perceived as instantaneous.

please no :( my eyes hurt just from the thought of it. 10ms at most and I'm sure some people can perceive even faster stuff.

You're conflating two different things.

A typical screen refreshes at 60Hz, giving 16ms between refreshes. If you take more than 16ms between animation frames, it will appear choppy and "bad".

The 100ms budget is around pressing a button and seeing a response. If you pressed your refresh button and the page loaded in 100ms, you consider that sufficiently "instantaneous". (This breaks down for video game controls where you're continuously giving input. And maybe 100ms is too long to respond to someone typing at 70WPM, but expecting a 10ms response is much too greedy.)

Is it, though? IntelliJ, a fully-blown IDE gets below it: https://pavelfatin.com/typing-with-pleasure/

I don’t see the logic behind having a full 60fps in a 3d game but not in a textarea (even though the linked article explains why it’s difficult).

xxxdarrenxxx's dead comment is long and relevant. If you don't have them turned on, it might be worthwhile.

your eyeballs have more than 10ms latency iirc

Even if there was a minute latency between our eyes and brain, we would still be able to perceive the same sub-second differences we can now. We would just be too late to react to them.

Even if 100ms delay wasn’t noticeable, my peak writing speed is over 10 characters per second. Every character which comes less than 100ms after the previous one adds some delay, and I can imagine you end up writing for a bit, then stop and wait as the computer types 10 characters per second trying to catch up with you.

The limit probably more to do with typing speed isn't it? 100wpm, 5.1 characters per word is 8.5 characters per second.

~5ms is a more realistic budget given the 60-120 Hz refresh rate of most displays.

I take issue with the 100ms statement, but perhaps I'm missing something.

My hands automatically tend to play ahead of time when I play on a midi keyboard (piano piece) with anything more than 12ms.

Doesn't matter what I think I perceive, reality is my hands (muscle memory) intuitively perceive lag and begin to (measurably) hit notes just before the beat.

Likewise when I play a multiplayer (pvp) game, 100ms is considered even for the average gamer to be slow and annoying and is very much perceivable.

I like it no more than 25ms, and I have the luxury to get 7ms in best case scenario over fiber and ethernet. I can anticipate for the lag sure, but I feel it, which in this context is perceiving.

I am very confident I do not have to look at a gauge to perceive the lag, and can easily perceive it in a blind test around the 10 - 20ms response times.

I hit just one key and I notice if I lag or not. My ears are finetuned to expect the responding note to sound with at least 10ms accuracy within the resulting timeframe.

> A memcpy/memove() runs at many GB/s on a typical machine today, so you would have to be editing absolutely huge files to notice.

That's only good if your text happen to be raw text; I guess it won't work so well if it's rich text with needs to be converted to lightweight objects before being processed, where you'll want every object to be memory-aligned. And I wonder how well that approach works if your virtual memory space is paginated, which wasn't common in the 90's, but is practically everywhere nowadays.

I wrote a wysiwyg, memory-based rich text editor from scratch on the Mac back in the 90s called Scorpio. It had support for fonts, colors, styles, inline pictures, etc. It used a single buffer for text and I had no problem with insert/delete into this buffer even back then, even with ginormous amounts of text.

The "trick" (not really a trick, but..) was to use external data structures for the style runs that had offsets into the text array for styles/rich object support. That way I didn't need to worry about object alignment or whatever.

The downside of that approach was that there was bookkeeping to do in these external structures when text was inserted/deleted (indexes would need to be increased/decreased by a fixed amount, etc). The other major downside was that multi-level undo became problematic. It was easy enough to do one level of undo, which I did.

BTW, writing a word processor/text editor from scratch is a blast. I highly recommend it!

In my first attempt at an editor, written in an amazingly ugly (Borland) C/C++ mix, I assumed text was always delimited with some form of newline. I loaded the initial file into memory completely, and then constructed a "line" array structs, which contained a pointer to the start of the line, a length and a pointer to the 'previous' version. If this was NULL, it was the initially loaded buffer.

The editing of a line itself was far from optimal but straight-forward: it involved allocating a new struct, filling it in, when the undo mechanism was enabled pointing the previous version to the currently active version, otherwise pointing to the previous version of the last version (which should be the initial buffer), swapping the line pointer in the line array, and free'ing that old buffer/struct. Inserting a new line involved a memmove of the "line" array and an insert, and worst case a realloc of that array if it was too small.

I never really finished it, but I had a primitive undo mechanism in place that had issues with line inserts, which I never ironed out (project was abandoned at that point). It wasn't very memory-friendly when I enabled this undo mechanism, since this caused the entire history of every line being kept indefinitely, and every single-character insert or delete did a full copy of that line - and caused memory issues creating even relatively small files, certainly when starting from scratch, the memory bloat was incredible ^^

Still it was a fun project for me as a 16-year old kid.

> In my first attempt at an editor, written in an amazingly ugly (Borland) C/C++ mix, I assumed text was always delimited with some form of newline. I loaded the initial file into memory completely, and then constructed a "line" array structs, which contained a pointer to the start of the line, a length and a pointer to the 'previous' version. If this was NULL, it was the initially loaded buffer.

Something like this is how various vi implementations, and even some older (pre-GNU) versions of Emacs, work.

So 16-year-old you was not far off the mark.

Y'all should have a look at ewig which uses Immer - really fast persistent data structures for c++. Ewig is just a proof of concept, but the code is really nice. I have thought about taking that and building something bigger.

The upsides are the same as for a piece table (really simple undo/redo) but with the downside of not being able to just mmap a file. You also get basically zero memory usage when you do cut and paste (you can paste a file into itself until it is bigger than RAM without problems, since you are actually not copying the contents, just the pointer)

Look at the YouTube video as well. It is all very cool, at least if you are not already spoiled by using clojure :)


Edit: Forgot to mention: Ewig can be found among Immer's author Arximboldi's repos. On my phone right now on GPRS connection, so maybe another friendly soul can provide the link.

Thanks a lot for mentioning it! :-)

Ewig uses RRB-Trees (Relaxed Radix Balanced Trees) which like ropes, is confluent (supports fast concatenation) but has very stable bounds otherwise, similar to a vector-like type.

EDIT: The parent mentions a video, I gues it is the CppNow talk: https://www.youtube.com/watch?v=ZsryQp0UAC8 Last week I did another version of that talk at CppCon (with slightly deeper coverage of Ewig) but I don't think it is in Youtube yet.

I think immer is brilliant. I just saw that you have guile bindings, which makes it even more exciting!

I like that they are lgpl as well. I am a believer in that kind of freedom definition, and I hope you can dual-license it successfully.

Atom uses a piece-table-inspired data structure to represent text[0]. We store the file's original contents in a single contiguous buffer. All of the user's unsaved edits are then stored in a separate mutable structure called a Patch, which we represent as a splay tree. When we need to read from the buffer in a background thread, we can 'freeze' the patch, and store any subsequent writes in a new patch which is layered on top of the previous one.

Even though we do load the entire file into memory (as opposed to mmap-ing the file), the piece table design is still very useful. It makes it very cheap to compute the buffer's current unsaved change set, which is a value that we periodically serialize for crash recovery purposes (similar to vim's `.swp` files).

It's also just a very compact way to store a large chunk of text, which is good from a cache-locality and memory usage perspective.

[0] https://github.com/atom/superstring/blob/master/src/core/tex...

I've been doing the same thing (implementing a text editor for fun) and settled on a data structure which is a hybrid rope+2-3 tree+B+tree+gap buffer. I don't know what to call it but maybe someone here has seen it before.

Basically the idea is, the inner nodes are a 2-3 tree which track the size of each child like a rope does. The leaves are gap buffers of a fixed size (a few cache lines), which, when full, split into two like leaves of a B+tree. So you get the dense-ish packing of a gap buffer with the O(log n) performance guarantee of a 2-3 tree, while avoiding the potential linear copies associated with ropes and gap buffers.

(I know this is overkill for a toy text editor but it wouldn't be a side project if it weren't!)

I did something like this recently too. A b+tree w/ linked leaves containing piecetables in the leaf nodes. To avoid the memmove() costs it uses the right-edge of the leaf to keep a sorting queue so the operations are still O(1) inside the leaf.

I still need to finish up deletions, but early performance numbers are showing similar/better than the RBTree approach while being much more cacheline/allocation friendly.

Ultimately, I want to use this in a replacement for GtkTextBuffer/GtkTextView in gtk 4.x.


Beyond just loading, editing and saving text, how do these data structures compare when it comes to implementing other "tricky" features, such as:

1. Multi-line regular expressions. Typically the regex library is given an array of text to search in (). We have no such buffer to give.

2. Line wrap. Adding or removing a single character near the top of the file might require the entire file to be re-wrapped. You need to complete the wrapping process before you know how many lines are in the buffer and therefore can update the size and position of the vertical scroll bar's handle.

3. Column mode editing of text containing tabs.

Beyond those features, these data structures (gap buffer and piece table) don't seem well suited to operations that effect the entire buffer, such as convert-tabs-to-spaces.

At least when I used PCRE, it seemed to require this.

1. All of them could probably easily create a bi-directional or markable stream of characters, which should be enough for the regex engine to work with.

2. That's a display issue. I would suspect that line-wrapping would want its own data structure within the display system.

3. I think this would fall under the same note as (2).

Basically, a layered approach like (display + editing mode) / (buffer) / (array or ropes or pieces or whatever). Because the editing mode would effect the edits being made to the buffer, which would then be translated into whatever underlying data structure is actually storing the file.

Reasonable points. As for (1), I couldn't find how to get PCRE to work with a stream like that. But that's either my fault or PCRE, not the text editor data structure.

As for 2, I'd just say that I think these are much harder than the problem that the linked article discusses. There have been other articles on the same topic on Hacker News recently. They all seem to focus on solving the trivial problem and ignoring the difficult ones. I mean, a std::list<std::string> is fine for the main data structure if you just want to load, edit and save and don't require (2) or (3).

I've also never seen a PCRE that actually works with such a stream, though it would be possible to construct one.

For your remaining points, sure. I don't know enough about text processing to know where the hard problems are. I do think that engineers sometimes overlook the possibility of having multiple data structures for multiple use cases for a single set of data, because it doesn't seem as elegant as one magical data structure that does it all.

PCRE's partial match[0] API allows it to be used on non-contiguous data structures. We use this feature in Atom[1] to allow for fast regex search in a background thread.

[0] http://www.pcre.org/original/doc/html/pcrepartial.html#SEC2

[1] https://github.com/atom/superstring/blob/ed57b08a74220dd33e4...

>...(gap buffer and piece table) don't seem well suited to operations that affect the entire buffer

With a gap buffer, you can just move the gap to the beginning (or end -- whichever is closer) of the buffer, and you'll have the "entire buffer" ready for one of those operations.

For (1) you can just convert your whole file to a contiguous array if necessary, then run your library, and if any changes happened, convert back.

Had an assignment in functional programming which was to implement a text editor, an interesting data structure was provided.

Buffer of X is a BackwardsList of X, Cursor X, ForwardsList of X

So if you make Line a Buffer of char and your document a Buffer of Line you can easily insert and remove characters on a line or lines in the document.

If you wanted to move the cursor forwards you could just place the cursor onto the head of the backwards list behind you and take the head off the forwards list in front of you to be your cursor.

Such a structure is called a Zipper, for anyone wanting to look into it further.

In this case it's a Zipper for a list. You can write zippers for other data structures as well (and in fact you can mechanically derive them, even).

This is conceptually the same as a Gap Buffer, it just uses a different memory layout.

They were trying to get you to learn ZipLists, I imagine.

In Haskell a ZipList has not much to do with a zipper, if memory serves right. It's just a newtype wrapper around a normal list to get a different Applicative instance.

Aka a "zipper".

> They should! Why aren’t they!?!?! Somebody needs to make that a markdown extension. Every time you want to insert an indexed footnote, you type [^#]

Probably because there's no elegant way to support all the potential edge cases of that default behavior. For instance, if you introduced a reference with that format, you can't refer to it later - if you use the static number and it changes, you're now pointing to the wrong reference. So it reduces to using names, and the hassle of coming up with a name for each new link or footnote reduces to just using numbers - so the only thing that needs to be supported is manually named or numbered items.

The author correctly describes the ideal solution - a plugin that replaces unnamed links before saving or such - but likely fails to understand why that (as opposed to adding behavioral cruft to a markup language) is the correct level of abstraction for such a solution. Imagine if a project like wikipedia was riddled with the ambiguity of dozens of people's various attempts to wrangle the autonumbering to their writing.

Interestingly (and IMHO unsurprisingly) org-mode does this correctly[0]. Footnotes are identified by 'fn:' and a unique token (of course, numbers are unique tokens); they can be easily inserted with C-c C-x f, and can also be renumbered — including normalisation.

I reminded of this discussion from a week and a half ago: https://news.ycombinator.com/item?id=15321850

IMHO org-mode is, as the original author puts it, 'one of the most reasonable markup languages to use for text.' I suggest that rather than trying to improve Markdown, folks just use org-mode instead.

[0] http://orgmode.org/manual/Footnotes.html

I still think that org-mode and Markdown have quite different targets.

org-mode seeks to be organized and manipulable, whereas Markdown seeks to be readable, and parsable.

I had a technical interview once where the interviewer asked me to create a data structure for a text editor written for a 1980s computer with extremely slow and expensive I/O and almost no ram. I ended up with a variant of the piece table and afterwards he told me that’s how Word worked (he used to work at Microsoft).

That reminds me of this tutorial for writing a text editor (in win32 land): http://www.catch22.net/tuts/piece-chains

Out of curiosity, what was the job position?

I know the conventional wisdom is that you can't use an array, because O(n). How big of a file can you insert a character into with an O(1) algorithm before it takes more than one refresh time (16 mS)? From the following test program, the file can be 150 MB of text (MBP 2015). I've never had a source file, even a generated one, this long. So I claim that this wisdom is obsolete.

  #include <stdlib.h>
  #include <string>
  using namespace std;
  int main(int argc, char **argv)
    size_t size = (size_t)atol(argv[1]);
    string editor_buffer(size, 'x');
    for (int iter=0; iter<100; iter++) {
      editor_buffer.insert(5, "foo");

More abstractly, I enjoyed a description of a text editor in a Categories for Computer Science class I took waay back.

It's two stacks head to head. Moving the cursor is popping on one and and pushing on the other. All the other operations were given too. There were the category diagrams and everything -- the meat of which I've forgotten but I might be able to find if anyone is interested.

I'd be interested. Using two stacks is a nifty technique.

That technique sounds analogous to using a zipper to traverse a list (I've found a decent looking link for the technique if you're interested - https://ferd.ca/yet-another-article-on-zippers.html)

This is the book. The preview is limited though and I can't find the editor reference directly.

Categories and Computer Science R.F.C Walters, 1992 https://books.google.com.au/books?id=FurEQgAACAAJ&dq=edition...

One thing I think about a lot in these situations is how constraints that come in at the high level can end up influencing your low level design. I've spent the last couple months writing a codemirror replacement for an in-house DSL IDE (in Javascript).

My particular use case required that I parse whatever was written in the editor in a few ways: a lexer pass for syntax highlighting and a full parse for semantic feedback (much of which is provided somewhat unpredictably from AJAX calls, for various reasons). And I knew that touching the DOM (which I'd have to do to keep the syntax highlighting/semantic feedback up to date) is typically going to be a lot slower than doing a few thousand loop comparisons in JS.

I looked into structures like ropes and whatnot that would enable fast edits, but I realized that in the end I wasn't going to do better than linear in the worst case, for a few reasons, e.g.: 1. Suppose the user enters an open quote at the first character: every other character just flipped from outside a quote to inside or vice versa and I'm going to have to re-parse the whole document to update my semantic analysis (and syntax highlighting, though that would only require a partial re-parse). 2. Edits that change the line number associated with errors are going to require me to update that display and you can make a document with a number of errors that grow linearly, so entering a newline at the first character was potentially linear...

So I ended up just using a doubly linked list of tokens (as returned by the lexer) and re-lexing tiny sections around a cursor when the user enters text. Collectively the whole thing turns out linear and but it saves a ton of work by re-using the same structure when I'm doing parsing and semantic analysis. Doing it this way let me do my own DOM reconciliation and update the absolute minimum number of DOM nodes for every edit, which ultimately has a huge effect on performance because touching the DOM is expensive. And one can set this up so that (at least from the lexer's perspective) undo involves just snipping the old middle segments of the list back into place.

So it turned out that higher level requirements (eventually having to actually parse the text and update the display in various ways) made it so I wouldn't really save any work optimizing at the low level.

The most costly operation is usually showing the text on the screen, so you want to optimize the data structure for that. Although it depends on what level you are programming on. Updating arrays is not slow (unless the file is huge) and can be done in parallel.

The other day, I was implementing a Piece Table in Haskell, and realized a "Piece" is a "Monoid", and a "Piece Table" (which I call a "Buffer") is just a list of "Pieces".

    -- A Piece is a String, start, and end.
    data Piece a = Piece { list :: [a] -- 'a' will be 'Char' later, but we can support any type here.
                         , start :: Int
                         , size :: Int
                         } deriving (Show)

    -- Pieces are Monoids, like Lists, Trees, etc.
    --  note that the type declarations here are not allowed (Monoid defines them already). I put them there for your leisure.
    instance Monoid (Piece a) where
        -- mempty is an empty Piece.
        mempty :: Piece
        mempty = Piece mempty 0 0
        -- mconcat takes a list of Pieces, and condenses them into one Piece.
        mconcat :: [Piece] -> Piece
        mconcat = foldl mappend mempty
        -- mappend takes two Pieces, and puts them together, resulting in one Piece.
        mappend :: Piece -> Piece -> Piece
        mappend (Piece firstList firstStart firstSize)
                (Piece nextList nextStart nextSize) =
                    let start = fst $ splitAt nextStart firstList
                        middle = nextList
                        end = snd $ splitAt nextSize firstList
                    in Piece (concat [start, middle, end]) firstStart firstSize

    type Buffer = [Piece Char]
Now we can implement some pure manipulations:

    -- replace creates a new Piece, and appends it to the Buffer.
    replace :: String -> Int -> Int -> Buffer -> Buffer
    replace text from to buffer = buffer ++ [(Piece text from to)]

    insert :: String -> Int -> Buffer -> Buffer
    insert text at = replace text at 0

    delete :: Int -> Int -> Buffer -> Buffer
    delete from to = replace "" from to
...and all that is left is to get the text from the buffer

    bufferText :: Buffer -> String
    -- mconcat folds our Pieces together, resulting in one Piece
    -- list gets the Piece's "list" (String = [Char] in Haskell)
    bufferText = list . mconcat
Now to use it, all we need to do is string our manipulations together:

    >>> buffer = replace "Hello" 0 7
               $ insert ", world!" 7
               $ insert "Goodbye" 0
               $ mempty

    >>> bufferText buffer
    "Hello, world!"

One advantage that is obvious now is that "Char" isn't specified until we define "Buffer", meaning a "Piece" can deal with any datatype, so you can easily get your Piece Table to deal with any character format from ASCII, UTF-8, UTF-16, etc. to raw bytes, Hex values, etc. It's even fairly obvious how to merge buffers that use different character formats.

The strongest point I see regarding Piece Tables is that Pieces are so discrete. You can define a Piece to be whatever you want it to be, as long as you implement "mappend" for it, or you can put the Piece in a box with other data, and make your buffer a list of boxes, and define "mappend" for the rest of the "box"'s data.

Another fairly obvious implementation is an undo/redo tree (like Vim has): Instead of Buffer being a List of Pieces, it can be a Tree of Pieces. The buffer is merged by merging the leftmost leaves of the tree, and "undo" is done by swapping the last left Leaf with an empty Leaf. Redo is done by rotating the last left node's leaves.

A final note regarding the above implementation: This simple mappend can only merge Pieces in order. If a Piece tries to edit part of another Piece that doesn't exist (out of bounds), it will be appended, or prepended without any filler, because that is how "splitAt" handles edge cases (rather than being implemented with Maybe). It wouldn't be too difficult to implement "mappend" for these edge cases, but you would need to decide on a filler character like space, or you would have to implement some kind of lazy merge that just keeps the second Piece around until it can be merged.

If you use persistent data structures in your implementation (like would be the default in Haskell), your data structures themselves don't need to know about undo/redo. Just keep references to the old state around (in an undo tree or a linear structure for simplicity), and just switch back to them if necessary. Don't explicitly manipulate your data structures. Sharing will take care of keeping the overheads low.

> Don't explicitly manipulate your data structures. Sharing will take care of keeping the overheads low.

Immutability is one of the strong points of Piece Tables.

Don't forget that this is the most simple implementation of a Piece Table. If you are in any way concerned about performance, you can quite easily implement caching, sharing, etc.

I suspect we are in violent agreement. I was just taking aim at

> [...] "undo" is done by swapping the last left Leaf with an empty Leaf. Redo is done by rotating the last left node's leaves.

Which is more complicated than you need to be in an immutable setting to implement undo/redo.

Back in the mid 90s I created a shareware text editor that ran on Windows 3.1 (on a 80386 machine) and was written in C++ using pure Win16.

At that time I used a double link list like structure with a line cursor to help with list traversal.

While this was not one of the formal text editor design patterns, I found that pattern very worked well.

Maybe it only worked well because you never encountered a file consisting of one very long line? Back then most somewhat human-readable data formats where line-based. Lines also where the de-facto standard work unit for stream processing so that there was an implicit understanding that a large file without newlines would not be processable as text even if all parts of that line were human readable. This only changed once XML went big which made most newlines a convenience that is frequently omitted in machine to machine communication, a paradigm shift that almost all younger formats followed. Now those m2m versions of optionally linebroken data occasionally hit our editors and make the limitations of lines as the basic unit of text editors painfully noticeable.

That being said, I have often seen otherwise reliable editors choke on very long lines, so your former self would still be in good company, even two decades later.

> Maybe it only worked well because you never encountered a file consisting of one very long line?

You correct when you say very long lines are hard to handle, but I'm not sure this is because of the internal data structure used.

In this case the internal data structure allocated a line buffer at every node of the double linked list and even on that Win16 environment that allowed for a line size of up to 64 kBytes (i.e. 2^16) in length.

However, to handle tabs correctly the editor is forced to recalculate the column position on every user event and since that calculation has to consider the entire line, the time needed to do this column calculation grows with line length.

So the editor could handled long lines but as the line lengths around 500 characters the speed became noticeably slow.

You could guess the tab handling with some fast method at first, and asynchronously run the proper handling to catch up with a pause in user input a few seconds later.

Interesting comment on lobste.rs to explain why the piece table is rarely superior to the gap buffer: https://lobste.rs/s/xpab69/text_editor_data_structures

If you're interested in this kind of structure it's worth taking a look at 'transclusion' devised by Ted Nelson in the mid 60's https://en.wikipedia.org/wiki/Transclusion

I'm always amazed that people are still working on new text editors. When I have an idea for an app and I see there's even only one alternative I usually loose all motivation for coding this app (until I realize the alternative doesn't fulfill my need).

I had the same thought first but kept on reading. The topic, which data structures should an editor employ, is actually quite interesting and well explained.

I am torn between this feeling, and the opposite. There are some great projects that seem to have very similar goals to mine, but I still yearn for a truly modular editor, and am not sure if I am better off working from scratch, or benefiting from others' work. Either way, it's a learning experience, and - to me - quite fun. Text editing is a very complex, yet approachable problem.

I'm certain that you've looked into it (if you're implementing your own), but what about Emacs is a bad fit for your goals? From what I can tell, it is very modular. If you're going to build something heavily modular, but different from Emacs, I'd be rather interested in what you end up with.

See my comment here:


Emacs comes fairly close, but not close enough for me.

What I want is modularity to the extreme. I want to define an entire UI/UX as a user without running into roadblocks.

I have envisioned a set of default UIs that the user can simply override, or reference in their own UI. This way a new user is presented with a familiar editor, but can cleanly step away from defaults. No workarounds.

> I'd be rather interested in what you end up with.

My problem is that I care so much about planning ahead that I have started over several times. It would be very helpful for me to get feedback from others, and bounce ideas back and forth.

I am considering putting a serious effort into this project, since I am currently unemployed, and need to build a resume, etc.

I use emacs, but it has lots of problems; I'm still waiting for an editor that combines its good qualities while resolving the problems. (And I'm tempted to try to write one.)

> And I'm tempted to try to write one.

It looks like there are quite a few of us. I wonder how many would be interested in putting a group project together. I would really like to consider and hash out some ideas with a group.

Sure, I’m down. Maybe start a slack or a subreddit or something? :)

What do you think are the problems of emacs?

30 years of cruft + mix/matching of semi-incompatible semi-backwards-compatible systems if you use 3rd-party elisp and/or maintain a config for a long time.

I love emacs, use it daily, and think that it's positives far outweigh it's negatives, but hooo boy sometimes you need to wade through decades of muck to make it do a thing, and it's often not clear what the best way to go about trying to do a thing is, particularly to people who haven't been using emacs for a decade+.

My biggest problem with Emacs is that most modes map keys to be compatible with its default layout, so you really can't get away from the default Keybindings.

Evil-mode is nice in this regard, but I don't want Vim's defaults either, not to mention Emacs' bindings are still there, just hiding in insert mode.

The feature I want the most (that isn't readily available in an editor) is to define keybindings from scratch. I don't want to deal with undoing everyone else's work just to get started on my own.

bad performance and lots of bugs.

>until I realize the alternative doesn't fulfill my need

And programmers spend most of their day interacting with a text editor, so many hold strong opinions about them (see editor wars).

It is likely that some will not find one that fit their needs, and thus build their own.


Some people aren't like that. This should not be news. And it's "lose".

Incredibly thorough, are there any plans to implement this into a text editor?

It is already in use in a bunch of smaller editors like Vis (iirc)

Yeah, the vis implementation is here: https://github.com/martanne/vis/blob/master/text.c

That was a great read, and really shows the benefits of writing good comments.

I'm more interested in what he will use for the UI. The popular option nowadays seems to embed WebKit, but that has obvious drawbacks. On the other hand, having scouted the grounds myself a year ago, it is one of the easiest option to set up, and it looks the best.

The first thing that came to my mind was mmap.

I’m sure there is something very wrong with using mmap and I can probably find the answer with just a google search but it would have been great if it was covered by the article.

Well if I ever wanted to ask do I really need to know advanced data structures this answers that question. Really interesting article.

Figure out whatever VSCode is doing and use that.

Are some types of binary trees favored for making ropes? I heard 2,3 - finger trees are good as a backing data structure for rope.

I’d think a relaxed radix tree would be ideal.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact