It doesn't require you to break the text up into lines and hope for short lines or anything; you can insert, split, cut, paste, delete, whatever, with at worst O(log(n)) for "expensive" operations (like beginning to edit at a random point in the text, or cutting/pasting a huge chunk of text) and O(1) for "cheap" operations (like insertion/deletion of a character after you've started editing).
It's based on a finger tree containing densely packed byte or UTF arrays (strictly speaking, this isn't necessary, but gives an appreciable constant time/memory speedup), and tree nodes are tagged with the length of all text underneath them. Here's an implementation specialized to byte arrays (not UTF-8). https://hackage.haskell.org/package/rope
I'm proud to say that the only bugs AFL found in my implementation were related to my (documented) lack of UTF8 validation. And my skip list implementation beats the performance of the SGI C++ STL rope implementation by about 20x while being several times smaller. (It comfortably performs in the millions of random operations/second on documents up to 1M in size, and it has a smooth logarithmic decline in performance after that).
Unfortunately in C its quite awkward to generically control what aggregate information is cached at internal nodes in the tree. You need control over that to be able to efficiently implement lots of the operations you need in a normal text editor. For example converting between line/character positions and offset positions. I can't just hardcode a lot of those translations into the data type because the bookkeeping is expensive and it kills cache coherence. Its a cost you only want to pay when you need it.
I'm curious to try porting the skip list code to unsafe rust to see if I can add that abstraction capacity in while maintaining the current performance profile. Data structures are fun!
Funnily enough, when looking for a job out of college I was asked what data structure I would use for a text editor in a phone screen, and answered, "a pair of finger trees." The next hour was spent trying to explain over the phone what a finger tree was. And then I was asked to pick something as simple as possible, so I answered, "an array." No, not that simple: "A cut buffer. No? How about a linked list... of character arrays of bounded size." I didn't get an onsite.
Even PCs in the DOS era with memory bandwidths in the single-digit MB/s range were OK with text editors that use a linear buffer, simply because people didn't really edit nor generate huge text files very often ("huge" being more than several hundred KB.)
I am developing experimental text editor in C. What I am currently using is array of lines (strings). Entering new line is a memmove in array of pointers. Entering new character is a memmove in array of chars. I concatenated all sources of Linux kernel. Resulting file has 542MB and 19'726'498 lines. Insertion of line is slightly laggy then - probably on the edge of noticeability. I also tested it a bit with a file containing single long line and it was also OK. Based on what profiler shows me rendering takes much more time and that's what I have to optimize.
I agree, but as you said in those cases the final size is known so it is not a series of one-character operations (which would be quadratic complexity and definitely noticeable.)
Based on what profiler shows me rendering takes much more time and that's what I have to optimize.
That's been my experience playing around with text editing too; the time taken to modify the buffer is tiny in comparison to rendering the text itself. It is here that e.g. updating only the regions which changed will have a noticeable improvement in responsiveness.
There was quite nice post  commented here  lately on text management of web text editors.
> So what is the actual data structure of the great Atom text buffer?...
> @lines = [''];
I think that in JS simple operations on arrays of strings have much more impact than in C. Few things from the top of my mind: additional metadata that has to be managed behind the scenes and garbage collection. But I don't really know how it would add up to overall performance. Certainly performance would look different if text would be rendered by dedicated library instead of advanced layout engine that lives inside modern browsers. It could be an interesting project to write editor in JS, but use for example Pango  bindings to render text.
There is a lua editor called Vis that uses piece chains which the paper suggests is best.
Even gap buffers seem somewhat odd to me. I'm assuming the gap is only created/inserted when text is typed. Not just on moving of the point/cursor.
That, or thinking longer, I suppose that the contents of the file can be out of sync while the buffer is not saved. So, probably possible to easily insert and shift a gap periodically into the text, as needed. (I'm envisioning the rotate memory code from Programming Pearls to keep the gap where you want it.)
Similarly, the piece chain (which, I should add that googling that was a piece of hilarity), if it is possible to "snapshot" the base of the piece chain every so often to what is on disk, it would just mean inverting some of the change pieces. Which doesn't sound too hard.
You load a file. All text is one giant buffer. You make a gap buffer and need it somewhere in the original buffer. Otherwise... How does this work?
You want to move the gap to a new location. Which is the same as originally moving it in. This is effectively rotating part of the buffer.
To be clear, memcopy to move the buffer without using extra memory is the same as rotating.
If you don't believe me. Try it. Move some memory of arbitrary size using constant memory and see what algorithms you find. :)
It's a pretty tortured way to think about it; after googling I finally figured out what you meant. Sounds like the kind of algorithm I would never, ever ask about in a programming interview!
Now the cursor just after the word "the". How do you get the gap you did there? You likely started with: "this is the startGAP". You want "this is theGAP start". Look at only " startGAP" and see that getting "GAP start" is just a rotation.
Could you see this another way? Almost certainly. However, as a mutable algorithm, probably tough to beat using this way.
Finally.... Where did I say this was an interview question? Might be a fun discussion. Terrible fact based question.
Edit:. Curse my phone typing. I am still contemplating this and see roughly what you are getting at. Didn't mean to post. I do challenge some of this. But still thinking. Apologies for posting early.
Edit2: on the assumption that I don't care what is in the gap,i think I fully agree with you and I was wrong. I challenge that some, but would have to think longer. Basically, my challenge is that undo and redo are more than a bit easier if you take care of the contents of the gap.
The part that moves the gap is https://github.com/darius/dole/blob/master/text.lua#L178 and it's just a memmove. There's nothing like the permutation algorithm Bentley wrote about, his rotating by double reversal. If I'm missing your point, I'm sorry this is so troublesome -- maybe it's not worth the bother of correcting me.
(Added: I wrote the above before your edit. I hope this helps; probably there are better sources to read, but I picked this one because I wrote it.)
And it is absolutely no bother. Thanks for discussing it! Again, I would never dream of this as an interview thing. Fun discussion, though.
Some day I may dive into a text editor. Many of the data structures that are supposedly used there don't leap to me as obvious wins nowadays. They often pass the"appealing argument" test, but I have grown very weary of that.
I had some fun with my editor, but if I try again I'll stay away from the tty, or anyway look for a better cross-platform tty library first.
I recall coming across this when hearing about vis.
The hardest part was rendering in a performant way. One page was fine but with several pages it got horribly slow. I had to invent some caching to avoid recalculating the layout all the time.
Definitely one of the most educational things I have ever done.
The files opened by the program are read-only, so I didn't need to look in to or worry about piece-chains, but it's nice to be able to open multi-gigabyte files, with highlighting, and scroll around instantly to anywhere in the file, and have very low CPU and memory usage.
It was an interesting project. Text parsing becomes so simple once you have a logic. And there are so many things to watch out for e.g. Different versions of PDF are structured differently, some vendors had HTML documents, flat text, some totally not in any structure.
Thanks for sharing!
Nearly all of the load is shouldered by the Cocoa text system.
You can use the same basic techniques presented in the original article for loading large files instantly on macOS (I did), and then use Core Text APIs for rendering the actual text instead of the Win32 API calls.
You need to use Core Text APIs over the NS equivalent, or at least you do if you want to handle text with long lines, because all the NS* eventually gets down to Core Text and one of the functions (I forget which one, but likely CTTypesetterCreateWithAttributedString) becomes noticeably sluggish when processing long lines (think > 10k characters per line), especially if most of that text might not even be appearing on the screen. At least it does for Chinese, not sure about other scripts.
However if you break that same 10k chunk up in to a bunch of smaller text chunks first, and stop once you get off the visible screen, the combined time of processing each chunk is significantly less than processing them as a single piece of text. Checking my source code, I've currently got the maximum length to split on set at 1,024 and that works quite well.
The Unicode text handling in Windows has been significantly revamped from the APIs listed in these articles, as has the text rendering APIs which now go through DirectX.
Just pointing out that the technology for Notepad has actually changed and improved since 2005.