Hacker News new | past | comments | ask | show | jobs | submit login
Design and Implementation of a Win32 Text Editor (2005) (catch22.net)
173 points by setra on Dec 4, 2016 | hide | past | favorite | 51 comments

I don't see it mentioned in the text, but essentially the best-known data structure for representing large amounts of random-location-editable text is called the Rope. It's based on my favorite data structure, the Finger Tree, which has very impressive asymptotic performance characteristics over a huge range of intuitively (but not actually) expensive operations.

It doesn't require you to break the text up into lines and hope for short lines or anything; you can insert, split, cut, paste, delete, whatever, with at worst O(log(n)) for "expensive" operations (like beginning to edit at a random point in the text, or cutting/pasting a huge chunk of text) and O(1) for "cheap" operations (like insertion/deletion of a character after you've started editing).

It's based on a finger tree containing densely packed byte or UTF arrays (strictly speaking, this isn't necessary, but gives an appreciable constant time/memory speedup), and tree nodes are tagged with the length of all text underneath them. Here's an implementation specialized to byte arrays (not UTF-8). https://hackage.haskell.org/package/rope

Finger trees are very cool. I have an efficient rope implementation in C based on skip lists here:


I'm proud to say that the only bugs AFL found in my implementation were related to my (documented) lack of UTF8 validation. And my skip list implementation beats the performance of the SGI C++ STL rope implementation by about 20x while being several times smaller. (It comfortably performs in the millions of random operations/second on documents up to 1M in size, and it has a smooth logarithmic decline in performance after that).

Unfortunately in C its quite awkward to generically control what aggregate information is cached at internal nodes in the tree. You need control over that to be able to efficiently implement lots of the operations you need in a normal text editor. For example converting between line/character positions and offset positions. I can't just hardcode a lot of those translations into the data type because the bookkeeping is expensive and it kills cache coherence. Its a cost you only want to pay when you need it.

I'm curious to try porting the skip list code to unsafe rust to see if I can add that abstraction capacity in while maintaining the current performance profile. Data structures are fun!

In 2010 I tried using a pair of FingerTrees, whose elements were strings, to make an in-browser text editor in JavaScript. It was visibly slow on small buffers, much slower than using a plain pair of Strings. It'd have been fine in C++, and I'm really quite surprised how bad it ended up being. Maybe JS is faster now.

Funnily enough, when looking for a job out of college I was asked what data structure I would use for a text editor in a phone screen, and answered, "a pair of finger trees." The next hour was spent trying to explain over the phone what a finger tree was. And then I was asked to pick something as simple as possible, so I answered, "an array." No, not that simple: "A cut buffer. No? How about a linked list... of character arrays of bounded size." I didn't get an onsite.

You're not going to be hit by V8 if you use complicated operations and data structures. Raw strings over a certain size will probably be prioritized by any jit as they are a classic example of what to optimize. You're also going to loose any semblance of a cache if you do something like tree navigation I'm a dynamic runtime.

It's worth noting that the only thing that has to be fast in a gap buffer is memmove(), which usually works fine even in interpreters. (In Javascript you'd probably represent this via substring operations.)

I've read a lot about different advanced data structures for text editing, but in my experience the reality is that, with memory bandwidths and CPU speeds being what they are today, and the speed at which human inputs occur, one big linear buffer is the simplest way to do it and the performance is not as bad as you'd think --- indeed it seems inefficient to be moving blocks of memory up to the whole size of the file whenever you insert a character, but when you consider that ~30CPS is the fastest input rate you're likely to encounter for individual single-character inserts (keyboard autorepeat), and that a modern CPU can memmove() at >1GB/s, you are unlikely to notice any difference in performance until the file you're editing is several tens of MB; and that is a relatively rare use-case (large files are usually opened not for actual editing, but for browsing and searching --- like logs. In that case, the overhead of constructing a more complex datastructure upon open may actually be slower.)

Even PCs in the DOS era with memory bandwidths in the single-digit MB/s range were OK with text editors that use a linear buffer, simply because people didn't really edit nor generate huge text files very often ("huge" being more than several hundred KB.)

I've noticed that with text files that are not large, but not that large, almost all the text editors I've found on Android choke badly for simple things like moving around in the file. For example, my main org-mode file at work is 560K. The only appified text editor on Android that handles it reasonably is VimTouch (Emacs, vim, and similar all handle it easily in a Debian chroot on the same hardware). So it has to be the case that better data structures for text editing still matter in at least some realistic environments today.

I agree to some point with you. Simpler data structures will not result with abysmal performance nowadays. However I would not use fastest human character input as a benchmark. In advanced editor you may want computer to do some edits/inputs for you. Code pasting, filtering of lines or search/replace should also be considered. That said it is probably easy to move blocks if you know the size of edits.

I am developing experimental text editor in C. What I am currently using is array of lines (strings). Entering new line is a memmove in array of pointers. Entering new character is a memmove in array of chars. I concatenated all sources of Linux kernel. Resulting file has 542MB and 19'726'498 lines. Insertion of line is slightly laggy then - probably on the edge of noticeability. I also tested it a bit with a file containing single long line and it was also OK. Based on what profiler shows me rendering takes much more time and that's what I have to optimize.

Code pasting, filtering of lines or search/replace should also be considered. That said it is probably easy to move blocks if you know the size of edits.

I agree, but as you said in those cases the final size is known so it is not a series of one-character operations (which would be quadratic complexity and definitely noticeable.)

Based on what profiler shows me rendering takes much more time and that's what I have to optimize.

That's been my experience playing around with text editing too; the time taken to modify the buffer is tiny in comparison to rendering the text itself. It is here that e.g. updating only the regions which changed will have a noticeable improvement in responsiveness.

I wonder what data structure the Atom text editor uses --- it's famously slow on large files, but I doubt that's where the bottleneck is; it's more like an IDE so parsing and rendering are taking the bulk of the time. It is written in JavaScript and browser-based, but having seen JS run a PC emulator and boot a usable Linux kernel, I don't think that is the bottleneck either.

> I wonder what data structure the Atom text editor uses --- it's famously slow on large files, but I doubt that's where the bottleneck is; it's more like an IDE so parsing and rendering are taking the bulk of the time. It is written in JavaScript and browser-based, but having seen JS run a PC emulator and boot a usable Linux kernel, I don't think that is the bottleneck either.

There was quite nice post [1] commented here [2] lately on text management of web text editors.


> Every time text is inserted, it's inserted as one "chunk", then split up by its line endings. This is done by invoking a regular expression engine. Personally I think this is overkill, but it certainly lets Atom continue to be easily modifiable. I can imagine the same thought is running through a few people reading this. It pushes all the new lines to a stack (or more technically: a regular JavaScript array). Already I don't want to find myself opening a large file. It then uses "spliceArray" to replace a range of lines.

> So what is the actual data structure of the great Atom text buffer?...

> @lines = [''];

> A regular JavaScript array. Ooof.

I think that in JS simple operations on arrays of strings have much more impact than in C. Few things from the top of my mind: additional metadata that has to be managed behind the scenes and garbage collection. But I don't really know how it would add up to overall performance. Certainly performance would look different if text would be rendered by dedicated library instead of advanced layout engine that lives inside modern browsers. It could be an interesting project to write editor in JS, but use for example Pango [3] bindings to render text.

[1] https://ecc-comp.blogspot.de/2016/11/a-glance-into-web-tech-...

[2] https://news.ycombinator.com/item?id=12927173

[3] http://www.pango.org/

The PC word processor "Final Word", also available on some earlier machines, used a gap buffer for the part of the file that was in memory.

Given the desire for syntax highlighting and other demanding features in modern editors, it would be fun to see a comparison of "piece chains" or "the rope" with a gap buffer, which is the data structure historically used by Emacs. Sure, it's easy to see how different editing itself fares with these various data structures, but what about syntax highlighting?

There exists a phenomenal paper comparing ropes, gap buffers, and piece chains I cannot recall.

There is a lua editor called Vis that uses piece chains which the paper suggests is best.

I would be very interested in if you could find that paper. Briefly reading on piece chains, I can't see how they would be the preferred way, sadly. (I'm assuming the failing is on me, hence the sadly.)

Even gap buffers seem somewhat odd to me. I'm assuming the gap is only created/inserted when text is typed. Not just on moving of the point/cursor.

That, or thinking longer, I suppose that the contents of the file can be out of sync while the buffer is not saved. So, probably possible to easily insert and shift a gap periodically into the text, as needed. (I'm envisioning the rotate memory code from Programming Pearls to keep the gap where you want it.)

Similarly, the piece chain (which, I should add that googling that was a piece of hilarity), if it is possible to "snapshot" the base of the piece chain every so often to what is on disk, it would just mean inverting some of the change pieces. Which doesn't sound too hard.

You don't need to rotate memory, because you don't care about the contents of the gap.

My thinking:

You load a file. All text is one giant buffer. You make a gap buffer and need it somewhere in the original buffer. Otherwise... How does this work?

You want to move the gap to a new location. Which is the same as originally moving it in. This is effectively rotating part of the buffer.

https://en.wikipedia.org/wiki/Gap_buffer has a good explanation. You allocate extra space and do memmove() to open/move/close the gap.

This is exactly what I described. You make a gap buffer and move it into the original text. As typing happens, the size of the buffer is effectively decreased. You move the cursor/point, and need to move what is left of the gap. This is the same as rotating it.

To be clear, memcopy to move the buffer without using extra memory is the same as rotating.

If you don't believe me. Try it. Move some memory of arbitrary size using constant memory and see what algorithms you find. :)

I can only repeat what abecedarius said: it's nothing like a rotation because you don't care about the bytes in the gap.

It's a pretty tortured way to think about it; after googling I finally figured out what you meant. Sounds like the kind of algorithm I would never, ever ask about in a programming interview!

It is exactly like a rotation. Think of it this way. You have a file. Contents "this is the start". You copy it to memory. (Or map it. Doesn't matter.)

Now the cursor just after the word "the". How do you get the gap you did there? You likely started with: "this is the startGAP". You want "this is theGAP start". Look at only " startGAP" and see that getting "GAP start" is just a rotation.

Could you see this another way? Almost certainly. However, as a mutable algorithm, probably tough to beat using this way.

Finally.... Where did I say this was an interview question? Might be a fun discussion. Terrible fact based question.

We were just pointing out that changing "startGAP" to "GAPstart" is doing extra work; it's just as good to end up with "stastart"; we don't care what's in the bytes that you're writing as "GAP". The location of the gap is typically represented with a couple of pointers or ints.

Which didn't help your point. You don't care what is in the gap. But you do care what is in either side of it. So, you have to move those.

Edit:. Curse my phone typing. I am still contemplating this and see roughly what you are getting at. Didn't mean to post. I do challenge some of this. But still thinking. Apologies for posting early.

Edit2: on the assumption that I don't care what is in the gap,i think I fully agree with you and I was wrong. I challenge that some, but would have to think longer. Basically, my challenge is that undo and redo are more than a bit easier if you take care of the contents of the gap.

Here's a gap buffer in Lua. The code for inserting and deleting: https://github.com/darius/dole/blob/master/text.lua#L171 (The representation is documented at the start of the file. Some of the code in this function is only for assertions.)

The part that moves the gap is https://github.com/darius/dole/blob/master/text.lua#L178 and it's just a memmove. There's nothing like the permutation algorithm Bentley wrote about, his rotating by double reversal. If I'm missing your point, I'm sorry this is so troublesome -- maybe it's not worth the bother of correcting me.

(Added: I wrote the above before your edit. I hope this helps; probably there are better sources to read, but I picked this one because I wrote it.)

Yeah, apologies for the edits. I meant to hold on the thought some last time.

And it is absolutely no bother. Thanks for discussing it! Again, I would never dream of this as an interview thing. Fun discussion, though.

Some day I may dive into a text editor. Many of the data structures that are supposedly used there don't leap to me as obvious wins nowadays. They often pass the"appealing argument" test, but I have grown very weary of that.

I hope you do! There are bound to be new things to discover.

I had some fun with my editor, but if I try again I'll stay away from the tty, or anyway look for a better cross-platform tty library first.

Data Structures for Text Sequences by Charles Crowley?


I recall coming across this when hearing about vis.

Yes! I believe that was it. If not it's very similar.

This paper only discusses editing.

Syntax highlighting is primarily a "read" operation, and for that I suspect a plain buffer is going to have the advantage in speed since all the characters are contiguous and no pointer-chasing is required to read, unlike with the other data structures.

People interested in this stuff might also want to take a look at the rope in https://github.com/google/xi-editor. It's not a finger tree, rather it's based on a B-tree, which is simpler but doesn't have quite as good asymptotic guarantees for operations at the beginning and end of the sequence. Also, Rust.

I once wrote an HTML WYSIWYG editor for Windows 3.1. I had bid for a project but didn't check that the library I wanted to use would actually work with 16 bit Windows. So I had to do it myself. It took me several all-nighters to make it work.

The hardest part was rendering in a performant way. One page was fine but with several pages it got horribly slow. I had to invent some caching to avoid recalculating the layout all the time.

Definitely one of the most educational things I have ever done.

This is a really great series of articles, and I used a number of the techniques mentioned in them for a program I made - https://www.chinesetextanalyser.com/

The files opened by the program are read-only, so I didn't need to look in to or worry about piece-chains, but it's nice to be able to open multi-gigabyte files, with highlighting, and scroll around instantly to anywhere in the file, and have very low CPU and memory usage.

I always enjoy reading articles like this. I am a .Net developer and recently did an application for parsing some of our B2B PDF documents (different vendors, different document structures) and loading the data into a backend db.

It was an interesting project. Text parsing becomes so simple once you have a logic. And there are so many things to watch out for e.g. Different versions of PDF are structured differently, some vendors had HTML documents, flat text, some totally not in any structure.

Thanks for sharing!

What did you use for parsing PDFs?

I used itextsharp library to convert the PDF files to text and then go from there. Once you have the file in text format you can then determine how to parse - that would be the structure you will see all the time for that document. In my case, each document differ by vendors, hence different parsers. That's the gist of it.

Honestly very good article - are anyone aware of a similar type of tutorial for OSX that covers small to huge text file editing?

TextEdit, the actual editor that ships on macOS, is open source.


Nearly all of the load is shouldered by the Cocoa text system.

And it's not very good with huge text files and long lines.

Yep, this is the problem, just because its bare metal sys-call wise does not mean it does partial/incremental loading correctly! :/

Incremental loading is one of the things I dislike about the native cocoa text/document components - especially if you have a large file and want to scroll to the end. Scroll and wait. Scroll and wait. Scroll and wait.

You can use the same basic techniques presented in the original article for loading large files instantly on macOS (I did), and then use Core Text APIs for rendering the actual text instead of the Win32 API calls.

You need to use Core Text APIs over the NS equivalent, or at least you do if you want to handle text with long lines, because all the NS* eventually gets down to Core Text and one of the functions (I forget which one, but likely CTTypesetterCreateWithAttributedString) becomes noticeably sluggish when processing long lines (think > 10k characters per line), especially if most of that text might not even be appearing on the screen. At least it does for Chinese, not sure about other scripts.

However if you break that same 10k chunk up in to a bunch of smaller text chunks first, and stop once you get off the visible screen, the combined time of processing each chunk is significantly less than processing them as a single piece of text. Checking my source code, I've currently got the maximum length to split on set at 1,024 and that works quite well.

It's not a tutorial, but Ridiculous Fish wrote about his experience writing Hex Fiend (https://www.google.com/&q=site:ridiculousfish.com%2Fblog+%22... ); see, for example, http://ridiculousfish.com/blog/posts/old-age-and-treachery.h... .

Great article. Would be an interesting discussion to have with a developer canididate during an interview.


Perhaps add "2005" to the title?

Has the cutting edge techniques for implementing Notepad changed that much since 2005?

Actually yes!

The Unicode text handling in Windows has been significantly revamped from the APIs listed in these articles, as has the text rendering APIs which now go through DirectX.

You can still use Uniscribe in Windows 10 but as you said it has been superseded by DirectWrite (used in Microsoft Word, WPF, etc).

Yep, you can definitely still use it all, just like you can continue to use the even older APIs too - hooray for backwards compatibility.

Just pointing out that the technology for Notepad has actually changed and improved since 2005.

It's the house style to add the year to titles from previous years.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact