
Design and Implementation of a Win32 Text Editor (2005) - setra
http://www.catch22.net/tuts/neatpad
======
wyager
I don't see it mentioned in the text, but essentially the best-known data
structure for representing large amounts of random-location-editable text is
called the Rope. It's based on my favorite data structure, the Finger Tree,
which has very impressive asymptotic performance characteristics over a huge
range of intuitively (but not actually) expensive operations.

It doesn't require you to break the text up into lines and hope for short
lines or anything; you can insert, split, cut, paste, delete, whatever, with
at worst O(log(n)) for "expensive" operations (like beginning to edit at a
random point in the text, or cutting/pasting a huge chunk of text) and O(1)
for "cheap" operations (like insertion/deletion of a character after you've
started editing).

It's based on a finger tree containing densely packed byte or UTF arrays
(strictly speaking, this isn't necessary, but gives an appreciable constant
time/memory speedup), and tree nodes are tagged with the length of all text
underneath them. Here's an implementation specialized to byte arrays (not
UTF-8).
[https://hackage.haskell.org/package/rope](https://hackage.haskell.org/package/rope)

~~~
userbinator
I've read a lot about different advanced data structures for text editing, but
in my experience the reality is that, with memory bandwidths and CPU speeds
being what they are today, and the speed at which human inputs occur, one big
linear buffer is the simplest way to do it and the performance is not as bad
as you'd think --- indeed it seems inefficient to be moving blocks of memory
up to the whole size of the file whenever you insert a character, but when you
consider that ~30CPS is the fastest input rate you're likely to encounter for
individual single-character inserts (keyboard autorepeat), and that a modern
CPU can memmove() at >1GB/s, you are unlikely to notice any difference in
performance until the file you're editing is several tens of MB; and that is a
relatively rare use-case (large files are usually opened not for actual
editing, but for browsing and searching --- like logs. In that case, the
overhead of constructing a more complex datastructure upon open may actually
be slower.)

Even PCs in the DOS era with memory bandwidths in the single-digit MB/s range
were OK with text editors that use a linear buffer, simply because people
didn't really edit nor generate huge text files very often ("huge" being more
than several hundred KB.)

~~~
hawski
I agree to some point with you. Simpler data structures will not result with
abysmal performance nowadays. However I would not use fastest human character
input as a benchmark. In advanced editor you may want computer to do some
edits/inputs for you. Code pasting, filtering of lines or search/replace
should also be considered. That said it is probably easy to move blocks if you
know the size of edits.

I am developing experimental text editor in C. What I am currently using is
array of lines (strings). Entering new line is a memmove in array of pointers.
Entering new character is a memmove in array of chars. I concatenated all
sources of Linux kernel. Resulting file has 542MB and 19'726'498 lines.
Insertion of line is slightly laggy then - probably on the edge of
noticeability. I also tested it a bit with a file containing single long line
and it was also OK. Based on what profiler shows me rendering takes much more
time and that's what I have to optimize.

~~~
userbinator
_Code pasting, filtering of lines or search /replace should also be
considered. That said it is probably easy to move blocks if you know the size
of edits._

I agree, but as you said in those cases the final size is known so it is not a
series of one-character operations (which would be quadratic complexity and
definitely noticeable.)

 _Based on what profiler shows me rendering takes much more time and that 's
what I have to optimize._

That's been my experience playing around with text editing too; the time taken
to modify the buffer is tiny in comparison to rendering the text itself. It is
here that e.g. updating only the regions which changed will have a noticeable
improvement in responsiveness.

I wonder what data structure the Atom text editor uses --- it's famously slow
on large files, but I doubt that's where the bottleneck is; it's more like an
IDE so parsing and rendering are taking the bulk of the time. It is written in
JavaScript and browser-based, but having seen JS run a PC emulator and boot a
usable Linux kernel, I don't think that is the bottleneck either.

~~~
hawski
> I wonder what data structure the Atom text editor uses --- it's famously
> slow on large files, but I doubt that's where the bottleneck is; it's more
> like an IDE so parsing and rendering are taking the bulk of the time. It is
> written in JavaScript and browser-based, but having seen JS run a PC
> emulator and boot a usable Linux kernel, I don't think that is the
> bottleneck either.

There was quite nice post [1] commented here [2] lately on text management of
web text editors.

Quote:

> Every time text is inserted, it's inserted as one "chunk", then split up by
> its line endings. This is done by invoking a regular expression engine.
> Personally I think this is overkill, but it certainly lets Atom continue to
> be easily modifiable. I can imagine the same thought is running through a
> few people reading this. It pushes all the new lines to a stack (or more
> technically: a regular JavaScript array). Already I don't want to find
> myself opening a large file. It then uses "spliceArray" to replace a range
> of lines.

> So what is the actual data structure of the great Atom text buffer?...

> @lines = [''];

> A regular JavaScript array. Ooof.

I think that in JS simple operations on arrays of strings have much more
impact than in C. Few things from the top of my mind: additional metadata that
has to be managed behind the scenes and garbage collection. But I don't really
know how it would add up to overall performance. Certainly performance would
look different if text would be rendered by dedicated library instead of
advanced layout engine that lives inside modern browsers. It could be an
interesting project to write editor in JS, but use for example Pango [3]
bindings to render text.

[1] [https://ecc-comp.blogspot.de/2016/11/a-glance-into-web-
tech-...](https://ecc-comp.blogspot.de/2016/11/a-glance-into-web-tech-based-
text.html)

[2]
[https://news.ycombinator.com/item?id=12927173](https://news.ycombinator.com/item?id=12927173)

[3] [http://www.pango.org/](http://www.pango.org/)

------
maxxxxx
I once wrote an HTML WYSIWYG editor for Windows 3.1. I had bid for a project
but didn't check that the library I wanted to use would actually work with 16
bit Windows. So I had to do it myself. It took me several all-nighters to make
it work.

The hardest part was rendering in a performant way. One page was fine but with
several pages it got horribly slow. I had to invent some caching to avoid
recalculating the layout all the time.

Definitely one of the most educational things I have ever done.

------
imron
This is a really great series of articles, and I used a number of the
techniques mentioned in them for a program I made -
[https://www.chinesetextanalyser.com/](https://www.chinesetextanalyser.com/)

The files opened by the program are read-only, so I didn't need to look in to
or worry about piece-chains, but it's nice to be able to open multi-gigabyte
files, with highlighting, and scroll around instantly to anywhere in the file,
and have very low CPU and memory usage.

------
iask
I always enjoy reading articles like this. I am a .Net developer and recently
did an application for parsing some of our B2B PDF documents (different
vendors, different document structures) and loading the data into a backend
db.

It was an interesting project. Text parsing becomes so simple once you have a
logic. And there are so many things to watch out for e.g. Different versions
of PDF are structured differently, some vendors had HTML documents, flat text,
some totally not in any structure.

Thanks for sharing!

~~~
bbernoulli
What did you use for parsing PDFs?

~~~
iask
I used itextsharp library to convert the PDF files to text and then go from
there. Once you have the file in text format you can then determine how to
parse - that would be the structure you will see all the time for that
document. In my case, each document differ by vendors, hence different
parsers. That's the gist of it.

------
dingdingdang
Honestly very good article - are anyone aware of a similar type of tutorial
for OSX that covers small to huge text file editing?

~~~
valleyer
TextEdit, the actual editor that ships on macOS, is open source.

[https://developer.apple.com/library/content/samplecode/TextE...](https://developer.apple.com/library/content/samplecode/TextEdit/Introduction/Intro.html)

Nearly all of the load is shouldered by the Cocoa text system.

~~~
imron
And it's not very good with huge text files and long lines.

~~~
dingdingdang
Yep, this is the problem, just because its bare metal sys-call wise does not
mean it does partial/incremental loading correctly! :/

~~~
imron
Incremental loading is one of the things I dislike about the native cocoa
text/document components - especially if you have a large file and want to
scroll to the end. Scroll and wait. Scroll and wait. Scroll and wait.

You can use the same basic techniques presented in the original article for
loading large files instantly on macOS (I did), and then use Core Text APIs
for rendering the actual text instead of the Win32 API calls.

You need to use Core Text APIs over the NS equivalent, or at least you do if
you want to handle text with long lines, because all the NS* eventually gets
down to Core Text and one of the functions (I forget which one, but likely
CTTypesetterCreateWithAttributedString) becomes noticeably sluggish when
processing long lines (think > 10k characters per line), especially if most of
that text might not even be appearing on the screen. At least it does for
Chinese, not sure about other scripts.

However if you break that same 10k chunk up in to a bunch of smaller text
chunks first, and stop once you get off the visible screen, the combined time
of processing each chunk is significantly less than processing them as a
single piece of text. Checking my source code, I've currently got the maximum
length to split on set at 1,024 and that works quite well.

------
bluedino
Great article. Would be an interesting discussion to have with a developer
canididate during an interview.

------
hugozap
Mirror?

~~~
softblush
[https://web.archive.org/web/20161129233642/http://www.catch2...](https://web.archive.org/web/20161129233642/http://www.catch22.net/tuts/neatpad)

------
dewiz
Perhaps add "2005" to the title?

~~~
nacs
Has the cutting edge techniques for implementing Notepad changed that much
since 2005?

~~~
imron
Actually yes!

The Unicode text handling in Windows has been significantly revamped from the
APIs listed in these articles, as has the text rendering APIs which now go
through DirectX.

~~~
batina
You can still use Uniscribe in Windows 10 but as you said it has been
superseded by DirectWrite (used in Microsoft Word, WPF, etc).

~~~
imron
Yep, you can definitely still use it all, just like you can continue to use
the even older APIs too - hooray for backwards compatibility.

Just pointing out that the technology for Notepad has actually changed and
improved since 2005.

