At the moment I am also writing a text-editor with my partner, to show-case the C++ RRB-Trees implementations in Immer . We are just started, but my plan is to stop at 1000k lines. Interestingly, with such persistent data structure, you go already a long way in implementing undo, parallel processing, and many editing algorithms are super simple thanks to slicing/concat. Memory consumption is still sub-optimal (also I am using wchar_t to simplify my life), but I am very satisfied with the results so far (it is the fastest editor I have on my machine when editing 1GB file, also can edit a ~100MB file on a Raspberry --- larger files fail only due to excessive memory use :/).
I'm developing my own text editor. To test naive implementation of text editing routines - array of line strings. I concatenated all sources of recent Linux kernel 4.6.3 once. From what I remember it was all .c, .h and .S files. Resulting file had 542MB and 19'726'498 lines. It was good enough even adding a line at front.
[The actual analysis is much more subtle, as the tree invariants are a minimum and maximum number of children for non-root non-leaf nodes, and a minimum and maximum leaf size. When freshly loaded, the code is careful to maximally pack both, and after extensive editing the distribution will be more varied. However, it's still rigorously O(log n), so it ends up being a small constant factor. Might be worth doing some empirical performance testing.]
(I looked into monoid cached trees for float placement in CSS at one point in an effort to come up with an O(n log n) algorithm. I succeeded, but the constant factor was so high that it wasn't worth the price. I ended up switching to an algorithm based on splay trees that, while O(n^2) in the worst case, ended up being O(n) with a small constant factor on real-world pages.)
I love the idea of using monoids like they're described in the blog series, but the examples suggest that there's a certain amount of non-generalizable cleverness that goes into defining each monoid. Could you do CYK subtables inside the monoid, so that people can define arbitrary CF grammars, as long as they're in Chomsky normal form?
If that doesn't happen (like in the "parens" or "other" description), to use CYK, I think you could maybe have piece of the table for each substring (so for all entries with both indices inside the substring) in each node but then I don't think the operation at the parent is commutative. And you have to decide to copy the info from the children or not (if not, you'll spend an extra log n time looking down the tree).
IIRC, Internet Explorer used a binary tree to represent its strings, at least at version 5 in the early 2000s, because of the inefficiency of doing lots of copying for string operations - looped concatenation was one of the primary drivers. That doesn't mean it went all the way to ropes, of course.
Excitement is nice to feel, but it takes some experience to know when excitement is really aimed in a productive direction. Otherwise we end up with the kind of motivation that so often produces over-complex and mis-aimed software: having a "cool idea" for "exciting technology" and then looking for places to apply it, and the applications don't really fit or don't really work, but we don't want to notice that, so we don't.
To pull examples: an entire one of these essays is on "paren matching" and how it would be really great if you monoidized (ugh) and parallelized that ... the basic idea of which is instantly shot down by the fact that language grammars are just more complicated than counting individual characters. Hey bro, what if there is a big comment in the middle of your file that has some parens in it? The author didn't even think of this, and relegates this to a comment at the end of that particular essay: "Jonathan Tomer pointed out that real parsing is much more interesting than just paren matching." Which is a short way of saying "this entire essay is not going to work so you probably shouldn't read it, but I won't tell you that until the bottom of the page, and even then I will only slyly allude to that fact." Which in itself is contemptuous of the reader -- it is the kind of thing that happens when you are excited enough about your ideas that the question of whether they are correct is eclipsed. This leads to bad work.
There's the essay about the scrollbar -- if you have a 100k-line text file, do you really want a really long line somewhere in the middle to cause the scrollbar to be narrow and tweakyin the shorter, well-behaved majority of the file? No, you probably don't! But this shoots down the idea that you might want to do a big parallel thing to figure out line length, so he declines to think about it. In reality what you probably want is the scrollbar to be sized based on a smooth sliding window that is slightly bigger than what appears on the screen (but not too much).
Besides which, computers are SO FAST that if you just program them in a straightforward way, and don't do any of the modern software engineering stuff that makes programs slow, then your editor is going to react instantly for all reasonable editing tasks.
I don't want to be too overly critical and negative -- these sorts of thoughts are fine if they are your private notes and are thinking about technical problems and asking friends for feedback. It becomes different when you post them to Hacker News and/or the rest of the internet, because this contains an implicit claim that these are worth many readers' time. But in order to be worth many readers' time, much more thought would have had to go in ... and as a result, the ideas would have changed substantially from what they are now.
I didn't read past essay 4, so if it gets more applicable to reality after that I don't know!
As to being a young person, I don't know if you remember, but I was already of drinking age when you and me and Adam Sah hung out a bit and talked about the "rush" language.
I admit to being excited, but because I think "modern" editors really are too bloated and slow and it causes a thousand paper cuts throughout the day, and I think that can be fixed.
Sorry for presuming age + experience level, it's how this came across to me. Actually I think Rush is a prime example of "excited about general ideas that turn out not to be right or relevant to much". But we were students, and I guess that is what students often do.
I agree modern editors are too slow and bloated. I would write one if I didn't have way too many other things happening. But I don't think they are slow and bloated due to a lack of computer science concepts. I think they are slow because most of the world, over the last 25 years, has lost the art of writing software that is remotely efficient.
If I were to write an editor, it would store text as arrays of lines (since lines are what you care about) with maybe one level of hierarchy, such that each 10k lines of the file are in one array. I think that would be fine and if it ran into problems with very large files, relatively minor
modifications would take it the rest of the way. (Of course this is untested but I feel pretty confident about it). Rather than calling malloc all the time, a specialized allocator would be in play.
I do think it's a good idea to make a better editor so I wish you good luck with that (dude I am so sick of emacs).
As I said elsewhere on the thread, the good thing about ropes is their worst case performance. Super-long line? Rope still efficient. 10 million line file? Rope still efficient. Massive sequence of edits relative to the pristine file? Rope still efficient. It's pretty easy to write code which works well in the common cases but then degrades when things get interesting. I'm not going to apologize for caring about that.
This is what I object to about the rope representation -- it intentionally destroys this information that you actually want available most of the time. I don't think that's nice at all.
It's possible you could make the rope work better in this sense by annotating each piece... I dunno, haven't thought about it.
As for the worst-case performance thing ... I think my scheme would do fine with super-long lines or 10 million line files. But dude, I don't even have an editor today that works okay on 10k-line files, and I don't think it's the internal data representation that's the problem, I think it's because of all the other decisions that get made (or lack thereof).
Besides, what's a "line"? Is it a source line, or a visual line after wrapping? In an editor, you care deeply about both, depending on what exactly you're doing. With an array of lines, you pick one for the representation, and when you want the other one, it's quite painful. With ropes, no problem at all, just two different Cursor objects into the same rope.
I agree, other decisions besides buffer representation are important, and I'm doing my best to get those right too.
It's really not that painful. A logical line of text may span multiple visual lines because of word wrapping. A visual line will never span more than one logical line. So you always have a 1-to-N relationship between logical and visual lines, and everyone knows you have to go to M-to-N before things get objectively painful.
Here's the thing about text editors: they are extremely well known problems. It's really hard to say anything about writing a text editor is "difficult" when we have hundreds of examples of free, open source software that we can sit with and analyze for requirements. There really aren't many new lessons to learn, certainly not in the memory representation for the file.
It's really nice to talk about parallelizing your text editor operations, but here in the year twenty thousand and seventeen, Vi is still single-threaded and not at all hardware accelerated and it still runs essentially infinitely faster than most other text editors. The only time Vi's single-threadedness becomes a problem is when plugin authors can't be arsed to not load up Python.
What data structure does Vi use? An array of lines.
There are, of course, computer science concepts that are very smart. But we don't need these to save us from slow software, because today's slow software problem is just the result of people doing bad things in layer upon layer. We have to stop doing all the bad stuff and dig us out of the hole we're in, just to get back to neutral. Once we are back at neutral, then we can try thinking about some computer science smarty stuff to take us forward.
You give an example of Object-Oriented Programming as an idea that has software engineering idea back by decades but really isn't it the misapplication of that tool that does the damage? Consider that time and again software engineers have developed a code base of functions which all need a bit a shared state, and technique of calling all those functions while including a reference to the 'context', was more simply expressed as calling a function "from" the context itself?
The tools help with the cognitive burden of understanding the entire system. And tools that allow one to make durable assumptions about part of the system allow the engineer to 'free up' space in their brain for other parts of the system. That desire to add abstraction in order to 'move up' the conceptual tree and get a wider view of the overall system has been part of how humans think since they first started collecting into tribes.
Certainly "over abstracting" is a huge issue. There was a great story about the Xerox Star system (first word processor that was multi-lingual and multi-fonted) that Dave Curbow used to tell about how the 'call stack' to get a character on the screen had reached insane levels (and that made things very slow). All due to abstraction. And yet there are good examples too where at Sun the adding of support for the 3b2 file system was accomplished quickly, and everything still worked, due to the Virtual File System (VFS) abstraction.
My point is that it isn't the tools that set computer science back, it is the misapplication of them that does that. And what I liked about Raph's discussions on Xi is the exploring and testing whether or not a tool he had available was applicable to writing text editors.
 Can you imagine the challenge of having to talk to someone in the tribe for 10 minutes to determine what their role was and capabilities? So much easier to say "You're a hunter right?" and when they say yes just assume various hunter capabilities are available.
One big problem is that one can't deploy the existing methods of zero-cost abstraction (templates/monomorphized generics) across process boundaries or the kernel/userspace boundary. Let's work on this, not act as if abstractions themselves are the problem.
(As an aside: I think the "implementation inheritance" aspect of OOP is at least as harmful as jblow suggests. "Associate related pieces of data with the code that operates on them as a unit" is a pretty reasonable idea on its own.)
I don't want to sound like I want to oppose you here, but I would be genuinely interested about on your write up about OOP setting engineering back by decades. Could you elaborate, please?
You can make a decent argument that's the only realistic way extremely large systems like the web can possibly come about. It's not pretty but the track record of the alternatives is worse.
Of course in the name of not using the same golden hammer for all problems, extremely large systems like the web and text editors should each be considered in their own rights for the best solution for each of those.
I'm not just saying that people have lost track of the importance of efficiency. I am saying they've lost track of how to actually do it. I think at least 95% of the programmers working in Silicon Valley have no practical idea of how to make code run fast. Of the remaining 5%, a very small number are actually good at making code run fast. It's a certain thing that you either get or don't. (I didn't really get it when I started in games, even though I thought I did ... it took a while to really learn.)
Forget 16ms, I would be happy to get to an order of magnitude slower than that for much of today's software ... it would be a massive increase in human happiness.
I'm sure someone overly excited is at this very moment trying to make this a mathematically precise statement so they can see if you're right, and if so, if there's a way to change the author's approach to support more complicated computations on the text. Maybe you can help them along if you're not just guessing.
The Dyck language can be recognized by a monoid as described. The regular language can be recognized by a monoid whose elements are functions between states of the recognizing automaton, each symbol being identified with the transitions the automaton performs when reading the symbol. The intersection should be describable by the product of monoids. (If such a thing exists, anyway. My abstract algebra is a bit rusty.)
The hard part is dealing with the homomorphism that turns the intersection of the two languages into the context-free language. I'm guessing that the general construction might require turning parentheses into empty strings, which would destroy the simplicity of the monoidal construction.
Anyway, I'm unfortunately not excited enough to see whether I can get it to work, at least not right now. But maybe it helps point someone else in the right direction.
There is an extensive body of literature on parsing that goes back decades. Most of it I don't think is that useful. But some of it is about parallel parsing. If you are interested, there are quite a number of people with something to say about it. However, the speed wins in practice are not very big.
On the other hand, if you just write the parser so that it's fast to begin with, you don't really have a problem. The language I am working on parses 2.5 million lines of code per second on a laptop, and I have only spent a couple of hours working on parser speed. To do this it does go in parallel, but it goes parallel in the obvious way using ordinary data structures (1 input file at a time as a distinct parallel unit). So it's not "parallel parsing" in the algorithmic sense.
Of course, I don't disagree with your point that a fast parser makes making a distinction here less useful. However that number sounds interestingly large, without context, do you have more info I can read about it?
Lexers/tokenizers handle degenerate cases very well (after many years of being used for syntax highlighting in IDEs), and some parsers intended for IDE consumption are getting much better at dealing with degenerate cases (the Roslyn family and Typescript in my experience have some very interesting work put into this area), but most parsers still have a long way to go. (Especially, because many of the most common parser-generators themselves have never bothered to concern themselves with degenerate cases.)
Was neither agreeing nor disagreeing with your points, simply expanding "sideways", because I think the conversation about the usefulness of parsing to general usage in text editors and information processing gets derailed by the "degenerate" cases where things don't parse (because those are very important to text editors).
I think people often forget or underappreciate the lexing half of the lexer/parser divide. Yet syntax highlighting engines in most of the text editors we use these days already hint that you can do a lot of user meaningful things with "just" rudimentary, generic lexers.
As a continued aside: I felt I got really good results using a lexer as the basis for a character-based somewhat "semantic" diff tool, but still to date I've yet to really see it come into general usage outside my prototype toy (http://github.com/WorldMaker/tokdiff).
Second, it's completely reasonable to imagine data-structure-driven improvements to the task of writing source code. Imagine, for example, an AST editor. There are tools like org-mode and paredit which are halfway to true AST editing, and plenty of languages have tooling sufficient to support it, if there were demand. An AST editor would, of course, generalize the lessons here about ropes to pretty-printed ASTs, but there's no innate reason why it couldn't be done.
Third, a meta-comment, to address your comment elsewhere in the thread: "There are, of course, computer science concepts that are very smart. But we don't need these to save us from slow software, because today's slow software problem is just the result of people doing bad things in layer upon layer. We have to stop doing all the bad stuff and dig us out of the hole we're in, just to get back to neutral. Once we are back at neutral, then we can try thinking about some computer science smarty stuff to take us forward."
There are assumptions here about the nature of CS. CS is a science of abstractions. Complaining about layers in CS is complaining about the very nature of CS. Software is slow and insecure and hard to use because the tasks that we demand of software are extremely complex and our human processes for creating code are not sufficiently high-level, powerful, and expressive enough for us to design good systems on the first try.
Whenever somebody says, "I have removed a useless abstraction," they usually forget to also say, "By replacing it with a useful abstraction."
If you're interested in applications to other domains, we use ropes as the basis of the String data type in TruffleRuby. TruffleRuby is open source, so you can see that implementation by checking out the code in the org.truffleruby.core.rope package . We had to extend the basic idea of ropes to better match Ruby's semantics, such as making them encoding-aware. I gave a talk about it at last year's RubyKaigi  that dives into real world trade-offs.
There are also a lot of various rope implementations out there. You probably can find one for your language of choice.
 - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14....
 - https://github.com/graalvm/truffleruby
 - https://www.youtube.com/watch?v=UQnxukip368
I wonder if only wanted a printable character ASCII editor would simplify things a lot or only a little. And I guess no tabs.
> Part 2 Line breaking
I don't really understand the problem here. Can't we count the line breaks like anything else? Is it because that's not the values we want in the end?
> Part 4: Again, making this into a monoid is pretty easy. You store two copies of the (t, m) pair - one for the simple case, and one for the case where the beginning of the string is in a comment. You also keep two bits to keep track of whether the string ends or begins a comment. In principle, you have to do the computation twice for both cases, whether the first line is a comment or not, but in practice it doesn’t make the computation any more expensive: you compute (t, m) for the first line and for the rest of the string, and just store both the first value and the monoid sum.
What if a node of the rope contains an "end comment" and (later) a "("? What should the two pairs of (t, m) be? Now that substring might be entirely inside, outside or partially outside and inside a comment.
Although I do understand the general idea of computing the result for all possible initial/input state to achieve paralellism.
Maybe it'd still be possible to maintain good response time while enjoying 4-10x memory savings.
That said, I'm very skeptical that lz4 on text would be worth it.
Sure my text editors aren't perfect, but they mostly get the job done, so any editor coming along needs to show that it tries to solve a problem that the user has. I'm not yet convinced this one does, so I probably will never find out what makes it brilliant.
I opened the file in intellij and it's performing admirably but there is noticeable lag when entering input. Keep in mind that features were disabled automatically due to the file size.
2MB may not be very representative for all files, but when having to jump into generated protobuf class files I've certainly wished my IDE could handle the load. I can't tell you what data structures these IDEs are using internally though.
And, again, the fact that features had to be disabled in intellij really says a lot. With more power, you can have more features, you can get feedback faster in your IDE, you can have more linters, more plugins, etc. That's a huge benefit to having fast software. Performance is enabling.