I think this overstates the difficulty. This of course depends a lot on the language, but for a reasonable one (not C++) you can just go and write the parser by hand. I’d ballpark this as three weeks, if you know how the thing is supposed to work.
> it doesn’t have to redo the whole thing on every keypress.
This is probably what makes the task seem harder than it is. Incremental parsing is nice, but not mandatory. rust-analyzer and most IntelliJ parsers re-parse the whole file on every change (IJ does incremental lexing, which is simple).
> The reason (most) LSP servers don’t offer syntax highlighting is because of the drag on performance.
I am surprised to hear that. We never had performance problems with highlighting on the server in rust-analyzer. I remember that for Emacs specifically there were client side problems with parsing LSP JSON.
> Every keystroke you type must be sent to the server, processed, a partial tree returned, and your syntax highlighting updated.
That’s not the bottleneck for syntax highlighting, typechecking is (and it’s typechecking that makes highlighting especially interesting).
In general, my perception of what’s going on with proper parsing in the industry is a bit different. I’d say status quo from five years back boils down to people just getting accustomed to the way things were done. Compiler authors generally didn’t think about syntax highlighting or completions, and editors generally didn’t want to do the parsing stuff. JetBrains were the exception, as they just did the thing. In this sense, LSP was a much-needed stimulus to just start doing things properly. People were building rich IDE experiences before LSP just fine (see dart analyzer), it’s just that relatively few languages saw it as an important problem to solve at all.
It's not necessarily hard, but it takes a lot of work, and you will absolutely learn a lot of stuff about the language long after 3 weeks. I've programmed in Python for 18 years and still learned more about the language from just working with the parser, not even implementing it!
And this doesn't even count error recovery / dealing with broken code ...
It isn't trivial to work around the parts that aren't context free, but it's also nothing insurmountable that requires more than 120 hours of effort. The document explicitly points out which grammar rules are not context free and gives an algorithm that can be used as an alternative.
Parsing is really not as challenging a job as a lot of people make it out to be and it's an interesting exercise to try yourself and get an intuitive feel for. You can use a compiler compiler (like yacc) if you feel like it to just get something up and running, but the downside of such tools is they do very poorly with error handling. Rolling out a hand written parser gives much better error messages and really is nothing that crazy. C++ is the only mainstream language I can think of that has a grammar so unbelievably complex that it would require a team of people working years to implement properly (and in fact none of the major compilers implement a proper C++ parser).
For statically typed languages things get harder because you first need to parse an AST, and then perform semantic analysis on it, but if all you need is syntax highlighting, you can skip over the semantic analysis.
I wish we could move toward semantics highlighting.
I will chime in with you though and agree, as a writer and teacher of parsers, it doesn’t have to be that hard. In fact, if you implement your parser as a PEG, it really doesn’t have to be much longer than the input to a parser generator like YACC. Parser combinators strongly resemble the ebnf notation, it’s almost a direct translation. That’s why parser generators are possible to write in the first place. But in my opinion they are wholly unnecessary, since true grammar itself is really all you need if you’ve designed your grammar correctly. Just by expressing the grammar you’re 90% of the way to implementing it.
> And this doesn't even count error recovery / dealing with broken code ...
With a hand written parser, you mostly get error resilience for free. In rust-analyzer’s parser, there’s very little code which explicitly deals with recovery. The trick, is, during recursive descent, to just not bail on the first error.
In a C-like language, I'd imagine you'd use braces or semicolons to see how far to skip ahead - the error bubbles up to a parser that knows how to recover, like say a statement or function body, it scans ahead to where it thinks its node ends and returns an error node, allowing the parent to continue.
Also, thanks dunham for the sixten suggestion!
I am given to understand that this is not a problem any more (since Emacs 27.1). Before that, the JSON parser was written in elisp which is a slow language (though somewhat mitigated with recent native-compilation). But now Emacs has preference to just use native bindings (jansson), and afaik this had solved most of the performance grievances raised by LSP clients.
I don't agree. Newer languages are all being designed with the constraint that the grammar should be easy to parse and not require indefinite lookahead and full compilation to get back on track after an error.
That's a big change from the C/C++ heritage.
It's no coincidence that "modern" languages (call it the last 10 or so years) tend to have things like explicit variable assignment (let-statement-like) and delimiters between variable and type, for example.
I think that says less about the difficulty of parsing and more that language designers have realised that 'easy to parse' is not incompatible with good readability and terse syntax. In fact, the two go hand in hand: languages that are easy for computers to understand are often easy for users to understand too.
I don't agree. Lisp is "easy" to parse, but difficult to add structure to. Tcl similarly. Typeless languages are now out of favor--everybody wants to be able to add types.
Perl is a nightmare and probably undecidable. Satan help you if you miss a period in COBOL because God sure won't. FORTRAN is famous for it's DO LOOP construct that would hose you.
About the only language that wasn't hot garbage to parse was Pascal. And I seem to recall that was intentional.
I have no idea what you mean by this, or how you think it relates to your original claim that having languages with a less terrible grammar than C++ or even C is some recent development.
> Perl is a nightmare
And it's pretty clearly C-inspired, even if it added lots of new syntactic horrors of its own invention. Also, it's late 80ies not early 70ies, so hardly a poster-case for languages becoming grammatically saner.
> About the only language that wasn't hot garbage to parse was Pascal.
In addition to Pascal and Lisp which you already mentioned Algol, Prolog, APL, Smalltalk are all famous languages from around the same time as C or significantly older and none of them are "hot garbage to parse". Neither are important 80ies languages like Postscript or SML. In fact the only significant extant 70s language I can think of from the top of my head that is syntactically noticeably more deranged than C and maybe even C++ is TeX.
> And I seem to recall that was intentional.
Well yes, why would anyone create a language that's really hard to parse for no discernible benefit? This is not the counterintuitive recent insight you make it out to be. If anything, the trend for popular languages would seem to become harder to parse -- none of the significant languages from the 2000s (like Swift, Julia or Rust) are anywhere as easy to parse as the languages I listed above.
Python is considerably more picky in what it allows in which constructs; it does all the numerical typing stuff implicitly (so switching between integer, big int, and float happens most of the time without users even knowing about it), and b/c of backwards compatibility, coercions between numbers and booleans still happen (`True + 1` is still `2` in Python3.9). By extension, this includes empty lists, strings, and dictionaries evaluating to `False` in appropriate contexts.
I guess what I want to say is that strong and loose typing exists on a continuum much like human languages are never just of a single idealized kind.
'2' * 3
'2' + 3
(Though Python does have optional type annotations these days.)
A hand-tuned Rust parser on a 2021 machine? I can imagine it handling hundreds of kilobytes without major issues.
Still, there's some "performance tuning itch" that this doesn't quite scratch. I can't get past the notion that this kind of things ought to be done incrementally, even when the practical evidence says that it's not worth it.
Glances at the memory usage of Goland in a moderately sized project and weeps
Having a parser which generates an AST is just the first step. Then, you actually need to implement all the rules of the language, so for instance the scoping rules, how the object system works, any other built-in compound/aggregate types, other constructs like decorators, generics, namespaces or module systems, and on and on and on. Depending on the language, this will usually be the main work.
And then of course there's dynamic typing - if you want to enable smart completions for a dynamically typed language, you need to implement some kind of type deduction. This alone can take a lot of time to implement.
And don't even try to be smart with dynamically typed languages, it cannot possibly be reliable, short of actually executing the program. If your program are short enough you won't need it, and if you do need such static analysis… consider switching to a statically typed language instead.
There’s still work to do, but having tree sitter in neovim feels like a great step forward.
Heh. A long time ago I wrote a video game somewhat similar to Williams Defender, and casting about for some sort of "theme" for the game, I hit upon the "editor wars", the ancient storied battle between vi and emacs. You are ostensibly "vi", (a little spaceship vaguely reminiscent of the Vipers from Battlestar Galactica) cruising through system memory, evading system processes, GDB instances, etc trying to recover your ".swp" files. How to represent Emacs? Obviously, via a giant blimp! and I could display all sorts of messages on the side of the blimp, singing the praises of Emacs, and disparaging fans of vi. And the Emacs blimp had a "memory leak", which meant that pieces of the xemacs source code would literally leak out of the back end of the blimp, with the letters floating lazily away, like smoke. So that meant I had to take a look at the xemacs source, dig through it and try to find some funny bits to put in. Of course, "semantic bovinate" jumped out at me.
Quite a lot of languages are already supported, it's really nice to see. I might have a use for such a library for a personal project :)
You can play around with the playground here: https://tree-sitter.github.io/tree-sitter/playground
(I appreciate the complexity of the problem, btw)
On the other hand, I find the "frivolous indulgence" perspective extremely obnoxious along with the related implication of moral or technical superiority of not using syntax highlighting.
As a side note, the way it helps many people who prefer it has some fascinating cog-psych underpinnings: https://en.wikipedia.org/wiki/Visual_search
Sometimes I wonder if those who don't prefer it might have some synesthesia which might allow their brain hardware to provide what the syntax highlighting does for the rest of us.
You get the same attitude a lot for things like autocomplete and even for static typing.
If I could have some sort of focus follows mind where highlighting automatically happens commensurate to what level of granularity I'm currently thinking about the code at I would be extremely interested, but absent "focus follows mind" it's a trade-off that everybody has to make for themselves.
Some people prefer to highlight almost everything, some almost nothing, some people find it helpful for some languages/tasks but not for others.
It's similar IME to the extent to which preferred debugging styles (printf versus interactive versus hybrid versus situational choices) are also something people have to figure out, and, well, different people are different, and that's neither a bad thing nor an avoidable thing.
I would not at all be surprised if people who started off one way or the other (for generational reasons or indeed any 'whatever environment they were first introduced to' style reasons) are less likely to end up switching, but that probably says more about perceived switching costs as what would be most comfortable for somebody.
e.g. I know people who took a month to be comfortable without synhi but then loved that, and I've spent weeks trying to be more comfortable -with- and given up, and honestly anything that half screws your productivity for over a week is going to be a hard sell even if the end result -would- be better (waves in "also, still can't manage to drive emacs" ;)
It was hard for the first few hours, but then I eventually got used to it, and now I can't use anything else.
I know this is not quite as extreme as working without syntax highlighting :)
Ie, the paragraph (or block of code) your cursor is focused on is visible, the rest of the code is blurred out.
Edited to add that I found this for VS Code which I might try:
Although I can definitely imagine this being a plugin in different editors
It certainly sounds like an experiment that would be interesting to try, though.
I can imagine more useful highlighting than color coding the types of the symbols encountered. Lighting up the active scopes. Giving the same hue to names that look like each other. There are probably highlighters out there that do that. But "simple" syntax highlighting is still the norm.
> most colours are actively adding irrelevant information to the cognitive load of existing. It should be obvious that apples and red and the sky is blue.
That’s silly, because it does add relevant information. Obviously it’s a spectrum - too many colours can hide information, but when used appropriately it’s fine.
Also everyone is different. Perhaps your brain gets distracted by the colours more than the majority of people.
As you had guessed a little later, there are a few different emacs packages that do this. One of them is "rainbow parentheses" that gives every bracket a different colour (remember that emacs supports lisp, so differentiating between lots of different parentheses is arguably more useful in emacs than any other editor). .
Another one is highlight parentheses  which highlights all parens that enclose the cursor position, and gives a darker colour to those "further away" from the cursor.
Emacs' 'Rainbow Identifiers' does that. I like it.
LSPs provide an "outline" which can be very useful to navigate through code. I find "jump to symbol" function in my text editor to be faster than scanning all of the code to find the line.
Also most themes dim the comments, but IMO if something in the code needed an explanation, it should be brighter, not dimmer.
That makes me crazy! I use base2tone, which is not nearly as minimal as your theme but more than most, and I modify the comments to be bright.
I also like the idea of using colour to distinguish different identifiers, e.g. https://wordsandbuttons.online/lexical_differential_highligh...
Somehow I never found a need to change that. I highlight comments, keywords, and strings. Comment and string highlights are helpful if they contain code-like text, to make them obviously not-code. Keywords give some structure to the text.
Everything else is frivolous to me. Books do not highlight verbs in green, either.
While I will not argue with your general point -- I also don't really need highlighting and I read a lot of plaintext code -- I wonder about this.
Would this make languages easier for non-native speakers? Would improve comprehension?
It's funny that the industry spends so much time on syntax highlighting for programming languages, when humanity's written languages are arguably more complex and difficult to parse and master.
When I've been trying to learn languages, I can typically part-of-speech tag unknown words quite easily (common prefixes/suffixes/word length/sentence position give lots of information – and some of this is shared across languages as well). The comprehension difficulty is nearly always due to content words I haven't seen before (or have forgotten).
So, it occurred to me that whether syntax highlighting is actually useful depends somewhat on the context, what are you trying to do?!
I suppose it's easy to extend that realization to people who are different and might feel overloaded by information more easily, so I can sympathize with what you're saying (hope this doesn't sound condescending, I am just trying to say people can have very different cognition overload levels, regardless of how capable they actually are in general).
(I have almost no -ambient- highlighting in my baseline but I know lots of people who do and still derive great value from showmatch for the feedback - from discussions with other people rainbow parens style lisp modes seem to provide a maximum overkill approach to that question but I very much prefer maximum underkill in my own tooling even while wanting to be very sure I'm not making it unduly difficult for collaborators with opposite preferences)
Some of the alternatives can be found by starting at: https://www.gnu.org/software/emacs/manual/html_node/emacs/Ma...
I can sympathise with both sides; I like syntax highlighting when it's done well - when it's distracting I turn it off.
Seeing a keyword highlighted within a comment is an instant red flag - unfortunately it happens loads in Azure Data Studio (which I need to occasionally use).
Never happened in TreeSitter though.
even if you can, surely you're wasting time and/or focus on an automatable task.
But we all have something we do 'the hard way' because it feels like more effort to relearn the task than its worth, or because we tried the easy way once and were put off by some side-effect.
paren highlighting never comes as a single unit, its always packaged with other 'helpful' tools, some subset of which will always be infuriating to someone.
True, but mentally balancing parentheses is usually something that you do while writing the code: you pushpop a little stack in your head and this becomes second nature.
Mentally verifying if parentheses are balanced while reading code is hardly ever required. You can usually safely assume that they are (unless that darn compiler tells you otherwise).
maybe i just dont have that stack well enough built in my head--if im editing in a plugin-free vim, i do find i have to backtrack and count to make sure i've put the right number of kets at the end of a nested expression.
if i used s-expressiony instead of tab-heavy languages more often i'm sure i'd be better at it.
I do have a minimal amount of highlighting though.
Much of syntax highlighting in the wild is junk, just distracting eye candy.
Ironically, the most successful IDEs today, the Jetbrains ones, are demonstrations of this. They are built out of reusable components that are combined to produce a wide range of IDEs for different languages.
LSP and DAP aren't perfect, but they're a huge step in the right direction. There's no reason people shouldn't be able to use the editor of their choice along with common tooling for different languages. The fact that IDEs had (for a while) better autocomplete, for example, than emacs wasn't because of some inherent advantage an IDE has over an editor. It's because the people that wrote the language analysis tools behind that autocomplete facility deliberately chose to package them in such a way that they could only be used with one blessed editor. It's great to see the fight back against that, and especially so to see Microsoft (of all people) embracing it with LSP, Roslyn, etc.
One point in favor of tight integration and against LSPs is that editing programs isn't like editing unstructured text at all and shouldn't be presented as such. There are tons of ways in which the IDE UX can be enhanced using syntactic and semantic knowledge of programs. Having a limited and standardized interface between the UI and a module providing semantic information will just hamper such innovation.
It's true that if you own both the editor and the language analysis tools you can more rapidly add new capabilities, but many facilities that were historically the domain of IDEs, such as autocomplete, are very easy to standardise an interface for, and this has been done. Supporting such interfaces doesn't prevent you from also supporting nonstandard/internal interfaces for more cutting-edge capabilities. The argument made by Jetbrains is similar to the one you've made and it's entirely false. They could easily support LSP and it would have no impact on their ability to innovate. They refuse to do so for purely business reasons (as is their right).
Editing instead of replying as the depth limit is reached (bad form, perhaps, but gmueckl's reply is in the form of a question and I'd like to respond): The necessary UI capabilities for the features you describe already exist in emacs. Multiple alternative implementations of them, actually (lsp-mode vs eglot). It's the editor's job to provide the UI and the LSP server's job to provide the backend. The interface between them is easy to standardise and it has been done (yes, even for the features you mention).
(Seems I can reply after all so have done so, and now I appear unable to edit the GP and remove the text above from it :-/)
As a trivial example, let's say for some reason I keep needing to generate the sha256 hash of a password and add it to the current file. I could add this to my .emacs:
(defun my-insert-password-hash (password)
(interactive "MPassword: ")
(insert (secure-hash 'sha256 password)))
(global-set-key (kbd "C-c p h") 'my-insert-password-hash)
This kind of user interaction can equally easily allow the user to select from lists of dynamically determined options, including very large ones (with nice fuzzy matching menus if you use ivy, helm or similar). It's also trivial to write functions that prompt for several pieces of information, ask for different information depending on context, etc.
In the case of LSP, the server only has to provide information about what options are available and what possible responses are permitted. It's easy for emacs to dynamically provide the corresponding UI.
You're effectively confirming that there needs to be feature-specific integration code for each and every navigation/refactoring/... feature in the editor. Once you have that, you again have tight coupling.
User to emacs: I want to perform a code action.
Emacs to server: The selection is this. What code actions can you perform?
Server to emacs: I can perform actions called "Extract Method", "Extract Base Class", etc.
Emacs to user: Choose an action from this list
User to emacs: I'll do "Extract method"
Emacs to server: We're going with "Extract Method", what info do you need.
Server to emacs: I need an item called "method name" which is a string and an item called "visibility", which is one of the following: "public", "private", "protected".
Emacs to user: Enter method name... Select visibility...
Emacs to server: Here are the parameter values.
Server: Here are the edits necessary to perform the code action
Emacs: Updates code.
No special per-action code is required. If you want to see an implementation have a look at lsp-mode.
I hope that makes it clear. I've spent more of my day explaining this than I should have now, so I'll leave it there.
Maybe. That can certainly be a downside of standardisation in general. However, it doesn't necessarily follow in all cases, and this is, I think, one where it doesn't. The features LSP provides are stable - and have been standard across most editors/IDEs for quite some time. Implementing them once for N editors, rather than N times, is just far more approachable (and appealing) for language tooling developers.
It doesn't stop those developers (or anyone else) adding features beyond the LSP standard. But that means doing it in an editor-specific way. Which is no worse than where we were before anyway.
People buy IDE -> money goes to improving the IDE -> IDE gets better
People download one of 6 competing open-source plugins -> a couple of people improve it a little -> 3 years pass, the author loses interest -> someone else else reinvents the wheel, there are now 7 competing open-source plugins 3 of which are good but not maintained anymore.
Great features require time, I just don't see non-commercial work succeeding here.
Doesn't mean it's not possible to create commercial fantastic open-source standalone language tools, it's just not happening for some reason. Probably just because most businesses are still hesitant to open-source their core business?
I think there _are_ ways to do it right. For instance, open-sourcing a Windows application is not necessarily problematic if 99.9% of your user base has never ever compiled something from source. Heck, my father is the kind of person who doesn't know the difference between "Windows" and "gmail". He has purchased software for his business once, it would've made no difference to him whether it was open-source or not.
Despite my believing that it's possible, I can't really think of any examples other than redhat and Qt from the top of my head...
What a fatuous remark. I publish the Coca-Cola recipe to Pepsi
drinkers. I'm fairly sure the recipe will eventually get around to a home
brewer who's sick of paying the Coca-Cola company for its product.
I used to work with Eclipse and it supported everything* through plugins just nicely.
I ask because modern editors can do most things people often regard as IDE only but there is still the odd gem that’s worth hearing about.
- Add a parameter to a method signature and fill it in with a default value.
- Reorder method parameters.
- Extract a block of code to a method that infers the correct input and output types.
The most advanced refactoring I've done with IntelliJ is structural replace for Java which can do something like: for every private method matching the name "*DoThing" defined on a class that extends Foo, insert a log line at the beginning of the method: https://www.jetbrains.com/help/idea/structural-search-and-re...
I make heavy use of the "integrated" aspect of IntelliJ. One of the nicer benefits is that SQL strings are type-checked against a database schema.
But the above things are not done in LSP generally. It doesn’t have first-class support for structural search replace. It doesn’t have support for interactive refactors which require user input.
> The features language servers provide are only some of the many things that IDE users have had for literally decades.
Yes of course, because that's what they were explicitly designed to do. The novel thing about language servers isn't that they enable code intelligence features like auto-complete and variable renaming. It's that they do so over a standard protocol that any editor or IDE (or website or CI system or ...) can use.
And the reason for that is mostly down to fragmentation: the vim guys are doing their thing; the Emacs theirs, etc.
Now focus that energy into a singular project like a Language Server and the payout is likely to be many orders of magnitude greater.
I think that's a bit different from just being a "Microsoft offering".
Very hard to figure all when it's not your core domain,
I used Nom. Even though it's not incremental, parsing is easily fast enough to just reparse the entire document on each change.
An alternative is to just use Tree Sitter as your parser for the CLI too. You won't use the incremental parsing feature in the CLI but that's fine.
Supporting IntelliJ may be tricky but there is a WIP plugin that adds LSP support.
While I agree... he might be surprised to know that that is what all language servers do anyway, even if they don't provide syntax highlighting. Every keystroke gets sent over the LSP. As JSON. It's amazing it works as well as it does.
I was wondering if the both were to achieve similar goals it makes no sense to run them both, but now I can educate myself.