I don't think it's easy to write a good syntax coloring engine like the one in Vim.
Syntax coloring has to handle context: different rules for material nested in certain ways.
Vim's syntax higlighter lets you declare two kinds of items: matches and regions. Matches are simpler lexical rules, whereas regions have separate expressions for matching the start and end and middle. There are ways to exclude leading and trailing material from a region.
Matches and regions can declare that they are contained. In that case they are not active unless they occur in a containing region.
Contained matches declare which regions contain them.
Regions declare which other regions they contain.
That's the basic semantic architecture; there are bells and whistles in the system due to situations that arise.
I don't think even Justine could develop that in an interview, other than as an overnight take home.
Everything that is white (not colored) is literal template material. Lisp code is embedded in directives, like @(do ...). In this scheme, TXR keywords appear purple, TXR Lisp ones green. They can be the same; see the (and ...) in line 149, versus numerous occurrences of @(and).
Quasistrings contain nested syntax: see 130 where `<a href ..> ... </a>` contains an embedded (if ...). That could itself contain a quasistring with more embedded code.
TXR's txr.vim" and tl.vim* syntax definition files are both generated by this:
Some random things that the author seem to have missed:
> but TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings
Many more languages support that:
C# $"{x} plus {y} equals {x + y}"
Python f"{x} plus {y} equals {x + y}"
JavaScript `${x} plus ${y} equals ${x + y}`
Ruby "#{x} plus #{y} equals #{x + y}"
Shell "$x plus $y equals $(echo "$x+$y" | bc)"
Make :) echo "$(x) plus $(y) equals $(shell echo "$x+$y" | bc)"
> Tcl
Tcl is funny because comments are only recognized in code, and since it's a homoiconic, it's very hard to distinguish code and data. { } are just funny string delimiters. E.g.:
xyzzy {#hello world}
Is xyzzy a command that takes a code block or a string? There's no way to tell. (Yes, that means that the Tcl tokenizer/parser cannot discard comments: only at evaluation time it's possible to tell if something is a comment or not.)
my $foo = 5;
my $bar = 'x';
my $quux = "I have $foo $bar\'s: @{[$bar x $foo]}";
print "$quux\n";
This prints out:
I have 5 x's: xxxxx
The "@{[...]}" syntax is abusing Perl's ability to interpolate an _array_ as well as a scalar. The inner "[...]" creates an array reference and the outer "@{...}" dereferences it.
For reasons I don't remember, the Perl interpreter allows arbitrary code in the inner "[...]" expression that creates the array reference.
This was a fun read, but it left me a bit more sympathetic to the lisp perspective, which (if I've understood it) is that syntax, being not an especially important part of a language, is more of a hurdle than a help, and should be as simple and uniform as possible so we can focus on other things.
Which is sort of ironic because learning how to do structural editing on lisps has absolutely been more hurdle than help so far, but I'm sure it'll pay off eventually.
Having a simple syntax might be fine for computers but syntax is mainly designed to be read and written by humans. Having a simple one like lisp then just makes syntactic discussions a semantic problem, just shifting the layers.
And I think an complex syntax is far easier to read and write than a simple syntax with complex semantics. You also get a faster feedback loop in case the syntax of your code is wrong vs the semantics (which might be undiscovered until runtime).
Jury's out re: whether I feel this in my gut. Need more time with the lisps for that. But re: cognitive load maybe it goes like:
1. 1 language to rule them all, fancy syntax
2. Many languages, 1 simple syntax to rule them all
3. Many languages and many fancy syntaxes
Here in the wreckage of the tower of babel, 1. isn't really on the table. But 2. might have benefits because the inhumanity of the syntax need only be confronted once. The cumulative cost of all the competing opinionated fancy syntaxes may be the worst option. Think of all the hours lost to tabs vs spaces or braces vs whitespace.
Lisp has reader macros which allow you to reprogram its lexer. Lisp macros allow you to program the translation from the visible structure to the parse tree.
I am surprised to hear that structural editing has been a hurdle for you, and I think I can offer a piece of advice. I also used to be terrified by its apparent complexity, but later found out that one just needs to use parinfer and to know key bindings for only three commands: slurp, barf, and raise.
With just these four things you will be 95% there, enjoying the fruits of paredit without any complexity — all the remaining tricks you can learn later when you feel like you’re fluent.
<rant> It's not so much the editing itself but the unfamiliarity of the ecosystem. It seems it's a square-peg I've been crafting a round hole of habits for it:
I guess I should use emacs? How to even configure it such that these actions are available? Or maybe I should write a plugin for helix so that I can be in a familiar environment. Oh, but the helix plugin language is a scheme, so I guess I'll use emacs until I can learn scheme better and then write that plugin. Oh but emacs keybinds are conflicting with what I've configured for zellij, maybe I can avoid conflicts by using evil mode? Oh ok, emacs-lisp, that's a thing. Hey symex seems like it aligns with my modal brain, oh but there goes another afternoon of fussing with emacs. Found and reported a symex "bug" but apparently it only appears in nix-governed environments so I guess I gotta figure out how to report the packaging bug (still todo). Also, I guess I might as well figure out how to get emacs to evaluate expressions based on which ones are selected, since that's one of the fun things you can do in lisps, but there's no plugin for the scheme that helix is using for its plugin language (which is why I'm learning scheme in the first place), but it turns out that AI is weirdly good at configuring emacs so now my emacs config contains most that that plugin would entail. Ok, now I'm finally ready to learn scheme, I've got this big list of new actions to learn: https://countvajhula.com/2021/09/25/the-animated-guide-to-sy.... Slurp, barf, and raise you say? excellent, I'll focus on those.
I'm not actually trying to critique the unfamiliar space. These are all self inflicted wounds: me being persnickety about having it my way. It's just usually not so difficult to use something new and also have it my way.</rant>
It almost looks like special error handling syntax but still remains satisfying once you realize it's an || logical-or statement and it's using short circuiting rules to execute handle error if the action returns a non-zero value.
I’d be interested to see a re-usable implementation of joe's[0] syntax highlighting.[1] The format is powerful enough to allow for the proper highlighting of Python f-strings.[2]
It’s just that nesting them arbitrarily is now allowed, right? That shouldn’t matter much for a mere syntax highlighter then. And one could even argue that code that relies on this too much is not really for human consumption.
Also, you can now use the same quote character that encloses an f-string within the {} expressions. That could make them harder to tokenize, because it makes it harder to recognise the end of the string.
$ sml < hello.sml
Standard ML of New Jersey (64-bit) v110.99.5 [built: Thu Mar 14 17:56:03 2024]
- = hello, world
$ mlton hello.sml && ./hello
hello, world
Given how C was considered one of the "expressive" languages when it arrived, it's curious that nested comments were never part of the language.
There are 3 things I find funny about that comment: ML didn’t have single-line comments, so same level of surprising limitation. I’ve never heard someone refer to C as “expressive”, but maybe it was in 1972 when compared to assembly. And what bearing does the comment syntax have on the expressiveness of a language? I would argue absolutely none at all, by definition. :P
> ML didn’t have single-line comments, so same level of surprising limitation.
It is not quite clear to me why the lack of single-line comments is such a surprising limitation. After all, a single-line block comment can easily serve as a substitute. However, there is no straightforward workaround for the lack of nested block comments.
> I’ve never heard someone refer to C as “expressive”, but maybe it was in 1972 when compared to assembly.
I was thinking of Fortran in this context. For instance, Fortran 77 lacked function pointers and offered a limited set of control flow structures, along with cumbersome support for recursion. I know Fortran, with its native support for multidimensional arrays, excelled in numerical and scientific computing but C quickly became the preferred language for general purpose computing.
While very few today would consider C a pinnacle of expressiveness, when I was learning C, the landscape of mainstream programming languages was much more restricted. In fact, the preface to the first edition of K&R notes the following:
"In our experience, C has proven to be a pleasant, expressive and versatile language for a wide
variety of programs."
C, Pascal, etc. stood out as some of the few mainstream programming languages that offered a reasonable level of expressiveness. Of course, Lisp was exceptionally expressive in its own right, but it wasn't always the best fit for certain applications or environments.
> And what bearing does the comment syntax have on the expressiveness of a language?
Nothing at all. I agree. The expressiveness of C comes from its grammar, which the language parser handles. Support for nested comments, in the context of C, is a concern for the lexer, so indeed one does not directly influence the other. However, it is still curious that a language with such a sophisticated grammar and parser could not allocate a bit of its complexity budget to support nested comments in its lexer. This is a trivial matter, I know, but I still couldn't help but wonder about it.
Fair enough. From my perspective, lack of single line comments is a little surprising because most other languages had it at the time (1973, when ML was introduced). Lack of nested comments doesn’t seem surprising, because it isn’t an important feature for a language, and because most other languages did not have it at the time (1972, when C was introduced).
I can imagine both pro and con arguments for supporting nested comments, but regardless of what I think, C certainly could have added support for nested comments at any time, and hasn’t, which suggests that there isn’t sufficient need for it. That might be the entire explanation: not even worth a little complexity.
Except that text inside #if 0 still has to lex correctly.
(unifdef has some evil code to support using C-style preprocessor directives with non-C source, which mostly boils down to ignoring comments. I don’t recommend it!)
Another syntax oddity (not mentioned here) that breaks most highlighters: In Java, unicode escapes can be anywhere, not just in strings. For example, the following is a valid class:
class Foo\u007b}
and this assert will not trigger:
assert
// String literals can have unicode escapes like \u000A!
"Hello World".equals("\u00E4");
I also argue that failing to syntax highlight this correctly is a security issue. You can terminate block comments with Unicode escapes, so if you wanted to hide some malicious code in a Java source file, you just need an excuse for there to be a block of Unicode escapes in a comment. A dev who doesn’t know about this quirk is likely to just skip over it, assuming it’s commented out.
InputCharacter:
UnicodeInputCharacter but not CR or LF
UnicodeInputCharacter is defined as the following in section 3.3:
UnicodeInputCharacter:
UnicodeEscape
RawInputCharacter
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u {u}
HexDigit:
(one of)
0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
RawInputCharacter:
any Unicode character
As a result the lexical analyser honours Unicode escape sequences absolutely anywhere in the program text. For example, this is a valid Java program:
public class Bar {
public static void \u006d\u0061\u0069\u006e(String[] args) {
System.out.println("hello, world");
}
}
Here is the output:
$ javac Bar.java && java Bar
hello, world
However, this is an incorrect Java program:
public class Baz {
// This comment contains \u6d.
public static void main(String[] args) {
System.out.println("hello, world");
}
}
> TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings. So to highlight a string, one must count curly brackets and maintain a stack of parser states.
Presumably this is also true in Python - IIRC the brace-delimited fields within f-strings may contain arbitrary expressions.
More generally, this must mean that the lexical grammar of those languages isn't regular. "Maintaining a stack" isn't part of a finite-state machine for a regular grammar - instead we're in the realm of pushdown automata and context-free grammars.
Is it even possible to support generalized string interpolation within a strictly regular lexical grammar?
Complicated interpolation can be lexed as a regular language if you treat strings as three separate lexical things, eg in JavaScript template literals there are,
`stuff${
}stuff${
}stuff`
so the ${ and } are extra closing and opening string delimiters, leaving the nesting to be handled by the parser.
You need a lexer hack so that the lexer does not treat } as the start of a string literal, except when the parser is inside an interpolation but all nested {} have been closed.
Justine gets very close to the hairiest parsing issue in any language without encountering it:
Perl's syntax is undecidable, because the difference between treating some characters as a comment or as a regex can depend on the type of a variable that is only determined e.g. based on whether a search for a Collatz counterexample terminates, or just, you know, user input.
I think possibly the most hilariously complicated instance of this is in perl’s tokenizer, toke.c (which starts with a Tolkien quote, 'It all comes from here, the stench and the peril.' — Frodo).
There’s a function called intuit_more which works out if $var[stuff] inside a regex is a variable interpolation followed by a character class, or an array element interpolation. Its result can depend on whether something in the stuff has been declared as a variable or not.
But even if you ignore the undecidability, the rest is still ridiculously complicated.
As soon as I saw this was part of llamafile I was hoping that it would be used to limit LLM output to always be "valid" code as soon as it saw the backticks, but I suppose most LLMs don't have problems with that anyway. And I'm not sure you'd want something like that automatically forcing valid code anyway
llama.cpp does support something like this -- you can give it a grammar which restricts the set of available next tokens that are sampled over
so in theory you could notice "```python" or whatever and then start restricting to valid python code. (in least in theory, not sure how feasible/possible it would be in practice w/ their grammar format.)
for code i'm not sure how useful it would be since likely any model that is giving you working code wouldn't be struggling w/ syntax errors anyway?
but i have had success experimentally using the feature to drive fiction content for a game from a smaller llm to be in a very specific format.
yeah, ive used llama.cpp grammars before, which is why i was thinking about it. i just think it'd be cool for llamafile to do basically that, but with included defaults so you could eg, require JSON output. it could be cool for prototyping or something. but i dont think that would be too useful anyway, most of the time i think you would want to restrict it to a specific schema, so i can only see it being useful for something like a tiny local LLM for code completion, but that would just encourage valid-looking but incorrect code.
i think i just like the idea of restricting LLM output, it has a lot of interesting use cases
gotchya. i do think that is a cool idea actually -- LLMs tiny enough to do useful things with formally structured output but not big enough to nail the structure ~100% is probably not an empty set.
Right, c" is for when you want to pass a literal string to some other word, not print it. But I agree that it's not very common, because you normally use s" for that, which leaves the address and length on the stack, while c" leaves just an address on the stack, pointing to a one-byte count field followed by the bytes. I think adding c" in Forth-83 (and renaming " to s") was a mistake, and it would have been better to deprecate the standard words that expect or produce such counted strings, other than count itself. See https://forth-standard.org/standard/alpha, https://forth-standard.org/standard/core/Cq, https://forth-standard.org/standard/core/COUNT, and https://forth-standard.org/standard/core/Sq.
You can easily add new string and comment syntaxes to Forth, though. For example, you can add BCPL-style // comments to end of line with this line of code in, I believe, all standard Forths, though I've only tested it in GForth:
: // 10 word drop ; immediate
Getting it to work in block files requires more work but is still only a few lines of code. The standard word \ does this, and see \ decompiles the GForth implementation as
: \
blk @
IF >in @ c/l / 1+ c/l * >in ! EXIT
THEN
source >in ! drop ; immediate
This kind of thing was commonly done for text editor commands, for example; you might define i as a word that reads text until the end of the line and inserts it at the current position in the editor, rather than discarding it like my // above. Among other things, the screen editor in F83 does exactly that.
So, as with Perl, PostScript, TeX, m4, and Lisps that support readmacros, you can't lex Forth without executing it.
Counted (“Pascal”) strings are rare nowadays so C" is not often used. Its addr len equivalent is S" and that one is fairly common in string manipulation code.
As for C#'s triple-quoted strings, they actually came from Java before and C# ended up adopting the same or almost the same semantics. Including stripping leading whitespace.
I don't understand why you wouldn't use Tree Sitter's syntax highlighting for this. I mean it's not going to be as fast but that clearly isn't an issue here.
Meanwhile NeoVim doesn’t syntax highlight my commit message properly if I have messed with "commit cleanup" enough.
The comment character in Git commit messages can be a problem when you insist on prepending your commits with some "id" and the id starts with `#`. One suggestion was to allow backslash escapes in commit messages since that makes sense to a computer scientist.[1]
But looking at all of this lexical stuff I wonder if makes-sense-to-computer-scientist is a good goal. They invented the problem of using a uniform delimiter for strings and then had to solve their own problem. Maybe it was hard to use backtick in the 70’s and 80’s, but today[2] you could use backtick to start a string and a single quote to end it.
What do C-like programming languages use single quotes for? To quote characters. Why do you need to quote characters? I’ve never seen a literal character which needed an "end character" marker.
Raw strings would still be useful but you wouldn’t need raw strings just to do a very basic thing like make a string which has typewriter quotes in it.
Of course this was for C-like languages. Don’t even get me started on shell and related languages where basically everything is a string and you have to make a single-quote/double-quote battle plan before doing anything slightly nested.
There’s surprising layers to this. That the reporter in that thread says that git-commit will “happily” accept `#` in commit messages is half-true: it will accept it if you don’t edit the message since the `default` cleanup (that you linked to) will not remove comments if the message is given through things like `-m` and not an editing session. So `git commit -m'#something' is fine. But then try to do rebase and cherry-pick and whatever else later, maybe get a merge commit message with a commented "conflicted" files. Well it can get confusing.
> Maybe it was hard to use backtick in the 70’s and 80’s, but today[2] you could use backtick to start a string and a single quote to end it.
That's how quoting works by default in m4 and TeX, both defined in the 70s. Unfortunately Unicode retconned the ASCII apostrophe character ' to be a vertical line, maybe out of a misguided deference to Microsoft Windows, and now we all have to suffer the consequences. (Unless we're using Computer Modern fonts or other fonts that predate this error, such as VGA font ROM dumps.)
In the 70s and 80s, and into the current millennium on Unix, `x' did look like ‘x’, but now instead it looks like dogshit. Even if you are willing to require a custom font for readability, though, that doesn't solve the problem; you need some way to include an apostrophe in your quoted string!
As for end delimiters, C itself supports multicharacter literals, which are potentially useful for things like Macintosh type and creator codes, or FTP commands. Unfortunately, following the Unicode botch theme, the standard failed to define an endianness or minimum width for them, so they're not very useful today. You can use them as enum values if you want to make your memory dumps easier to read in the debugger, and that's about it. I think Microsoft's compiler botched them so badly that even that's not an option if you need your code to run on it.
> Unfortunately Unicode retconned the ASCII apostrophe character ' to be a vertical line
Unicode does not precribe the appearance of characters. Although in the code chart¹ it says »neutral (vertical) glyph with mixed usage« (next to »apostrophe-quote« and »single quote«), font vendors have to deal with this mixed usage. And with Unicode the correct quotation marks have their own code points, making it unnecessary to design fonts where the ASCII apostrophe takes their form, but rendering all other uses pretty ugly.
I would regard using ` and ' as paired quotation marks as a hack from times when typographic expression was simply not possible with the character sets of the day.
_________
¹
0027 ' APOSTROPHE
= apostrophe-quote (1.0)
= single quote
= APL quote
• neutral (vertical) glyph with mixed usage
• 2019 ’ is preferred for apostrophe
• preferred characters in English for paired quotation marks are 2018 ‘ & 2019 ’
• 05F3 ׳ is preferred for geresh when writing Hebrew
→ 02B9 ʹ modifier letter prime
→ 02BC ʼ modifier letter apostrophe
→ 02C8 ˈ modifier letter vertical line
→ 0301 $́ combining acute accent
→ 030D $̍ combining vertical line above
→ 05F3 ׳ hebrew punctuation geresh
→ 2018 ‘ left single quotation mark
→ 2019 ’ right single quotation mark
→ 2032 ′ prime
→ A78C ꞌ latin small letter saltillo«
> That's how quoting works by default in m4 and TeX, both defined in the 70s.
Good point. And it was in m4[1] I saw that backtick+apostrophe syntax. I would have probably not thought of that possibility if I hadn’t seen it there.
[1] Probably on Wikipedia since I have never used it
> Unfortunately Unicode retconned the ASCII apostrophe character ' to be a vertical line, maybe out of a misguided deference to Microsoft Windows, and now we all have to suffer the consequences. (Unless we're using Computer Modern fonts or other fonts that predate this error, such as VGA font ROM dumps.)
I do think the vertical line looks subpar (and I don’t use it in prose). But most programmers don’t seem bothered by it. :|
> In the 70s and 80s, and into the current millennium on Unix, `x' did look like ‘x’, but now instead it looks like dogshit.
Emacs tries to render it like ‘x’ since it uses backtick+apostrophe for quotes. With some mixed results in my experience.
> Even if you are willing to require a custom font for readability, though, that doesn't solve the problem; you need some way to include an apostrophe in your quoted string!
Aha, I honestly didn’t even think that far. Seems a bit restrictive to not be able to use possessives and contractions in strings without escapes.
> As for end delimiters, C itself supports multicharacter literals, which are potentially useful for things like Macintosh type and creator codes, or FTP commands.
I should have made it clear that I was only considering C-likes and not C itself. A language from the C trigraph days can be excused. To a certain extent.
Syntax coloring has to handle context: different rules for material nested in certain ways.
Vim's syntax higlighter lets you declare two kinds of items: matches and regions. Matches are simpler lexical rules, whereas regions have separate expressions for matching the start and end and middle. There are ways to exclude leading and trailing material from a region.
Matches and regions can declare that they are contained. In that case they are not active unless they occur in a containing region.
Contained matches declare which regions contain them.
Regions declare which other regions they contain.
That's the basic semantic architecture; there are bells and whistles in the system due to situations that arise.
I don't think even Justine could develop that in an interview, other than as an overnight take home.
reply