Hacker News new | past | comments | ask | show | jobs | submit login
Weird Lexical Syntax (justine.lol)
433 points by jart 38 days ago | hide | past | favorite | 227 comments



I think my favorite C trigraph was something like

  do_action() ??!??! handle_error()
It almost looks like special error handling syntax but still remains satisfying once you realize it's an || logical-or statement and it's using short circuiting rules to execute handle error if the action returns a non-zero value.


https://en.wikipedia.org/wiki/Digraphs_and_trigraphs_(progra...

??! Is converted to |

So ??!??! Becomes || i.e. “or”


Did you choose the legacy C trigraphs over || for aesthetic purposes?


No, I don't really write C. I was remarking on something I saw once and I wanted to share it because the original article mentions trigraphs.


Could you review my comment on HN? Please educate me if there is something I haven’t understood, rather than downvoting my question.


The grandparent post is specifically about trigraphs. Saying something about trigraphs was the end-in-itself, trigraphs were chosen to illustrate something about trigraphs. So your question made no sense. Hope that helps.


Maybe the confusion was the other way, more like "why is that funny/interesting?"

An attempt to answer that: In English, mixing ?! at the end of a question is a way of indicating bewilderment. Like "What was that?!"


> So your question made no sense. Hope that helps.

I think that is uncharitable. The question ("Did you choose the legacy C trigraphs over || for aesthetic purposes?") makes perfect sense to me. I think context makes it reasonably clear that the answer is 'yes,' but that doesn't mean that the question doesn't make sense, only perhaps that it didn't need to be asked.


My question was precisely about why the user like trigraphs over using just || on this case. It is a very clear question and makes all the sense.


The post shows a “favorite C trigraph” thing, not that they were going out of their way to use trigraphs in actual code or that you should. Using trigraphs is the whole premise so no, your question makes no sense in that context.

FWIW the ??!??! double trigraph as error processing is funny because of the meaning of ?! and various combinations of ? and !. It is funny and it has trigraphs. That’s the whole point.


> The post shows a “favorite C trigraph” thing, not that they were going out of their way to use trigraphs in actual code or that you should.

But I am free to be curious and ask the author why he choose it! We are not computers but human beings! There is no HN rule that says that I cannot be curious and asks a question that arised from a thread but it is not connected to that! [1].

[1] https://www.iflscience.com/charles-babbage-once-sent-the-mos...


> But I am free to be curious and ask the author why he choose it

They chose it because the discussion is about trigraphs. It is not a particularly surprising choice to talk about trigraphs in that context.

> There is no HN rule that says that I cannot be curious and asks a question that arised from a thread but it is not connected to that!

I did not write that you did not follow HN rules, just that your question was very strange in the context. I did not downvote you, but I understand why some people did. Your question was a bit passive agressive, even if you did not mean it.


My reading of the downvoted question was one of genuine curiosity of why the author chose that as a favorite trigraph, as in “why that one instead of another”, not as criticism of the choice of trigraph over something more conventional. I may be wrong of course, but it didn’t seem like a particularly malicious question to me and your rationale unfortunately doesn’t convince me otherwise. Not that it has to, this is all very subjective after all, but just offering up a counter opinion.

I gave the question a +1 because I, as previously stated, read it to be genuine curiosity. Maybe a smiley would’ve helped, I don’t know. ¯\_(ツ)_/¯


I didn't downvote your comment but understand why it looks "wrong": it's like, in a thread on English oddities, you replied to someone bringing up the "buffalo buffalo buffalo" example with the question "why are you so fond of bovines"?


It has nothing to do with that. I could ask why he didn't choose a different homonymic ambiguity [1].

[1] https://journals.linguisticsociety.org/proceedings/index.php...


I think the misunderstanding is a bit of missing context from the original comment. I read it as “one of my favorite trigraphs [that I have seen]”. Their comment didn’t make a claim to using said trigraph, just that they had seen it somewhere and thought it was interesting.

Your comments seem to suggest your reading was “one of my favorite trigraphs [to use is]”, which is understandable as a valid interpretation.

The whole of this chain of comments is a misunderstanding on the original comment’s ambiguity.


Easiest way to get downvotes is to ask people not to give them. You just gotta ignore the haters.


This was a fun read, but it left me a bit more sympathetic to the lisp perspective, which (if I've understood it) is that syntax, being not an especially important part of a language, is more of a hurdle than a help, and should be as simple and uniform as possible so we can focus on other things.

Which is sort of ironic because learning how to do structural editing on lisps has absolutely been more hurdle than help so far, but I'm sure it'll pay off eventually.


Having a simple syntax might be fine for computers but syntax is mainly designed to be read and written by humans. Having a simple one like lisp then just makes syntactic discussions a semantic problem, just shifting the layers.

And I think an complex syntax is far easier to read and write than a simple syntax with complex semantics. You also get a faster feedback loop in case the syntax of your code is wrong vs the semantics (which might be undiscovered until runtime).


Jury's out re: whether I feel this in my gut. Need more time with the lisps for that. But re: cognitive load maybe it goes like:

1. 1 language to rule them all, fancy syntax

2. Many languages, 1 simple syntax to rule them all

3. Many languages and many fancy syntaxes

Here in the wreckage of the tower of babel, 1. isn't really on the table. But 2. might have benefits because the inhumanity of the syntax need only be confronted once. The cumulative cost of all the competing opinionated fancy syntaxes may be the worst option. Think of all the hours lost to tabs vs spaces or braces vs whitespace.


I think 3 is not only a natural state, but the best state.

I don’t think we can have 1 language that satisfies the needs of all people who write code, and thus, we can’t have 1 syntax that does that either.

3 seems the only sensible solution to me, and we have it.


I dunno, here in 3 the hardest part of learning a language has little to do with the language itself and more to do with the ecosystem of tooling around that language. I think we could more easily get on to the business of using the right language for the job if more of that tooling was shared. If each language, for instance did not have it's own package manager, its own IDE, its own linters and language servers all with their own idiosyncrasies arising not from deep philosophical differences of the associated language but instead from accidental quirks of perspective from whoever decided that their favorite language needed a new widget.

I admire the widget makers, especially those wrangling the gaps between languages. I just wish their work could be made easier.


I really like the Linux package managers. If you're going to write an application that will run on some system, it's better to bake dependencies into it. And with virtualization and containerization, the system is not tied to a physical machine. I've been using containers (incus) more and more for real development purposes as I can use almost the same environment to deploy. I don't care much about the IDE, but I'm glad we have LSP, Tree-sitter, and DAP. The one thing I do not like is the proliferation of tooling version manager (NVM,..) instead of managing the environment itself (tied to the project).


I've always thought these complaints are really just a reflection of how stuck we are in the C paradigm. The idea that you have to edit programs as text is outdated IMO. It should be that your editor operates on the syntax tree of the source code. Once you do that, the code can be displayed in any way.


I also believe this, and we're actually about half way there via MPS <https://github.com/JetBrains/MPS#readme> but I'm pretty sure that dream is dead until this LLM hype blows over, since LLMs are not going to copy-paste syntax trees until the other dream of a universal representation materializes[1]

1: There have been several attempts at Universal ASTs, including (unsurprisingly) a JVM-centric one from JetBrains https://github.com/JetBrains/intellij-community/blob/idea/24...


The problem with this statement is that it assumes parsing-easiness as something universal, and stable. And this is certainly not true. You may believe syntax A is so much easier simply because it's the syntax you have been dealing with most of your career thus your brain is trained for it. On top of it a particular task can make a lot of difference: most people would agree that regex is simplification versus writing the same logic in usual if-then way for pattern matching in strings, but I'm not sure many would like to have their whole programs looking that way (but even that could be subjective, see APL).


This is interesting. My first thought was that a language where more meaning is expressed in syntax could catch more errors at compile time. But there seems to be no reason why meaning encoded in semantics could not also be caught at compile time.

The main benefit of putting things in the syntax seems to be that many errors would become visually obvious.


I don't understand your distinction between syntax and semantics. If the semantics are complex, wouldn't that mean the syntax is thus complex?


lisp's syntax is simple - its just parenthesis to define a list, first element of a list is executed as a function.

but for example a language like C has many different syntaxes for different operations, like function declaration or variable or array syntax, or if/switch-case etc etc.

so to know C syntax you need to learn all these different ways to do different things, but in lisp you just need to know how to match parenthesis.

But of course you still want to declare variables, or have if/else and switch case. So you instead need to learn the builtin macros (what GP means by semantics) and their "syntax" that is technically not part of the language's syntax but actually is since you still need all those operations enough that they are included in the standard library and defining your own is frowned upon.


Lisp has way more syntax, that doesn't cover any of the special forms. Knowing about application syntax doesn't help with understanding `let` syntax. Even worse, with macros, the amount of syntax is open-ended. That they all come in the form of S-expressions doesn't help a lot in learning them.


Most languages' abstract machines expose a very simple API, it's up to the language to add useful constructs to help us write code more efficiently. Languages like Lisp start with a very simple syntax, then add those constructs with the language itself (even though those can be fixed using a standard), others just add it through the syntax. These constructs plus the abstract machine's operations form the semantics, syntax is however the language designer decided to present them.


Lisp has reader macros which allow you to reprogram its lexer. Lisp macros allow you to program the translation from the visible structure to the parse tree.

For example, https://pyret.org/

It really isn’t simple or necessarily uniform.


I've heard that certain lisps (Common Lisp comes up when I search for reader macros) allow for all kinds of tinkering with themselves. But the ability of one to make itself not a lisp anymore, while interesting, doesn't seem to say much about the merits of sticking to s-expressions, except maybe to point out that somebody once decided not to.


Reader macros are there to program and configure the reader. The reader is responsible for reading s-expressions into internal data structures. There are basically two main uses of reader-macros: data structures and reader control.

A CL implementation will implement reading lists, symbols, numbers, arrays, strings, structures, characters, pathnames, ... via reader macros. Additionally the reader implements various forms of control operations: conditional reading, reading and evaluation, circular datastructures, quoting and comments.

This is user programmable&configurable. Most uses will be in the two above categories: data structure syntax and control. For example we could add a syntax for hash tables to s-expressions. An example for a control extension would be to add support for named readtables. For example a Common Lisp implementation could add a readtable for reading s-expressions from Scheme, which has a slightly different syntax.

Reader macros were optimized for implementing s-expressions, thus the mechanism isn't that convenient as a lexer/parser for actual programming languages. It's a a bit painful to do so, but possible.

A typical reader macro usage, beyond the usage described above, is one which implements a different token or expression syntax. For example there are reader macros which parse infix expressions. This might be useful in Lisp code where arithmetic expressions can be written in a more conventional infix syntax. The infix reader macro would convert infix expressions into prefix data.


Is Pyret based on reader macros? I would think it's much easier to use a syntax parser for that.


I am surprised to hear that structural editing has been a hurdle for you, and I think I can offer a piece of advice. I also used to be terrified by its apparent complexity, but later found out that one just needs to use parinfer and to know key bindings for only three commands: slurp, barf, and raise.

With just these four things you will be 95% there, enjoying the fruits of paredit without any complexity — all the remaining tricks you can learn later when you feel like you’re fluent.


Thanks very much for the advice, it's timely.

<rant> It's not so much the editing itself but the unfamiliarity of the ecosystem. It seems it's a square-peg I've been crafting a round hole of habits for it:

I guess I should use emacs? How to even configure it such that these actions are available? Or maybe I should write a plugin for helix so that I can be in a familiar environment. Oh, but the helix plugin language is a scheme, so I guess I'll use emacs until I can learn scheme better and then write that plugin. Oh but emacs keybinds are conflicting with what I've configured for zellij, maybe I can avoid conflicts by using evil mode? Oh ok, emacs-lisp, that's a thing. Hey symex seems like it aligns with my modal brain, oh but there goes another afternoon of fussing with emacs. Found and reported a symex "bug" but apparently it only appears in nix-governed environments so I guess I gotta figure out how to report the packaging bug (still todo). Also, I guess I might as well figure out how to get emacs to evaluate expressions based on which ones are selected, since that's one of the fun things you can do in lisps, but there's no plugin for the scheme that helix is using for its plugin language (which is why I'm learning scheme in the first place), but it turns out that AI is weirdly good at configuring emacs so now my emacs config contains most that that plugin would entail. Ok, now I'm finally ready to learn scheme, I've got this big list of new actions to learn: https://countvajhula.com/2021/09/25/the-animated-guide-to-sy.... Slurp, barf, and raise you say? excellent, I'll focus on those.

I'm not actually trying to critique the unfamiliar space. These are all self inflicted wounds: me being persnickety about having it my way. It's just usually not so difficult to use something new and also have it my way.</rant>


> Oh but emacs keybinds are conflicting with what I've configured for zellij,

Don't do that. ;)

Emacs is a graphical application! Don't use it in the terminal unless you really have to (i.e., you're using it on a remote machine and TRAMP will not do).

> it turns out that AI is weirdly good at configuring emacs

I was just chatting with a friend about this. ChatGPT seems to be much better at writing ELisp than many other languages I've asked it to work with.

Also while you're playing with it, you might be interested in checking out kakoune.el or meow, which provide modal editing in Emacs but with the selection-first ordering for commands, like in Kakoune and Helix rather than the old vi way.

PS: symex looks really interesting! Hadn't been that one


Well, elisp probably accounts for like 85% of the lisp code on GH and co, so that'd make sense


To be fair, I am not a "lisper" and I don't know Emacs at all. I am just a Clojure enjoyer who uses IntelliJ + Cursive with its built-in parinfer/paredit.


I never bothered with structural editing on Emacs. I just use the sentence/paragraph movement commands. M-a, M-e, M-n, M-p, M-T, M-space, etc.


> I'm not sure who wants to be able to syntax highlight C at 35 MB per second, but I am now able to do so

Fast, but tcc *compiles* C to binary code at 29 MB/s on a really old computer: https://bellard.org/tcc/#speed Should be possible to go much faster but probably not needed


Justine vs Bellard, that's a nice setup.


Some random things that the author seem to have missed:

> but TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings

Many more languages support that:

    C#             $"{x} plus {y} equals {x + y}"
    Python         f"{x} plus {y} equals {x + y}"
    JavaScript     `${x} plus ${y} equals ${x + y}`
    Ruby           "#{x} plus #{y} equals #{x + y}"
    Shell          "$x plus $y equals $(echo "$x+$y" | bc)"
    Make :)        echo "$(x) plus $(y) equals $(shell echo "$x+$y" | bc)"
> Tcl

Tcl is funny because comments are only recognized in code, and since it's a homoiconic, it's very hard to distinguish code and data. { } are just funny string delimiters. E.g.:

    xyzzy {#hello world}
Is xyzzy a command that takes a code block or a string? There's no way to tell. (Yes, that means that the Tcl tokenizer/parser cannot discard comments: only at evaluation time it's possible to tell if something is a comment or not.)

> SQL

PostgreSQL has the very convenient dollar-quoted strings: https://www.postgresql.org/docs/current/sql-syntax-lexical.h... E.g. these are equivalent:

    'Dianne''s horse'
    $$Dianne's horse$$
    $SomeTag$Dianne's horse$SomeTag$


Perl lets you do this too:

    my $foo = 5;
    my $bar = 'x';
    my $quux = "I have $foo $bar\'s: @{[$bar x $foo]}";
    print "$quux\n";
This prints out:

    I have 5 x's: xxxxx
The "@{[...]}" syntax is abusing Perl's ability to interpolate an _array_ as well as a scalar. The inner "[...]" creates an array reference and the outer "@{...}" dereferences it.

For reasons I don't remember, the Perl interpreter allows arbitrary code in the inner "[...]" expression that creates the array reference.


> For reasons I don't remember, the Perl interpreter allows arbitrary code in the inner "[...]" expression that creates the array reference.

...because it's an array value? Aside from how the languages handle references, how is that part any different from, for example, this in python:

  >>> [5 * 'x']
  ['xxxxx']
You can put (almost) anything there, as long as it's an expression that evaluates to a value. The resulting value is what goes into the array.


I understand that's constructing an array. What's a bit odd is that the interpreter allows you to string interpolate any expression when constructing the array reference inside the string.


It's not...? Well, not directly: It's string interpolating an array of values, and the array is constructed using values from the results of expressions. These are separate features that compose nicely.


> What's a bit odd is that the interpreter allows you to string interpolate any expression when constructing the array reference inside the string.

Why? Surely it is easier for both the language and the programmer to have a rule for what you can do when constructing references to anonymous arrays, without having to special case whether that anonymous array is or is not in a string (or in any one of the many other contexts in which such a construct may appear in Perl).


You also don't need quotes around strings (barewords). So

    my $bar = x;
should give the same result.

Good luck with lexing that properly.

https://perlmaven.com/barewords-in-perl


If you're writing anything approaching decent perl that won't be accepted.


Doesn't really matter for a syntax highlighter, because it is out of your control what you get. For the llamafile highlighter even more so since it supports other legacy quirks, like C trigraphs as well.


"use strict" will prevent it and I think strict will be assumed/default soon.


As of Perl 5.12, `use`ing a version (necessary to ensure availability of some of the newer features) automatically implies `use strict`.

https://perldoc.perl.org/strict#HISTORY


> actual code being embedded inside strings

My view on this is that it shouldn’t be interpreted as code being embedded inside strings, but as a special form of string concatenation syntax. In turn, this would mean that you can nest the syntax, for example:

    "foo { toUpper("bar { x + y } bar") } foo"
The individual tokens being (one per line):

    "foo {
    toUpper
    (
    "bar {
    x
    +
    y
    } bar"
    )
    } foo"
If `+` does string concatenation, the above would effectively be equivalent to:

    "foo " + toUpper("bar " + (x + y) + " bar") + " foo"
I don’t know if there is a language that actually works that way.


Indeed in some of the listed languages you can nest it like that, but in others (e.g. Python) you can't. I would guess they deliberately don't want to enable that and it's not a problem in their parser or something.


As of python 3.6 you can nest fstrings. Not all formatters and highlighters have caught up, though.

Which is fun, because correct highlighting depends on language version. Haskell has similar problems where different compiler flags require different parsers. Close enough is sufficient for syntax highlighting, though.

Python is also a bit weird because it calls the format methods, so objects can intercept and react to the format specifiers in the f-string while being formatted.


I didn't mean nested f-strings. I mean this is a syntax error:

    >>> print(f"foo {"bar"}")
    SyntaxError: f-string: expecting '}'
Only this works:

    >>> print(f"foo {'bar'}")
    foo bar


You're using an old Python version. On recent versions, it's perfectly fine:

    Python 3.12.7 (main, Oct  3 2024, 15:15:22) [GCC 14.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print(f"foo {"bar"}")
    foo bar


Even when nesting is disallowed, my point is that I find it preferable to not view it (and syntax-highlight it) as a “special string” with embedded magic, but as multiple string literals with just different delimiters that allow omitting the explicit concatenation operator, and normal expressions interspersed in between. I think it’s important to realize that it is really just very simple syntactic sugar for normal string concatenation.


While you're conceptually right, in practice I think it bears mentioning that in C# the two syntaxes compile differently. This is because C#’s target platform, the .NET Framework, has always had a function called `string.Format` that lets you write this:

  var str = string.Format("{0} is {1} years old.", name, age);
When interpolated strings were introduced later, it was natural to have them compile to this instead of concatenation.


There's no reason in principle why

    name + " is " + age + " years old."
couldn't compile to exactly the same. (Other than maybe `string.Format` having some additional customizable behavior, I don't know C# that well.)


Like python, and Rust with the format! macro (which doesn't even support arbitrary expressions), C# the full syntax for interpolated/formatted strings is this: {<interpolationExpression>[,<alignment>][:<formatString>]}, ie there is more going on then just a simple wrapper around concat or StringBuilder.


When not using the format specifiers or alignment it will indeed compile to just string.Concat (which is also what the + operator for strings compiles to). Similar to C compilers choosing to call pits instead of printf if there is nothing to be formatted.


If it’s treated strictly as simple concatenation syntactic sugar then you are allowing something like print(“foo { func() ); Which seems janky af.

> just very simple syntactic sugar for normal string concatenation.

Maybe. There’s also possibly a string conversion. It seems reasonable to want to disallow implicit string conversion in a concatenation operator context (especially if overloading +) while allowing it in the interpolation case.


I failed to mention the balancing requirement, that should of course remain. But it's an artificial requirement, so to speak, that is merely there to double-check the programmer's intent. The compiler/parser wouldn't actually care (unlike for an arithmetic expression with unbalanced parentheses, or scope blocks with unbalanced braces), the condition is only checked for the programmer's benefit.

> here’s also possibly a string conversion. It seems reasonable to want to disallow implicit string conversion in a concatenation operator context (especially if overloading +) while allowing it in the interpolation case.

Many languages have a string contenation operator that does implicit conversion to string, while still having a string interpolation syntax like the above. It's kind of my point that both are much more similar to each other than many people seem to realize.


> "foo { …

That should probably not be one token.

> My view on this is that it shouldn’t be interpreted as code being embedded inside strings

I’m not sure exactly what you’re proposing and how it is different. You still can’t parse it as a regular lexical grammar.

How does this change how you highlight either?

Whatever you call it, to the lexer it is a special string, it has to know how to match it, the delimiters are materially different than concatenation.

I might be being dense but I’m not sure what’s formally distinct.


> > "foo { …

> That should probably not be one token.

It's exactly the point that this is one token. It's a string literal with opening delimiter `"` and closing delimiter `{`, and that whole token itself serves as a kind of opening "brace". Alternatively, you can see `{` as a contraction of `" +`. Meaning, aside from the brace balancing requirement, `"foo {` does the same a `"foo " +` would.

Still alternatively, you could imagine a language that concatenates around string literals by default, similar to how C behaves for sequences of string literals. In C,

    "foo" "bar" "baz"
is equivalent to

    "foobarbaz"
Similarly, you could imagine a language where

    "foo" some_variable "bar"
would perform implicit concatenation, without needing an explicit operator (as in `"foo" + x + "bar"`). And then people might write it without the inner whitespace, as:

    "foo"some_variable"bar"
My point is that

    "foo{some_variable}bar"
is really just that (plus a condition requiring balanced pairs of braces). You can also re-insert the spaces for emphasis:

    "foo{ some_variable }bar"
The fact that people tend to think of `{some_variable}` as an entity is sort-of an illusion.

> How does this change how you highlight either?

You would highlight the `"...{`, `}...{`, and `}..."` parts like normal string literals (they just use curly braces instead of double quotes at one or both ends), and highlight the inner expressions the same as if they weren't surrounded by such literals.


> It's exactly the point that this is one token.

Fair enough. The point, as you have acknowledged, being that unlike + you have to treat { specially for balancing (and separately from the “).

> The fact that people tend to think of `{some_variable}` as an entity is sort-of an illusion.

I guess. I just don’t know what being an illusion means formally. It’s not an illusion to the person that has to implement the state machine that balances the delimiters.

> You would highlight the `"...{`, `}...{`, and `}..."` parts like normal string literals (they just use curly braces instead of double quotes at one or both ends), and highlight the inner expressions the same as if they weren't surrounded by such literals

Emacs does it this way FWIW. But I’m not sure how important it is to dictate that the brace can’t be a different color.

In any event, I can agree your design is valid (Kotlin works this way), but I don’t necessarily agree it is any more valid than say how Python does it where there can format specifiers, implicit conversion to string is performed whereas not with concatenation. I’m not seeing the clear definitive advantage of interpolated strings being an equivalent to concatenation vs some other type of method call.

The other detail is order of evaluation or sequencing. String concat may behave differently. Not sure I agree it is wrong, because at the end of the day it is distinct looking syntax. Illusion or not, it looks like a neatly enclosed expression, and concatenation looks like something else. That they might parse, evaluate or behave different isn't unreasonable.


Ruby takes this to 100. As much as a I love Ruby, this is valid Ruby, and I can't defend this:

    puts "This is #{<<HERE.strip} evil"
    incredibly 
    HERE
Just to combine the string interpolation with her concern over Ruby heredocs.

My other favorite evil quirk in Ruby is that whitespace is a valid quote character in Ruby. The string (without the quotes) "% hello " is a quoted string containing "hello" (without the quotes), as "%" in contexts where there is no left operand initiates a quoted string and the next characters indicates the type of quotes. This is great when you do e.g. "%(this is a string)" or "%{this is a string}". It's not so great if you use space (I've never seen that in the wild, so it'd be nice if it was just removed - even irb doesn't handle it correctly)


https://pbs.twimg.com/media/GbEfj6fbQAQRUB7?format=png&name=...

That's so going in the blog post later today.


Heh. I love Ruby, but, yes, the parser is "interesting", for values of interesting left undefined for its high obscenity content.


And don't overlook the fact that the bare-world, or its "HERE" friend, are still in an interpolation context, so...

    puts "hello #{<<onoz.strip} world"
    recursion is #{<<onoz.strip}
    recursive
    onoz
    onoz
    puts "that was fun"
yields

  hello recursion is recursive world
  that was fun
and then there's its backtick friend

    puts "hello #{<<`onoz`.strip} world"
    date -u
    onoz
coughs up

    hello Sun Nov  3 17:25:32 UTC 2024 world
and for those trying out your percent-space trick, be aware that it only tolerates such a thing in a standalone expression context so

  puts (% hello )+" world"
  # or
  x = % hello #
  puts x
because when I tried it "normally" I got

    $ /usr/bin/ruby -e 'puts % hello  + "world"'

    -e:1:in `<main>': undefined local variable or method `hello' for main:Object (NameError)
    $ /usr/bin/ruby -v
    ruby 2.6.10p210 (2022-04-12 revision 67958) [universal.x86_64-darwin21]
but, at the intersection is "ruby parsing is the 15th circle of hell"

    ruby -e 'puts (% #{<<FOO.strip}  )+ " world"
    hello
    FOO
    '


> $ /usr/bin/ruby -e 'puts % hello + "world"'

Yes, it's roughly limited in use to places where it is not ambiguous whether it would be the start of a quoted string or the modulus operator, and after a method name would be ambiguous.

> but, at the intersection is "ruby parsing is the 15th circle of hell"

It's surprisingly (not this part, anyway) not that hard. You "just" need to create a forward reference, and keep track of heredocs in your parser, and when you come to the end of a line with heredocs pending, you need to parse them and assign them to the variable references you've created.

It is dirty, though, and there are so many dark corners of parsing Ruby. Having written a partial Ruby parser, and being a fan of Wirth-style grammar simplicity while enjoying using Ruby is a dark, horrible place to live in. On the one hand, I find Ruby a great pleasure to use, on the other hand, the parser-writer in me wants to spend eternity screaming into the void in pain.


How are you so awesome?


Thanks. I'm a big fan of your work, so that is appreciated...


> Scala

Note about Scala's string interpolation. They can be used as pattern match targets.

  val s"${a} + ${b}" = "1 + 2";
  println(a) // 1
  println(b) // 2


One cool feature of C# interpolated strings is that they are lazy. Many loggers used to implement their own interpolation because something like

    log.trace($"Entering iteration {i} for customer {c.ID} [{c.ShortName}]");
in a hot loop would call string.Concat every time it was called before the logger could bail out of the method.

C# lets you declare an overload that accepts a `DefaultInterpolatedStringHandler` (or your own custom implementation of the handler pattern) and this overload will take precedence and allow you to delay the building of the string until after you've checked whether logging it is required.


Is this a bash-ism?

    "$x plus $y equals $((x+y))"


No, it's portable shell syntax.


"$((" arithmetic expansion is POSIX (XCU 2.6.4 "Arithmetic Expansion").

But if I'm not mistaken, it originated in csh.


This works in "sh" as well for me.


On some systems (like on mine) sh is just a link to bash, so I couldn't test it.


Isn't bash supposed to act like sh when executed with that name?


It still has bashisms


> Is this a bash-ism?

> "$x plus $y equals $((x+y))"

No, it is specified in POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V...


  Make :)        echo "$(x) plus $(y) equals $(shell echo "$x+$y" | bc)"
I'm guessing this is the reason for the :) but to be clear for anyone else: Make is only doing half of the work, whatever comes after "shell" is being passed to another executable, then make captures its stdout and interpolates that. The other executable is "sh" by default but can be changed to whatever.


Python f-strings are kind of wild. They can even contain comments! They also have slightly different rules for parsing certain kinds of expressions, like := and lambdas. And until fairly recently, strings inside the expressions couldn't use the quote type of the f-string itself (or backslashes).


VHDL

There is a record constructor syntax in VHDL using attribute invocation syntax: RECORD_TYPE'(field1expr, ..., fieldNexpr). This means that if your record has a first field a subtype of a character type, you can get record construction expression like this one: REC'('0',1,"10101").

Good luck distinguishing between '(' as a character literal and "'", "(" and "'0'" at lexical level.

Haskell.

Haskell has context-free syntax for bracketed ("{-" ... "-}") comments. Lexer has to keep bracketed comment syntax balanced (for every "{-" there should be accompanying "-}" somewhere).


> Many more languages support that:

Julia as well:

    Julia    "$x plus $y equals $(x+y)"


jq: "\("hello" + "world")!!"

I wish PG had dollar-bracket quoting where you have to use the closing bracket to close, that way vim showmatch would work trivially. Something like ${...}$.


Shell "$x plus $y equals $((x+y))"

Shell "$x plus $y equals $((expr $x + $y))"


Correction: Shell "$x plus $y equals $(expr $x + $y)"


> PostgreSQL has the very convenient dollar-quoted strings

I did not know that. Today I learned.


Another syntax oddity (not mentioned here) that breaks most highlighters: In Java, unicode escapes can be anywhere, not just in strings. For example, the following is a valid class:

    class Foo\u007b}
and this assert will not trigger:

    assert
        // String literals can have unicode escapes like \u000A!
        "Hello World".equals("\u00E4");


I also argue that failing to syntax highlight this correctly is a security issue. You can terminate block comments with Unicode escapes, so if you wanted to hide some malicious code in a Java source file, you just need an excuse for there to be a block of Unicode escapes in a comment. A dev who doesn’t know about this quirk is likely to just skip over it, assuming it’s commented out.


I once wrote a puzzle using this, which (fortunately) doesn't work any more, but would do interesting things on older JDK versions: https://pastebin.com/raw/Bh81PwXY


I have never seen this in Java! Is there any use cases where it could be useful?


I don't know about usefulness but it does let us write identifiers using Unicode characters. For example:

  public class Foo {
      public static void main(String[] args) {
          double \u03c0 = 3.14159265;
          System.out.println("\u03c0 = " + \u03c0);
      }
  }
Output:

  $ javac Foo.java && java Foo
  π = 3.14159265
Of course, nowadays we can simply write this with any decent editor:

  public class Foo {
      public static void main(String[] args) {
          double π = 3.14159265;
          System.out.println("π = " + π);
      }
  }
Support for Unicode escape sequences is a result of how the Java Language Specification (JLS) defines InputCharacter. Quoting from Section 3.4 of JLS <https://docs.oracle.com/javase/specs/jls/se23/jls23.pdf>:

  InputCharacter:
    UnicodeInputCharacter but not CR or LF
UnicodeInputCharacter is defined as the following in section 3.3:

  UnicodeInputCharacter:
    UnicodeEscape
    RawInputCharacter

  UnicodeEscape:
    \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

  UnicodeMarker:
    u {u}

  HexDigit:
    (one of)
    0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

  RawInputCharacter:
    any Unicode character
As a result the lexical analyser honours Unicode escape sequences absolutely anywhere in the program text. For example, this is a valid Java program:

  public class Bar {
      public static void \u006d\u0061\u0069\u006e(String[] args) {
          System.out.println("hello, world");
      }
  }
Here is the output:

  $ javac Bar.java && java Bar
  hello, world
However, this is an incorrect Java program:

  public class Baz {
      // This comment contains \u6d.
      public static void main(String[] args) {
          System.out.println("hello, world");
      }
  }
Here is the error:

  $ javac Baz.java
  Baz.java:2: error: illegal unicode escape
      // This comment contains \u6d.
                                   ^
  1 error
Yes, this is an error even if the illegal Unicode escape sequence occurs in a comment!


I wonder if full unicode range was accepted because some companies are writing code in non-english.


Javac uses the platform encoding [0] by default to interpret Java source files. This means that Java source code files are inherently non-portable. When Java was first developed (and for a long time after), this was the default situation for any kind of plain text files. The escape sequence syntax allows to transform [1] Java source code into a portable (that is, ASCII-only) representation that is completely equivalent to the original, and also to convert it back to any platform encoding.

Source control clients could apply this automatically upon checkin/checkout, so that clients with different platform encodings can work together. Alternatively, IDEs could do this when saving/loading Java source files. That never quite caught on, and the general advice was to stick to ASCII, at least outside comments.

[0] Since JDK 18, the default encoding defaults to UTF-8. This probably also extends to javac, though I haven’t verified it.

[1] https://docs.oracle.com/javase/8/docs/technotes/tools/window...


> Of all the languages, I've saved the best for last, which is Ruby. Now here's a language whose syntax evades all attempts at understanding.

TeX with its arbitrarily reprogrammable lexer: how adorable


Lisp reader macros allow you to program its lexer too.


You can basically define a new language with a few lines of code in Racket.


> Every C programmers (sic) knows you can't embed a multi-line comment in a multi-line comment.

And every Standard ML programmer might find this to be a surprising limitation. The following is a valid Standard ML program:

  (* (* Nested (**) *) comment *)
  val _ = print "hello, world\n"
Here is the output:

  $ sml < hello.sml           
  Standard ML of New Jersey (64-bit) v110.99.5 [built: Thu Mar 14 17:56:03 2024]
  - = hello, world

  $ mlton hello.sml && ./hello
  hello, world
Given how C was considered one of the "expressive" languages when it arrived, it's curious that nested comments were never part of the language.


There are 3 things I find funny about that comment: ML didn’t have single-line comments, so same level of surprising limitation. I’ve never heard someone refer to C as “expressive”, but maybe it was in 1972 when compared to assembly. And what bearing does the comment syntax have on the expressiveness of a language? I would argue absolutely none at all, by definition. :P


> ML didn’t have single-line comments, so same level of surprising limitation.

It is not quite clear to me why the lack of single-line comments is such a surprising limitation. After all, a single-line block comment can easily serve as a substitute. However, there is no straightforward workaround for the lack of nested block comments.

> I’ve never heard someone refer to C as “expressive”, but maybe it was in 1972 when compared to assembly.

I was thinking of Fortran in this context. For instance, Fortran 77 lacked function pointers and offered a limited set of control flow structures, along with cumbersome support for recursion. I know Fortran, with its native support for multidimensional arrays, excelled in numerical and scientific computing but C quickly became the preferred language for general purpose computing.

While very few today would consider C a pinnacle of expressiveness, when I was learning C, the landscape of mainstream programming languages was much more restricted. In fact, the preface to the first edition of K&R notes the following:

"In our experience, C has proven to be a pleasant, expressive and versatile language for a wide variety of programs."

C, Pascal, etc. stood out as some of the few mainstream programming languages that offered a reasonable level of expressiveness. Of course, Lisp was exceptionally expressive in its own right, but it wasn't always the best fit for certain applications or environments.

> And what bearing does the comment syntax have on the expressiveness of a language?

Nothing at all. I agree. The expressiveness of C comes from its grammar, which the language parser handles. Support for nested comments, in the context of C, is a concern for the lexer, so indeed one does not directly influence the other. However, it is still curious that a language with such a sophisticated grammar and parser could not allocate a bit of its complexity budget to support nested comments in its lexer. This is a trivial matter, I know, but I still couldn't help but wonder about it.


> Fortran 77 lacked function pointers

But we did have dummy procedures, which covered one of the important use cases directly, and which could be abused to fake function/subroutine pointers stored in data.


Fair enough. From my perspective, lack of single line comments is a little surprising because most other languages had it at the time (1973, when ML was introduced). Lack of nested comments doesn’t seem surprising, because it isn’t an important feature for a language, and because most other languages did not have it at the time (1972, when C was introduced).

I can imagine both pro and con arguments for supporting nested comments, but regardless of what I think, C certainly could have added support for nested comments at any time, and hasn’t, which suggests that there isn’t sufficient need for it. That might be the entire explanation: not even worth a little complexity.


> C certainly could have added support for nested comments at any time

After C89 was ratified, adding nested comments to C would have risked breaking existing code. For instance, this is a valid program in C89:

  #include <stdio.h>

  int main() {
      /* /* Comment */
      printf("hello */ world");
      return 0;
  }
However, if a later C standard were to introduce nested comments, it would break the above program because then the following part of the program would be recognised as a comment:

      /* /* Comment */
      printf("hello */
The above text would be ignored. Then the compiler would encounter the following:

      world");
This would lead to errors like undeclared identifier 'world', missing terminating " character, etc.


Given the neighboring thread where I just learned that the lexer runs before the preprocessor, I’m not sure that would be the outcome. There’s no reason to assume the comment terminator wouldn’t be ignored in strings. And even today, you can safely write printf(“hello // world\n”); without risking a compile error, right?


> Given the neighboring thread where I just learned that the lexer runs before the preprocessor, I’m not sure that would be the outcome.

That is precisely why nested comments would end up breaking the C89 code example I provided above. I elaborate this further below.

> There’s no reason to assume the comment terminator wouldn’t be ignored in strings.

There is no notion of "comment terminator in strings" in C. At any point of time, the lexer is reading either a string or a comment but never one within the other. For example, in C89, C99, etc., this is an invalid C program too:

  #include <stdio.h>

  int main() {
      /* Comment
      printf("hello */ world");
      return 0;
  }
In this case, we wouldn't say that the lexer is "honoring the comment terminator in a string" because, at the point the comment terminator '*/' is read, there is no active string. There is only a comment that looks like this:

      /* Comment
      printf("hello */
The double quotation mark within the comment is immaterial. It is simply part of the comment. Once the lexer has read the opening '/*', it looks for the terminating '*/'. This behaviour would hold even if future C standards were to allow nested comments, which is why nested comments would break the C89 example I mentioned in my earlier HN comment.

> And even today, you can safely write printf("hello // world\n"); without risking a compile error, right?

Right. But it is not clear what this has got to do with my concern that nested comments would break valid C89 programs. In this printf() example, we only have an ordinary string, so obviously this compiles fine. Once the lexer has read the opening quotation mark as the beginning of a string, it looks for an unescaped terminating quotation mark. So clearly, everything until the unescaped terminating quotation mark is a string!


AFAIK, C didn't get single line comments until C99. They were a C++ feature originally.


Oh wow, I didn’t remember that, and I did start writing C before 99. I stand corrected. I guess that is a little surprising. ;)

Is true that many languages had single line comments? Maybe I’m forgetting more, but I remember everything else having single line comments… asm, basic, shell. I used Pascal in the 80s and apparently forgot it didn’t have line comments either?


Some C compilers supported it as an unofficial extension well before C99, so that could be why you didn't realise or don't remember. I think that included both Visual Studio (which was really a C++ compiler that could turn off the C++ bits) and GCC with GNU extensions enabled.


That's my recollection, that most languages had single line comments. Some had multi-line comments but C++ is the first I remember having syntaxes for both. That said, I'm not terribly familiar with pre-80s stuff.


I was barely too young for this to make much of an impact at the time, (but older than many, perhaps most, here), I understand why C was considered a "high level language", but it still hits me weird, given today's context.


Lexing nested comments requires maintaining a stack (or at least a nesting-level counter). That wasn’t traditionally seen as being within the realm of lexical analysis, which would only use a finite-state automaton, like regular expressions.


Well there is one way to nest comments in C, and that's by using #if 0:

  #if 0
  This is a
  #if 0
  nested comment!
  #endif
  #endif


Except that text inside #if 0 still has to lex correctly.

(unifdef has some evil code to support using C-style preprocessor directives with non-C source, which mostly boils down to ignoring comments. I don’t recommend it!)


> Except that text inside #if 0 still has to lex correctly.

Are you sure? I just tried on godbolt and that’s not true with gcc 14.2. I’ve definitely put syntax errors intentionally into #if 0 blocks and had it compile. Are you thinking of some older version or something? I thought the pre-processor ran before the lexer since always…


There are three (relevant) phases (see “translation phases” in section 5 of the standard):

• program is lexed into preprocessing tokens; comments turn into whitespace

• preprocessor does its thing

• preprocessor tokens are turned into proper tokens; different kinds of number are disambiguated; keywords and identifiers are disambiguated

If you put an unclosed comment inside #if 0 then it won’t work as you might expect.


Ah, I see. You’re right!


Pascal always supported the same nested comment syntax as your example.


This is not just true of Standard ML; it's also true of regular ML.


I’d be interested to see a re-usable implementation of joe's[0] syntax highlighting.[1] The format is powerful enough to allow for the proper highlighting of Python f-strings.[2]

0. https://joe-editor.sf.net/

1. https://github.com/cmur2/joe-syntax/blob/joe-4.4/misc/HowItW...

2. https://gist.github.com/irdc/6188f11b1e699d615ce2520f03f1d0d...


I've actually made several lexers and parsers based on the joe DFA style of parsing. The state and transition syntax was something that I always understood much more easily than the standard tools.

The downside is your rulesets tend to get more verbose and are a little bit harder to structure than they might ideally be in other languages more suited towards the purpose, but I actually think that's an advantage, as it's much easier to reason about every production rule when looking at the code.


Interestingly, python f-strings changed their syntax at version 3.12, so highlighting should depend on the version.


It’s just that nesting them arbitrarily is now allowed, right? That shouldn’t matter much for a mere syntax highlighter then. And one could even argue that code that relies on this too much is not really for human consumption.


Also, you can now use the same quote character that encloses an f-string within the {} expressions. That could make them harder to tokenize, because it makes it harder to recognise the end of the string.


Author hasn’t tried to highlight TeX. Which is good for their mental health, I suppose, as it’s generally impossible to fully highlight TeX without interpreting it.

Even parsing is not enough, as it’s possible to redefine what each character does. You can make it do things like “and now K means { and C means }”.

Yes, you can find papers on arXiv that use this god-forsaken feature.


I wrote https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafil... and it does a reasonable job highlighting without breaking for all the .tex files I could find on my hard drive. My goal is to hopefully cover 99.9% of real world usage, since that'll likely cover everything an LLM might output. Esoteric syntax also usually isn't a problem, so long as it doesn't cause strings and comments to extend forever, eclipsing the rest of the source code in a file.


Yes, when goal isn’t to support 100% of all the weird stuff, then it’s orders of magnitude easier!


I couldn't believe it when I learned that \makeatletter does not “make (something) at a letter (character)” but rather “treats the '@' character as a letter when parsing”.


Same with Common Lisp (you can redefine the read table), although that’s likely abused less often on arXiv.


I don't think it's easy to write a good syntax coloring engine like the one in Vim.

Syntax coloring has to handle context: different rules for material nested in certain ways.

Vim's syntax higlighter lets you declare two kinds of items: matches and regions. Matches are simpler lexical rules, whereas regions have separate expressions for matching the start and end and middle. There are ways to exclude leading and trailing material from a region.

Matches and regions can declare that they are contained. In that case they are not active unless they occur in a containing region.

Contained matches declare which regions contain them.

Regions declare which other regions they contain.

That's the basic semantic architecture; there are bells and whistles in the system due to situations that arise.

I don't think even Justine could develop that in an interview, other than as an overnight take home.


Here is an example of something hard to handle: TXR language with embedded TXR Lisp.

This is the "genman" script which takes the raw output of a manpage to HTML converter, and massages it to form the HTML version of the TXR manual:

https://www.kylheku.com/cgit/txr/tree/genman.txr

Everything that is white (not colored) is literal template material. Lisp code is embedded in directives, like @(do ...). In this scheme, TXR keywords appear purple, TXR Lisp ones green. They can be the same; see the (and ...) in line 149, versus numerous occurrences of @(and).

Quasistrings contain nested syntax: see 130 where `<a href ..> ... </a>` contains an embedded (if ...). That could itself contain a quasistring with more embedded code.

TXR's txr.vim" and tl.vim* syntax definition files are both generated by this:

https://www.kylheku.com/cgit/txr/tree/genvim.txr


Naively, I would have assumed that the "correct" way to write a syntax highlighter would be to parse into an AST and then iterate over the tokens and update the color of a token based on the type of node (and maybe just tracking a diff to avoid needing to recolor things that haven't changed). I'm guessing that if this isn't done, it's for efficiency reasons (e.g. due to requiring parsing the whole file to highlight rather than just the part currently visible on the screen)?


> I would have assumed that the "correct" way to write a syntax highlighter would be to parse into an AST and then […] I'm guessing that if this isn't done, it's for efficiency reasons

It’s not only running time, but also ease of implementation.

A good syntax highlighter should do a decent job highlighting both valid and invalid programs (rationale: in most (editor, language) pairs, writing a program involves going through moments where the program being written isn’t a valid program)

If you decide to use an AST, that means you need to have good heuristics for turning invalid programs into valid ones that best mimic what the programmer intended. That can be difficult to achieve (good compilers have such heuristics, but even if you have such a compiler, chances are it isn’t possible to reuse them for syntax coloring)

If this simpler approach gives you most of what you can get with the AST approach, why bother writing that?

Also, there are languages where some programs can’t be perfectly parsed or syntax colored without running them. For those, you need this approach.


> I don't think even Justine could develop that in an interview

Not so sure I’d put money on that opinion ;)


In the C# multiquoted strings, how does it know this:

   Console.WriteLine("""""");
   Console.WriteLine("""""");
Are 2 triplequoted empty strings and not one "\nConsole.WriteLine(" sixtuplequoted string?


It's a syntax error!

  Unterminated raw string literal.
https://replit.com/@Wei-YenYen/DistantAdmirableCareware#main...


Ah, so there is no backtracking in lexer for this case. Makes sense.


If the opening quotes are followed by anything that is not a whitespace before the next new-line (or EOF), then it's a single-line string.

I imagine implementing those things took several iterations :)


The former, I'd say.

https://learn.microsoft.com/en-us/dotnet/csharp/programming-...

For a multi-line string the quotes have to be on their own line.


> TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings. So to highlight a string, one must count curly brackets and maintain a stack of parser states.

Presumably this is also true in Python - IIRC the brace-delimited fields within f-strings may contain arbitrary expressions.

More generally, this must mean that the lexical grammar of those languages isn't regular. "Maintaining a stack" isn't part of a finite-state machine for a regular grammar - instead we're in the realm of pushdown automata and context-free grammars.

Is it even possible to support generalized string interpolation within a strictly regular lexical grammar?


Complicated interpolation can be lexed as a regular language if you treat strings as three separate lexical things, eg in JavaScript template literals there are,

    `stuff${
    }stuff${
    }stuff`
so the ${ and } are extra closing and opening string delimiters, leaving the nesting to be handled by the parser.

You need a lexer hack so that the lexer does not treat } as the start of a string literal, except when the parser is inside an interpolation but all nested {} have been closed.


> Is it even possible to support generalized string interpolation within a strictly regular lexical grammar?

Almost certainly not, a fun exercise is to attempt to devise a Pumping tactic for your proposed language. If it doesn’t exist, it’s not regular.

https://en.m.wikipedia.org/wiki/Pumping_lemma_for_regular_la...


  select'select'select
is a perfectly valid SQL query, at least for Postgres.

Languages' approach to whitespace between tokens is all over the place


The author may have missed that lexing C is actually context-sensitive, i.e. you need a symbol table: https://en.wikipedia.org/wiki/Lexer_hack

Of course, for syntax highlighting this is only relevant if you want to highlight the multiplication operator differently from the dereferencing operator, or declarations differently from expressions.

More generally, however, I find it useful to highlight (say) types differently from variables or functions, which in some (most?) popular languages requires full parsing and symbol table information. Some IDEs therefore implement two levels of syntax highlighting, a basic one that only requires lexical information, and an extended one that kicks in when full grammar and type information becomes available.


> this is only relevant if you want to highlight the multiplication operator differently from the dereferencing operator

Can you mention one editor which does that?


I could be stretching the definition of "does" but the newfound(?) tree-sitter support in Emacs[1] I believe would allow that since it for sure understands the distinction but I don't possess enough font-lock ninjary to actually, for real, bind a different color to the distinct usages

  /* given foo.c */
  int main() {
    int a, *b;
    a = 5 * 10;
    b = &a;
    printf("a is %d\n", *b);
  }
and then M-x c-ts-mode followed by navigating to each * and invoking M-x treesit-inspect-node-at-point in turn produces, respectively:

  (declaration declarator: (pointer_declarator "*"))

  right: (binary_expression operator: "*")

  arguments: (argument_list (pointer_expression operator: "*"))
1: https://www.emacswiki.org/emacs/Tree-sitter


These examples are unambiguous. Try with something more spicy like

  return (A)*(B);
which depends on A being a type or a variable.


I don't think they implied there is. The sentence you quoted is essentially "this is relevant for their article about weird lexical syntax, but (almost definitely) not relevant to their original problem of syntax highlighting".


hey


I don't think the lexer hack is relevant in this instance. The lexer hack just refers to the ambiguity of `A * B` and whether that should be parsed as a variable declaration or an expression. If you're building a syntax tree, then this matters, but AFAICT all the author needs is a sequence of tokens and not a syntax tree. Maybe "parser hack" would be a better name for it.


I’d be shocked if jart didn’t know this, but it seems unlikely that an LLM would generate one of these most vexing parses, unless explicitly asked


I think you're thinking of something different to the issue in the parent comment. The most vexing parse is, as the name suggests, a problem at the parsing stage rather than the earlier lexing phase. Unlike the referenced lexing problem, it does't require any hack for compilers to deal with it. That's because it's not really a problem for the compiler; it's humans that find it surprising.


Given all the things that were new to the author in the article, I wouldn’t be shocked at all. There’s just a huge number of things to know, or to have come across.


Justine is proficient in C, she is the author of a libc (cosmopolitan) among other things, like Actually Portable Executables [1].

I would expect her to know C quite well, and that's probably an understatement.

[1] https://justine.lol/ape.html


At one point there was an open source project to formally specify Ruby, but I don’t know if it’s still alive: https://github.com/ruby/spec

Hmm, it seems to be alive, but based more on behavior than syntax.


Nice read.

I guess the article could be called Falsehoods Programmers Assume of Programming Language Syntaxes.


Justine gets very close to the hairiest parsing issue in any language without encountering it:

Perl's syntax is undecidable, because the difference between treating some characters as a comment or as a regex can depend on the type of a variable that is only determined e.g. based on whether a search for a Collatz counterexample terminates, or just, you know, user input.

https://perlmonks.org/?node_id=663393

C++ templates have a similar issue, I think.


I think possibly the most hilariously complicated instance of this is in perl’s tokenizer, toke.c (which starts with a Tolkien quote, 'It all comes from here, the stench and the peril.' — Frodo).

There’s a function called intuit_more which works out if $var[stuff] inside a regex is a variable interpolation followed by a character class, or an array element interpolation. Its result can depend on whether something in the stuff has been declared as a variable or not.

But even if you ignore the undecidability, the rest is still ridiculously complicated.

https://github.com/Perl/perl5/blob/blead/toke.c#L4502


Wow. I wonder how that function came to be in the first place. Surely it couldn't have started out that complicated?


> C++ templates have a similar issue

TIL! I went and dug up a citation: https://blog.reverberate.org/2013/08/parsing-c-is-literally-...


Yup, bash and GNU Make have the same issue as Perl does, and I mention the C++ issue here too:

Parsing Bash is Undecidable - https://www.oilshell.org/blog/2016/10/20.html

I remember a talk from Larry Wall on Perl 6 (now Raku), where he says this type of thing is a mistake. Raku can be statically parsed, as far as I know.


Parsing POSIX shell in undecidable too:

https://news.ycombinator.com/item?id=30362718


Yes, good point -- aliases makes parse time depend on runtime. That is mentioned in

Morbig: A static parser for POSIX shell - https://scholar.google.com/scholar?cluster=15754961728999604...

(at the time I wrote the post about bash, I hadn't implemented aliases yet)

But it's a little different since it is an intentional feature, not an acccident. It's designed to literally reinvoke the parser at runtime. I think it's not that good/useful a feature, and I tend to avoid it, but many people use it.


How could a search for a Collatz counterexample possibly terminate? ;)


> You'll notice its hash function only needs to consider a single character in in a string. That's what makes it perfect,

Is that a joke?

https://en.m.wikipedia.org/wiki/Perfect_hash_function


No. Taking the value of a single character is a correct perfect hash function, assuming there exists a position for the input string set where all characters differ.


It just so happens that a single character was all that's needed in that case. In the general case it wouldn't be


Forth has a default syntax, but Forth code can execute during the compilation process allowing it to accept/compile custom syntaxes.


> The languages I decided to support are Ada, Assembly, BASIC, C, C#, C++, COBOL, CSS, D, FORTH, FORTRAN, Go, Haskell, HTML, Java, JavaScript, Julia, JSON, Kotlin, ld, LISP, Lua, m4, Make, Markdown, MATLAB, Pascal, Perl, PHP, Python, R, Ruby, Rust, Scala, Shell, SQL, Swift, Tcl, TeX, TXT, TypeScript, and Zig.

A few (admittedly silly) questions about the list:

1. Why no Erlang, Elixir, or Crystal?

Erlang appears to be just at the author's boundary at #47 on the TIOBE index. https://www.tiobe.com/tiobe-index/

2. What is "Shell"? Sh, Bash, Zsh, Windows Cmd, PowerShell..?

3. Perl but no Awk? Curious why, because Awk is a similar but comparatively trivial language. Widely used, too.

To be fair, Awk, Erlang, and Elixir rank low on popularity. Yet m4, Tcl, TeX, and Zig aren't registered in the top 50 at all.

What's the methodology / criteria? Only things the author is already familiar with?

Still a fun article.


Tiobes's index is quite literally worthless, especially with regards to its stated purpose, let alone as a general point of orientation.

I'd wish that purple would stop lending it any credibility.


"Shell" in the context of a syntax highlighting language picker almost always means a Unixy shell, most likely something along the lines of Bash.


This was a delightful read, thanks!


As soon as I saw this was part of llamafile I was hoping that it would be used to limit LLM output to always be "valid" code as soon as it saw the backticks, but I suppose most LLMs don't have problems with that anyway. And I'm not sure you'd want something like that automatically forcing valid code anyway


llama.cpp does support something like this -- you can give it a grammar which restricts the set of available next tokens that are sampled over

so in theory you could notice "```python" or whatever and then start restricting to valid python code. (in least in theory, not sure how feasible/possible it would be in practice w/ their grammar format.)

for code i'm not sure how useful it would be since likely any model that is giving you working code wouldn't be struggling w/ syntax errors anyway?

but i have had success experimentally using the feature to drive fiction content for a game from a smaller llm to be in a very specific format.


yeah, ive used llama.cpp grammars before, which is why i was thinking about it. i just think it'd be cool for llamafile to do basically that, but with included defaults so you could eg, require JSON output. it could be cool for prototyping or something. but i dont think that would be too useful anyway, most of the time i think you would want to restrict it to a specific schema, so i can only see it being useful for something like a tiny local LLM for code completion, but that would just encourage valid-looking but incorrect code.

i think i just like the idea of restricting LLM output, it has a lot of interesting use cases


gotchya. i do think that is a cool idea actually -- LLMs tiny enough to do useful things with formally structured output but not big enough to nail the structure ~100% is probably not an empty set.


I don't understand why you wouldn't use Tree Sitter's syntax highlighting for this. I mean it's not going to be as fast but that clearly isn't an issue here.

Is this a "no third party dependencies" thing?


I don't want to require everyone who builds llamafile from source need to install rust. I don't even require that people install the gperf command, since I can build gperf as a 700kb actually portable executable and vendor it in the repo. Tree sitter I'd imagine does a really great highly precise job with the languages it supports. However it appears to support fewer of them than I am currently. I'm taking a breadth first approach to syntax highlighting, due to the enormity of languages LLMs understand.


I think the Rust component of tree-sitter-highlight is actually pretty small (Tree Sitter generates C for the actual parser).

But fair enough - fewer dependencies is always nice, especially in C++ (which doesn't have a modern package manager) and in ML where an enormous janky Python installation is apparently a perfectly normal thing to require.


I somehow thought Conan[1] was the C++ package manager; it's at least partially supported by GitLab, for what that's worth

1: https://docs.conan.io/2/introduction.html


No, if anything vcpkg is "the C++ package manager", but it's nowhere near pervasive and easy-to-use enough to come close to even Pip. It's leagues away from Cargo, Go, and other actually good PL package managers.


I knew that Microsoft used that on Windows but had no idea it was multi-platform: https://github.com/microsoft/vcpkg/releases/tag/2024.10.21 (MIT, like a lot of their stuff)

Microsoft is such an odd duck, sometimes, but I'm glad to take advantage of their "good years" while it lasts


Have you developed against TreeSitter? Some feedback from people who use it here - https://news.ycombinator.com/item?id=39783471

And here - https://lobste.rs/s/9huy81/tbsp_tree_based_source_processing...


Yes I have, and it worked very well for what I was using it for (assembly language LSP server). I didn't run into any of the issues they mentioned (not saying they don't exist though).

For new projects I use Chumsky. It's a pure Rust parser which is nice because it means you avoid the generated C, and it also gives you a fully parsed and natively typed output, rather than Tree Sitter's dynamically typed tree of nodes, which means there's no extra parsing step to do.

The main downside is it's more complicated to write the parser (some fairly extreme types). The API isn't stable yet either. But overall I like it more than Tree Sitter.


> Perl also has this goofy convention for writing man pages in your source code

The world corpus of software would be much better documented if everywhere else had stolen this from Perl. Inline POD is great.


Perl and Python stole it from Emacs Lisp, though Perl took it further. I'm not sure where Java stole it from, but nowadays Doxygen is pretty common for C code. Unfortunately this results in people thinking that Javadoc and Doxygen are substitutes for actual documentation like the Emacs Lisp Reference Manual, which cannot be generated from docstrings, because the organization of the source code is hopelessly inadequate for a reference manual.


> Emacs Lisp Reference Manual, which cannot be generated from docstrings, because the organization of the source code is hopelessly inadequate for a reference manual.

Well, they're not doing themselves any favors by just willy nilly mixing C with "user-facing" defuns <https://emba.gnu.org/emacs/emacs/-/blob/ed1d691184df4b50da6b...>. I was curious if they could benefit from "literate programming" since OrgMode is the bee's knees but not with that style coding they can't


I didn't mean that specifically the Emacs source code was not organized in the right way for a reference manual. I meant that C and Java source code in general isn't. And C++, which is actually where people use Doxygen more.

The Python standard library manual is also exemplary, and also necessarily organized differently from the source code.


> The Python standard library manual is also exemplary

Maybe parts of it are, but as a concrete example https://docs.python.org/3/library/re.html#re.match is just some YOLO about what, specifically, is the first argument to re.match: string, or compiled expression? Well, it's both! Huzzah! I guess they get points for consistency because the first argument to re.compile is also "both"

But, any idea what type re.compile returns? cause https://docs.python.org/3/library/re.html#re.compile is all "don't you worry about it" versus its re.match friend who goes out of their way to state that it is an re.Match object

Would it have been so hard to actually state it, versus requiring someone to invoke type() to get <class 're.Pattern'>?


I'm surprised to see that it's allowed to pass a compiled expression to re.match, since the regular expression object has a .match method of its own. To me the fact that the argument is called pattern implies that it's a string, because at the beginning of that chapter, it says, "Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). (...) Usually patterns will be expressed in Python code using this raw string notation."

But this ability to pass a compiled regexp rather than a string can't have been an accidental feature, so I don't know why it isn't documented.

Probably it would be good to have an example of invoking re.match with a literal string in the documentation item for re.match that you linked. There are sixteen such examples in the chapter, the first being re.match(r"(\w+) (\w+)", "Isaac Newton, physicist"), so you aren't going to be able to read much of the chapter without figuring out that you can pass a string there, but all sixteen of them come after that section. A useful example might be:

    >>> [s for s in ["", " ", "a ", " a", "aa"] if re.match(r'\w', s)]
    ['a ', 'aa']
It's easy to make manuals worse by adding too much text to them, but in this case I think a small example like that would be an improvement.

As for what type re.compile returns, the section you linked to says, "Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below." Is your criticism that it doesn't explicitly say that the regular expression object is returned (as opposed to, I suppose, stored in a table somewhere), or that it says "a regular expression object" instead of saying "an re.Pattern object"? Because the words "regular expression object" are a link to the "Regular Expression Objects" section, which begins by saying, "class re.Pattern: Compiled regular expression object returned by re.compile()." To me the name of the class doesn't seem like it adds much value here—to write programs that work using the re module, you don't need to know the name of the class the regular expression objects belong to, just what interface they support.

(It's unfortunate that the class name is documented, because it would be better to rename it to a term that wasn't already defined to mean "a string that can be compiled to a regular expression object"!)

But possibly I've been using the re module long enough that I'm blind to the deficiencies in its documentation?

Anyway, I think documentation questions like this, about gradual introduction, forward references, sequencing, publicly documented (and thus stable) versus internal-only names, etc., are hard to reconcile with the constraints of source code, impossible in most languages. In this case the source code is divided between Python and C, adding difficulty.


> If you ever want to confuse your coworkers, then one great way to abuse this syntax is by replacing the heredoc marker with an empty string

Maybe I am in favor of the death penalty after all


The final line number count is missing Julia. Based on the file in the repo, it would be at the bottom of the first column: between ld and R.

Among the niceties listed here, the one I'd wish for Julia to have would be C#'s "However many quotes you put on the lefthand side, that's what'll be used to terminate the string at the other end". Documentation that talks about quoting would be so much easier to read (in source form) with something like that.


One nicety that Julia does have that I didn't know about (or had forgotten) is nested multi-line comments.

    #= this one
       has a #= nested
       comment =# inside of it
       and that works fine! =#


The article doesn’t address why they chose not use treesiter and instead roll their own syntax highlight system.

Edit: same question with answer here https://news.ycombinator.com/item?id=42026554


I've done a fair bit of forth and I've not seen c" used. The usual string printing operator is ." .


Right, c" is for when you want to pass a literal string to some other word, not print it. But I agree that it's not very common, because you normally use s" for that, which leaves the address and length on the stack, while c" leaves just an address on the stack, pointing to a one-byte count field followed by the bytes. I think adding c" in Forth-83 (and renaming " to s") was a mistake, and it would have been better to deprecate the standard words that expect or produce such counted strings, other than count itself. See https://forth-standard.org/standard/alpha, https://forth-standard.org/standard/core/Cq, https://forth-standard.org/standard/core/COUNT, and https://forth-standard.org/standard/core/Sq.

You can easily add new string and comment syntaxes to Forth, though. For example, you can add BCPL-style // comments to end of line with this line of code in, I believe, all standard Forths, though I've only tested it in GForth:

    : // 10 word drop ; immediate
Getting it to work in block files requires more work but is still only a few lines of code. The standard word \ does this, and see \ decompiles the GForth implementation as

  : \
    blk @
    IF     >in @ c/l / 1+ c/l * >in ! EXIT
    THEN
    source >in ! drop ; immediate
This kind of thing was commonly done for text editor commands, for example; you might define i as a word that reads text until the end of the line and inserts it at the current position in the editor, rather than discarding it like my // above. Among other things, the screen editor in F83 does exactly that.

So, as with Perl, PostScript, TeX, m4, and Lisps that support readmacros, you can't lex Forth without executing it.


Counted (“Pascal”) strings are rare nowadays so C" is not often used. Its addr len equivalent is S" and that one is fairly common in string manipulation code.


Wouldn’t be possible to let the LLM do the highlighting? Instead of returning code in plain text, it could return code within html with the appropriate tags. Maybe it’s harder than it sounds… but if it’s just for highlighting the code the LLM returns, I wouldn’t mind the highlighting not being 100% accurate.


Would be much slower and eat up precious context window.


While developing the syntax for a programming language in the early 80s, I discovered that allowing spaces in identifiers was unambiguous, e.g., upper left corner = scale factor * old upper left corner.


> Ruby is the union of all earlier languages, and it's not even formally documented.

It's documented, but you need $250 to spare: https://www.iso.org/standard/59579.html


Well, according to (ahem) a copy that I found, it only goes up to MRI 1.9 and goes out of its way to say "welp, the world is changing, so we're just going to punt until Ruby stabilizes" which is damn cheating for a standard IMHO

Also, while doing some digging I found there actually are a number of the standards that are legitimately publicly available https://standards.iso.org/ittf/PubliclyAvailableStandards/in...


ISO Ruby is a tiny, dated subset of Ruby. I doubt you'll find much Ruby that conforms to it.

The Ruby everyone uses is much better defined by RubySpec etc. via test cases, but that's not complete either.


Impressive work!

I am surprised Smalltalk and Prolog are in there though.


As for C#'s triple-quoted strings, they actually came from Java before and C# ended up adopting the same or almost the same semantics. Including stripping leading whitespace.


No AWK?


Meanwhile NeoVim doesn’t syntax highlight my commit message properly if I have messed with "commit cleanup" enough.

The comment character in Git commit messages can be a problem when you insist on prepending your commits with some "id" and the id starts with `#`. One suggestion was to allow backslash escapes in commit messages since that makes sense to a computer scientist.[1]

But looking at all of this lexical stuff I wonder if makes-sense-to-computer-scientist is a good goal. They invented the problem of using a uniform delimiter for strings and then had to solve their own problem. Maybe it was hard to use backtick in the 70’s and 80’s, but today[2] you could use backtick to start a string and a single quote to end it.

What do C-like programming languages use single quotes for? To quote characters. Why do you need to quote characters? I’ve never seen a literal character which needed an "end character" marker.

Raw strings would still be useful but you wouldn’t need raw strings just to do a very basic thing like make a string which has typewriter quotes in it.

Of course this was for C-like languages. Don’t even get me started on shell and related languages where basically everything is a string and you have to make a single-quote/double-quote battle plan before doing anything slightly nested.

[1] https://lore.kernel.org/git/vpq3808p40o.fsf@anie.imag.fr/

[2] Notwithstanding us Europeans that use a dead-key keyboard layout where you have to type twice to get one measly backtick (not that I use those)


> The comment character in Git commit messages can be a problem when you insist on prepending your commits with some "id" and the id starts with `#`

https://git-scm.com/docs/git-commit#Documentation/git-commit...


See "commit cleanup".

There’s surprising layers to this. That the reporter in that thread says that git-commit will “happily” accept `#` in commit messages is half-true: it will accept it if you don’t edit the message since the `default` cleanup (that you linked to) will not remove comments if the message is given through things like `-m` and not an editing session. So `git commit -m'#something' is fine. But then try to do rebase and cherry-pick and whatever else later, maybe get a merge commit message with a commented "conflicted" files. Well it can get confusing.


> Maybe it was hard to use backtick in the 70’s and 80’s, but today[2] you could use backtick to start a string and a single quote to end it.

That's how quoting works by default in m4 and TeX, both defined in the 70s. Unfortunately Unicode retconned the ASCII apostrophe character ' to be a vertical line, maybe out of a misguided deference to Microsoft Windows, and now we all have to suffer the consequences. (Unless we're using Computer Modern fonts or other fonts that predate this error, such as VGA font ROM dumps.)

In the 70s and 80s, and into the current millennium on Unix, `x' did look like ‘x’, but now instead it looks like dogshit. Even if you are willing to require a custom font for readability, though, that doesn't solve the problem; you need some way to include an apostrophe in your quoted string!

As for end delimiters, C itself supports multicharacter literals, which are potentially useful for things like Macintosh type and creator codes, or FTP commands. Unfortunately, following the Unicode botch theme, the standard failed to define an endianness or minimum width for them, so they're not very useful today. You can use them as enum values if you want to make your memory dumps easier to read in the debugger, and that's about it. I think Microsoft's compiler botched them so badly that even that's not an option if you need your code to run on it.


> Unfortunately Unicode retconned the ASCII apostrophe character ' to be a vertical line

Unicode does not precribe the appearance of characters. Although in the code chart¹ it says »neutral (vertical) glyph with mixed usage« (next to »apostrophe-quote« and »single quote«), font vendors have to deal with this mixed usage. And with Unicode the correct quotation marks have their own code points, making it unnecessary to design fonts where the ASCII apostrophe takes their form, but rendering all other uses pretty ugly.

I would regard using ` and ' as paired quotation marks as a hack from times when typographic expression was simply not possible with the character sets of the day.

_________

¹

    0027 ' APOSTROPHE
    = apostrophe-quote (1.0)
    = single quote
    = APL quote
    • neutral (vertical) glyph with mixed usage
    • 2019 ’ is preferred for apostrophe
    • preferred characters in English for paired quotation marks are 2018 ‘ & 2019 ’
    • 05F3 ׳ is preferred for geresh when writing Hebrew
    → 02B9 ʹ modifier letter prime
    → 02BC ʼ modifier letter apostrophe
    → 02C8 ˈ modifier letter vertical line
    → 0301 $́ combining acute accent
    → 030D $̍ combining vertical line above
    → 05F3 ׳ hebrew punctuation geresh
    → 2018 ‘ left single quotation mark
    → 2019 ’ right single quotation mark
    → 2032 ′ prime
    → A78C ꞌ latin small letter saltillo«



This is an excellent document. I disagree with its normative conclusions, because I think being incompatible with ASCII, Unix, Emacs, and TeX is worse than being incompatible with ISO-8859-1, Microsoft Windows, and MacOS 9, but it is an excellent reference for the factual background.


> That's how quoting works by default in m4 and TeX, both defined in the 70s.

Good point. And it was in m4[1] I saw that backtick+apostrophe syntax. I would have probably not thought of that possibility if I hadn’t seen it there.

[1] Probably on Wikipedia since I have never used it

> Unfortunately Unicode retconned the ASCII apostrophe character ' to be a vertical line, maybe out of a misguided deference to Microsoft Windows, and now we all have to suffer the consequences. (Unless we're using Computer Modern fonts or other fonts that predate this error, such as VGA font ROM dumps.)

I do think the vertical line looks subpar (and I don’t use it in prose). But most programmers don’t seem bothered by it. :|

> In the 70s and 80s, and into the current millennium on Unix, `x' did look like ‘x’, but now instead it looks like dogshit.

Emacs tries to render it like ‘x’ since it uses backtick+apostrophe for quotes. With some mixed results in my experience.

> Even if you are willing to require a custom font for readability, though, that doesn't solve the problem; you need some way to include an apostrophe in your quoted string!

Aha, I honestly didn’t even think that far. Seems a bit restrictive to not be able to use possessives and contractions in strings without escapes.

> As for end delimiters, C itself supports multicharacter literals, which are potentially useful for things like Macintosh type and creator codes, or FTP commands.

I should have made it clear that I was only considering C-likes and not C itself. A language from the C trigraph days can be excused. To a certain extent.


I'd forgotten about `' in Emacs documentation! That may be influenced by TeX.

C multicharacter literals are unrelated to trigraphs. Trigraphs were a mistake added many years later in the ANSI process.


The comment character is also configurable:

    git config core.commentchar <char>
This is helpful where you want to use use say, markdown to have tidily formatted commit messages make up your pull request body too.


I want to try to set it to `auto` and see what spicy things it comes up with.


There are no problems caused by using unary delimiters for strings, because using paired delimiters for strings doesn't solve the problems unary delimiters create.

By nature, strings contain arbitrary text. Paired delimiters have one virtue over unary: they nest, but this virtue is only evident when a syntax requires that they must nest, and this is not the case for strings. It's but a small victory to reduce the need for some sort of escaping, without eliminating it.

Of the bewildering variety of partial solutions to the dilemma, none fully satisfactory, I consider the `backtick quote' pairing among the worst. Aside from the aesthetic problems, which can be fixed with the right choice of font, the bare apostrophe is much more common in plain text than an unmatched double quote, and the convention does nothing to help.

This comes at the cost of losing a type of string, and backtick strings are well-used in many languages, including by you in your second paragraph. What we would get in return for this loss is, nothing, because `don't' is just as invalid as 'don't' and requires much the same solution. `This is `not worth it', you see', especially as languages like to treat strings as single tokens (many exceptions notwithstanding) and this introduces a push-down to that parse for, again, no appreciable benefit.

I do agree with you about C and character literals, however. The close quote isn't needed and always struck me as somewhat wasteful. 'a is cleaner, and reduces the odds of typing "a" when you mean 'a'.


Glad to see confirmed that PHP is the most non weird programming language ;)


I recently learned php's heredoc can have space before it and it will remove those spaces from the lines in the string:

    $a = <<<EOL
        This is
        not indented
            but this has 4 spaces of indentation
        EOL;
But the spaces have to match, if any line has less spaces than the EOL it gives an error.


There are two types of languages: the ones full of quirks and the ones no one uses.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: