Weird Lexical Syntax

TomatoCo · 2024-11-02T14:19:10 1730557150

I think my favorite C trigraph was something like

  do_action() ??!??! handle_error()

It almost looks like special error handling syntax but still remains satisfying once you realize it's an || logical-or statement and it's using short circuiting rules to execute handle error if the action returns a non-zero value.

jacobn · 2024-11-03T16:14:57 1730650497

https://en.wikipedia.org/wiki/Digraphs_and_trigraphs_(progra...

??! Is converted to |

So ??!??! Becomes || i.e. “or”

wslh · 2024-11-02T14:31:49 1730557909

Did you choose the legacy C trigraphs over || for aesthetic purposes?

TomatoCo · 2024-11-08T20:48:21 1731098901

No, I don't really write C. I was remarking on something I saw once and I wanted to share it because the original article mentions trigraphs.

wslh · 2024-11-02T20:10:55 1730578255

Could you review my comment on HN? Please educate me if there is something I haven’t understood, rather than downvoting my question.

samatman · 2024-11-02T20:50:44 1730580644

The grandparent post is specifically about trigraphs. Saying something about trigraphs was the end-in-itself, trigraphs were chosen to illustrate something about trigraphs. So your question made no sense. Hope that helps.

Izkata · 2024-11-03T02:54:46 1730602486

Maybe the confusion was the other way, more like "why is that funny/interesting?"

An attempt to answer that: In English, mixing ?! at the end of a question is a way of indicating bewilderment. Like "What was that?!"

JadeNB · 2024-11-03T15:51:43 1730649103

> So your question made no sense. Hope that helps.

I think that is uncharitable. The question ("Did you choose the legacy C trigraphs over || for aesthetic purposes?") makes perfect sense to me. I think context makes it reasonably clear that the answer is 'yes,' but that doesn't mean that the question doesn't make sense, only perhaps that it didn't need to be asked.

wslh · 2024-11-03T07:41:46 1730619706

My question was precisely about why the user like trigraphs over using just || on this case. It is a very clear question and makes all the sense.

kergonath · 2024-11-03T09:33:20 1730626400

The post shows a “favorite C trigraph” thing, not that they were going out of their way to use trigraphs in actual code or that you should. Using trigraphs is the whole premise so no, your question makes no sense in that context.

FWIW the ??!??! double trigraph as error processing is funny because of the meaning of ?! and various combinations of ? and !. It is funny and it has trigraphs. That’s the whole point.

wslh · 2024-11-03T15:51:12 1730649072

> The post shows a “favorite C trigraph” thing, not that they were going out of their way to use trigraphs in actual code or that you should.

But I am free to be curious and ask the author why he choose it! We are not computers but human beings! There is no HN rule that says that I cannot be curious and asks a question that arised from a thread but it is not connected to that! [1].

[1] https://www.iflscience.com/charles-babbage-once-sent-the-mos...

kergonath · 2024-11-04T10:30:57 1730716257

> But I am free to be curious and ask the author why he choose it

They chose it because the discussion is about trigraphs. It is not a particularly surprising choice to talk about trigraphs in that context.

> There is no HN rule that says that I cannot be curious and asks a question that arised from a thread but it is not connected to that!

I did not write that you did not follow HN rules, just that your question was very strange in the context. I did not downvote you, but I understand why some people did. Your question was a bit passive agressive, even if you did not mean it.

mstade · 2024-11-03T13:26:24 1730640384

My reading of the downvoted question was one of genuine curiosity of why the author chose that as a favorite trigraph, as in “why that one instead of another”, not as criticism of the choice of trigraph over something more conventional. I may be wrong of course, but it didn’t seem like a particularly malicious question to me and your rationale unfortunately doesn’t convince me otherwise. Not that it has to, this is all very subjective after all, but just offering up a counter opinion.

I gave the question a +1 because I, as previously stated, read it to be genuine curiosity. Maybe a smiley would’ve helped, I don’t know. ¯\_(ツ)_/¯

teo_zero · 2024-11-03T08:52:13 1730623933

I didn't downvote your comment but understand why it looks "wrong": it's like, in a thread on English oddities, you replied to someone bringing up the "buffalo buffalo buffalo" example with the question "why are you so fond of bovines"?

wslh · 2024-11-03T15:57:14 1730649434

It has nothing to do with that. I could ask why he didn't choose a different homonymic ambiguity [1].

[1] https://journals.linguisticsociety.org/proceedings/index.php...

phatskat · 2024-11-04T02:07:01 1730686021

I think the misunderstanding is a bit of missing context from the original comment. I read it as “one of my favorite trigraphs [that I have seen]”. Their comment didn’t make a claim to using said trigraph, just that they had seen it somewhere and thought it was interesting.

Your comments seem to suggest your reading was “one of my favorite trigraphs [to use is]”, which is understandable as a valid interpretation.

The whole of this chain of comments is a misunderstanding on the original comment’s ambiguity.

James_K · 2024-11-03T15:11:15 1730646675

Easiest way to get downvotes is to ask people not to give them. You just gotta ignore the haters.

__MatrixMan__ · 2024-11-02T13:49:11 1730555351

This was a fun read, but it left me a bit more sympathetic to the lisp perspective, which (if I've understood it) is that syntax, being not an especially important part of a language, is more of a hurdle than a help, and should be as simple and uniform as possible so we can focus on other things.

Which is sort of ironic because learning how to do structural editing on lisps has absolutely been more hurdle than help so far, but I'm sure it'll pay off eventually.

mqus · 2024-11-02T14:19:33 1730557173

Having a simple syntax might be fine for computers but syntax is mainly designed to be read and written by humans. Having a simple one like lisp then just makes syntactic discussions a semantic problem, just shifting the layers.

And I think an complex syntax is far easier to read and write than a simple syntax with complex semantics. You also get a faster feedback loop in case the syntax of your code is wrong vs the semantics (which might be undiscovered until runtime).

__MatrixMan__ · 2024-11-02T15:44:46 1730562286

Jury's out re: whether I feel this in my gut. Need more time with the lisps for that. But re: cognitive load maybe it goes like:

1. 1 language to rule them all, fancy syntax

2. Many languages, 1 simple syntax to rule them all

3. Many languages and many fancy syntaxes

Here in the wreckage of the tower of babel, 1. isn't really on the table. But 2. might have benefits because the inhumanity of the syntax need only be confronted once. The cumulative cost of all the competing opinionated fancy syntaxes may be the worst option. Think of all the hours lost to tabs vs spaces or braces vs whitespace.

dartos · 2024-11-02T16:25:25 1730564725

I think 3 is not only a natural state, but the best state.

I don’t think we can have 1 language that satisfies the needs of all people who write code, and thus, we can’t have 1 syntax that does that either.

3 seems the only sensible solution to me, and we have it.

__MatrixMan__ · 2024-11-02T16:45:35 1730565935

I dunno, here in 3 the hardest part of learning a language has little to do with the language itself and more to do with the ecosystem of tooling around that language. I think we could more easily get on to the business of using the right language for the job if more of that tooling was shared. If each language, for instance did not have it's own package manager, its own IDE, its own linters and language servers all with their own idiosyncrasies arising not from deep philosophical differences of the associated language but instead from accidental quirks of perspective from whoever decided that their favorite language needed a new widget.

I admire the widget makers, especially those wrangling the gaps between languages. I just wish their work could be made easier.

skydhash · 2024-11-02T17:42:15 1730569335

I really like the Linux package managers. If you're going to write an application that will run on some system, it's better to bake dependencies into it. And with virtualization and containerization, the system is not tied to a physical machine. I've been using containers (incus) more and more for real development purposes as I can use almost the same environment to deploy. I don't care much about the IDE, but I'm glad we have LSP, Tree-sitter, and DAP. The one thing I do not like is the proliferation of tooling version manager (NVM,..) instead of managing the environment itself (tied to the project).

James_K · 2024-11-03T15:17:32 1730647052

I've always thought these complaints are really just a reflection of how stuck we are in the C paradigm. The idea that you have to edit programs as text is outdated IMO. It should be that your editor operates on the syntax tree of the source code. Once you do that, the code can be displayed in any way.

mdaniel · 2024-11-03T17:19:39 1730654379

I also believe this, and we're actually about half way there via MPS <https://github.com/JetBrains/MPS#readme> but I'm pretty sure that dream is dead until this LLM hype blows over, since LLMs are not going to copy-paste syntax trees until the other dream of a universal representation materializes[1]

1: There have been several attempts at Universal ASTs, including (unsurprisingly) a JVM-centric one from JetBrains https://github.com/JetBrains/intellij-community/blob/idea/24...

broken-kebab · 2024-11-03T13:13:41 1730639621

The problem with this statement is that it assumes parsing-easiness as something universal, and stable. And this is certainly not true. You may believe syntax A is so much easier simply because it's the syntax you have been dealing with most of your career thus your brain is trained for it. On top of it a particular task can make a lot of difference: most people would agree that regex is simplification versus writing the same logic in usual if-then way for pattern matching in strings, but I'm not sure many would like to have their whole programs looking that way (but even that could be subjective, see APL).

andai · 2024-11-03T09:54:07 1730627647

This is interesting. My first thought was that a language where more meaning is expressed in syntax could catch more errors at compile time. But there seems to be no reason why meaning encoded in semantics could not also be caught at compile time.

The main benefit of putting things in the syntax seems to be that many errors would become visually obvious.

drewr · 2024-11-02T15:18:36 1730560716

I don't understand your distinction between syntax and semantics. If the semantics are complex, wouldn't that mean the syntax is thus complex?

SuperCuber · 2024-11-02T17:00:09 1730566809

lisp's syntax is simple - its just parenthesis to define a list, first element of a list is executed as a function.

but for example a language like C has many different syntaxes for different operations, like function declaration or variable or array syntax, or if/switch-case etc etc.

so to know C syntax you need to learn all these different ways to do different things, but in lisp you just need to know how to match parenthesis.

But of course you still want to declare variables, or have if/else and switch case. So you instead need to learn the builtin macros (what GP means by semantics) and their "syntax" that is technically not part of the language's syntax but actually is since you still need all those operations enough that they are included in the standard library and defining your own is frowned upon.

kryptiskt · 2024-11-03T15:58:06 1730649486

Lisp has way more syntax, that doesn't cover any of the special forms. Knowing about application syntax doesn't help with understanding `let` syntax. Even worse, with macros, the amount of syntax is open-ended. That they all come in the form of S-expressions doesn't help a lot in learning them.

skydhash · 2024-11-02T18:01:25 1730570485

Most languages' abstract machines expose a very simple API, it's up to the language to add useful constructs to help us write code more efficiently. Languages like Lisp start with a very simple syntax, then add those constructs with the language itself (even though those can be fixed using a standard), others just add it through the syntax. These constructs plus the abstract machine's operations form the semantics, syntax is however the language designer decided to present them.

fanf2 · 2024-11-02T15:16:43 1730560603

Lisp has reader macros which allow you to reprogram its lexer. Lisp macros allow you to program the translation from the visible structure to the parse tree.

For example, https://pyret.org/

It really isn’t simple or necessarily uniform.

__MatrixMan__ · 2024-11-02T21:11:38 1730581898

I've heard that certain lisps (Common Lisp comes up when I search for reader macros) allow for all kinds of tinkering with themselves. But the ability of one to make itself not a lisp anymore, while interesting, doesn't seem to say much about the merits of sticking to s-expressions, except maybe to point out that somebody once decided not to.

lispm · 2024-11-03T10:00:12 1730628012

Reader macros are there to program and configure the reader. The reader is responsible for reading s-expressions into internal data structures. There are basically two main uses of reader-macros: data structures and reader control.

A CL implementation will implement reading lists, symbols, numbers, arrays, strings, structures, characters, pathnames, ... via reader macros. Additionally the reader implements various forms of control operations: conditional reading, reading and evaluation, circular datastructures, quoting and comments.

This is user programmable&configurable. Most uses will be in the two above categories: data structure syntax and control. For example we could add a syntax for hash tables to s-expressions. An example for a control extension would be to add support for named readtables. For example a Common Lisp implementation could add a readtable for reading s-expressions from Scheme, which has a slightly different syntax.

Reader macros were optimized for implementing s-expressions, thus the mechanism isn't that convenient as a lexer/parser for actual programming languages. It's a a bit painful to do so, but possible.

A typical reader macro usage, beyond the usage described above, is one which implements a different token or expression syntax. For example there are reader macros which parse infix expressions. This might be useful in Lisp code where arithmetic expressions can be written in a more conventional infix syntax. The infix reader macro would convert infix expressions into prefix data.

lispm · 2024-11-03T09:44:57 1730627097

Is Pyret based on reader macros? I would think it's much easier to use a syntax parser for that.

nlitened · 2024-11-02T14:34:03 1730558043

I am surprised to hear that structural editing has been a hurdle for you, and I think I can offer a piece of advice. I also used to be terrified by its apparent complexity, but later found out that one just needs to use parinfer and to know key bindings for only three commands: slurp, barf, and raise.

With just these four things you will be 95% there, enjoying the fruits of paredit without any complexity — all the remaining tricks you can learn later when you feel like you’re fluent.

__MatrixMan__ · 2024-11-02T14:57:56 1730559476

Thanks very much for the advice, it's timely.

<rant> It's not so much the editing itself but the unfamiliarity of the ecosystem. It seems it's a square-peg I've been crafting a round hole of habits for it:

I guess I should use emacs? How to even configure it such that these actions are available? Or maybe I should write a plugin for helix so that I can be in a familiar environment. Oh, but the helix plugin language is a scheme, so I guess I'll use emacs until I can learn scheme better and then write that plugin. Oh but emacs keybinds are conflicting with what I've configured for zellij, maybe I can avoid conflicts by using evil mode? Oh ok, emacs-lisp, that's a thing. Hey symex seems like it aligns with my modal brain, oh but there goes another afternoon of fussing with emacs. Found and reported a symex "bug" but apparently it only appears in nix-governed environments so I guess I gotta figure out how to report the packaging bug (still todo). Also, I guess I might as well figure out how to get emacs to evaluate expressions based on which ones are selected, since that's one of the fun things you can do in lisps, but there's no plugin for the scheme that helix is using for its plugin language (which is why I'm learning scheme in the first place), but it turns out that AI is weirdly good at configuring emacs so now my emacs config contains most that that plugin would entail. Ok, now I'm finally ready to learn scheme, I've got this big list of new actions to learn: https://countvajhula.com/2021/09/25/the-animated-guide-to-sy.... Slurp, barf, and raise you say? excellent, I'll focus on those.

I'm not actually trying to critique the unfamiliar space. These are all self inflicted wounds: me being persnickety about having it my way. It's just usually not so difficult to use something new and also have it my way.</rant>

pxc · 2024-11-02T22:01:31 1730584891

> Oh but emacs keybinds are conflicting with what I've configured for zellij,

Don't do that. ;)

Emacs is a graphical application! Don't use it in the terminal unless you really have to (i.e., you're using it on a remote machine and TRAMP will not do).

> it turns out that AI is weirdly good at configuring emacs

I was just chatting with a friend about this. ChatGPT seems to be much better at writing ELisp than many other languages I've asked it to work with.

Also while you're playing with it, you might be interested in checking out kakoune.el or meow, which provide modal editing in Emacs but with the selection-first ordering for commands, like in Kakoune and Helix rather than the old vi way.

PS: symex looks really interesting! Hadn't been that one

cenamus · 2024-11-03T10:19:48 1730629188

Well, elisp probably accounts for like 85% of the lisp code on GH and co, so that'd make sense

nlitened · 2024-11-02T17:37:57 1730569077

To be fair, I am not a "lisper" and I don't know Emacs at all. I am just a Clojure enjoyer who uses IntelliJ + Cursive with its built-in parinfer/paredit.

xenophonf · 2024-11-02T16:50:21 1730566221

I never bothered with structural editing on Emacs. I just use the sentence/paragraph movement commands. M-a, M-e, M-n, M-p, M-T, M-space, etc.

petters · 2024-11-03T05:33:09 1730611989

> I'm not sure who wants to be able to syntax highlight C at 35 MB per second, but I am now able to do so

Fast, but tcc *compiles* C to binary code at 29 MB/s on a really old computer: https://bellard.org/tcc/#speed Should be possible to go much faster but probably not needed

BiteCode_dev · 2024-11-03T09:11:41 1730625101

Justine vs Bellard, that's a nice setup.

pdw · 2024-11-02T13:16:16 1730553376

Some random things that the author seem to have missed:

> but TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings

Many more languages support that:

    C#             $"{x} plus {y} equals {x + y}"
    Python         f"{x} plus {y} equals {x + y}"
    JavaScript     `${x} plus ${y} equals ${x + y}`
    Ruby           "#{x} plus #{y} equals #{x + y}"
    Shell          "$x plus $y equals $(echo "$x+$y" | bc)"
    Make :)        echo "$(x) plus $(y) equals $(shell echo "$x+$y" | bc)"

> Tcl

Tcl is funny because comments are only recognized in code, and since it's a homoiconic, it's very hard to distinguish code and data. { } are just funny string delimiters. E.g.:

    xyzzy {#hello world}

Is xyzzy a command that takes a code block or a string? There's no way to tell. (Yes, that means that the Tcl tokenizer/parser cannot discard comments: only at evaluation time it's possible to tell if something is a comment or not.)

> SQL

PostgreSQL has the very convenient dollar-quoted strings: https://www.postgresql.org/docs/current/sql-syntax-lexical.h... E.g. these are equivalent:

    'Dianne''s horse'
    $$Dianne's horse$$
    $SomeTag$Dianne's horse$SomeTag$

autarch · 2024-11-02T15:36:18 1730561778

Perl lets you do this too:

    my $foo = 5;
    my $bar = 'x';
    my $quux = "I have $foo $bar\'s: @{[$bar x $foo]}";
    print "$quux\n";

This prints out:

    I have 5 x's: xxxxx

The "@{[...]}" syntax is abusing Perl's ability to interpolate an _array_ as well as a scalar. The inner "[...]" creates an array reference and the outer "@{...}" dereferences it.

For reasons I don't remember, the Perl interpreter allows arbitrary code in the inner "[...]" expression that creates the array reference.

Izkata · 2024-11-02T19:29:27 1730575767

> For reasons I don't remember, the Perl interpreter allows arbitrary code in the inner "[...]" expression that creates the array reference.

...because it's an array value? Aside from how the languages handle references, how is that part any different from, for example, this in python:

  >>> [5 * 'x']
  ['xxxxx']

You can put (almost) anything there, as long as it's an expression that evaluates to a value. The resulting value is what goes into the array.

autarch · 2024-11-02T19:38:26 1730576306

I understand that's constructing an array. What's a bit odd is that the interpreter allows you to string interpolate any expression when constructing the array reference inside the string.

Izkata · 2024-11-02T19:56:56 1730577416

It's not...? Well, not directly: It's string interpolating an array of values, and the array is constructed using values from the results of expressions. These are separate features that compose nicely.

JadeNB · 2024-11-03T04:11:50 1730607110

> What's a bit odd is that the interpreter allows you to string interpolate any expression when constructing the array reference inside the string.

Why? Surely it is easier for both the language and the programmer to have a rule for what you can do when constructing references to anonymous arrays, without having to special case whether that anonymous array is or is not in a string (or in any one of the many other contexts in which such a construct may appear in Perl).

weinzierl · 2024-11-02T20:38:58 1730579938

You also don't need quotes around strings (barewords). So

    my $bar = x;

should give the same result.

Good luck with lexing that properly.

https://perlmaven.com/barewords-in-perl

shawn_w · 2024-11-02T23:26:38 1730589998

If you're writing anything approaching decent perl that won't be accepted.

weinzierl · 2024-11-03T10:00:43 1730628043

Doesn't really matter for a syntax highlighter, because it is out of your control what you get. For the llamafile highlighter even more so since it supports other legacy quirks, like C trigraphs as well.

emmelaich · 2024-11-03T01:57:30 1730599050

"use strict" will prevent it and I think strict will be assumed/default soon.

JadeNB · 2024-11-03T04:13:38 1730607218

As of Perl 5.12, `use`ing a version (necessary to ensure availability of some of the newer features) automatically implies `use strict`.

https://perldoc.perl.org/strict#HISTORY

layer8 · 2024-11-02T17:15:19 1730567719

> actual code being embedded inside strings

My view on this is that it shouldn’t be interpreted as code being embedded inside strings, but as a special form of string concatenation syntax. In turn, this would mean that you can nest the syntax, for example:

    "foo { toUpper("bar { x + y } bar") } foo"

The individual tokens being (one per line):

    "foo {
    toUpper
    (
    "bar {
    x
    +
    y
    } bar"
    )
    } foo"

If `+` does string concatenation, the above would effectively be equivalent to:

    "foo " + toUpper("bar " + (x + y) + " bar") + " foo"

I don’t know if there is a language that actually works that way.

panzi · 2024-11-02T17:38:56 1730569136

Indeed in some of the listed languages you can nest it like that, but in others (e.g. Python) you can't. I would guess they deliberately don't want to enable that and it's not a problem in their parser or something.

Tarean · 2024-11-02T17:59:01 1730570341

As of python 3.6 you can nest fstrings. Not all formatters and highlighters have caught up, though.

Which is fun, because correct highlighting depends on language version. Haskell has similar problems where different compiler flags require different parsers. Close enough is sufficient for syntax highlighting, though.

Python is also a bit weird because it calls the format methods, so objects can intercept and react to the format specifiers in the f-string while being formatted.

panzi · 2024-11-02T19:24:55 1730575495

I didn't mean nested f-strings. I mean this is a syntax error:

    >>> print(f"foo {"bar"}")
    SyntaxError: f-string: expecting '}'

Only this works:

    >>> print(f"foo {'bar'}")
    foo bar

pdw · 2024-11-02T20:27:28 1730579248

You're using an old Python version. On recent versions, it's perfectly fine:

    Python 3.12.7 (main, Oct  3 2024, 15:15:22) [GCC 14.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print(f"foo {"bar"}")
    foo bar

layer8 · 2024-11-02T17:57:04 1730570224

Even when nesting is disallowed, my point is that I find it preferable to not view it (and syntax-highlight it) as a “special string” with embedded magic, but as multiple string literals with just different delimiters that allow omitting the explicit concatenation operator, and normal expressions interspersed in between. I think it’s important to realize that it is really just very simple syntactic sugar for normal string concatenation.

Timwi · 2024-11-03T00:46:47 1730594807

While you're conceptually right, in practice I think it bears mentioning that in C# the two syntaxes compile differently. This is because C#’s target platform, the .NET Framework, has always had a function called `string.Format` that lets you write this:

  var str = string.Format("{0} is {1} years old.", name, age);

When interpolated strings were introduced later, it was natural to have them compile to this instead of concatenation.

layer8 · 2024-11-03T03:51:31 1730605891

There's no reason in principle why

    name + " is " + age + " years old."

couldn't compile to exactly the same. (Other than maybe `string.Format` having some additional customizable behavior, I don't know C# that well.)

epcoa · 2024-11-03T05:07:30 1730610450

Like python, and Rust with the format! macro (which doesn't even support arbitrary expressions), C# the full syntax for interpolated/formatted strings is this: {<interpolationExpression>[,<alignment>][:<formatString>]}, ie there is more going on then just a simple wrapper around concat or StringBuilder.

ygra · 2024-11-03T08:07:41 1730621261

When not using the format specifiers or alignment it will indeed compile to just string.Concat (which is also what the + operator for strings compiles to). Similar to C compilers choosing to call pits instead of printf if there is nothing to be formatted.

epcoa · 2024-11-03T03:07:52 1730603272

If it’s treated strictly as simple concatenation syntactic sugar then you are allowing something like print(“foo { func() ); Which seems janky af.

> just very simple syntactic sugar for normal string concatenation.

Maybe. There’s also possibly a string conversion. It seems reasonable to want to disallow implicit string conversion in a concatenation operator context (especially if overloading +) while allowing it in the interpolation case.

layer8 · 2024-11-03T03:45:32 1730605532

I failed to mention the balancing requirement, that should of course remain. But it's an artificial requirement, so to speak, that is merely there to double-check the programmer's intent. The compiler/parser wouldn't actually care (unlike for an arithmetic expression with unbalanced parentheses, or scope blocks with unbalanced braces), the condition is only checked for the programmer's benefit.

> here’s also possibly a string conversion. It seems reasonable to want to disallow implicit string conversion in a concatenation operator context (especially if overloading +) while allowing it in the interpolation case.

Many languages have a string contenation operator that does implicit conversion to string, while still having a string interpolation syntax like the above. It's kind of my point that both are much more similar to each other than many people seem to realize.

epcoa · 2024-11-02T18:52:13 1730573533

> "foo { …

That should probably not be one token.

> My view on this is that it shouldn’t be interpreted as code being embedded inside strings

I’m not sure exactly what you’re proposing and how it is different. You still can’t parse it as a regular lexical grammar.

How does this change how you highlight either?

Whatever you call it, to the lexer it is a special string, it has to know how to match it, the delimiters are materially different than concatenation.

I might be being dense but I’m not sure what’s formally distinct.

layer8 · 2024-11-03T03:31:13 1730604673

> > "foo { …

> That should probably not be one token.

It's exactly the point that this is one token. It's a string literal with opening delimiter `"` and closing delimiter `{`, and that whole token itself serves as a kind of opening "brace". Alternatively, you can see `{` as a contraction of `" +`. Meaning, aside from the brace balancing requirement, `"foo {` does the same a `"foo " +` would.

Still alternatively, you could imagine a language that concatenates around string literals by default, similar to how C behaves for sequences of string literals. In C,

    "foo" "bar" "baz"

is equivalent to

    "foobarbaz"

Similarly, you could imagine a language where

    "foo" some_variable "bar"

would perform implicit concatenation, without needing an explicit operator (as in `"foo" + x + "bar"`). And then people might write it without the inner whitespace, as:

    "foo"some_variable"bar"

My point is that

    "foo{some_variable}bar"

is really just that (plus a condition requiring balanced pairs of braces). You can also re-insert the spaces for emphasis:

    "foo{ some_variable }bar"

The fact that people tend to think of `{some_variable}` as an entity is sort-of an illusion.

> How does this change how you highlight either?

You would highlight the `"...{`, `}...{`, and `}..."` parts like normal string literals (they just use curly braces instead of double quotes at one or both ends), and highlight the inner expressions the same as if they weren't surrounded by such literals.

epcoa · 2024-11-03T04:30:41 1730608241

> It's exactly the point that this is one token.

Fair enough. The point, as you have acknowledged, being that unlike + you have to treat { specially for balancing (and separately from the “).

> The fact that people tend to think of `{some_variable}` as an entity is sort-of an illusion.

I guess. I just don’t know what being an illusion means formally. It’s not an illusion to the person that has to implement the state machine that balances the delimiters.

> You would highlight the `"...{`, `}...{`, and `}..."` parts like normal string literals (they just use curly braces instead of double quotes at one or both ends), and highlight the inner expressions the same as if they weren't surrounded by such literals

Emacs does it this way FWIW. But I’m not sure how important it is to dictate that the brace can’t be a different color.

In any event, I can agree your design is valid (Kotlin works this way), but I don’t necessarily agree it is any more valid than say how Python does it where there can format specifiers, implicit conversion to string is performed whereas not with concatenation. I’m not seeing the clear definitive advantage of interpolated strings being an equivalent to concatenation vs some other type of method call.

The other detail is order of evaluation or sequencing. String concat may behave differently. Not sure I agree it is wrong, because at the end of the day it is distinct looking syntax. Illusion or not, it looks like a neatly enclosed expression, and concatenation looks like something else. That they might parse, evaluate or behave different isn't unreasonable.

vidarh · 2024-11-03T10:15:10 1730628910

Ruby takes this to 100. As much as a I love Ruby, this is valid Ruby, and I can't defend this:

    puts "This is #{<<HERE.strip} evil"
    incredibly 
    HERE

Just to combine the string interpolation with her concern over Ruby heredocs.

My other favorite evil quirk in Ruby is that whitespace is a valid quote character in Ruby. The string (without the quotes) "% hello " is a quoted string containing "hello" (without the quotes), as "%" in contexts where there is no left operand initiates a quoted string and the next characters indicates the type of quotes. This is great when you do e.g. "%(this is a string)" or "%{this is a string}". It's not so great if you use space (I've never seen that in the wild, so it'd be nice if it was just removed - even irb doesn't handle it correctly)

jart · 2024-11-03T16:41:38 1730652098

https://pbs.twimg.com/media/GbEfj6fbQAQRUB7?format=png&name=...

That's so going in the blog post later today.

vidarh · 2024-11-03T19:40:41 1730662841

Heh. I love Ruby, but, yes, the parser is "interesting", for values of interesting left undefined for its high obscenity content.

mdaniel · 2024-11-03T17:32:21 1730655141

And don't overlook the fact that the bare-world, or its "HERE" friend, are still in an interpolation context, so...

    puts "hello #{<<onoz.strip} world"
    recursion is #{<<onoz.strip}
    recursive
    onoz
    onoz
    puts "that was fun"

yields

  hello recursion is recursive world
  that was fun

and then there's its backtick friend

    puts "hello #{<<`onoz`.strip} world"
    date -u
    onoz

coughs up

    hello Sun Nov  3 17:25:32 UTC 2024 world

and for those trying out your percent-space trick, be aware that it only tolerates such a thing in a standalone expression context so

  puts (% hello )+" world"
  # or
  x = % hello #
  puts x

because when I tried it "normally" I got

    $ /usr/bin/ruby -e 'puts % hello  + "world"'

    -e:1:in `<main>': undefined local variable or method `hello' for main:Object (NameError)
    $ /usr/bin/ruby -v
    ruby 2.6.10p210 (2022-04-12 revision 67958) [universal.x86_64-darwin21]

but, at the intersection is "ruby parsing is the 15th circle of hell"

    ruby -e 'puts (% #{<<FOO.strip}  )+ " world"
    hello
    FOO
    '

vidarh · 2024-11-04T00:49:16 1730681356

> $ /usr/bin/ruby -e 'puts % hello + "world"'

Yes, it's roughly limited in use to places where it is not ambiguous whether it would be the start of a quoted string or the modulus operator, and after a method name would be ambiguous.

> but, at the intersection is "ruby parsing is the 15th circle of hell"

It's surprisingly (not this part, anyway) not that hard. You "just" need to create a forward reference, and keep track of heredocs in your parser, and when you come to the end of a line with heredocs pending, you need to parse them and assign them to the variable references you've created.

It is dirty, though, and there are so many dark corners of parsing Ruby. Having written a partial Ruby parser, and being a fan of Wirth-style grammar simplicity while enjoying using Ruby is a dark, horrible place to live in. On the one hand, I find Ruby a great pleasure to use, on the other hand, the parser-writer in me wants to spend eternity screaming into the void in pain.

jart · 2024-11-04T02:11:46 1730686306

How are you so awesome?

vidarh · 2024-11-04T14:45:47 1730731547

Thanks. I'm a big fan of your work, so that is appreciated...

mbo · 2024-11-03T07:59:21 1730620761

> Scala

Note about Scala's string interpolation. They can be used as pattern match targets.

  val s"${a} + ${b}" = "1 + 2";
  println(a) // 1
  println(b) // 2

orthoxerox · 2024-11-05T08:04:18 1730793858

One cool feature of C# interpolated strings is that they are lazy. Many loggers used to implement their own interpolation because something like

    log.trace($"Entering iteration {i} for customer {c.ID} [{c.ShortName}]");

in a hot loop would call string.Concat every time it was called before the logger could bail out of the method.

C# lets you declare an overload that accepts a `DefaultInterpolatedStringHandler` (or your own custom implementation of the handler pattern) and this overload will take precedence and allow you to delay the building of the string until after you've checked whether logging it is required.

panzi · 2024-11-02T17:35:26 1730568926

Is this a bash-ism?

    "$x plus $y equals $((x+y))"

jwilk · 2024-11-02T19:01:24 1730574084

No, it's portable shell syntax.

LukeShu · 2024-11-02T20:47:59 1730580479

"$((" arithmetic expansion is POSIX (XCU 2.6.4 "Arithmetic Expansion").

But if I'm not mistaken, it originated in csh.

jonahx · 2024-11-02T18:58:27 1730573907

This works in "sh" as well for me.

panzi · 2024-11-02T19:21:43 1730575303

On some systems (like on mine) sh is just a link to bash, so I couldn't test it.

Izkata · 2024-11-03T02:50:42 1730602242

Isn't bash supposed to act like sh when executed with that name?

saagarjha · 2024-11-03T20:53:14 1730667194

It still has bashisms

susam · 2024-11-02T21:21:55 1730582515

> Is this a bash-ism?

> "$x plus $y equals $((x+y))"

No, it is specified in POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V...

Izkata · 2024-11-03T02:46:44 1730602004

  Make :)        echo "$(x) plus $(y) equals $(shell echo "$x+$y" | bc)"

I'm guessing this is the reason for the :) but to be clear for anyone else: Make is only doing half of the work, whatever comes after "shell" is being passed to another executable, then make captures its stdout and interpolates that. The other executable is "sh" by default but can be changed to whatever.

bastawhiz · 2024-11-03T19:59:24 1730663964

Python f-strings are kind of wild. They can even contain comments! They also have slightly different rules for parsing certain kinds of expressions, like := and lambdas. And until fairly recently, strings inside the expressions couldn't use the quote type of the f-string itself (or backslashes).

thesz · 2024-11-02T21:47:31 1730584051

VHDL

There is a record constructor syntax in VHDL using attribute invocation syntax: RECORD_TYPE'(field1expr, ..., fieldNexpr). This means that if your record has a first field a subtype of a character type, you can get record construction expression like this one: REC'('0',1,"10101").

Good luck distinguishing between '(' as a character literal and "'", "(" and "'0'" at lexical level.

Haskell.

Haskell has context-free syntax for bracketed ("{-" ... "-}") comments. Lexer has to keep bracketed comment syntax balanced (for every "{-" there should be accompanying "-}" somewhere).

sundarurfriend · 2024-11-02T20:58:19 1730581099

> Many more languages support that:

Julia as well:

    Julia    "$x plus $y equals $(x+y)"

cryptonector · 2024-11-03T19:38:54 1730662734

jq: "\("hello" + "world")!!"

I wish PG had dollar-bracket quoting where you have to use the closing bracket to close, that way vim showmatch would work trivially. Something like ${...}$.

1vuio0pswjnm7 · 2024-11-02T22:33:35 1730586815

Shell "$x plus $y equals $((x+y))"

Shell "$x plus $y equals $((expr $x + $y))"

1vuio0pswjnm7 · 2024-11-04T18:42:36 1730745756

Correction: Shell "$x plus $y equals $(expr $x + $y)"

therein · 2024-11-02T19:10:31 1730574631

> PostgreSQL has the very convenient dollar-quoted strings

I did not know that. Today I learned.

skitter · 2024-11-02T12:41:22 1730551282

Another syntax oddity (not mentioned here) that breaks most highlighters: In Java, unicode escapes can be anywhere, not just in strings. For example, the following is a valid class:

    class Foo\u007b}

and this assert will not trigger:

    assert
        // String literals can have unicode escapes like \u000A!
        "Hello World".equals("\u00E4");

mistercow · 2024-11-02T15:02:20 1730559740

I also argue that failing to syntax highlight this correctly is a security issue. You can terminate block comments with Unicode escapes, so if you wanted to hide some malicious code in a Java source file, you just need an excuse for there to be a block of Unicode escapes in a comment. A dev who doesn’t know about this quirk is likely to just skip over it, assuming it’s commented out.

styglian · 2024-11-03T15:17:48 1730647068

I once wrote a puzzle using this, which (fortunately) doesn't work any more, but would do interesting things on older JDK versions: https://pastebin.com/raw/Bh81PwXY

ivanjermakov · 2024-11-02T13:59:22 1730555962

I have never seen this in Java! Is there any use cases where it could be useful?

susam · 2024-11-02T14:14:56 1730556896

I don't know about usefulness but it does let us write identifiers using Unicode characters. For example:

  public class Foo {
      public static void main(String[] args) {
          double \u03c0 = 3.14159265;
          System.out.println("\u03c0 = " + \u03c0);
      }
  }

Output:

  $ javac Foo.java && java Foo
  π = 3.14159265

Of course, nowadays we can simply write this with any decent editor:

  public class Foo {
      public static void main(String[] args) {
          double π = 3.14159265;
          System.out.println("π = " + π);
      }
  }

Support for Unicode escape sequences is a result of how the Java Language Specification (JLS) defines InputCharacter. Quoting from Section 3.4 of JLS <https://docs.oracle.com/javase/specs/jls/se23/jls23.pdf>:

  InputCharacter:
    UnicodeInputCharacter but not CR or LF

UnicodeInputCharacter is defined as the following in section 3.3:

  UnicodeInputCharacter:
    UnicodeEscape
    RawInputCharacter

  UnicodeEscape:
    \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

  UnicodeMarker:
    u {u}

  HexDigit:
    (one of)
    0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

  RawInputCharacter:
    any Unicode character

As a result the lexical analyser honours Unicode escape sequences absolutely anywhere in the program text. For example, this is a valid Java program:

  public class Bar {
      public static void \u006d\u0061\u0069\u006e(String[] args) {
          System.out.println("hello, world");
      }
  }

Here is the output:

  $ javac Bar.java && java Bar
  hello, world

However, this is an incorrect Java program:

  public class Baz {
      // This comment contains \u6d.
      public static void main(String[] args) {
          System.out.println("hello, world");
      }
  }

Here is the error:

  $ javac Baz.java
  Baz.java:2: error: illegal unicode escape
      // This comment contains \u6d.
                                   ^
  1 error

Yes, this is an error even if the illegal Unicode escape sequence occurs in a comment!

ivanjermakov · 2024-11-02T16:47:06 1730566026

I wonder if full unicode range was accepted because some companies are writing code in non-english.

layer8 · 2024-11-02T18:27:16 1730572036

Javac uses the platform encoding [0] by default to interpret Java source files. This means that Java source code files are inherently non-portable. When Java was first developed (and for a long time after), this was the default situation for any kind of plain text files. The escape sequence syntax allows to transform [1] Java source code into a portable (that is, ASCII-only) representation that is completely equivalent to the original, and also to convert it back to any platform encoding.

Source control clients could apply this automatically upon checkin/checkout, so that clients with different platform encodings can work together. Alternatively, IDEs could do this when saving/loading Java source files. That never quite caught on, and the general advice was to stick to ASCII, at least outside comments.

[0] Since JDK 18, the default encoding defaults to UTF-8. This probably also extends to javac, though I haven’t verified it.

[1] https://docs.oracle.com/javase/8/docs/technotes/tools/window...

pwdisswordfishz · 2024-11-02T12:57:24 1730552244

> Of all the languages, I've saved the best for last, which is Ruby. Now here's a language whose syntax evades all attempts at understanding.

TeX with its arbitrarily reprogrammable lexer: how adorable

fanf2 · 2024-11-02T15:12:57 1730560377

Lisp reader macros allow you to program its lexer too.

skydhash · 2024-11-02T18:04:06 1730570646

You can basically define a new language with a few lines of code in Racket.

susam · 2024-11-02T13:57:17 1730555837

> Every C programmers (sic) knows you can't embed a multi-line comment in a multi-line comment.

And every Standard ML programmer might find this to be a surprising limitation. The following is a valid Standard ML program:

  (* (* Nested (**) *) comment *)
  val _ = print "hello, world\n"

Here is the output:

  $ sml < hello.sml           
  Standard ML of New Jersey (64-bit) v110.99.5 [built: Thu Mar 14 17:56:03 2024]
  - = hello, world

  $ mlton hello.sml && ./hello
  hello, world

Given how C was considered one of the "expressive" languages when it arrived, it's curious that nested comments were never part of the language.

dahart · 2024-11-02T14:12:07 1730556727

There are 3 things I find funny about that comment: ML didn’t have single-line comments, so same level of surprising limitation. I’ve never heard someone refer to C as “expressive”, but maybe it was in 1972 when compared to assembly. And what bearing does the comment syntax have on the expressiveness of a language? I would argue absolutely none at all, by definition. :P

susam · 2024-11-02T15:24:47 1730561087

> ML didn’t have single-line comments, so same level of surprising limitation.

It is not quite clear to me why the lack of single-line comments is such a surprising limitation. After all, a single-line block comment can easily serve as a substitute. However, there is no straightforward workaround for the lack of nested block comments.

> I’ve never heard someone refer to C as “expressive”, but maybe it was in 1972 when compared to assembly.

I was thinking of Fortran in this context. For instance, Fortran 77 lacked function pointers and offered a limited set of control flow structures, along with cumbersome support for recursion. I know Fortran, with its native support for multidimensional arrays, excelled in numerical and scientific computing but C quickly became the preferred language for general purpose computing.

While very few today would consider C a pinnacle of expressiveness, when I was learning C, the landscape of mainstream programming languages was much more restricted. In fact, the preface to the first edition of K&R notes the following:

"In our experience, C has proven to be a pleasant, expressive and versatile language for a wide variety of programs."

C, Pascal, etc. stood out as some of the few mainstream programming languages that offered a reasonable level of expressiveness. Of course, Lisp was exceptionally expressive in its own right, but it wasn't always the best fit for certain applications or environments.

> And what bearing does the comment syntax have on the expressiveness of a language?

Nothing at all. I agree. The expressiveness of C comes from its grammar, which the language parser handles. Support for nested comments, in the context of C, is a concern for the lexer, so indeed one does not directly influence the other. However, it is still curious that a language with such a sophisticated grammar and parser could not allocate a bit of its complexity budget to support nested comments in its lexer. This is a trivial matter, I know, but I still couldn't help but wonder about it.

pklausler · 2024-11-02T17:08:00 1730567280

> Fortran 77 lacked function pointers

But we did have dummy procedures, which covered one of the important use cases directly, and which could be abused to fake function/subroutine pointers stored in data.

dahart · 2024-11-02T15:40:15 1730562015

Fair enough. From my perspective, lack of single line comments is a little surprising because most other languages had it at the time (1973, when ML was introduced). Lack of nested comments doesn’t seem surprising, because it isn’t an important feature for a language, and because most other languages did not have it at the time (1972, when C was introduced).

I can imagine both pro and con arguments for supporting nested comments, but regardless of what I think, C certainly could have added support for nested comments at any time, and hasn’t, which suggests that there isn’t sufficient need for it. That might be the entire explanation: not even worth a little complexity.

susam · 2024-11-02T20:32:51 1730579571

> C certainly could have added support for nested comments at any time

After C89 was ratified, adding nested comments to C would have risked breaking existing code. For instance, this is a valid program in C89:

  #include <stdio.h>

  int main() {
      /* /* Comment */
      printf("hello */ world");
      return 0;
  }

However, if a later C standard were to introduce nested comments, it would break the above program because then the following part of the program would be recognised as a comment:

      /* /* Comment */
      printf("hello */

The above text would be ignored. Then the compiler would encounter the following:

      world");

This would lead to errors like undeclared identifier 'world', missing terminating " character, etc.

dahart · 2024-11-03T15:09:05 1730646545

Given the neighboring thread where I just learned that the lexer runs before the preprocessor, I’m not sure that would be the outcome. There’s no reason to assume the comment terminator wouldn’t be ignored in strings. And even today, you can safely write printf(“hello // world\n”); without risking a compile error, right?

susam · 2024-11-03T18:14:45 1730657685

> Given the neighboring thread where I just learned that the lexer runs before the preprocessor, I’m not sure that would be the outcome.

That is precisely why nested comments would end up breaking the C89 code example I provided above. I elaborate this further below.

> There’s no reason to assume the comment terminator wouldn’t be ignored in strings.

There is no notion of "comment terminator in strings" in C. At any point of time, the lexer is reading either a string or a comment but never one within the other. For example, in C89, C99, etc., this is an invalid C program too:

  #include <stdio.h>

  int main() {
      /* Comment
      printf("hello */ world");
      return 0;
  }

In this case, we wouldn't say that the lexer is "honoring the comment terminator in a string" because, at the point the comment terminator '*/' is read, there is no active string. There is only a comment that looks like this:

      /* Comment
      printf("hello */

The double quotation mark within the comment is immaterial. It is simply part of the comment. Once the lexer has read the opening '/*', it looks for the terminating '*/'. This behaviour would hold even if future C standards were to allow nested comments, which is why nested comments would break the C89 example I mentioned in my earlier HN comment.

> And even today, you can safely write printf("hello // world\n"); without risking a compile error, right?

Right. But it is not clear what this has got to do with my concern that nested comments would break valid C89 programs. In this printf() example, we only have an ordinary string, so obviously this compiles fine. Once the lexer has read the opening quotation mark as the beginning of a string, it looks for an unescaped terminating quotation mark. So clearly, everything until the unescaped terminating quotation mark is a string!

masfuerte · 2024-11-02T16:22:52 1730564572

AFAIK, C didn't get single line comments until C99. They were a C++ feature originally.

dahart · 2024-11-02T16:33:10 1730565190

Oh wow, I didn’t remember that, and I did start writing C before 99. I stand corrected. I guess that is a little surprising. ;)

Is true that many languages had single line comments? Maybe I’m forgetting more, but I remember everything else having single line comments… asm, basic, shell. I used Pascal in the 80s and apparently forgot it didn’t have line comments either?

quietbritishjim · 2024-11-03T10:46:19 1730630779

Some C compilers supported it as an unofficial extension well before C99, so that could be why you didn't realise or don't remember. I think that included both Visual Studio (which was really a C++ compiler that could turn off the C++ bits) and GCC with GNU extensions enabled.

masfuerte · 2024-11-02T16:42:58 1730565778

That's my recollection, that most languages had single line comments. Some had multi-line comments but C++ is the first I remember having syntaxes for both. That said, I'm not terribly familiar with pre-80s stuff.

michaelcampbell · 2024-11-03T14:54:35 1730645675

I was barely too young for this to make much of an impact at the time, (but older than many, perhaps most, here), I understand why C was considered a "high level language", but it still hits me weird, given today's context.

layer8 · 2024-11-02T17:24:06 1730568246

Lexing nested comments requires maintaining a stack (or at least a nesting-level counter). That wasn’t traditionally seen as being within the realm of lexical analysis, which would only use a finite-state automaton, like regular expressions.

gsliepen · 2024-11-02T15:10:46 1730560246

Well there is one way to nest comments in C, and that's by using #if 0:

  #if 0
  This is a
  #if 0
  nested comment!
  #endif
  #endif

fanf2 · 2024-11-02T15:51:18 1730562678

Except that text inside #if 0 still has to lex correctly.

(unifdef has some evil code to support using C-style preprocessor directives with non-C source, which mostly boils down to ignoring comments. I don’t recommend it!)

dahart · 2024-11-02T16:40:24 1730565624

> Except that text inside #if 0 still has to lex correctly.

Are you sure? I just tried on godbolt and that’s not true with gcc 14.2. I’ve definitely put syntax errors intentionally into #if 0 blocks and had it compile. Are you thinking of some older version or something? I thought the pre-processor ran before the lexer since always…

fanf2 · 2024-11-02T17:32:20 1730568740

There are three (relevant) phases (see “translation phases” in section 5 of the standard):

• program is lexed into preprocessing tokens; comments turn into whitespace

• preprocessor does its thing

• preprocessor tokens are turned into proper tokens; different kinds of number are disambiguated; keywords and identifiers are disambiguated

If you put an unclosed comment inside #if 0 then it won’t work as you might expect.

dahart · 2024-11-02T18:24:37 1730571877

Ah, I see. You’re right!

akira2501 · 2024-11-03T21:09:20 1730668160

Pascal always supported the same nested comment syntax as your example.

kragen · 2024-11-02T16:52:41 1730566361

This is not just true of Standard ML; it's also true of regular ML.

irdc · 2024-11-02T13:03:32 1730552612

I’d be interested to see a re-usable implementation of joe's[0] syntax highlighting.[1] The format is powerful enough to allow for the proper highlighting of Python f-strings.[2]

0. https://joe-editor.sf.net/

1. https://github.com/cmur2/joe-syntax/blob/joe-4.4/misc/HowItW...

2. https://gist.github.com/irdc/6188f11b1e699d615ce2520f03f1d0d...

akira2501 · 2024-11-03T21:12:29 1730668349

I've actually made several lexers and parsers based on the joe DFA style of parsing. The state and transition syntax was something that I always understood much more easily than the standard tools.

The downside is your rulesets tend to get more verbose and are a little bit harder to structure than they might ideally be in other languages more suited towards the purpose, but I actually think that's an advantage, as it's much easier to reason about every production rule when looking at the code.

pama · 2024-11-02T13:33:56 1730554436

Interestingly, python f-strings changed their syntax at version 3.12, so highlighting should depend on the version.

irdc · 2024-11-02T13:42:43 1730554963

It’s just that nesting them arbitrarily is now allowed, right? That shouldn’t matter much for a mere syntax highlighter then. And one could even argue that code that relies on this too much is not really for human consumption.

pansa2 · 2024-11-02T13:55:36 1730555736

Also, you can now use the same quote character that encloses an f-string within the {} expressions. That could make them harder to tokenize, because it makes it harder to recognise the end of the string.

murkt · 2024-11-02T19:05:08 1730574308

Author hasn’t tried to highlight TeX. Which is good for their mental health, I suppose, as it’s generally impossible to fully highlight TeX without interpreting it.

Even parsing is not enough, as it’s possible to redefine what each character does. You can make it do things like “and now K means { and C means }”.

Yes, you can find papers on arXiv that use this god-forsaken feature.

jart · 2024-11-02T19:10:15 1730574615

I wrote https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafil... and it does a reasonable job highlighting without breaking for all the .tex files I could find on my hard drive. My goal is to hopefully cover 99.9% of real world usage, since that'll likely cover everything an LLM might output. Esoteric syntax also usually isn't a problem, so long as it doesn't cause strings and comments to extend forever, eclipsing the rest of the source code in a file.

murkt · 2024-11-03T04:50:32 1730609432

Yes, when goal isn’t to support 100% of all the weird stuff, then it’s orders of magnitude easier!

bobbylarrybobby · 2024-11-02T21:38:44 1730583524

I couldn't believe it when I learned that \makeatletter does not “make (something) at a letter (character)” but rather “treats the '@' character as a letter when parsing”.

nathell · 2024-11-02T19:47:39 1730576859

Same with Common Lisp (you can redefine the read table), although that’s likely abused less often on arXiv.

kazinator · 2024-11-02T13:56:59 1730555819

I don't think it's easy to write a good syntax coloring engine like the one in Vim.

Syntax coloring has to handle context: different rules for material nested in certain ways.

Vim's syntax higlighter lets you declare two kinds of items: matches and regions. Matches are simpler lexical rules, whereas regions have separate expressions for matching the start and end and middle. There are ways to exclude leading and trailing material from a region.

Matches and regions can declare that they are contained. In that case they are not active unless they occur in a containing region.

Contained matches declare which regions contain them.

Regions declare which other regions they contain.

That's the basic semantic architecture; there are bells and whistles in the system due to situations that arise.

I don't think even Justine could develop that in an interview, other than as an overnight take home.

kazinator · 2024-11-02T14:06:54 1730556414

Here is an example of something hard to handle: TXR language with embedded TXR Lisp.

This is the "genman" script which takes the raw output of a manpage to HTML converter, and massages it to form the HTML version of the TXR manual:

https://www.kylheku.com/cgit/txr/tree/genman.txr

Everything that is white (not colored) is literal template material. Lisp code is embedded in directives, like @(do ...). In this scheme, TXR keywords appear purple, TXR Lisp ones green. They can be the same; see the (and ...) in line 149, versus numerous occurrences of @(and).

Quasistrings contain nested syntax: see 130 where `<a href ..> ... </a>` contains an embedded (if ...). That could itself contain a quasistring with more embedded code.

TXR's txr.vim" and tl.vim* syntax definition files are both generated by this:

https://www.kylheku.com/cgit/txr/tree/genvim.txr

saghm · 2024-11-02T21:08:21 1730581701

Naively, I would have assumed that the "correct" way to write a syntax highlighter would be to parse into an AST and then iterate over the tokens and update the color of a token based on the type of node (and maybe just tracking a diff to avoid needing to recolor things that haven't changed). I'm guessing that if this isn't done, it's for efficiency reasons (e.g. due to requiring parsing the whole file to highlight rather than just the part currently visible on the screen)?

Someone · 2024-11-02T22:12:37 1730585557

> I would have assumed that the "correct" way to write a syntax highlighter would be to parse into an AST and then […] I'm guessing that if this isn't done, it's for efficiency reasons

It’s not only running time, but also ease of implementation.

A good syntax highlighter should do a decent job highlighting both valid and invalid programs (rationale: in most (editor, language) pairs, writing a program involves going through moments where the program being written isn’t a valid program)

If you decide to use an AST, that means you need to have good heuristics for turning invalid programs into valid ones that best mimic what the programmer intended. That can be difficult to achieve (good compilers have such heuristics, but even if you have such a compiler, chances are it isn’t possible to reuse them for syntax coloring)

If this simpler approach gives you most of what you can get with the AST approach, why bother writing that?

Also, there are languages where some programs can’t be perfectly parsed or syntax colored without running them. For those, you need this approach.

tomcam · 2024-11-03T00:13:36 1730592816

> I don't think even Justine could develop that in an interview

Not so sure I’d put money on that opinion ;)

rererereferred · 2024-11-02T13:07:06 1730552826

In the C# multiquoted strings, how does it know this:

   Console.WriteLine("""""");
   Console.WriteLine("""""");

Are 2 triplequoted empty strings and not one "\nConsole.WriteLine(" sixtuplequoted string?

yen223 · 2024-11-02T13:29:04 1730554144

It's a syntax error!

  Unterminated raw string literal.

https://replit.com/@Wei-YenYen/DistantAdmirableCareware#main...

Joker_vD · 2024-11-02T13:44:50 1730555090

Ah, so there is no backtracking in lexer for this case. Makes sense.

Joker_vD · 2024-11-02T13:23:56 1730553836

If the opening quotes are followed by anything that is not a whitespace before the next new-line (or EOF), then it's a single-line string.

I imagine implementing those things took several iterations :)

ygra · 2024-11-02T13:18:54 1730553534

The former, I'd say.

https://learn.microsoft.com/en-us/dotnet/csharp/programming-...

For a multi-line string the quotes have to be on their own line.

pansa2 · 2024-11-02T13:02:18 1730552538

> TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings. So to highlight a string, one must count curly brackets and maintain a stack of parser states.

Presumably this is also true in Python - IIRC the brace-delimited fields within f-strings may contain arbitrary expressions.

More generally, this must mean that the lexical grammar of those languages isn't regular. "Maintaining a stack" isn't part of a finite-state machine for a regular grammar - instead we're in the realm of pushdown automata and context-free grammars.

Is it even possible to support generalized string interpolation within a strictly regular lexical grammar?

fanf2 · 2024-11-02T15:58:36 1730563116

Complicated interpolation can be lexed as a regular language if you treat strings as three separate lexical things, eg in JavaScript template literals there are,

    `stuff${
    }stuff${
    }stuff`

so the ${ and } are extra closing and opening string delimiters, leaving the nesting to be handled by the parser.

You need a lexer hack so that the lexer does not treat } as the start of a string literal, except when the parser is inside an interpolation but all nested {} have been closed.

aphantastic · 2024-11-02T14:31:24 1730557884

> Is it even possible to support generalized string interpolation within a strictly regular lexical grammar?

Almost certainly not, a fun exercise is to attempt to devise a Pumping tactic for your proposed language. If it doesn’t exist, it’s not regular.

https://en.m.wikipedia.org/wiki/Pumping_lemma_for_regular_la...

yen223 · 2024-11-02T12:52:28 1730551948

  select'select'select

is a perfectly valid SQL query, at least for Postgres.

Languages' approach to whitespace between tokens is all over the place

layer8 · 2024-11-02T17:43:13 1730569393

The author may have missed that lexing C is actually context-sensitive, i.e. you need a symbol table: https://en.wikipedia.org/wiki/Lexer_hack

Of course, for syntax highlighting this is only relevant if you want to highlight the multiplication operator differently from the dereferencing operator, or declarations differently from expressions.

More generally, however, I find it useful to highlight (say) types differently from variables or functions, which in some (most?) popular languages requires full parsing and symbol table information. Some IDEs therefore implement two levels of syntax highlighting, a basic one that only requires lexical information, and an extended one that kicks in when full grammar and type information becomes available.

teo_zero · 2024-11-03T08:07:57 1730621277

> this is only relevant if you want to highlight the multiplication operator differently from the dereferencing operator

Can you mention one editor which does that?

mdaniel · 2024-11-03T18:08:04 1730657284

I could be stretching the definition of "does" but the newfound(?) tree-sitter support in Emacs[1] I believe would allow that since it for sure understands the distinction but I don't possess enough font-lock ninjary to actually, for real, bind a different color to the distinct usages

  /* given foo.c */
  int main() {
    int a, *b;
    a = 5 * 10;
    b = &a;
    printf("a is %d\n", *b);
  }

and then M-x c-ts-mode followed by navigating to each * and invoking M-x treesit-inspect-node-at-point in turn produces, respectively:

  (declaration declarator: (pointer_declarator "*"))

  right: (binary_expression operator: "*")

  arguments: (argument_list (pointer_expression operator: "*"))

1: https://www.emacswiki.org/emacs/Tree-sitter

teo_zero · 2024-11-03T21:13:38 1730668418

These examples are unambiguous. Try with something more spicy like

  return (A)*(B);

which depends on A being a type or a variable.

quietbritishjim · 2024-11-03T10:50:55 1730631055

I don't think they implied there is. The sentence you quoted is essentially "this is relevant for their article about weird lexical syntax, but (almost definitely) not relevant to their original problem of syntax highlighting".

dummy7777 · 2024-11-03T18:08:43 1730657323

alekratz · 2024-11-03T01:56:27 1730598987

I don't think the lexer hack is relevant in this instance. The lexer hack just refers to the ambiguity of `A * B` and whether that should be parsed as a variable declaration or an expression. If you're building a syntax tree, then this matters, but AFAICT all the author needs is a sequence of tokens and not a syntax tree. Maybe "parser hack" would be a better name for it.

legobmw99 · 2024-11-02T18:11:52 1730571112

I’d be shocked if jart didn’t know this, but it seems unlikely that an LLM would generate one of these most vexing parses, unless explicitly asked

quietbritishjim · 2024-11-02T19:34:15 1730576055

I think you're thinking of something different to the issue in the parent comment. The most vexing parse is, as the name suggests, a problem at the parsing stage rather than the earlier lexing phase. Unlike the referenced lexing problem, it does't require any hack for compilers to deal with it. That's because it's not really a problem for the compiler; it's humans that find it surprising.

layer8 · 2024-11-02T18:42:00 1730572920

Given all the things that were new to the author in the article, I wouldn’t be shocked at all. There’s just a huge number of things to know, or to have come across.

jraph · 2024-11-03T04:21:25 1730607685

Justine is proficient in C, she is the author of a libc (cosmopolitan) among other things, like Actually Portable Executables [1].

I would expect her to know C quite well, and that's probably an understatement.

[1] https://justine.lol/ape.html

mcphage · 2024-11-02T12:41:52 1730551312

At one point there was an open source project to formally specify Ruby, but I don’t know if it’s still alive: https://github.com/ruby/spec

Hmm, it seems to be alive, but based more on behavior than syntax.

playingalong · 2024-11-02T14:04:41 1730556281

Nice read.

I guess the article could be called Falsehoods Programmers Assume of Programming Language Syntaxes.

SonOfLilit · 2024-11-02T14:30:36 1730557836

Justine gets very close to the hairiest parsing issue in any language without encountering it:

Perl's syntax is undecidable, because the difference between treating some characters as a comment or as a regex can depend on the type of a variable that is only determined e.g. based on whether a search for a Collatz counterexample terminates, or just, you know, user input.

https://perlmonks.org/?node_id=663393

C++ templates have a similar issue, I think.

fanf2 · 2024-11-02T15:44:40 1730562280

I think possibly the most hilariously complicated instance of this is in perl’s tokenizer, toke.c (which starts with a Tolkien quote, 'It all comes from here, the stench and the peril.' — Frodo).

There’s a function called intuit_more which works out if $var[stuff] inside a regex is a variable interpolation followed by a character class, or an array element interpolation. Its result can depend on whether something in the stuff has been declared as a variable or not.

But even if you ignore the undecidability, the rest is still ridiculously complicated.

https://github.com/Perl/perl5/blob/blead/toke.c#L4502

ufo · 2024-11-02T23:41:34 1730590894

Wow. I wonder how that function came to be in the first place. Surely it couldn't have started out that complicated?

swolchok · 2024-11-02T15:45:03 1730562303

> C++ templates have a similar issue

TIL! I went and dug up a citation: https://blog.reverberate.org/2013/08/parsing-c-is-literally-...

chubot · 2024-11-03T03:27:34 1730604454

Yup, bash and GNU Make have the same issue as Perl does, and I mention the C++ issue here too:

Parsing Bash is Undecidable - https://www.oilshell.org/blog/2016/10/20.html

I remember a talk from Larry Wall on Perl 6 (now Raku), where he says this type of thing is a mistake. Raku can be statically parsed, as far as I know.

jwilk · 2024-11-03T07:18:25 1730618305

Parsing POSIX shell in undecidable too:

https://news.ycombinator.com/item?id=30362718

chubot · 2024-11-03T14:45:49 1730645149

Yes, good point -- aliases makes parse time depend on runtime. That is mentioned in

Morbig: A static parser for POSIX shell - https://scholar.google.com/scholar?cluster=15754961728999604...

(at the time I wrote the post about bash, I hadn't implemented aliases yet)

But it's a little different since it is an intentional feature, not an acccident. It's designed to literally reinvoke the parser at runtime. I think it's not that good/useful a feature, and I tend to avoid it, but many people use it.

layer8 · 2024-11-02T18:36:02 1730572562

How could a search for a Collatz counterexample possibly terminate? ;)

lupire · 2024-11-02T13:57:47 1730555867

> You'll notice its hash function only needs to consider a single character in in a string. That's what makes it perfect,

Is that a joke?

https://en.m.wikipedia.org/wiki/Perfect_hash_function

jaen · 2024-11-03T14:41:41 1730644901

No. Taking the value of a single character is a correct perfect hash function, assuming there exists a position for the input string set where all characters differ.

anitil · 2024-11-04T03:36:16 1730691376

It just so happens that a single character was all that's needed in that case. In the general case it wouldn't be

jim_lawless · 2024-11-02T14:21:55 1730557315

Forth has a default syntax, but Forth code can execute during the compilation process allowing it to accept/compile custom syntaxes.

metadat · 2024-11-02T15:40:13 1730562013

> The languages I decided to support are Ada, Assembly, BASIC, C, C#, C++, COBOL, CSS, D, FORTH, FORTRAN, Go, Haskell, HTML, Java, JavaScript, Julia, JSON, Kotlin, ld, LISP, Lua, m4, Make, Markdown, MATLAB, Pascal, Perl, PHP, Python, R, Ruby, Rust, Scala, Shell, SQL, Swift, Tcl, TeX, TXT, TypeScript, and Zig.

A few (admittedly silly) questions about the list:

1. Why no Erlang, Elixir, or Crystal?

Erlang appears to be just at the author's boundary at #47 on the TIOBE index. https://www.tiobe.com/tiobe-index/

2. What is "Shell"? Sh, Bash, Zsh, Windows Cmd, PowerShell..?

3. Perl but no Awk? Curious why, because Awk is a similar but comparatively trivial language. Widely used, too.

To be fair, Awk, Erlang, and Elixir rank low on popularity. Yet m4, Tcl, TeX, and Zig aren't registered in the top 50 at all.

What's the methodology / criteria? Only things the author is already familiar with?

Still a fun article.

Yasuraka · 2024-11-02T17:32:17 1730568737

Tiobes's index is quite literally worthless, especially with regards to its stated purpose, let alone as a general point of orientation.

I'd wish that purple would stop lending it any credibility.

Kwpolska · 2024-11-03T11:18:57 1730632737

"Shell" in the context of a syntax highlighting language picker almost always means a Unixy shell, most likely something along the lines of Bash.

skrebbel · 2024-11-02T12:18:38 1730549918

This was a delightful read, thanks!

notsylver · 2024-11-02T12:57:08 1730552228

As soon as I saw this was part of llamafile I was hoping that it would be used to limit LLM output to always be "valid" code as soon as it saw the backticks, but I suppose most LLMs don't have problems with that anyway. And I'm not sure you'd want something like that automatically forcing valid code anyway

dilap · 2024-11-02T13:26:36 1730553996

llama.cpp does support something like this -- you can give it a grammar which restricts the set of available next tokens that are sampled over

so in theory you could notice "```python" or whatever and then start restricting to valid python code. (in least in theory, not sure how feasible/possible it would be in practice w/ their grammar format.)

for code i'm not sure how useful it would be since likely any model that is giving you working code wouldn't be struggling w/ syntax errors anyway?

but i have had success experimentally using the feature to drive fiction content for a game from a smaller llm to be in a very specific format.

notsylver · 2024-11-02T15:42:45 1730562165

yeah, ive used llama.cpp grammars before, which is why i was thinking about it. i just think it'd be cool for llamafile to do basically that, but with included defaults so you could eg, require JSON output. it could be cool for prototyping or something. but i dont think that would be too useful anyway, most of the time i think you would want to restrict it to a specific schema, so i can only see it being useful for something like a tiny local LLM for code completion, but that would just encourage valid-looking but incorrect code.

i think i just like the idea of restricting LLM output, it has a lot of interesting use cases

dilap · 2024-11-02T16:08:16 1730563696

gotchya. i do think that is a cool idea actually -- LLMs tiny enough to do useful things with formally structured output but not big enough to nail the structure ~100% is probably not an empty set.

IshKebab · 2024-11-02T14:21:50 1730557310

I don't understand why you wouldn't use Tree Sitter's syntax highlighting for this. I mean it's not going to be as fast but that clearly isn't an issue here.

Is this a "no third party dependencies" thing?

jart · 2024-11-02T17:28:51 1730568531

I don't want to require everyone who builds llamafile from source need to install rust. I don't even require that people install the gperf command, since I can build gperf as a 700kb actually portable executable and vendor it in the repo. Tree sitter I'd imagine does a really great highly precise job with the languages it supports. However it appears to support fewer of them than I am currently. I'm taking a breadth first approach to syntax highlighting, due to the enormity of languages LLMs understand.

IshKebab · 2024-11-02T19:41:10 1730576470

I think the Rust component of tree-sitter-highlight is actually pretty small (Tree Sitter generates C for the actual parser).

But fair enough - fewer dependencies is always nice, especially in C++ (which doesn't have a modern package manager) and in ML where an enormous janky Python installation is apparently a perfectly normal thing to require.

mdaniel · 2024-11-02T22:15:54 1730585754

I somehow thought Conan[1] was the C++ package manager; it's at least partially supported by GitLab, for what that's worth

1: https://docs.conan.io/2/introduction.html

IshKebab · 2024-11-02T22:46:39 1730587599

No, if anything vcpkg is "the C++ package manager", but it's nowhere near pervasive and easy-to-use enough to come close to even Pip. It's leagues away from Cargo, Go, and other actually good PL package managers.

mdaniel · 2024-11-03T00:03:26 1730592206

I knew that Microsoft used that on Windows but had no idea it was multi-platform: https://github.com/microsoft/vcpkg/releases/tag/2024.10.21 (MIT, like a lot of their stuff)

Microsoft is such an odd duck, sometimes, but I'm glad to take advantage of their "good years" while it lasts

chubot · 2024-11-03T03:31:23 1730604683

Have you developed against TreeSitter? Some feedback from people who use it here - https://news.ycombinator.com/item?id=39783471

And here - https://lobste.rs/s/9huy81/tbsp_tree_based_source_processing...

IshKebab · 2024-11-03T12:51:46 1730638306

Yes I have, and it worked very well for what I was using it for (assembly language LSP server). I didn't run into any of the issues they mentioned (not saying they don't exist though).

For new projects I use Chumsky. It's a pure Rust parser which is nice because it means you avoid the generated C, and it also gives you a fully parsed and natively typed output, rather than Tree Sitter's dynamically typed tree of nodes, which means there's no extra parsing step to do.

The main downside is it's more complicated to write the parser (some fairly extreme types). The API isn't stable yet either. But overall I like it more than Tree Sitter.

petesergeant · 2024-11-02T15:02:32 1730559752

> Perl also has this goofy convention for writing man pages in your source code

The world corpus of software would be much better documented if everywhere else had stolen this from Perl. Inline POD is great.

kragen · 2024-11-02T16:55:31 1730566531

Perl and Python stole it from Emacs Lisp, though Perl took it further. I'm not sure where Java stole it from, but nowadays Doxygen is pretty common for C code. Unfortunately this results in people thinking that Javadoc and Doxygen are substitutes for actual documentation like the Emacs Lisp Reference Manual, which cannot be generated from docstrings, because the organization of the source code is hopelessly inadequate for a reference manual.

mdaniel · 2024-11-02T22:12:48 1730585568

> Emacs Lisp Reference Manual, which cannot be generated from docstrings, because the organization of the source code is hopelessly inadequate for a reference manual.

Well, they're not doing themselves any favors by just willy nilly mixing C with "user-facing" defuns <https://emba.gnu.org/emacs/emacs/-/blob/ed1d691184df4b50da6b...>. I was curious if they could benefit from "literate programming" since OrgMode is the bee's knees but not with that style coding they can't

kragen · 2024-11-03T00:47:54 1730594874

I didn't mean that specifically the Emacs source code was not organized in the right way for a reference manual. I meant that C and Java source code in general isn't. And C++, which is actually where people use Doxygen more.

The Python standard library manual is also exemplary, and also necessarily organized differently from the source code.

mdaniel · 2024-11-03T18:18:19 1730657899

> The Python standard library manual is also exemplary

Maybe parts of it are, but as a concrete example https://docs.python.org/3/library/re.html#re.match is just some YOLO about what, specifically, is the first argument to re.match: string, or compiled expression? Well, it's both! Huzzah! I guess they get points for consistency because the first argument to re.compile is also "both"

But, any idea what type re.compile returns? cause https://docs.python.org/3/library/re.html#re.compile is all "don't you worry about it" versus its re.match friend who goes out of their way to state that it is an re.Match object

Would it have been so hard to actually state it, versus requiring someone to invoke type() to get <class 're.Pattern'>?

kragen · 2024-11-03T19:01:11 1730660471

I'm surprised to see that it's allowed to pass a compiled expression to re.match, since the regular expression object has a .match method of its own. To me the fact that the argument is called pattern implies that it's a string, because at the beginning of that chapter, it says, "Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). (...) Usually patterns will be expressed in Python code using this raw string notation."

But this ability to pass a compiled regexp rather than a string can't have been an accidental feature, so I don't know why it isn't documented.

Probably it would be good to have an example of invoking re.match with a literal string in the documentation item for re.match that you linked. There are sixteen such examples in the chapter, the first being re.match(r"(\w+) (\w+)", "Isaac Newton, physicist"), so you aren't going to be able to read much of the chapter without figuring out that you can pass a string there, but all sixteen of them come after that section. A useful example might be:

    >>> [s for s in ["", " ", "a ", " a", "aa"] if re.match(r'\w', s)]
    ['a ', 'aa']

It's easy to make manuals worse by adding too much text to them, but in this case I think a small example like that would be an improvement.

As for what type re.compile returns, the section you linked to says, "Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below." Is your criticism that it doesn't explicitly say that the regular expression object is returned (as opposed to, I suppose, stored in a table somewhere), or that it says "a regular expression object" instead of saying "an re.Pattern object"? Because the words "regular expression object" are a link to the "Regular Expression Objects" section, which begins by saying, "class re.Pattern: Compiled regular expression object returned by re.compile()." To me the name of the class doesn't seem like it adds much value here—to write programs that work using the re module, you don't need to know the name of the class the regular expression objects belong to, just what interface they support.

(It's unfortunate that the class name is documented, because it would be better to rename it to a term that wasn't already defined to mean "a string that can be compiled to a regular expression object"!)

But possibly I've been using the re module long enough that I'm blind to the deficiencies in its documentation?

Anyway, I think documentation questions like this, about gradual introduction, forward references, sequencing, publicly documented (and thus stable) versus internal-only names, etc., are hard to reconcile with the constraints of source code, impossible in most languages. In this case the source code is divided between Python and C, adding difficulty.

tomcam · 2024-11-03T00:11:10 1730592670

> If you ever want to confuse your coworkers, then one great way to abuse this syntax is by replacing the heredoc marker with an empty string

Maybe I am in favor of the death penalty after all

sundarurfriend · 2024-11-02T20:56:46 1730581006

The final line number count is missing Julia. Based on the file in the repo, it would be at the bottom of the first column: between ld and R.

Among the niceties listed here, the one I'd wish for Julia to have would be C#'s "However many quotes you put on the lefthand side, that's what'll be used to terminate the string at the other end". Documentation that talks about quoting would be so much easier to read (in source form) with something like that.

sundarurfriend · 2024-11-03T01:48:19 1730598499

One nicety that Julia does have that I didn't know about (or had forgotten) is nested multi-line comments.

    #= this one
       has a #= nested
       comment =# inside of it
       and that works fine! =#

KolmogorovComp · 2024-11-04T09:40:35 1730713235

The article doesn’t address why they chose not use treesiter and instead roll their own syntax highlight system.

Edit: same question with answer here https://news.ycombinator.com/item?id=42026554