Hacker News new | past | comments | ask | show | jobs | submit login
Wuffs the Language (github.com/google)
187 points by bshanks 9 months ago | hide | past | favorite | 75 comments



> There is no operator precedence. A bare a * b + c is an invalid expression. You must explicitly write either (a * b) + c or a * (b + c).

Honestly I've often wished for this in mainstream languages. It seems like operator precedence should go the way of bracketless if and implicit int casts. (Though I wonder if they wind up making exceptions here for chains of method calls? I guess technically those rely on operator precedence sort of?)

Edit: Yeah I see the example code has "args.src.read_u8?()". So it looks like they figured out how to keep the good stuff.


Yes please! I'm always using parenthesis for every compound expression, and I've heard so many times from coworkers or code reviewers smuggly going "you know you can skip that, right?". At the same time I've heard the same people having discussions and scratching their heads about precedence in some attempt to code golf their way through a feature. Not to mention bugs caused by incorrect assumptions. Or pausing to figure out what some previously written expression actually does. Meanwhile, I'll gladly write `X + (Y / Z)`. You can thank me later.


Thank you!

In some sane industries where functional safety is required, this is strictly enforced. It leaves no ambiguity - what you write is what you get - and when the excrement hits the quickly revolving pressure modification device, you can just glance at the expressions and tell if they make sense or not.


LISP-like languages have enforced operator precedence due to polish notation e.g. (+ (* a b) (+ c d))


In addition the variadic prefix-notation means the operators are not limited to being binary:

  3*x*y*z+w
becomes:

  (+ (* 3 x y z) w)


Also, pleasantly the 0-ary invocation is the identity so (+) is 0 and (*) is 1.


In addition to requiring remembering the precedence of + versus * this requires you to remember order of evaluation. Is it (ab)c or a(bc)? And no, with certain types those are not necessarily the same. Floats, for example.


1. It doesn't require you to remember precedence, since there is no ambiguity

2. It doesn't require you to remember order of evaluation because the order is unspecified (* x y z) is defined to be "The product of x, y, and z" with no requirement on the order of evaluation. If you need a well-defined order of evaluation then you can do that explicitly: (* x (* y z))


At least you could write a simple macro to do left to right evaluation with standard operators to get the same effect if you wanted.


Does lisp have native bigints, or would something like (+ MAX_INT MAX_INT MIN_INT) still suffer from operator precedence issues?


Common Lisp's integers are transparently multi-precision. There is no need to work with a separate type, or to use special syntax for writing bignum tokens in source code.

Bignum support first appeared in the MacLisp dialect in 1970 or 1970, one of the main predecessor dialects of Common Lisp.

According to Gabriel and Steele's Evolution of Lisp paper, "bignums—arbitrary precision integer arithmetic—were added [to MacLisp] in 1970 or 1971 to meet the needs of Macsyma users".


I dont' know about other lisps, but both common-lisp and scheme have native bignums and fixnums are promoted to bignums automatically.


there's no operator precedence if you don't have (multiple) operators that could precede each other. In LISP-like languages these are simply functions (or more correctly, forms) which have other expressions as arguments, like any other functions or forms. LISP works just fine without much of the things we take for granted in ALGOL-like languages.


Polish notation enforces binary operators. LISP doesn't, so you have to have the parentheses. (+ a b c) is + a + b c or + + a b c in polish notation. These are the same, of course until thet are not, such as with floating point arithmetic or in case you trap on integer overflows.


That is a completely wrong-headed view. There is no precedence there because there is no ambiguity. The parentheses in your example are the function call parentheses, not the optional grouping parentheses. They are mandatory.

There are some issues of associativity in the semantics of some Lisp functions. For instance we know that that the syntax (+ 1.0 2 3.0 4) is a list object containing certain items in a known order.

But how are they added? This could depend on dialect. I think in Common Lisp, the result has to be as if they were added left to right. When inexact numbers are present, it matters.

This isn't a matter of syntax; it's a semantic matter, which depends on the operator.

For instance in (- 1 2 3 4), the 2 3 4 are treated as subtrahends which are subtracted from the 1. But (- 2) means subtract 2 from 0.

In TXR Lisp, I allowed the expt operator to be n-ary: you can write (expt 2 3 4). But this actually means (expt 2 (expt 3 4)): it is a right to left reduction! It makes more sense that way, because it corresponds to:

     4
    3
   2

The left-to-right interpretation would be

     3 + 4
   2
which is less useful: you can code that yourself using (expt 2 (+ 3 4)), for any number of additional terms after the 4.


APL (and, its derivatives, I think) evaluate strictly right to left, so

a * b + c

is a * (b + c). It might be jarring at first but I really came to enjoy the consistency, I never had to remember operator precedence, which helps in a language like APL where most functions are infix.


Conversely, Smalltalk is left-to-right, so

a + b * c is (a + b) * c

which is simply a result of every operation being a message send - muddling the rules with precedence would be likewise confusing, and would ruin the simplicity of the grammar.


and in forth

a b + c *

I believe language with "default precedence" was meant to help us write less (parenthesis) but in the long run we ended up abusing (it)


I tend to think it's fine for the very most common and obvious operators (MDAS, etc), but as soon as you get outside of those I agree. In particular I've been bitten by the precedence of JavaScript's ?? operator:

  function foo(a) {
    return a ?? 10 + " is the num";  // a ?? (10 + " is the num")
  }

  foo(12) // 12


Me too! So far I have seen four actual bugs in large numerical code bases that were caused by overlooking operator precedence. I expect to see more in the years to come.

I think that precedence of '*' over '+' is acceptable (as everyone knows it instinctively) but I would love a way to require parenthesis for everything else.


Everyone doesn't know it instinctively, although I used to think so.

A while ago I taught an introductory spreadsheet class for adults. I got them to try "=2*3+4" and about half the class were surprised that the result wasn't 20. It's a lesson that has stayed in my mind.


I don't know if it was a joke (good one if so), but you seem to have mixed the operators up in your example.


APL wasn't about to bet that * is obviously over + in precedence. There everything is of equal precedence, right-to left. Until you put parenthesis. And that's for verbs (monadic and dyadic operators over nouns), not for other parts...


I don't think ditching the "basic" operator precedence (MDAS etc.) is a good idea, but I strongly agree that operator precedence should be a partial order, not a total order. See also [1].

[1] https://foonathan.net/2017/07/operator-precedence/


Have there been attempts at creating languages that use a postfix (RPN) notation?


Forth is (I think?) the oldest and most well-known. Postscript, the printer control language, is possibly more widely-deployed. And Factor is a modern take on Forth.


Dont forget to add UNIX's dc (in old times bc was a wrapper for dc)


There's also Bitcoin Script, which is a forth-like language.


to add to the list: HPL, the programming language on Hewlett-Packard's RPN calculators.


like forth?


That's just brilliant, and now I think about it, I wonder why no other newer languages have adopted this. I wish this to be the new norm of 2020's.


Only the good use it like that. Pony eg enforces it, and it was way before wuffs. Rust on the other hand lives with precedence rules which you have to remember by hard.


That isn't a bad idea, but keep in mind, there is always a usability aspect (I'll just call it the programmer computer interface problem) of "what makes a programming language popular". For example, consider PL/I: https://en.wikipedia.org/wiki/PL/I#Implementation_issues

When people see for example, * or + (or a, b, c). They may have some preassumptions about some implied associativity from arithmetic (depending on what they are taught and what level of math they are at), that may be hard to break. If you have learned some college (abstract) algebra, it may mean something quite different. How about the = sign? Of course, a, b, c may be meaningless to someone who is not a native latin-1 speaker either. My point I guess is that these are just matters of convention, there is just some implied commutativity or associativity usually implied, but this is all arbitrary.

Now, one intereting "quirk" with with PL/I was that certain things looked similar "to what people were used to" (relative to say other PL/I code, or FORTRAN or COBAL), but worked differently even in some small spatial area on a screen (two blocks of nearby code in some editor). For example, if the programmer's eye saw a block of code, reflexively, depending on their experience they may be able to predict what the result of the computation could do. PL/I was an interesting experiment because of the lack of reserved keywords. This made it very expressive but very hard to understand code in context. For example, in pseudo PL/I: foo = 1; = = 2; bar 2 + foo. You are basically changing the grammatical syntax of the language in 3 lines.

But on the other hand, everything is just a symbol and this may not be completely unusual. Consider the diversity of the world's languages and how they are written and how meaning is derived. Natural language grammars may connotate very different representations and transformations, but people learn because they see enough examples. Consider for the differences between Han, Brahmic scripts, Arabic BiDi, various African scripts, Cuneiform, Emoji, whatever. Perhaps all computer languages are "overfit" due to for example, Chomsky's ideas and BNF (keep in mind Chomsky's ideas about morphology were quite different).

Now, let's consider mathematical notation. Depending on how much pure math (or say, mathematical physics or other sciences) you consider, there may be more and more semantic overhead with the conventions of mathematical notation, and people often historically just "cartesianize" and "euclidized" things for convenience because of lack of tooling (think of a sheet of paper metaphor, we've simply moved it over to a computer. it's a skewmorph). Clearly we have better computer graphics, so why haven't developer tools and languages changed along with it? Maybe with more immersive manipulation they will.


Surprisingly little discussed so far, aside from these past related threads:

Wuffs’ PNG image decoder - https://news.ycombinator.com/item?id=26714831 - April 2021 (135 comments)

C performance mystery: delete unused string constant - https://news.ycombinator.com/item?id=23633583 - June 2020 (105 comments)

That first one was just yesterday but this is a rare case where we would not downweight the follow-up post (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...).


Apparently the language was renamed at some point, google/puffs redirects to google/wuffs?

Puffs was discussed few years back https://news.ycombinator.com/item?id=15711767


Thankfully they changed the name, in Germany it would be quite impossible to use it in any public discussion, as Der Puff is a special kind of boys club.


Yeah, Coq also feels the heat with name...


Yeah, I was curious about that as well. The README file links to a Google Groups discussion about the name change that seems to have been memory-holed, but apparently it was renamed to avoid confusion with a NetBSD component: https://news.ycombinator.com/item?id=15712659


This is a fascinating spin: a pure language, designed for libraries, not for complete programs. A tip of the hat to whoever was able to break out of the "a language has to do x y and z" thinking and perceive that this is a possibility.


Interestingly, I was thinking about this exact thing a few hours ago, before I saw this on HN. The thought was that you can design much more interesting languages by not making them be everything for everyone. Not quite domain specific, but also not quite general purpose.

I made a language a while back that was used to implement custom logic for a product (I've since replaced it with a more declarative system that's basically TOML but where values can be expressions that get evaluated to generate the actual values). One goal of this language was that it should always terminate[1], so it had no unbounded loops. Another goal was that it should be deterministic, so all input was gathered before execution and all output was accumulated to be processed at the end. The entire thing ran in a database transaction (so input could be queried, then the code was executed as a pure function of this input, then the result would be written back to the database or sent elsewhere). Externally triggered events would cause this to run. Essentially an event driven synchronous[2] language of the transformational system variety. It was slightly inspired by Lustre[3], which is used in critical systems like aeroplanes, trains and power plants. I'm a big fan of this style of language.

Basically, by constraining what the language can do or can be used for, you can design much more powerful semantics or language features for the things that it is designed for, similar to a domain specific language. I guess it really would be a somewhat more general domain specific language, or at least domain specific to multiple domains.

I was thinking about this while walking home from the shops and was wondering if such a language would be beneficial to solve some challenges I hit in my work and I was going to spend some time thinking what semantics would be useful, but haven't done so yet and then came across this HN submission. :)

[1] I was also thinking about how the halting problem doesn't really say that determining if a program halts is impossible, just that there are programs that are not computable. If you add constraints (like not being able to feed the program to itself as is done in the halting problem and not allowing unbounded loops) then it is possible to determine if a program will terminate or not.

[2] https://en.wikipedia.org/wiki/Synchronous_programming_langua...

[3] https://en.wikipedia.org/wiki/Lustre_(programming_language)


> If you add constraints (like not being able to feed the program to itself as is done in the halting problem and not allowing unbounded loops) then it is possible to determine if a program will terminate or not.

Dhall is a good example - https://github.com/dhall-lang/dhall-haskell .


I'm not familiar with Dhall, but that looks pretty neat! Thanks for sharing.


I'm not sure if languages are designed to build libraries or apps specifically. The languages I use are designed to communicate with a computer. It's the frameworks that dictate how we package a set of instructions.

So a "pure language" here is just a bs marketing term rather than any inherent feature in the language. As far as purity goes, a language like c does that just fine. That's as pure a language as it can get.

This whole opinionated "can never allocate memory" is condescending to engineers. A powerful language should be safe by default (to take some of the pressure off the developers of having to be always careful) but have the knobs to let them take full control when needed. C# does this very well.


Arguably Elm takes this approach to SPA development. The language spec itself is general purpose but in practice it's been developed to serve a narrow purpose (too narrow for some!)


Yes, an interesting concept. But I guess it won't see a wide adoption: Who wants to learn a new language, if you can't use it for anything but X? And who will use this language for X, if you haven't been able to learn it while doing Y?


Who would want to learn Javascript for the frontend and Python/Ruby/C#/Java/PHP/Whatever for the backend? And HTML for UI with CSS for styling?

You say this, but people do it all the time.


People who want to do X well will learn the language. You can learn more than one language.


I disagree. It's just a question of how many people are trying to do what it's designed for, and how much benefit it provides for that focused domain.


> Traditionally, the first program anyone writes in a given programming language is something that prints "Hello world". This doesn't work for Wuffs, for two reasons. One is that Wuffs doesn't have a string type per se. Two is that Wuffs code doesn't even have the capability to write to files directly, such as to stdout. Wuffs is a language for writing libraries, not complete programs, and the less Wuffs can do, the less Wuffs can do that is surprising (such as upload your files to the internet), even when processing untrusted input.


To be honest, I'm not sure what to make of this. Wuff the library makes sense as a drop in for the C standard library, but the language, I'm not sure how it fits.

It seems to offer some of the features offered by languages like D and Rust, while staying more C like, but also removing one of the few actual reasons to use C, which both D and Rust also provide on top of the other features offered by Wuff.

It's cool and all but it seems confused as to whether it wants to be a library for C, an extension to C or a standalone language. As a stand alone language, I'm not sure I really see the benefits over alternatives as a C library, it does have some interesting ideas.


The readme has more explanation:

> Wuffs (Wrangling Untrusted File Formats Safely) is formerly known as Puffs (Parsing Untrusted File Formats Safely).

> Wuffs is a memory-safe programming language (and a standard library written in that language) for wrangling untrusted file formats safely. Wrangling includes parsing, decoding and encoding. Example file formats include images, audio, video, fonts and compressed archives.

> Wuffs is not a general purpose programming language. It is for writing libraries, not programs. The idea isn't to write your whole program in Wuffs, only the parts that are both performance-conscious and security-conscious. For example, while technically possible, it is unlikely that a Wuffs compiler would be worth writing entirely in Wuffs.


The purpose of the language is clear to me: make it practical to prove that >90% of a C file-munging library is safe from several common types of errors. I will be looking into reorganizing some of my existing C library code of this type into large Wuffs components and small interaction-with-outside-world components.


The only part of the Wuffs spec I just read that I dislike:

Strings. I would really prefer strings to work like existing C and 'bash' style quoting. At least the simple aspects of it, the parts of the rules that are easy to remember and simple. A string should always be a sequence of octets, but easily coerced by a casting operator to a numeric format from any index. I'm not sure what the syntax for that would be offhand.


> I would really prefer strings to work like existing C...

The way "strings" work in C is a big source of the kind of bugs that Wuffs is supposed to prevent.

> A string should always be a sequence of octets, but easily coerced by a casting operator to a numeric format from any index

The Wuffs language just uses "a sequence of octets" in favor of a string type.


Yeah, when I got to the bottom and saw that they don't have strings, I immediately decided not to spend any more time learning about the language.

I'm not sure what I would want them to be like for this language, but I'd definitely want something.


It isn't made especially clear on the linked page, but Wuffs is a language to write parsers with. It's not a "general purpose" language, though it might find use in other domains.

If you've written a parser, you will have noticed that the built-in string types of other languages are counter-productive. You really want to work with plain ranges over bytes, and Wuffs offers that to you.


I like the idea. The inability to do something is an often underrated feature.


I'm confused by the "all functions are methods" restriction. the "what" seems clear, but the "why" is eluding me, and I'd love to read an explanation.


How does error handling work in Wuffs? That seems to be an important aspect for a reliable language, it wasn't immediately clear from the docs.

Edit ah found it, https://github.com/google/wuffs/blob/main/doc/note/statuses....


>"Wuffs (Wrangling Untrusted File Formats Safely) is formerly known as Puffs (Parsing Untrusted File Formats Safely). Wuffs is a memory-safe programming language (and a standard library written in that language) for wrangling untrusted file formats safely. Wrangling includes parsing, decoding and encoding. Example file formats include images, audio, video, fonts and compressed archives.

It is also fast. On many of its GIF decoding benchmarks, Wuffs measures 2x faster than "giflib" (C), 3x faster than "image/gif" (Go) and 7x faster than "gif" (Rust).

Goals and Non-Goals

Wuffs' goal is to produce software libraries that are as safe as Go or Rust, roughly speaking, but as fast as C, and that can be used anywhere C libraries are used. This includes very large C/C++ projects, such as popular web browsers and operating systems (using that term to include desktop and mobile user interfaces, not just the kernel).

Wuffs the Library is available as transpiled C code. Other C/C++ projects can use that library without requiring the Wuffs the Language toolchain. Those projects can use Wuffs the Library like using any other third party C library. It's just not hand-written C.

However, unlike hand-written C,

Wuffs the Language is safe with respect to buffer overflows, integer arithmetic overflows and null pointer dereferences.

A key difference between Wuffs and other memory-safe languages is that all such checks are done at compile time, not at run time. If it compiles, it is safe, with respect to those three bug classes.

The trade-off in aiming for both safety and speed is that Wuffs programs take longer for a programmer to write, as they have to explicitly annotate their programs with proofs of safety. A statement like x += 1 unsurprisingly means to increment the variable x by 1. However, in Wuffs, such a statement is a compile time error unless the compiler can also prove that x is not the maximal value of x's type (e.g. x is not 255 if x is a base.u8), as the increment would otherwise overflow. Similarly, an integer arithmetic expression like x / y is a compile time error unless the compiler can also prove that y is not zero.

Wuffs is not a general purpose programming language. It is for writing libraries, not programs. The idea isn't to write your whole program in Wuffs, only the parts that are both performance-conscious and security-conscious. For example, while technically possible, it is unlikely that a Wuffs compiler would be worth writing entirely in Wuffs."

PDS: Would like to see a future AV1 / AOM / libaom / FFmpeg -- written/compiled in Wuffs...


Wuffs seems fascinating and I really wanted to like it. But when I look at the code for the JSON decoder it seems so low level, and full of places for bugs to hide. JSON is a pretty simple spec and this obscures it (although to be fair it's also handling UTF-8).

https://github.com/google/wuffs/blob/main/std/json/decode_js...

Yes it prevents buffer overflows and integer overflow, but it can't prevent logical errors.

I'd rather see efficient code generated from a short high level spec, not an overwhelming amount of detail in a language verified along a few dimensions.

---

Logical errors in parsing also lead to security vulnerabilities. For example, here is an example of parser differentials in HTTP parsing:

https://about.gitlab.com/blog/2020/03/30/how-to-exploit-pars...

The canonical example of this class of bug is forging SSL certificates to take advantage of buggy parsers, but I don't have a link handy. There should be one off of https://langsec.org/ if anyone can help dig it up.

Again, this has nothing to do with buffer or integer overflows.

(aside: while googling for that I found the claim that mRNA vaccines work by parser differentials: https://twitter.com/maradydd/status/1342891437537505280?lang... If anyone understands that I'd be curious on an opinion/analysis :) )

At the very least, any language for parsing should include support for regular languages (regexes). The RFCs for many network protocols use this metalanguage, and there's no reason it shouldn't be executable. They compile easily to efficient code.

The VPRI project claimed to generate a TCP/IP implementation from 200 lines of code, although it's not really a fair comparison because it hasn't been tested in the wild: https://news.ycombinator.com/item?id=846028 .

Still I think that style has better engineering properties. Oil's lexer, which understands essentially all of bash, is generated from a short source file

https://www.oilshell.org/release/0.8.8/source-code.wwz/front...

which generates

https://www.oilshell.org/release/0.8.8/source-code.wwz/_devb...

which goes on to generate 28,000 lines of C code. It's short, but it really needs a better regex metalanguage to be readable: https://www.oilshell.org/release/latest/doc/eggex.html

A large part of JSON can be described by regular languages, and same with HTTP, etc.

-----

edit: An re2c target for wuffs could make sense. The generated code already doesn't allocate any memory, although it uses tons of pointers which could be dangling.

And in fact that was a problem Cloudflare, which sprayed the user data of their customers all over the Internet back in 2017: https://en.wikipedia.org/wiki/Cloudbleed

That was with Ragel and not re2c, which perhaps has a more error prone API.


> I'd rather see efficient code generated from a short high level spec

This is a holy grail for many PL researchers, but I don't think that there's any languages that reached this level of sophistication with expressiveness/practicality enough for production usages. At least with the status quo, you will probably need to write a massive amount of formal proofs if you want logical correctness, even with deceivingly simple specifications.


It doesn't have to be the same metalanguage for every program. You can write your own code generators adapted to the specific problems. They should have a "pit of success", and the knowledge of the domain is used to ensure that.

There are few research-level issues here; it's just good engineering.

The 0% or 100% mindset is bad engineering. You want something that's short, and that you can explain to other people, and that other people can write an independent implementation of. If the proof is 10x longer than normal code, and it's written in a metalanguage that the relevant people don't know, then it's not very useful.

Proofs are not guarantees. CompCERT has had logic bugs despite being written in a formal language. (It does reduce the number of bugs drastically in general, but it's also an extremely expensive technique, and not what I'm advocating.)

There are no guarantees in engineering, just good practices. Groveling through bytes one at a time in imperative languages is not an ideal engineering practice, even if the imperative language comes with more guarantees than most.


> JSON is a pretty simple spec a

Yet there are no proper implementations because it's too simple, sometimes ambiguous, and there are several standards. JSON parsing is a minefield, http://seriot.ch/parsing_json.php


JSON is simple compared to any textual format and many binary formats. Compare it with HTTP 1.1, 2, 3, or say Apache Arrow.

JSON is a compromise, and specifies only syntax, without semantics, and without an API. But I think that was a good tradeoff compared to say XML syntax, schemas, and APIs like SAX and DOM. You can build a lot of stuff on top, and people have.


Unfortunately, the lack of specification is becoming a security liability in modern times. That's a big reason why I'm developing https://concise-encoding.org/

- 1:1 compatible binary and text formats. Edit in text, send in binary.

- The binary encoding is super simple, and what's in use 99% of the time since it's usually machines talking to each other. Text is only for humans to see the data and to input the data.

- The format is more complex than JSON, but FAR better specced to avoid variance in implementations (important for security).


Tony Garnock-Jones (known for AMQP and RabbitMQ) is also developing Preserves [1] which I found promising.

[1] https://preserves.gitlab.io/preserves/


Avro for example is extremely complex compared to json.


Bencode is nice and simple arguably text based format, though needs a bit of editor support to be really human friendly.


No, binary formats are often simpler. Really anything that involves text is hugely complicated.


Not sure how this relates to what I said, maybe you misread it.


The first sentence: "JSON is simple compared to any textual format and many binary formats. "


The Pfizer vaccine indeed contains an important parser differential exploit to evade a security system. Ctrl+F "1-methyl-3’-pseudouridylyl" in this excellent post:

https://berthub.eu/articles/posts/reverse-engineering-source...


So a similar sense to Haxe: https://haxe.org


I think they're pretty different. Haxe is designed to compile to many different languages, sort of like a least common denominator.

Whereas Wuffs translates only C, and is pretty semantically close to it. Its goal is really safety, while Haxe seems to be a portability (e.g. running the same game on many platforms).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: