Hacker News new | past | comments | ask | show | jobs | submit login
Perfecting GLFW for Zig, and finding lurking undefined behavior that went unnot (hexops.com)
136 points by todsacerdoti 80 days ago | hide | past | favorite | 57 comments



I've actually found this bug before, but since I've been stuck on a six years out of date version of GLFW, I assumed it had been fixed and just wrote a ticket to update GLFW. It was wild to see this in my feed today!


Can anyone provide some context? I don't know what GLFW is, and the blog post doesn't really explain. From Googling, I see what it is now, but I don't really have a sense of how important GLFW bindings are. Is this very niche? Or, like is this a major contribution to game development? For that matter, is this mainly going to help Zig game developers, or is this a bit of tooling for everyone, kind of like zig cc?


Author here, sorry for the missing context. GLFW is very popular among game developers for opening graphics windows in a cross-platform way.

This helps just Zig game developers for now, it's not a major contribution to game development (though I hope other things I do with Mach engine in the future will be.)

I do think with some minor tweaks this could be used to make bindings for GLFW in other languages easier to install, e.g. by just requiring `zig`, but I haven't done that here.


Thanks! No need to apologize, I'm not in your target audience, and that's fine. Was just curious, is all.


glfw is very widely known because it's usually one of the recommended libraries to use when starting out graphics programming. It basically a cross platform library to create a window and initialize an opengl context .. or something like that. (It's been _many_ years since I last used it).


Is it safe to say that SDL is a strict superset of GLFW?


I think that is a reasonable statement. SDL includes other functionality like image loading, audio, etc. whereas GLFW is primarily focused just with getting a window with a graphics context for OpenGL/Vulkan/etc set up.


SDL’s advantage is that it supports more platforms (like mobile and Switch), and is much more battle-tested with hundreds of games using it, including Valve’s ones. Many people seem to use SDL for only windowing and input and seem okay with it.


This documentation has a good description of capabilities of it and alternatives.

https://www.khronos.org/opengl/wiki/Related_toolkits_and_API...


A little while ago I tinkered a bit with automatically generating "more idiomatic" language bindings from C-APIs by parsing the AST-dump output of C headers from Clang (a better approach would be to use libclang I guess, but the JSON AST dump is good enough for simple C headers).

The parsed AST information is then further processed by a language-specific script which replaces some "C-isms" with target-language concepts (things like replacing pointer/size pair structs with target-language-slices). This could be further helped by custom annotations on the C-API, but so far I'm just looking for special type name patterns in the C headers. IME making C-APIs "binding friendly" even helps with making the C API a bit safer (for instance by providing a size to each data pointer).

Anyway... currently it's good enough to automatically generate Zig bindings for my Sokol libraries, but I think with a bit more work a more robust and universal tool could be created with essentially does the same thing but more flexible. Microsoft is doing something similar now to expose the Windows APIs to C++ and Rust, but as far as I have seen this is on a whole different complexity level than my handful of python scripts ;)

Here's a bit more detailed information: https://floooh.github.io/2020/08/23/sokol-bindgen.html


Seems like a cool project and kudos to the authors attentions to detail.

These lines tickled my Bone of Irony:

> No installing apt packages. No dealing with missing header errors. It should just work out-of-the-box, and for every platform

I think it’s interesting/ironic because at some point the layers of header files were pitched as a way of making things more modular and better. A means of simplifying. Aptitude was meant to do the same thing. Simplify a process that was formerly seen as icky and complicated.

The author could have just as well mentioned just about any other software install/deploy/configure technology. In the end it always seems that yesteryears “it makes things easier” is eventually derided and replaced with a some new simple straightforward mechanism. As whatever the new approach is, scales in time and application, it takes the place of what it derided and becomes the new derided. One big merry go round.


There is a key difference here- if your build instructions include "install this package using apt" then you are limiting your developers to using Ubuntu or Debian, and even then limiting it to a particular version range within those distros.

If your build instructions are "zig build" then that works for everyone, on every platform: Windows, macOS, Linux (all distros!), FreeBSD, etc., and it works for all versions of them.


I mean, the build instructions could also say "install this package" and then you install it however your distribution does it.


Installing a package to build some code is just fundamentally wrong, though. Why am I making unspecified, permanent changes to my underlying operating system in order to build a random piece of code?

That just make no sense, and it is insane that this is the standard way of doing things.


Install in context of library package manager doesn't necessarily mean install to operating system. It can be just caching it somewhere in your home directory.


Part of the problem is, packages have different names on different systems, and many are missing altogether (is it libxml-dev or libxml-devel or does your OS simply not have that package, because the package repository owners are in a state of gang warfare with the libxml devs?)


One thing that has been bugging me for quite a while is that how a Zig library would be packaged downstream. AFAICT all Zig programs have to vendor their Zig dependencies which is a huge turn-off for Debian or Fedora developers. There is not an existing way to install Zig source files and IMHO this hurts reusibility (in a different way).


This is an issue with any modern systems language like Go and Rust, and it doesn't seem like this trend is going to stop. NPM will happily download precompiled binaries for certain packages, and NuGet only works with prebuilt packages (like the java ecosystem).

In current day, everyone wants their own package manager, and compiled builds to have no dependencies. While this makes distribution and building easier, it makes it impossible for distro maintainers to integrate it properly into their ecosystem, apply fixes and upgrade libraries across swathes of programs. I don't get why people can't accomodate for distro packaging in their tooling as well - many languages like python handle this just fine.

I believe we'll end up at a point where the system package manager will mean nothing - it'll manage end-user programs, which will all bundle their own versions of libraries - even ones considered "system" like libc and libX11. No sharing of (security) fixes and feature upgrades, and you'll have to learn a new package manager each time you switch languages or sometimes even projects...


Welcome in appstore land, where the system is merely a collection of independent applications with no relationship with each other.

Makes it indeed easier to build, deploy and sell software, compared to the former world of one big distribution where thousands of software packages were designed to work together.

I do believe the driving force in this trend is not coming from the users but from the producers desire to assert property of the software. Think the enclosure trend, for software.


> Welcome in appstore land, where the system is merely a collection of independent applications with no relationship with each other.

"App store" is having a centralized distro repository where you get all applications from (e.g. Windows Store), and "merely a collection of independent applications with no relationship with each other" is exactly the opposite (e.g. that's how the Windows apps were distributed before the advent of Windows Store).


The article discusses some undefined behaviour resulting from shifting an unsigned char left by 24 places. Any idea why the compiler wasn't warning about that? It seems like it would be easy to implement that warning in the compiler. I feel that I'm missing something.


That’s the point made at the end of the article. The compiler check does exist, but because it isn’t defaulted to on, no body in 6 years use it. Zig has the checks on by default, so the first time someone used it, they found and fixed it for everyone.

> Anybody using GLFW could have enabled UBSan in their C compiler. Anybody could have run into this same crash and debugged it in the last 6 years. But they didn’t. Only because Zig has good defaults, because it places so much emphasis on things being right out of the box, and because there is such an emphasis on having safety checks for undefined behavior - were we able to catch this undefined behavior that went unnoticed in GLFW for the last 6 years.


I have fixed similar issues in Allegro and SDL (which are libraries that cover a superset of GLFW's scope) because I do use UBSan in my projects. Unfortunately, I haven't used GLFW in any project yet, so I couldn't fix this one :)

(the point on good defaults stays valid though)


This is what I appreciate about another aspect Zig, checked arithmetic, which is on by default in safe release builds.

For example, if a u64 would overflow through addition, then instead of just allowing the addition to wrap the u64 around to 0, Zig will rather detect this and crash at runtime with an error, unless you explicitly indicate through the "%+" operator that you do intend to allow the addition to wrap.

If this is an unknown unknown for you, you don't need to worry about it because Zig will detect it, simply because the default is correct.

Whereas in other languages that support something like this (and I think even Zig's checked arithmetic is still among the most extensive across all operators), you have to actually opt-in to checked arithmetic, it's not the default in safe builds, which always surprises me.


I was expecting a compile time warning and thought ubsan shouldn't be necessary. But CUViper explains why the UB can only be caught at runtime.


Integer promotion happening is known at compile time, so it’s still a fair point to expect a compile time warning.


I disagree. A simplified version of the problem expression is:

unsigned char foo = 0x80; long bar = foo << 24;

It overflows and bar is set to -2147483648. If foo was initialized with 0x7f, there'd be no overflow and bar would be set to 2130706432.

In the original problem, the value of foo isn't known at compile time.


Probably why it's never been caught or been a problem.

Can't think of a reasonable case where a compiler would generate "bad" code because of this.


Agreed. I think it was only a problem because of ubsan. I guess ubsan added checks to the generated code that looked at the value being shifted left 24 and saw that overflow occurred and therefore raised its undefined behaviour signal.

The code would never fail on a two's compliment machine. What ubsan is saying is that the rules of C don't guarantee this code to work - it only works because the overflow writes into the sign bit, which is where the next thing expected it to be anyway.

If the above is true, I don't think their "fix" helps: https://github.com/glfw/glfw/pull/1986/files. I would have thought that the important part is to change the longs to unsigned longs. Edit: Ah, right, that's what the discussion says: https://github.com/glfw/glfw/pull/1986


Maybe what I'm suggesting isn't what you had in mind.

The fix for glfw is to manually promote to an appropriate unsigned type, to override the potentially UB-inducing default signed promotion. Since that potentially UB-inducing default can't change now, I was suggesting a warning for it.

It would be an aggressive warning and would probably trigger on a lot of existing code, but honestly, I would enable it for all new code. I really dislike the promotion logic in C.


You're right.

Their expression (simplified) was: *target = pixels[j] << 24, where pixels contains items of type unsigned char.

I hadn't realized that the pixels[j] was being promoted to an int rather than unsigned int. Eeep.

> I really dislike the promotion logic in C.

Agreed. I'll add that to my list of C dislikes.

I still think that even with manually promoting to the appropriate unsigned type isn't sufficient. Because target has type long*, so coercing the result of (unsigned int)pixels[j] << 24 to long still causes overflow and therefore UB... unless there's yet more I've missed.

Edit: Damn it, again, that's what the discussion says (https://github.com/glfw/glfw/pull/1986#issuecomment-95800024...).


It's not actually shifting a char, because integer promotion happens first.

https://github.com/glfw/glfw/pull/1986#issuecomment-95578417...


Furthermore, what ubsan catches is a shift into the sign bit of that promoted char.


gcc and clang should warn about this:

    #include <stdint.h>
    
    void foo() {
        uint32_t foo = 1;
        foo <<= 32;
    }
According to godbolt, the following warning is emitted for gcc:

    warning: left shift count >= width of type [-Wshift-count-overflow]
And for clang:

    warning: shift count >= width of type [-Wshift-count-overflow]
https://godbolt.org/z/ffccWexMP


The problem in the article is something like this:

    int main() {
        unsigned char foo = 0x80;
        unsigned int x = foo << 24;
    }
and with ubsan enabled it says

    /app/example.c:5:30: runtime error: left shift of 128 by 24 places cannot be represented in type 'int'
    SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /app/example.c:5:30 in

What's non-obvious is also the rules around arithmetic. Types unsigned char and unsigned int are mentioned in my snippet but the arithmetic - the `<<` operator - performed using the type int.


Different issue, that's shifting so far it would never fit. Here's an example with clang ubsan: https://godbolt.org/z/6Wvhn5rP9

Note that with foo = 0 there's no UB.


Yet another example of unnecessary undefined behavior in C as a consequence of needing to run on dead hardware. Bit shifting should clearly be defined to be operating on the underlying bits, with the resulting value being whatever that bit pattern would be on the target architecture.


“Whatever that bit pattern would be on the target architecture” is the original understanding of undefined.


> the left operand, which starts out as unsigned char, will get promoted to int

This is a big flaw in C. Quietly promoting an unsigned type to a signed type, simply because it was added to by a signed number? And that signed number is deterministically known to be > 0? C has major problems like this, you don't even need to look to UB to run into them. I think mixing signed and unsigned values should always give you a stern warning, explaining the pitfalls, and yet (a) the warning you get is some useless Chernobyl error light like "mixing signed and unsigned values" and (b) that warning is probably not enabled easily, without bringing in lots of stupid warnings too.


Zig, Rust, and (as much as I hate it) WASM seem to be doing magical things for several years running.


I enabled ubsan on a local build of our code. It found some issues, but good god the build time was atrocious! I doubt it's something you could enable on a CI pipeline.


We run sanitizers nightly, but not as part of presubmit checks. It strikes a good balance between productivity and safety.


I’ve always run sanitized builds as presubmit without any meaningful issue. It’s always nicer to have these issues caught before code gets merged into a shared repo. The long part of things is code review anyway so waiting on a build isn’t the biggest deal in the world.


That's great if it's fast enough. For some builds, like the GP and mine, it's extremely slow.


My point is I've run really big builds (i.e. taking about an hour to build non-sanitized) & we still did sanitized builds pre-submit. How big is the disparity you're seeing between sanitized & non-sanitized builds?


An hour, ouch. That's pretty long. We try to keep our presubmit under 10 minutes, and that's already frustrating.

I can't figure out how to dig the numbers out of our CI infrastructure for the nightly sanitizer runs.


I guess my point was more that it depends on how you structure your workflow. As long as you can do reviews and kick off asynchronous merge requests, the actual time it takes starts to be come largely irrelevant in my experience (aside from the impact it has on local iteration).

These days I'm on a different project and CI ASAN builds are 33 minutes vs normal builds of 21 minutes. I'd say that extra 10 minutes is fine because sanitized builds run in parallel to normal builds & this is clean builds vs incremental.


Sure but then you waste time bisecting the commit which introduced the issue.. I wonder if the bissection couldn't be automated also in the CI.


The system we use at work does indeed bisect a build breakage automatically.


I haven't found the sanitzer builds to be particularly slow. You mean running the test suite? Even that it wasn't particularly bad.


"The GLFW code is pretty popular, and it’s been around for 6 years."

More than 15 years, in fact. I contributed a few changes to the OS X side back in 2005.


I think he meant 6 years is that function was introduced


tl;dr:

Glfw is a library used most often by beginners learning graphics programming [but also in production systems]. A bit of code in that library shifts a char value c left by 24 places, yielding a 32-bit int value. If the top bit of the char (c & 0x80) is 1, that shifts it into the sign bit of the int, changing its sign, which is Undefined Behavior, and the compiler is free to do anything, including both what you wanted, and crashing. Compiling with "undefined behavior sanitizer" on, as Zig does by default, traps the event, causing the requested crash. Building without sanitization, nothing notices, and everything "just works".

A simple way to fix this would be to shift [wrong: c by 24u, "(c << 24u)"][right: "(unsigned(c) << 24)"], eliminating the Undefined Behavior while producing an identical instruction sequence, and all would be well.

Among permitted behaviors is launching missiles and just not even executing. In this case the compiler was not able to prove that a char value with the high bit set certainly would or certainly could not appear in that context. If it could prove it could not, then there would be no UB and everything would be fine. If it could prove such a value would appear, then it would consider itself justified in deleting all the code both after and leading up to the misuse, back to the point where a choice determined whether to enter the erroneous bit, and also the code making the choice, and likely the code that computes the value used in making the choice.

Perhaps surprisingly, this happens all the time and makes our programs smaller and faster than they would be otherwise. Still, people often complain about it.

If your testing does not exercise the code path that was deleted, then even with the UB sanitizer turned on, you might not notice that the compiler would have, without, chosen to delete that code. But deleting it is more likely to get your mistake noticed.

The most common source of this particular brand of UB is multiplying int values where the product does not properly fit in an int result. (Even Rust allows this, although it can be persuaded to trap misuses, with some care.) You might be tempted to do all your multiplications with unsigned ints, but you do not really get better results that way; often that just makes it even harder to find mistakes.


The (not yet merged) PR https://github.com/glfw/glfw/pull/1986 casts to signed long, which only avoids UB (is 64 bits) on 64-bit platforms which aren't Windows.

> Wouldn't it be better to cast to unsigned int or something similar (like uint32_t), since long can still be 32 bits? If the size of long and int is the same, wouldn't the cast as proposed in the PR be semantically identical to the current state?


Casting to signed long is demonstrably wrong.

The code in question is (probably) pretty slow, offering plenty of room for improvement beyond just fixing the UB. But it has seemed to work well enough, so far, like most things.


> A simple way to fix this would be to shift c by 24u, "(c << 24u)" which would promote both values to unsigned int

This works for most arithmetic operators, but not shift: "The type of the result is that of the promoted left operand". The type of the right-hand-side has no effect on the type of the result.

To fix this, you'd need to cast the left-hand-side to a type that is wide enough.


It should be made clear that glfw is also used by production software as well; I’ve spotted it in games, and also my terminal emulator of choice (kitty.) It is often a reasonable alternative to SDL if you just want to get a window with an OpenGL or Vulkan context and don’t need many of the other features SDL offers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: