The "volatile" keyword should never be used for C/C++ multithreaded code. It's specifically intended for access to device-mapped addresses and does not account for any specific memory model, so using it for multithreading will lead to breakage. Please use the C/C++ memory model facilities instead.
(As a contrast, note that in Java the "volatile" keyword can be used for multithreading, but again this does not apply to C/C++.)
> Please use the C/C++ memory model facilities instead
I should point out that for more than half of my professional career, those facilities did not exist, so volatile was the most portable way of implementing e.g. a spinlock without the compiler optimizing away the check. There was a period after which compilers were aggressively inlining and before C11 came out in which it could be otherwise quite hard to otherwise convince a compiler that a value might change.
The problem is that volatile alone never portably guaranteed atomicity nor barriers, so such a spinlock would simply not work correctly on many architectures: other writes around it might be reordered in a way that make the lock useless.
It does kinda sorta work on x86 due its much-stronger-than-usual guarantees wrt move instructions even in the absence of explicit barriers. And because x86 was so dominant, people could get away with that for a while in "portable" code (which wasn't really portable).
The C/C++ memory model is directly derived from the Java 5 memory model. However, the decision was made that volatile in C/C++ specifically referred to memory-mapped I/O stuff, and the extra machinery needed to effect the sequential consistency guarantees was undesirable. As a result, what is volatile in Java is _Atomic in C and std::atomic in C++.
C/C++ also went further and adopted a few different notions of atomic variables, so you can choose between a sequentially-consistent atomic variable, a release/acquire atomic variable, a release/consume atomic variable (which ended up going unimplemented for reasons), and a fully relaxed atomic variable (whose specification turned out to be unexpectedly tortuous).
Importantly these aren't types they're operations.
So it's not that you have a "release/acquire atomic variable" but you have an atomic variable and it so happens you choose to do a Release store to that variable, in other code maybe you do a Relaxed fetch from the same variable, elsewhere you have a compare exchange with different ordering rules
Since we're talking about Mutex here, here's the entirety of Rust's "try_lock" for Mutex on a Linux-like platform:
That's a single atomic operation, in which we hope the futex is UNLOCKED, if it is we store LOCKED to it with Acquire ordering, but, if it wasn't we use a Relaxed load to find out what it was instead of UNLOCKED.
We actually don't do anything with that load, but the Ordering for both operations is specified here, not when the variable was typed.
I much prefer the pattern I've noticed with the recent generation of Go projects. What I mean is that I find myself more and more often going to a project's repository to check for an issue or open a pull request, only to find that it is written in Go. After the initial hype cycle, Go silently started being the engine many useful tools were written in. While I understand that Zig needs to have some level of getting the word out early on, ultimately having great projects arrive that are written in Zig (that don't need a "written in Zig" tag line as a sort of marketing gimmick) would be the best statement. This pattern I've observed with Go holds true for projects written in TypeScript, Python, and even C as well but Go is the more recent entry.
Effectively, less focus on slogans and more on great projects that solve actual problems. Sometimes moving in silence can speak volumes. (In chess, there's a saying: "Move in silence. Only speak when it's time to say, 'Checkmate.'")
I just asked my local AI chatbot, and they said that this is the endgame for Zig:
Congratulation !!
A.D.2111
All bases of Rust were destroyed.
It seems to be peaceful.
But it is incorrect.
Rust is still alive. Zig must fight against Rust again
And down with them completely!
Good luck.
Agreed. It's easy to have memory safety when you don't even support heap allocation. Now if OP had said "Java" or "C#" instead of "COBOL", they would've had a solid point. But the way Rust ensures memory safety without mandating GC while still allowing for complex allocation patterns can be said to be practically unfeasible for any of the usual "legacy" languages, with the notable exception of Ada.
The nice thing about a vaguely English like language is that your average LLM is going to do a better job of making sense of it. Because it can leverage its learnings from the entire training set, not just the code-specific portion of it.
I've used o365 copilot to analyze a COBOL app I had source code to, and it was great at explaining how the code worked. Made writing an interface to it a breeze with some sample code and I swear I am not a COBOL person, I'm just the Linux guy trying to help a buddy out...
It also does a reasonable job of generating working COBOL. I had to fix up just a few errors in the data definitions as the llm generated badly sized data members, but it was pretty smooth. Much smoother than my experiences with llm's and Python. What a crap shoot Python is with llm's...
> Facilitating graph-level optimizations has been one of the most central tenets of tensorflow's design philosophy since its inception.
Agreed of course but it's not like they came up with this approach from scratch. They seem to have just picked it up from Theano (now Aesara/PyTensor).
Often enough, hardware-specific optimizations can be performed automatically by the compiler. On the flip side, depending on a small set of general-purpose primitives makes it easier to apply hardware-agnostic optimization passes to the model architecture. There are many efforts that are ultimately going in this direction, from Google's Tensorflow to the community project Aesara/PyTensor (née Theano) to the MLIR intermediate representation from the LLVM folks.
I'm a compiler engineer at a GPU company, and while tiny grad kernels might be made more performant by the JIT compiler underlying every GPU chips stack, oftentimes, a much bigger picture is needed to properly optimize all the chip's resources. The direction that companies like NVIDIA et al are going in involves whole model optimization, so I really don't see how tiny grad can be competitive here. I see it most useful in embedded, but Hotz is trying to make it a thing for training. Good luck.
> There are many efforts that are ultimately going in this direction, from Google's Tensorflow to the community project Aesara/PyTensor (née Theano) to the MLIR intermediate representation from the LLVM folks.
The various GPU companies (AMD, NVIDIA, Intel) are some of the largest contributors to MLIR, so saying that they're going in the direction of standardization is not wholly true. They're using MLIR as a way to share optimizations (really to stay at the cutting edge), but, unlike tiny grad, MLIR has a much higher level overview of the whole computation and the company's backends will thus be able to optimize over the whole model.
If tiny grad were focused on MLIR's ecosystem I'd say they had a fighting chance of getting NVIDIA-like performance, but they're off doing their own thing.
> It is increasingly less expensive (in OpEx-per-inference-call terms) to run larger models, as your call concurrency goes up. Which doesn't matter to individuals just doing one thing at a time; but it does matter to Inference-as-a-Service providers, as they can arbitrarily "pack" many concurrent inference requests from many users
In a way it also matters to individuals, because it allows them to run more capable models with a limited amount of system RAM. Yes, fetching model parameters from mass storage during inference is going to be dog slow (while NVMe transfer bandwidth is getting up there, it's not yet comparable to RAM) but that matters if you insist on getting your answer interactively, in real time. With a local model, it's trivial to make LLM inference a batch task. Some LLM inference frameworks can even save checkpoints for a single inference to disk and be cleanly resumed later.
You may want to consider https://marabos.nl/atomics/ for an approachable overview that's still quite rigorous.
reply