The same operations in the same order is a tough constraint in an environment where core count is increasing and clock speeds/IPC are not. It's hard to rewrite some of these algorithms to use a parallel decomposition that's the same as the serial one.
I've done a lot of work on reproducibility in machine learning systems, and its really, really hard. Even the JVM got me by changing some functions in `java.lang.Math` between versions & platforms (while keeping to their documented 2ulp error bounds).
Most people aren't spreading single calculations across multiple cores, and the ones that do are already deep enough into the technical weeds to handle deterministic chunking and combining.
"parallel decomposition that's the same as the serial one" would be difficult in many ways, but only needed when you can't afford a one-time change.
Many systems I've worked on can be written as embarrassingly parallel with synchronized checkpoints for a relatively small cost if they're not serial, but yeah, the middle ground is hard. That's the nature of the beast though. Imagine trying to rearrange the instructions in your program the same way. You're going to get all sorts of wacky behavior unless you invest an enormous amount of effort into avoiding problems the way superscalar CPUs do.
Relatively few programs truly need to operate that way though and they're usually written by people who know the cost of their choices.
You can definitely do operations in a reproducible way (assuming you do the reductions in a defined order) but, yeah, you might lose some performance. Unfortunately the ML people seem to pick better performance over correctness basically every time :(
Some of the gradients are present in the TF C API (and thus the C++ API), but it's hit & miss which are in C and which are in Python. There was an attempt years ago to port more gradients to C, but this petered out in the way that most TF related efforts seem to do at Google.
We use the C API to generate gradients for TF-Java, and have some success training models with it. Replicating the Python bits in another language is a huge effort though that we haven't completed.
The encoder's embedding is contextual, it depends on all the tokens. If you pull out the embedding layer from a decoder only model then that is a fixed embedding where each token's representation doesn't depend on the other tokens in the sequence. The bi-directionality is also important for getting a proper representation of the sequence, though you can train decoder only models to emit a single embedding vector once they have processed the whole sequence left to right.
Fundamentally it's basically a difference between bidirectional attention in the encoder and a triangular (or "causal") attention mask in the decoder.
There are a bunch of these things in a word2vec space. I had a blog post years ago on my group's blog which trained word2vec on a bunch of wikias so we could find out who is the Han Solo of Doctor Who (which I think somewhat inexplicably was Rory Williams). You need to carefully implement word2vec, and then the similarity search, but there are plenty of vaguely interesting things in there once you do.
It's a good point about true and false positives though, which makes me wonder if anyone's taken a large database of expected outputs from such "equations" and used it to calculate validation scores for different models in terms of precision and recall.
Java has had the ability to run a single code file directly by running `java <path-to-file.java>` since Java 11 - https://openjdk.org/jeps/330. And it's had a Java REPL since Java 9.
There's usually a two or three step training procedure, first training to predict the next word on a huge corpus of text (billions or trillions of words), then possibly some instruction tuning (giving the model question & answer pairs and training on the answer) and then finally RLHF (or RLAIF, DPO etc) where the model is trained to match human preferences. It's this last step that is used to increase the helpfulness & harmlessness of the model, training it to not respond to certain topics.
ONNX Runtime needed a good demo application in Java and this was fun to do (I maintain the Java API for ONNX Runtime and wrote this SD implementation). I've added SDv2 and SDXL support (and the turbo variants thereof) after the initial release, and I'll upgrade it to the latest ONNX Runtime when that comes out to get FP16 support among other things.
I did. It depends what you want, for an overview of how ONNX Runtime works then Microsoft have a bunch of things on https://onnxruntime.ai, but the Java content is a bit lacking on there as I've not had time to write much. Eventually I'll probably write something similar to the C# SD tutorial they have on there but for the Java API.
For writing ONNX models from Java we added an ONNX export system to Tribuo in 2022 which can be used by anything on the JVM to export ONNX models in an easier way than writing a protobuf directly. Tribuo doesn't have full coverage of the ONNX spec, but we're happy to accept PRs to expand it, otherwise it'll fill out as we need it.
I have been very impressed at how performant the runtime's cpu inference is. It beat out hand written avx intrinsics by almost an order of magnitude. I had to go find a machine with no discrete gpu to convince myself it wasn't using one despite it saying it wasn't.
There's a correction to that tweet, larger vocab means fewer tokens for any given sequence (usually, assuming it's not to add other languages or character sets).