You buy a DGX A100, or a cluster of them, for training and running large deep learning models (or for doing "traditional" HPC).
IBM's solution is more a small inference engine that is part of the CPU, so you don't need to move you data off-chip when doing a little bit of inferencing as part of some other workflow. I don't work with mainframes so I could be talking out of my behind, but maybe something like DL-assisted fraud detection as part of processing bank transactions?
The technician advised me to buy from Miele, Siemens or Bosch, as Samsung apparently has lots of issues.
The unholy Trinity of appliance hell. Every brand that makes these has issues. If you get 3-5 years of use out of any of them (post ~2005) you're lucky.
I'm firmly convinced that every washing machine or dishwasher brand just wants to steal from you
I learned that a lot of machines break down because of the combination of low temperature washing and the types of soap/detergent people use. It clogs up the machines and without regular maintenance, the occasional hot wash or better soap, it destroys components.
Apparently low temperatures do not fully resolve the modern soap (which is thick and heavily perfumed), leaving behind a lot of residue. An occasional hot wash clears them.
I don't remember the name of the better soap, but basically what he says is that for clothes that are just a bit smelly but not really dirty/stained, modern soap is massive overkill. It also doesn't need all this perfume. Your clothes are fine smelling neutral, they don't have to smell like a day in the Alps.
Would really advise to talk to a local maintenance guy, they can probably explain it much better.
I believe most front loaders these days have both a self clean cycle (you're supposed to run it every month or two, it's basically an extremely long hot water rinse+spin that you don't add soap or put clothes in for), and a drain filter that should be accessible near the bottom front (expect black slime if you haven't cleaned the filter recently).
E.g., micro models are just constrained optimization based on the idea of representing preference relations over abstract sets with continuous functions. So obviously, the math is then very simple. This is considered a feature. You can also use more complex math, which helps with certain proofs (especially existence and representation).
You could grab some higher level math for econ textbooks, which typically include the models as examples, where you skip over the math.
For example, for micro, you can get the following:
I think it treats the typical micro model (up to oligopoly models) via the first 50 or so pages while explaining set theory, lattices, monotone comparative statics with Tarski/Topkis etc.
They are, in fact, not really good academic papers. Finding a clever name and then choosing the most obtuse engineering-cosplay terms is not a good paper. It's just difficult to read. And so next, many well known results get discovered again to much acclaim in ML and head scratching elsewhere.
For example, yes they are kernel matrics. Indeed, the connection between reproducing kernel hilbert spaces and attention matrices has been exploited to create approximating architectures that are linear (not quadratic) in memory requirements for attention.
Or, as the author of the article also recognizes, the fact that attention matrices are also adjacency matrices of a directed graph can be used to show that attention models are equivariant (or unidentified, as the author says) and are therefore excellent tools to model Graphs (see: the entire literature of Geometric deep learning) and rather bad tools to model sequences of texts.
LLMs may or may not collapse to a single centroid if the amount of text data and parameters and whatever else are not in some intricate balance that nobody understands, and so they are inherently unstable tools.
All of this is true.
But then, here is the infuriating thing:
all this matters very little in practice. LLMs work, and on top of that, they work for stupid reasons!
The problem of "identification" was quickly solved by another engineering feat, which was to slap on "positional embeddings". As usual, this too didn't happen because there was a deep mathematical understanding. Rather, it was attempted and it worked.
Or, take the "efficient transformers" that "solve" the issue of quadratic memory growth by using kernel methods. Turns out, in practice, it just doesn't matter. OpenAI, or Anthropic, or Meta simply do not care about slapping on another thousand GPUs. They care about throughput. The only efficiency innovation that really established itself was fusing kernels (GPU kernels, that is) in a clever way to make it go brrrrr. And as clever as that is, there's little deep math behind it.
Results are speculation and empirics.
The proof is in the pudding, which is excellent.
not for long. steam engines existed long before statistical mechanics, but we dont get to modernity without the latter
Trial and error makes the universe go round.
> The problem of "identification" was quickly solved by another engineering feat, which was to slap on "positional embeddings". As usual, this too didn't happen because there was a deep mathematical understanding. Rather, it was attempted and it worked.
Wasn't that tried, because of robotics?
It's a commonly solved issue, that a hand of a robot must know each joints orientation in space. Typically, each joint (a degree of freedom) has a rotary encoder built in. There is more than one type, but the "absolute" version fits the one used in positional embeddings:
(full article: https://www.akm.com/global/en/products/rotation-angle-sensor... )
I find that parallel very fitting, since a positional embedding uses a sequence of sinusoidal shapes of increasing frequency. In the "learned positional embedding" gpt's (such as the gpt-2), where the network is free to use anything it would like to, seems that it actually learns the same pattern as the predefined one (albeit a little bit more wonky).
The arithmetic intensity of unfused attention is too low on usual GPUs; it's even more a memory bandwidth issue than a memory capacity issue. Just see how much faster FlashAttention is.
I used to be in that boat, now I'm in this boat.
I'm not comfortable sharing many specific details about my business publicly, in this forum. But I am comfortable sharing them with people who are where I was and want to get where I am. I'm happy to. I'm sure you understand.
In service of answering your question anyway. I'm a hardware product design engineer. I can take ownership over an entire complex piece of hardware (medical device, IT product, etc) and architect it, design it mechanically, electrically and do the systems engineering. I'm able to deliver entire complex hardware products that work well (more than well), can be mass manufactured, and meet cost targets. I've designed surgical robotics systems, artificial hearts and other class III implanted devices, stuff in the disney parks, times square and even the Smithsonian. I have a website about myself with more information at www.iancollmceachern.com
for an individual partner we need to get to know each other first, and then that person needs to be willing to only get paid when a project is successful, because otherwise i'd face having to pay a full salary from the start before we even have the first project.
how does one find such a person?