More

lhl · 2024-06-28T15:11:25

Just saw this, might get lost in the noise, but just for posterity, apparently the Gemma 2 models were specifically RL’d to index on Chat Arena performance: https://x.com/natolambert/status/1806384821826109597

(Relevant sections of the paper highlighted.)

occamrazor · 2024-06-28T17:08:15

On prompts only, with answers presumably from the teacher model (Gemini).

It was not trained or RLHFd on Arena replies or user preferences.

lhl · 2024-07-01T18:41:34

Yes, answers were distilled from a much stronger model. On the one hand, you can argue that this is exactly what the LMSYS, WildBench etc datasets are for (to improve performance/alignment on real-world use cases), but on the other hand, it's clear that training on the questions (most of which are repeatedly used by the (largely non-representative of general population) users of the ChatArena for comparing/testing models) makes ChatArena ELO less useful as a model comparison tool and artificially elevates Gemma 2's ChatArena score relative to its OOD performance.

At the end of the day, by optimizing for leaderboard scoring, it makes the leaderboard ranking less useful as a benchmark (Goodhart's law strikes again). The Gemma team obviously isn't the only one doing it, but it's important to be clear-eyed about the consequences.

lhl · 2024-06-27T23:48:16

I'd encourage people to test for themselves (and to let the Chatbot Arena scores to settle) before getting caught up in too much hype. I just did a personal eval and I found gemma-2-27b-it (tested on AI Studio) performed far worse in my testing than Llama 3 70B, especially for reasoning and basic world understanding queries.

WiSaGaN · 2024-06-28T00:25:26

I also prefer to use "Coding" or "Hard Prompts (Overall)" instead of default "Overall" in Chatbot Arena scores to determine the actual performance level of LLMs. Seems much more align to my vibe test in terms reasoning. I guess the "Overall" contains a lot of creative tasks, which is not what I use the most in the daily tasks.

nacs · 2024-06-28T00:11:28

Same. I tried 27B and found it to be not even close to llama3-70b.

Even llama-8b did better in some of my tests than Gemma 27b.

lhl · 2024-06-27T21:31:48

Sadly, while gemma-2-27b-it is available (as a Preview model) on the AI Studio playground, it didn't show up via API on list_models() for me.

lhl · 2024-06-17T14:33:59

I spotted this recent post https://www.reddit.com/r/LocalLLaMA/comments/1deqahr/comment... that was pretty interesting:

> When I was working on TVM at Qualcomm to port it to Hexagon a few years ago we had 12 developers working on it and it was still a multiyear long and difficult process.

> This is also ignoring the other 20 or so developers we had working on Hexagon for LLVM, which did all of the actual hardware enablement; we just had to generate good LLVM IR. You have conveniently left out all of the LLVM support that this all requires as AMD also uses LLVM to support their GPU architectures.

> Funny enough, about a half dozen of my ex coworkers left Qualcomm to go do ML compilers at AMD and they're all really good at it; way better than I am, and they haven't magically fixed every issue

> It's more like "hire 100 additional developers to work on the ROCM stack for a few years"

This last statement sounds about right. Note that ROCm has over 250 repos on Github, a lot of them pretty active: https://github.com/orgs/ROCm/repositories?type=all - I'm sure an enterprising analyst who was really interested could look at the projects active over the past year and find unique committers. I'd guess it's in the hundreds already.

I think if you click through the ROCm docs https://rocm.docs.amd.com/en/latest/ (and maybe compare to the CUDA docs https://docs.nvidia.com/cuda/ ) you might get a good idea of the differences. ROCm has made huge strides over the past year, but to me, the biggest fundamental problem is still that CUDA basically runs OOTB on every GPU that Nvidia makes (with impressive backwards and in some cases even forwards compatibility to boot https://docs.nvidia.com/deploy/cuda-compatibility/ ) on both Linux and Windows, and... ROCm simply doesn't.

I think the AMD's NPUs complicate things a bit as well. It looks like it's its currently running on its own ONNX/Vitis (Xilinx) stack https://github.com/amd/RyzenAI-SW , and really it should either get folded into ROCm (or a new SYCL/oneAPI-ish layer needs to be adopted to cover everything).

lhl · 2024-06-17T08:08:54

No, your memory is spot on. I also eval'd dozens of ARM (and some x86) SBCs for embedded use back in the early/mid-2010s and most of the BSPs were awful. You'd be locked into some ancient several-year old Linux kernels and documentation (and even actual hardware support) was often buggy/incomplete. God help you if you need to customize the boot/spin your own (Yocto was still new then, I assume life is a bit easier now).

lhl · 2024-06-17T07:59:31

ROCm 6.0 and 6.1 list RDNA3 (gfx1100) and RDNA2 (gfx1030) in their supported architectures list: https://rocm.docs.amd.com/en/latest/compatibility/compatibil...

Although "official" / validated support^ is only for PRO W6800/V620 for RDNA2 and RDNA3 RX 7900's for consumer. Based on lots of reports you can probably just HSA_OVERRIDE_GFX_VERSION override for other RDNA2/3 cards and it'll probably just work. I can get GPU-accelerate ROCm for LLM inferencing on my Radeon 780M iGPU for example w/ ROCm 6.0 and HSA_OVERRIDE_GFX_VERSION=11.0.0

(In the past some people also built custom versions of ROCm for older architectures (eg ROC_ENABLE_PRE_VEGA=1) but I have no idea if those work still or not.)

^ https://rocm.docs.amd.com/projects/install-on-linux/en/lates...

lhl · 2024-06-16T16:20:34

Pi's are great for easy hardware hacking, but I don't know if they ever made that much sense as home servers. You could always pick up used office/minipcs for even cheaper than a bare pi board, and if you picked carefully, you wouldn't really be using much more idle power.

Also €100-150 for those used 1L boxes sounds a bit pricey to me, since in that range you can buy brand new minipcs that perform similarly (personally for a network-centric device probably I'd go on aliexpress and grab one of the fanless N100 router-style pcs).

mmastrac · 2024-06-16T17:06:14

They are ridiculously overpowered for a number of usecases, even the older and cheaper 3 models. I'm running progscrape.com on a 4 and it held up to HN traffic without sweating at all.

I had a Pi1 running Stylus for home monitoring with maybe 10% CPU use at any time.

lhl · 2024-06-14T19:45:36

You could probably run it as a Q4 (definitely as a Q3) on 4 x A6000 (so on a $25K workstation), although you'd probably also be looking about 3-4 tok/s text generation. I do think that it's a big landmark to have a true GPT4-class model (with some questionable RL though from my initial testing). The best thing about it is that it's almost certainly now the strongest model available for generating synthetic data without any licensing restrictions.

Funnily enough, I don't think it's actually the most interesting model that Nvidia released this week. Nvidia also published this paper https://arxiv.org/abs/2406.07887 and released https://huggingface.co/nvidia/mamba2-hybrid-8b-3t-128k (Apache 2.0 licensed, to boot). It looks like it matches (and sometimes even edges out) Transformer performance, while having linear scaling for context length. Can't wait for a scaled up version of this.

Nvidia also released a top-notch Llama3 70B SteerLM reward model as well (although RLHFlow/ArmoRM-Llama3-8B-v0.1 might still be a better choice).

lhl · 2024-06-13T16:11:42

Perhaps relevant to the story, he was also on the team that delivered the first petaflop supercomputer (the most powerful supercomputer at the time) around that same time frame (it was around 24K processors):

Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s.

https://dl.acm.org/doi/10.5555/1413370.1413372

https://en.wikipedia.org/wiki/Roadrunner_(supercomputer)

bbatha · 2024-06-13T16:45:49

What a pain in the ass machine that was to write code for. You had 3 different processors to manage with 3 different architectures and all of the challenges the game dev community had with keeping the SPEs hot. But the SPEs were not enough to solely rely on for number crunching performance unlike today's GPU compute machines, so you also needed to do number crunching on the opterons and the main cell core for optimal performance unlike with a GPU where the CPU is mostly just keeping the GPU memory full. Then to make matters worse the cell had a different endianess than x86 making shared memory very annoying to work with.

sillywalk · 2024-06-13T19:01:32

I'm curious if you've ever heard of or used CellFS[0] , which was supposed to simplify programming for Cell?

[0] https://www.osti.gov/servlets/purl/1000498

bbatha · 2024-06-13T19:29:46

Haven't heard of it. Internal communication was not a strong suit at LANL... You also had a ton of different teams who all got access to the hardware at roughly the same time and were shotgunning different approaches and coalescing on the better libraries wasn't really a priority at the beginning of roadrunner when I was there. LANL didn't really have project management at the code base level to direct people to upgrade stuff like that, its mostly independent researchers pulling the code bases in different directions. Most of those researchers were physicists in my world and didn't really care about the "engineering" of software, just that their code ran fast and correct enough to get to the next deadline. Then there is a small core of more computer science researchers who did projects like this. If they were successful enough they'd attempt to integrate it into the mainline codes. But their incentives generally were to publish papers about super computer code like this, not necessarily to integrate it into the mainline codes. So often getting something like this in the codes was giving a talk and hoping a physicist would see it and need the extra performance for their needs.

I was not running codes that would utilize even a small fraction of roadrunner so my projects quickly moved to the GPU based test beds for the next gen computer so I didn't get to see these research coalesce. My understanding from people who stayed on roadrunner is that for the most part people didn't adopt too much fanciness for roadrunner as the rewrites were too deep and GPUs were on the horizon. There was a lot of vision about making generic heterogeneous compute frameworks but too my knowledge they didn't pan out and just writing against CUDA, OpenMP, and MPI won the day.

lhl · 2024-06-13T12:52:42

It looks like Runpod currently (checked right now) has "Low" availability of 8x MI300 SXM (8x$4.89/h), H100 NVL (8x$4.39/h), and H100 (8x$4.69/h) nodes for anyone w/ some time to kill that wants to give the shootout a try.

darrick_horton · 2024-06-13T14:26:55

We'd be happy to provide access to MI300X at TensorWave so you can validate our results! Just shoot us an email or fill out the form on our website

Jlagreen · 2024-06-17T14:24:53

If you're able to advertise available GPU compute in some public forums then it's enough to tell us about the demand of MI300X in cloud ...

lhl · 2024-06-18T05:39:45

You're joking/trolling right? There are literally 10's of thousands of H100s available on gpulist right now, does that mean there's no cloud demand for Nvidia gpus? (I notice from your comment history that you seem to be some sort of bizarre NVDA stan account, but come on, be serious)