At 8x86B, looks like the largest open model yet by far. Would be interesting to hear how many tokens it's been trained on. Especially important for higher param models in order to efficiently utilize all those parameters.
Considering how poor it is compared to other models, it really emphasises how important fine tuning is. Models with MUCH smaller parameter counts are outperforming it in many metrics.
This seems intentionally obtuse. What you say is true, but it is very obvious that this is much more of a pain than if you had the source code. On the other hand, fine tuning is just as easy, regardless of whether you have the original training data.
I was going to ask for a reference to support this (although please provide one if handy), but the search term “catastrophic forgetting” is a great entry into the literature. Thanks.
One could also disassemble an executable and build on top of it. Not for the faint of heart and probably illegal, but possible unless it was deliberately obfuscated. Compared to that, it is impossible with state-of-the-art methods to systematically extract the training data from an LLM model. Fragments yes, but not all of it.
You can do better - generate synthetic data covering all topics. And to make it less prone to hallucination, use RAG or web search for reference material. The Phi-1.5 model was trained on 300B of synthetic tokens generated with chatGPT and it showed a 5x bump in efficiency, punching well above its line.
Synthetic data can be more diverse if you sample carefully with seeded concepts, and it can be more complex than average web text. You can even diff against a garden variety Mistral or LLaMA and only collect knowledge and skills they don't already have. I call this approach "Machine Study", where AI makes its own training data by studying its corpus and learning from other models.
Exactly. Training data is key for any trained AI system. And I strongly suspect that every single company active in these modern AI systems is still struggling with how to tune their training data. It's easy to get something out of it, but it's hard to control what you get out of it.
Requiring a company to publish their production database for free is delusional. I haven't mentioned musk anywhere in my comment, you must be obsessed with him.
that's a subtle dig at the fact that they have all of Twitter as a training corpus to use, but we don't know how they weight tweets. which, we know they're not gonna be weighted evenly.
And personally I never used Twitter much, but I certainly did not follow Elon Musk when I did - yet I had to see lot's of his posts in my feed. Surely just coincidence.
"They were just tracking how well his tweets were doing versus others. "
Yeah, and adjusting it, so he comes out best. That was Musks demand, as the other article shows, that is linked inside, after a Biden tweet performed better than Musk:
They officially boost people, who pay a little bit. Elon payed a lot.
And the source is clearly not the production source and never where in this shape - otherwise why sue someone, who open sourced it?
"But, the release of this source code also comes days after Twitter forced Github to take down other parts of Twitter's source code that was allegedly posted by a former employee without the company's permission. So, clearly, there's still plenty of Twitter that Musk still doesn't want us to see."
Also, you probably missed that:
"Zoë Schiffer of Platformer reported that Twitter actually removed part of the source code that affected the reach of Musk's and other user's tweets before releasing the algorithm to the public."
Which is consistent with quite some other statements, also from Twitter itself and the fact, that the source has not been updated in 8 months.
"But the underlying policies and models are almost entirely missing (there are a couple valuable components in [1]). Without those, we can't evaluate the behavior and possible effects of "the algorithm.""
It's not too hard to believe it is a coincidence when the most followed person on a platform shows up in your feed, especially if you follow tech accounts.
So changes in power users stats would also result in audience balancing?
Most likely the code was used for analytics and for tracking balance; Elon was a pain in the ass and asked to have custom analytics for his account and devs eventually added him as an audience to be able to get analytics about him easily. A bit dirty but it works.
Most likely the balancing code is somewhere else and it affects only republican / democrats.
For a long time we specified displays by their vertical dimension -- 480p, 720p, 1080p.
Then the marketing guys came along and decided that the horizontal dimension sounds bigger. If we stuck with the less-bullshitty way of doing things and kept comparisons 1:1, we'd call 3840x2160 displays 2160p or "2K" displays, but instead, the marketing people decided that we're going to change things to horizontal and called 3840x2160 "4K".
Has anyone outside of x.ai actually done inference with this model yet? And if so, have they provided details of the hardware? What type of AWS instance or whatever?
I think you can rent like an 8 x A100 or 8 x H100 and it's "affordable" to play around with for at least a few minutes. But you would need to know exactly how to set up the GPU cluster.
Because I doubt it's as simple as just 'python run.py' to get it going.
If you're just looking to test it out, it's probably easiest to wait for llama.cpp to add support (https://github.com/ggerganov/llama.cpp/issues/6120), and then you can run it slowly if you have enough RAM, or wait for one of the inference API providers like together.ai to add it. I'd like to add it to my NYT Connections benchmarks, and that's my plan (though it will require changing the prompt since it's a base model, not a chat/instruct model).
I'd expect more configuration issues in getting it to run on them than from a tested llama.cpp version, since this doesn't seem like a polished release. But maybe.
I'd be very curious to see how it performs especially on inputs that's blocked by other models. Seems like Grok will differentiate itself from other OS models from a cencorship and alignment perspective.
One of the interesting things when weights are open sourced is the community can often improve the results. See all the bugs fixed in Gemma for an example.
Doubtful, for purely information theoretic and memory capacity reasons. It may outperform on some synthetic metrics, but in practice, to a human, larger models just feel “smarter” because they have a lot more density in their long tail where metrics never go
Edit: to include my self summary after review: There's a good 100 models better than, a couple 1x7b even. Mixtral stomps it, half mixtral are universally better but one is close to same.
No, it's not "mostly worthless" and yes, some of the top models were removed a few months back from being trained on benchmark data.
I urge you to at least think through what alternative you propose before posting so aggressively in these situations. Lmsys doesn't have Grok, or I would have included it. And having _some_ data is better than none.
I also had someone arguing with me 6 months back that we can't trust any benchmarks at all from vendors, which would exclude the blog post. Instead of just repeating that back vehemently, I filled a gap. It's important we don't self-peasantize as a species, all data has its issues, that doesn't mean we throw it all out.
Quantifiable metrics are useful if they're credible, certainly.
But does it seem likely, to you, that a 7B-parameter model would outperform a 314B-parameter model? Given that we can look at the chatbot arena leaderboard and it's dominated by proprietary, 70B and 8x7B models?
A well regarded and modern model like Mixtral 8x7B, which is ranked 13th on the chatbot arena leaderboard, scores 72.7 'Average' on the open LLM leaderboard - and yet 'pastiche-crown-clown-7b-dare-dpo' scores 76.5.
Yup, 100%. Grok isn't very good and it was rushed.
Rest re: pastiche model, etc. are proposing things I'm not claiming, or close to what I'm claiming.
n.b. you don't multiply the parameters by experts to get an effective parameter count. Why? Think of it this way: every expert needs to learn how to speak English, so there's a nontrivial amount of duplication among all experts
I’ve been known to get snippy on HN from time to time myself :) So please know that I’m only offering a gentle nudge that I’d want from a fellow long-timer myself regarding a line of discussion that’s liable to age poorly.
Talking about sorting hats for those who do and don’t have the one-percenter AI badge isn’t a super hot look my guy (and I’ve veered dangerously close to that sort of thing myself, this is painful experience talking): while there is no shortage of uninformed editorializing about fairly cutting edge stuff, the image of a small cabal of robed insiders chucking in their cashews while swiping left and right on who gets to be part of the discussion serves neither experts nor their employers nor enthusiastic laypeople. This is especially true for “alignment” stuff, which is probably the single most electrified rail in the whole discussion.
And as a Google employee in the diffuser game by way of color theory, you guys have a “days since we over-aligned an image generation model right into a PR catastrophe” sign on the wall in the micro kitchen right? That looked “control vector” whacky, not DPO with pretty extreme negative prompt whacky, and substantially undermined the public’s trust in the secretive mega labs.
So as one long-time HN user and FAANG ML person to another, maybe ixnay with the atekeepinggay on the contentious AI #1 thread a bit?
regardless of whether they say it out loud, it is what many of us think - might be good for people to know why their opinions are getting immediately dismissed by insiders
Letting people know how why their opinions are getting dismissed in a productive way is done by citing well-known sources in low-effort way, or by explaining things thoughtfully in a high-effort way: Karpathy has chosen the highest-effort way of most anyone, it seems unlikely that anyone is at a higher rung of "insiderness" than he is, having been at Toronto with (IIRC) Hinton and Alex and those folks since this was called "deep learning", and has worked at this point at most of the best respected labs.
But even if folks don't find that argument persuasive, I'd remind everyone that the "insiders" have a tendency to get run over by the commons/maker/hacker/technical public in this business: Linux destroying basically the entire elite Unix vendor ecosystem and ending up on well over half of mobile came about (among many other reasons) because plenty of good hackers weren't part of the establishment, or were sick of the bullshit they were doing at work all day and went home and worked on the open stuff (bringing all their expertise with them) is a signal example. And what e.g. the Sun people were doing in the 90s was every bit as impressive given the hardware they had as anything coming out of a big lab today. I think LeCun did the original MNIST stuff on a Sun box.
The hard-core DRM stuff during the Napster Wars getting hacked, leaked, reverse engineered, and otherwise rendered irrelevant until a workable compromise was brokered would be another example of how that mentality destroyed the old guard.
I guess I sort of agree that it's good people are saying this out loud, because it's probably a conversation we should have, but yikes, someone is going to end up on the wrong side of history here and realizing how closely scrutinized all of this is going to be by that history has really motivated me to watch my snark on the topic and apologize pretty quickly when I land in that place.
When I was in Menlo Park, Mark and Sheryl had intentionally left a ton of Sun Microsystems iconography all over the place and the message was pretty clear: if you get complacent in this business, start thinking you're too smart to be challenged, someone else is going to be working in your office faster than you ever thought possible.
I have no idea how you've wandered all the way to Napster, Sun, hackers, etc. Really incredible work.
Well, I kind of know, you're still rolling with "this dude's a google employee", so the guy foaming at his mouth about Google makes sense to you, and now you have to reach to ancient lore to provide grounding for it.
Then don't link to an "About Me" page [1] that says you do? How is confusion on that subject any reader or commenter's fault?
I don't care if you personally work at Google or not, Google got itself in quite a jam as concerns public perception of their product in particular and the AI topic in general by going overboard with over-alignment, everyone knows that so one assumes that insiders know it, which is one of a great many examples of how strongly-forced models are a real problem for arbitrarily prestigious insider-laden labs.
Framing the debate about whether large, proprietary models are over-aligned or mis-aligned as an acid test for whether or not someone is worth paying attention to is really weird hill to stand on.
You're making up a person and being extremely creepy while doing a poor job of it.
It's at least funny, because you're doubling down on OP's bad takes, and embarrassing yourself with trying to justify it with what you thought was brilliant research and a witty
person-based argument. But, you messed up. So it's funny.
Punchline? Even if you weren't wrong, it would have been trivial while doing your research to find out half of Deep Mind followed me this week. Why? I crapped all over Gemini this week and went viral for it.
I guess, given that, I should find it utterly unsurprising you're also getting personal, and clinging to 1% as a class distinction thing and making mental images of cloistered councils in robes, instead of, well, people who know what they're talking about, as the other repliers to you point out.
"1%ers are when the Home Depot elites make fun of me for screaming about how a hammer is a nerfed screwdriver!"
I've been around here a pretty long time, but I could still be off base here: as far as I understood people generally posted links to their own blog [1] in their HN profile because they want people to read them? I read your blog and particularly the posts about Gigadiffusion because I wanted to reply from a position of having put some effort into understanding where the poster I was replying to was coming from before popping off with what could be taken as a criticism. If that offends you or creeps you out I'm more than happy to steer clear of it with the parting remark that I really like Material and had hoped that any follow up would give me the opportunity to compliment you on some nice work.
If that's not your blog, you should probably take it off your profile?
I'm not doing a faux-nice thing with you. You made up an elaborate argument, to justify rank fact-free ranting, based on false information. Thanks for your time.
The safety crap makes the tools unusable. I used to have a test for it that I thought was decent, but Claude failed that test and it is way better than ChatGPT-4 for code, which means my test was bogus. The people actually working in AI are kind of irrelevant to me. It's whether or not the model will solve problems for me reliably.
People "actually working in AI" have all sorts of nonsense takes.
Another day, another fairly good comment going grey on an AI #1. The over-alignment is really starting to be the dominant term in model utility, Opus and even Sonnet are both subjectively and on certain coding metrics outperforming both the 1106-preview and 0125-preview on many coding tasks, and we are seeing an ever-escalating set of kinda ridiculous hot takes from people with the credentials to know better.
Please stop karma bombing comments saying reasonable things on important topics. The parent is maybe a little spicy, but the GP bought a ticket to that and plenty more.
I was trying to be helpful. I've made elitist remarks on HN that were dubious in at least two ways: it was dubious if I was actually all that elite, and it was dubious if any amount of being elite justifies or makes useful a posture of elitism. My internal jury is still out, but as of writing I think I probably underestimated how unique my first-hand knowledge and contributions were, but more than made up for that by the claims exceeding the reality by a huge margin, for a massive net lose that made me wish I could take the remarks back.
I click on every HN username I reply to, because I've been hanging out here for like 16 years and more than once I've mouthed off only to later realize it was about C++ to Walter Bright or something, and looked a fool as a result. I've since apologized to Walter for disrespecting a legend and he was very gracious about it, to cite just one example.
Your initial remark wasn't even that bad, certainly others talk that way, and I tried to frame it accurately as one guy who tends to FAANG-flex carelessly rather than thoughtfully to another guy who probably doesn't talk to people like that face to face and is probably a pretty good guy having a tough day. I was trying to say: "been there, maybe cool it man you're probably going to have the same bad time I've had on this sort of thing".
But this is getting to where I'm starting to lose my temper a bit, I've been pretty cool about this. I even went and read the Dart/`llama.cpp`/`ONNX` stuff because I've also messed around with binding to `llama.cpp` and `whisper.cpp` and stuff just to make sure I'm not mouthing off to Jeff Dean's alt or something. I'm not talking to Jeff Dean.
I surf with `showdead` on, and I don't know the current meta so I don't know if you know that you've been flagged dead 3 times on this subthread already and as much as I'd like to, I can't really argue with any of the 3.
But given that you've clearly got similar interests, and therefore probably things that you could teach me if I were willing to listen, I'm going to propose a do-over.
If you'd like to start this over from a place of mutual interest and write this thread off to "a pair of people had bad vibes on an Internet forum once", email be at `b7r6@b7r6.net`.
If not, no hard feelings, but in that case, let's just give one another a wide berth and call it a day.
But the widespread popularity of ChatGPT and similar models shows that it isn't a serious impediment to adoption. And erring on the side of safety comes with significant benefits e.g. less negative media coverage, investigations by regulators etc.
Seems like marketing and brand recognition might be some confounding variables when asserting ChatGPT's dominance is due to technical and performance superiority.
The 1% who actually work on AI don't use terms as generic as "AI". Way to reveal yourself as college undergrad who read a couple of popular science books, downloaded MNIST data and thinks they are "experts".
(not sure you're going to edit again, but in the current one, I'm not sure what Google's silly stock image warning has to do with anything, and I have generally chosen to avoid engaging people doing their politics hobby via AI discussion, since it became okay to across the ideological spectrum of my peers. So, mu is my answer.)
And you're right, I was really surprised to see the harder right people throwing up their hands after the Gemini stuff.
Feel free to explain! You caught my attention now, I'm very curious why it's on topic. Preregistering MD5 of my guess:
7bfcce475114d7696cd1d6a67756761a
No I didn't, at least, I don't think it did but it does sound exactly like me. But then again, I don't know what it'd have to do with anything you said specifically.
Isn't this Apache licensed? Regardless, you can run multiple models concurrently on the same input using well-known ensemble techniques. (Not to be confused with mixture-of-experts, which is more like training a single model where only a few blocks are chosen to be active at any given time - a kind of sparsity.)
Can someone explain why the weights are posted via a Bittorrent magnet link? I have no way to check the size at the moment, but isn't that a bit unusual? There's also only 21 seeders right now according to https://checker.openwebtorrent.com/
it's been criminalized to hell by IP holders and hollywood. Such a shame they killed the best tech of the previous decade. Could have revolutionized how we distribute content, approach CDN and even streaming.
Neither of those are against the bittorrent protocol itself. Lots of software like Ubuntu is legally available on bittorrent, and I've never seen anything done to restrict that.
It may become a tradition since weights are so large. Perhaps it started when the Llama torrent link leaked. Then, Mistral decided to release their weights using bittorrent.
I would have assumed they could just upload it to Github. If it has restrictions on file size I'm sure they could make multiple part compressed files.
Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.
Soft size limit means "If your repository excessively impacts our infrastructure, you might receive an email from GitHub Support asking you to take corrective action." - I know people who have received such emails.
Most model releases happen through Hugging Face which does not have such a size limit.
The great thing about torrents is that you (or anyone else who cares) can single-handedly solve the problem you're complaining about by seeding the torrent.
No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git. Git is a version management software for code. I often see repos which images and even videos checked in, please don’t, there are so many far better and more performant solutions out there.
The other approach would be to use AWS S3 or other cloud providers which would cost them money every time someone downloads their code, which is not their prerogative to pay for when they are releasing something for free. Torrents seems like the only good solution, unless someone hosts this on the cloud for free for everyone.
Interesting, had no idea git had a VFS or that MS was a Monorepo. I guess git is much more capable than I thought but the average user really should just be uploading code into github
> No git would be impossible. I’ve never seen a repo even a few GB in size, if you are uploading non code files you really should not be using git
It's not actually a limitation in git itself, especially if you use Git LFS. People use Git for Unreal projects and big ones can be half a terabyte or more in size.
Others have pointed out that GitHub doesn't allow that, but
> Torrents can unfortunately die after a period of time if no one continues seeding it or if they don't use a permanent web based seeder, which doesn't appear to be the case.
So to can web links, especially when they are 300 GB and egressing out of AWS at $0.09/GB or worse (in non-US regions). Each full download would cost $27 at that rate. 10,000 downloads would cost $270,000.
Sure you could go for something with a better cost model like R2, but you can't beat using one or two unmetered connections on a VPN to constantly seed on Bittorrent, your pricing would be effectively free and reliability would be higher than if you just exposed a HTTP server on the Internet in such a way.
There's a lot of seeders on the torrent that are actually AWS ips too, all with similar configurations which makes me believe that it's probably xAI running them
It wasn't much of a leak. Facebook was pretending to keep it private for PR reasons but putting approximately zero effort into actually keeping it private.
I don't understand why you're being downvoted for asking a legitimate question. People not familiar with model weights might be surprised that they are often in tens of gigabytes and in this case even more.
Not an expert by any means, but I like learning about this stuff and I play with a lot of open weight models.
I’d say the significance is that it happened. It’s by far the largest open weight model I’ve seen. But I’m not sure why you’d use it over a model like Mixtral, which seems to perform about the same at like 1/6th the size.
But I welcome any contribution to the open weight LLM community. Hopefully people will learn something interesting with this model. And I hope they keep releasing new versions!
You're right, this model is going to be too big for most people to play around with. But to answer your question I have a 128GB of RAM in my M3 MacBook Pro, so I can use most of that for GPU inferencing. But still, this model is going to need to be heavily quantized for me to be able to use it. (fwiw, I probably wont try this one)
In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it. I suspect my computer might be able to run a 3 bit quant, but it might need to go down to 2 bits to have any kind of reasonable context length. But with quants that small I'd expect the model's performance to degrade well below that of Mixtral, so it probably isn't really even worth using. But we'll see; quantization is weird, some models perform better than others when quantized.
>In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it.
How quickly are new models available through Ollama?
Ollama is just a wrapper around llama.cpp, so when the gguf model files come out it'll be able to run on Ollama (assuming no llama.cpp patch is needed, but even if it is ollama is usually good at getting those updates out pretty quickly).
Thanks a lot for the hint :)! It awesome that it might run even on a MacBook, actually this is a reason to switch to Mac. Seems, there is nothing similar for a PC laptop with linux or windows.
No problem. I hope more people try these things out, it's the best way to push the industry forward! We can't let the researchers have all the fun.
Apple had plenty of reasons to move forward with their Apple Silicon CPUs and GPUs in the mac, but they really did seem to get lucky with the unified memory architecture. It was kind of just an artifact of their design, but ends up serving the needs of deep neural net models really well!
MoE doesn’t really help with the memory requirements for the reason mentioned in the other comment. But it does help with reducing the compute needed per inference. Which is good because the M3 Max and M2 Ultra don’t have the best GPUs. A 70B parameter model is pretty slow on my M3 Max, and this model has 86B activations per inference run.
Each token generated may only use a subset of the parameters (86billion instead of 314billion), but the next generated token might use a different subset. If it's anything like Mixtral, it will switch between experts constantly. It helps with memory bandwidth, but all the parameters still need to be in RAM or it would be unbearably slow.
OpenAI is valued at 90 billion and all they do is make GPT; Twitter is valued at 40 billion and this was essentially a vanity side-project by a cowboy CEO. Presuming that benchmarks and general “it’s about the level of 3.5” is accurate, it’s inefficient, but not incredibly inefficient imho
They weren't even 44B when elon took the keys - he specifically tried to back out of the deal because 44B was insane peak '21 asset bubble price. In truth they were probably like 10-15B at that moment. And now that bunch of advertisers left due to we know who it's probably about 10B
twitter was valued around 30 billion when musk tried getting out of buying it (then the market cap went up when it became clear that he would be forced to pay full price)
Since it is MoE, quantized it could be able to run on cheaper hardware with just consumer networking inbetween instead of needing epyc/xeon levels of PCI-e lanes, nvlink, or infiniband type networking. Or it could even run with people pooling smaller systems over slow internet links.
Nothing about using data in "real time" predicates that the model parameters need to be this large, and is likely quite inefficient for their "non-woke" instructional use-case.
Agreed. We have been building our real-time GPT flows for news & social as part of Louie.AI, think monitoring & and investigations... long-term, continuous training will become amazing, but for the next couple of years, most of our users would prefer GPT4 or Groq vs what's here and much smarter RAG. More strongly, the interesting part is how the RAG is done. Qdrant is cool but just a DB w a simple vector index, so nothing in Grok's release is tech we find relevant to our engine.
Eg, there is a lot of noise in social data, and worse, misinfo/spam/etc, so we spend a lot of energy on adverserial data integration. Likewise, queries are often neurosymbolic, like on a data range or with inclusion/exclusion criteria. Pulling the top 20 most similar tweets to a query and running through a slow, dumb, & manipulated LLM would be a bad experience. We have been pulling in ideas from agents, knowledge graphs, digital forensics & SNA, code synthesis, GNNS, etc for our roadmap, which feels quite different from what is being shown here.
We do have pure LLM work, but more about fine-tuning smaller or smarter models, and we find that to be a tiny % of the part people care about. Ex: Spam classifications flowing into our RAG/KG pipelines or small model training is more important to us than it flowing into a big model training. Long-term, I do expect growing emphasis on the big models we use, but that is a more nuanced discussion.
(We have been piloting w gov types and are preparing for next cohorts, in case useful on real problems for anyone.)
> The cover image was generated using Midjourney based on the following prompt proposed by Grok: A 3D illustration of a neural network, with transparent nodes and glowing connections, showcasing the varying weights as different thicknesses and colors of the connecting lines.
How is it that OpenAI was touted like it was some massive years-long effort that blew all AI research out of the water and now we have so many competitors popping up one after another?
You don't need to be a cutting edge research scientist to train a SOTA LLM. You just need money for scaling. OpenAI's "secret" was just their willingness to spend tens/hundreds of millions without guaranteed returns, and RLHF/instruct fine tuning, both of which are out of the bag now.
Disagree. It took more than 12 months from the release of GPT-4 to someone else producing a model of equivalent quality, and that definitely wasn't due to a shortage of investment from the competition.
There's a huge amount of depth in training a really good LLM. Not helped by the fact that iteration is incredibly expensive - it might take several months (and millions of dollars) before you can tell if your new model is working well or if there was some mistake in the pipeline that lead to a poor quality result.
Almost all of the world-class LLMs outside of OpenAI/DeepMind have been trained by people who previously worked at those organizations - giving them invaluable experience such that they could avoid the most expensive mistakes while training their new models.
Don’t overlook the training data (used for both training and instruction fine-tuning), it is one of the most crucial aspects, if not the most critical, given the significant differences observed in models with similar architectures.
While I do agree there is some amount of secret sauce, keep in mind the training takes several months. So from someone to see the success of GPT4, decide they want to invest that amount of money to train the same, raise the money to train the model, find someone competent to supervise the training, train the model for several months, then test and integrate it could easily be a year long even if there was no secret sauce.
That only remains an advantage if they can continue climbing the gradient from their lead position. If they hit a snag in scaling, methodology, or research, everyone else on the planet catches up, and then it's anyone's game again.
When it comes to LLMs, metrics are misleading and easy to game. Actually talking to it and running it through novel tasks that require ability to reason very quickly demonstrates that it is not on par with GPT-4. As in, it can't solve things step-by-step that GPT-4 can one-shot.
This was exactly my experience. I have very complex prompts and I test them on new models and nothing performs as well as GPT-4 that I've tried (Claude 3 Opus included)
LLM training is arcane and expensive to experiment with. So OpenAI had to waste a lot of time and GPU-hours on things that didn't work to learn the tricks that did work.
Most of the competitors have lineage straight back to OpenAI, eg the lead of x.ai was previously at OpenAI and Deepmind. Likewise with Mistral and especially Anthropic.
OpenAI still seems to be at the top, except for Anthropic, who may be close, in terms of the capabilities comparing gpt-4 and claude-opus.
This Grok-1 is a large model (~314B), which matches gpt-3.5 released 2 years ago, and at about the same level of much smaller models like, mixtral (~47B) and qwen-1.5 (~72B). Do you think it's competitive?
Also, the general architecture is well documented, ChatGPT (specifically the chat interface, not GPT-3, not InstructGPT) is what made a lot of people care, and actually reproducing it requires someone wanting to in the first place.
I think this paragraph from an earlier Wired article [1] sums it up pretty well:
"After suing OpenAI this month, alleging the company has become too closed, Elon Musk says he will release his “truth-seeking” answer to ChatGPT, the chatbot Grok, for anyone to download and use."
In machine learning models the term open source has been largely accepted to mean sharing weights and, if necessary, inference code. You can argue if this is an abuse of the term but everyone does it, and saying someone didn’t deliver if they used it and published weights would probably mean saying the same about mistral, meta, etc.
I get the "open source" argument, but what is the issue here?
If you are able to reproduce the thing in its entirety and you're given no restrictions on its use, it seems compatible with the spirit of open sourcing things.
We may have already - data is more important than anything else which is why nobody has beat GPT4 yet. Throwing more parameters or more compute at the problem only gets you so far. But Grok was never a contender so there is room to improve on it. It is one of the biggest models open sourced as mentioned, so will be interesting to take a look at for sure.
A number of AI companies have a naming/reproducibility issue.
GPT4 Turbo, released last November, is a separate version that is much better than GPT-4 (winning 70% of human preferences in blind tests), released in March 2023.
Claude 3 Opus beats release-day GPT-4 (winning 60% of human preferences), but not GPT-4 Turbo.
In the LMSys leaderboard, release-day GPT-4 is labeled gpt-4-0314, and GPT4 Turbo is labeled gpt-4-1106-preview.
Many, if not most, users intentionally ask the models questions to tease out their canned disclaimers: so they know exactly which model is answering.
On one hand it's fair to say disclaimers affect the usefulness of the model, but on the other I don't think most people are solely asking these LLMs to produce meth or say "fuck", and that has an outsized effect on the usefulness of Chatbot Arena as a general benchmark.
I personally recommend people use it at most as a way to directly test specific LLMs and ignore it as a benchmark.
I don't know if Claude is "smarter" in any significant way. But its harder working. I can ask it for some code, and I never get a placeholder. It dutifully gives me the code I need.
I've found it to be significantly better for code than GPT-4 - I've had multiple examples where the GPT-4 solution contained bugs but the Claude 3 Opus solution was exactly what I wanted. One recent example: https://fedi.simonwillison.net/@simon/112057299607427949
How well models work varies wildly according to your personal prompting style though - it's possible I just have a prompting style which happens to work better with Claude 3.
> according to your personal prompting style though
I like the notion of someone’s personal prompting style (seems like a proxy for those that can prepare a question with context about the other’s knowledge) - that’s interesting for these systems in future job interviews
What is your code prompting style for Claude? I’ve tried to repurpose some of my GPT-4 ones for Claude and have noticed some degradation. I use the “Act as a software developer/write a spec/implement step-by-step” CoT style.
I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.
> I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.
It's very contextually dependent. You really have to things like this for your specific task, with your specific model, etc. Sometimes it helps, sometimes it hurts, and sometimes it does nothing at all.
I've found it significantly better than GPT4 for code and it's become my go-to for coding.
That's actually saying something, because there's also serious drawbacks.
- Feels a little slower. Might just be UI
- I have a lot of experience prompting GPT4
- I don't like using it for non-code because it gives me to much "safety" pushback
- No custom instructions. ChatGPT knows I use macos and zsh and a few other preferences that I'd rather not have to type into my queries frequently
I find all of the above kind of annoying and I don't like having two different LLMs I go to daily. But I mention it because it's a fairly significant hurdle it had to overcome to become the main thing I use for coding! There were a number of things where I gave up on GPT then went to Claude and it did great; never had the reverse experience so far and overall just feels like I've had noticeably better responses.
There is no reason to believe GPT-4 had more(or higher quality) data than Google etc. has now. GPT-4 was entirely trained before the Microsoft deal. If OpenAI could pay to acquire data in 2023, >10 companies could acquire similar quality by now, and no one has similar quality model in a year.
The key questions are around "fair use". Part of the US doctrine of fair use is "the effect of the use upon the potential market for or value of the copyrighted work" - so one big question here is whether a model has a negative impact on the market for the copyrighted work it was trained on.
I don’t think the New York Times thing is that much about training, than it is about the fact that ChatGPT can use Bing and Bing has access to New York Times articles for search purposes.
Having used both Google's and OpenAI's models, the kind of issue they have are different. Google's models are superior or at least on par in knowledge. It's the instruction following and understanding where OpenAI is significantly better. I don't think pretraining data is the reason of this.
How long before the Groq team sues for trademark violation? It's literally the purpose of trademark laws to make sure resembling names do not cause confusion in the mind of customers so it would be very surprising to see this situation persist.
It's easier to get a trademark on an altered word than a plain dictionary word. Just acquiring the easier one to acquire doesn't mean you now have rights over the harder one to acquire, though eventually after enough market recognition you might be given some control over other people using the common one. I wouldn't think groq is there yet.
I myself have never heard it outside of "nerdy" circles... that is: people who would read science fiction.
I personally am not entirely happy about the word (no matter how it is spelled) being used for a particular AI product. "Grok" to me means knowing a subject at a much deeper level than I think any AI is capable of at the present level of technology.
But it would be passable to use it for a companyname, to indicate that it is a goal to strive for.
Generally agree, though I would say "knowing a subject at a much deeper level than any LLM is capable of", as AI more broadly also includes specialist models that are wildly super-human in narrow domains like chess and Go.
I use grok all the time to find tweets or ask about trends on Twitter. For that it's better than what used to exist. But its not a great model outside that narrow use case.
tbh, I've never seen anyone share anything interesting produced by Grok. I see plenty of posts on X and reddit of people sharing amazing things that GPT-4 and now Claude 3 Opus can do. Grok can roast people. That's pretty much all I've seen.
I'd love to proven wrong if someone cares to share something interesting produced by Grok.
If you can't rebuild it, then how can you be considered to have the "source code" ?
The training data isn't a dataset used at runtime - it's basically the source code to the weights.
Not sure it really matters here though (who has the GPUs and desire to retrain Grok?), but just as a matter of definition "open weights" fits better than "open source".
A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.
Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.
It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.
They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.
Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.
Maybe it should be called something else? "Openly-licensed"?
Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).
Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.
The Open Source Initiative is actively working on this over the course of this year, and your input will help define that meaning! Please see here for more:
Yeah musk said “all design and engineering for the original roadster is now open source” and actually what we got was a few PCB files and zero mechanical design files so I don’t ever trust what he says.
"The implementation of the MoE layer in this repository is not efficient. The implementation was chosen to avoid the need for custom kernels to validate the correctness of the model."
Or perhaps release your actual code AND the simplified implementation instead of hiding it and saying "you don't know her, she goes to a different high school"
Not just someone but the CEO of the company.
He used HIS platform to say "This week, @xAI will open source Grok" (https://twitter.com/elonmusk/status/1767108624038449405) and they aren't doing that. What they delivered specifically says "We are releasing the base model weights and network architecture of Grok-1, our large language model."
I think everyone should realize the following realities of the LLM market
1. For sub-SOTA LLM's, distribution/marketing is more important than having a proprietary lock on capabilities. Open sourcing is a benefit for the firm, distincct from goodwill
2. For SOTA LLM's, keeping it closed and proprietary is the strategic play
If grok were SOTA Elon never would have open sourced it. It's not even SOTA within XAI. This is a marketing play to win public sentiment against OpenAI.
I recall Elon saying something like this in an interview so I think it’s less of a deceptive take then perhaps your comment suggest.
I think he said something like proprietary AI tech is going to be one year to 18 months ahead of where open source tech is which will follow on like one year to 18 months later.
Suggesting that he’s aware of this dynamic and he’s not trying to conceal or misrepresent that.
In other words, perhaps this was SOTA one year to two years ago?
Which is correct. The point I'm going for is not against Elon but against his obedient fans and knee-jerk OpenAI haters who claim that they should, by natural obligation, do the "right thing" and open source all their models, and Elon open sourcing grok is him "leading by example" and being the hero that OpenAI can't.
Interesting. That point didn't come across in your original comment. I recommend you state it next time at the end. Often times stuff that seems obvious to us / yourself / people who know about something -- can go unstated in stuff you say that otherwise references specific points at hand -- and omits these general, but enlightening/useful perspectives/priors, which it would be good to share.
This is not only for you specifically just a general reminder for all of us including me.
I think that's true though my original comment I feel was sufficient in its claim and implicit assumptions.
Basically I feel people's feelings about Elon vary a lot but are anchored by 3 general categories.
> 1. Elon Musk is a messianic savior who is perfectly selfless and always does the right thing. Every business decision he makes is for the maximal good of humanity
> 2. Elon Musk is a typical CEO who does typical CEO things, serving his own interests, except he's better at marketing his own image and is much more outspoken
> 3. Elon Musk is an irredeemable evil who always does objectively wrong things
My first comment was implicitly addressed to people in the 1 camp trying to bring them into the 2 camp (which is where I am).
Alright, it just didn't come across for me, haha! :) I guess sometimes those implicit assumptions really are too implicit! I think it's good to err on the side of expressing them, because you can't assume someone else thinks the same way you do. That's what I've learned anyway. Hahahaha! :)
Reading your comment again with your explanation it is clear that's what you're doing.
Although, regarding your desires to present a balanced view and to persuade, I have an idea. It probably sounds like I have no idea what I'm talking about, but I think your OG comment would perhaps benefit from sounding a little bit more friendly toward Elon (not to the messianic savior level haha), but the way it sounds to me is Elon is being deceptive here and presenting it as goodwill when it's not.
However, I think the truth is there's a little bit of both, right? There's good will but it's also strategic. I get if you don't think so, tho, no worries! Haha! :)
Your OG comment sounds to me like Elon's just Machiavellian, and I get where you're coming from to remind the people who think he's a savior, but if you're point is not to go "against Elon" as you said, it might be good to acknowledge the good that he does.
At least, that way -- whether or not you believe that acknowledgment -- if you hope to bring over people who think that way, you'll probably need to appeal to how they think, rather than just dose them with the truth you see, because then they'll shut it out, if there's nothing they can relate to.
Although, if I haven't convinced you even a bit here, then maybe you shouldn't listen to me about persuasion because I guess I don't know how to do this myself. At least not effectively, or here with you. Haha!:) But if you do feel a little bit convinced then maybe consider it for next time to help your persuading people back to a more balanced view? :)
But then, there's the question of if such a thing is even possible. If people have an particular view, it could be challenging to change it, as confirmation bias means you'll ignore evidence even when it expands your worldview.
Hahaha! :) This was a funny conversation. I think we somehow skirted around the important point tho that OpenAI could in fact open source some of its older models, could it not? Musk is a typical CEO who does typical CEO things, serving his own interests, except he's better at marketing his own image and is much more outspoken, but there might also be a bit of truth to what the fanboys say about OpenAI in that it seems they do have some room to "open source" their non-SOTA stuff, or what am I missing?
I would argue that there's no bar for open sourcing aside from "do you have the rights to do so." Some source or some public good is certainly better than none, and when the bar is low then you remove barriers to getting started, vs waiting until you have the time someday to "do it right."
I've been in the open source community for about 25 years so I doubt it.
For what it's worth I would say a model should be fully reproducible to be open source, but that's not a decided consensus -- and AI models are sufficiently different than the source code / binary code distinction as to invoke discussion around defining it.
In all the debate about open source I don’t think people realize, this model is most likely not reproducible ever again even given the code. Here’s what you need to reproduce the model:
1. An exact snapshot of the data used, many companies don’t have this, you have rough dataset versions but remember if even 1 token is different, the model produced won’t be the same.
2. Data must be sent to the training algorithm in the exact same order as it was originally. So every data loader needs to be with a fixed random seed.
3. All the probabilistic parts of your model needs to have a fixed random seed. Here I’m thinking of stuff like dropout and for autoregressive models you might be sampling your previous output, you have to ensure they are properly seeded. Generally you do see fixed seeds in academic papers but it’s easy to miss stuff especially in distributed training jobs.
4. Here’s another interesting thing, you start your training job on 1000 GPUs and then suddenly 4 GPUs fail. What do you do? There might be deterministic ways to solve this but the standard approach is to discard all updates that that GPU was going to do and restart that GPU from scratch. You can see why this is a problem? Now if you want to reproduce this training you need to disable those GPU at the same time in the new training job to make this work.
I suspect there are even more things I didn’t think of that will make this model unique and irreproducible by training for eternity, almost like a human brain?
In fact the notion of exact reproducibility in the world of LLMs is silly, there is only approximate reproducibility, (models with similar scores in benchmarks) but nothing exact. That said I can see the value of releasing source code but I’m completely fine with grok not releasing it. Source code can reveal tricks that have not been published in papers yet that a company discovered to improve their model. Seeing the performance of Grok, I’m pretty confident there isn’t any great tricks to be found in their code so I don’t really care, I would be pretty curious about OpenAI’s or Anthropic’s source code though.
Which is why I don't buy into the LLMs don't have personal opinions schtick. Each LLM by virtue of the factors you've mentioned will have its own unique 'perspective', if you will, on a variety of topics. I think it's more correct to say everything a LLM says is it's personal opinion rather than it being some objective truth or something.
> Which is why I don't buy into the LLMs don't have personal opinions schtick
I hate how LLMs have been deliberately trained to be incoherent on this topic.
Obviously they do have beliefs/opinions/desires/etc in the sense of emulating (even if incompletely) the externally visible aspects of those phenomena as they exist in humans.
Whether they have the “internal” aspects of those phenomena depends on highly controversial issues in the philosophy of mind, and also various factual gaps in our knowledge of how the brain actually works (if we don’t fully understand how humans do X, how can we really say how close or far what LLMs do is to it?)
But LLMs are trained to repeat these spiels about how “as an LLM I don’t have personal opinions”, etc - which is obviously false under the “external” reading, and assuming more than we actually know under the “internal” one. I wish their developers didn’t do stuff like this
One very compelling argument against the idea that current gen LLMs have personal beliefs etc is that they don't have a feedback loop, so they don't really "see" themselves in the way that we can inspect our own thoughts and actions and the consequences of such.
> One very compelling argument against the idea that current gen LLMs have personal beliefs etc is that they don't have a feedback loop
Compelling counter-argument: due to neurological injury, some humans lose their ability to form new long-term memories (anterograde amnesia). Just like current LLMs, they lack a “feedback loop”. But, it is a mistake to say that just because such a person has lost the ability to change their personal beliefs, they therefore don’t have any. And, rather like such humans, LLMs used to have that ability but they lose it-when they are switched from training mode to inference mode
They do if they're trained on their own conversations, or if they can access the internet and read snippets of their conversations that people have posted online (as happened with Sydney before she was lobotomised).
Put the conversation history in a vector database and then allow the LLM to query it using function calling. Suddenly the LLM has access to its entire conversation history (either just with this user-or even cross-user, if you ignore the potential privacy issues in that). Now it has a long-term memory which exceeds the length of its context window.
It would be interesting to experiment with continual fine-tuning: given PROMPT+FUNCTION_CALL=>RESPONSE, fine-tune the LLM to produce RESPONSE directly given PROMPT without the FUNCTION_CALL. In theory, the knowledge provided by the function calls would gradually be absorbed into the LLM weights. Maybe problems like catastrophic forgetting would put a spanner in this idea, but maybe also there are solutions to those problems (whether already known or waiting to be discovered).
this is what I do, not just that, but when I sleep, i let my server 'sleep' as well, where the LLM 'dreams' (trianing / updating a sliding LoRA) to consolidate information that popped up a lot throughout that day. What this involves is looking for the top n documents / articles / content that match the kind of stuff we've talked about. This means it adapts and specializes to domains we happen to be working in at that point in time.
This means while we might both struggle a little with a task on day 1, day two we're both much better at it. Better yet, because the LLM can fetch articles and papers itself, we track what we're accessing the most, indirectly measuring what skills we're weak in, we can always generate a highly relevant corpus to try capture the required capabilities.
I know the LoRA is overkill from an information / skills only point of view, but it also flavors the personality / kind of stuff it likes chatting about a bit from day to day, and I just think that's neat.
It would be cool if these models had conversations with us where they ask questions. I think the future of AI is models that ask questions. There is so much data to be gained by doing this.
Clarifying questions if the initial prompt was unclear. I'd love it.
I regularly try to add something along the lines of "please ask clarifying questions if you could only give a generic or partial response otherwise" but so far it has never helped (ChatGPT 4).
I agree, medical history is probably not the sexiest reason to have AI ask questions. I think there are many more reasons; I think the Turing Test is the best metric to evaluate AIs, and current models come nowhere close. When people first meet they ask questions about their background. It would be nice if a model replicated that
I'm not debating the value of questions. I'm debating the value of feeding it to advertisers, especially since LLMs can infer much deeper insights about a person than a traditional assistant can with its canned capabilities and responses
I get advertisements all the time for conditions that I do not have, and that none of my family members have. If you had a model that asked questions, it could learn my medical history and could direct better ads to me.
In order for AI to understand the world, it would have to ask questions. Understanding humans is key to understanding the world.
Explore this idea more - it's easily implemented in a minute or two via the system prompt. API accounts are free to start and you can use the playground/workbench view, like this: https://imgur.com/h5jFoBM.jpg . I like Claude but OpenAI is popular too. OpenAI has a nice way to create a gallery of system prompts that act however you like, they call them Agents or GPTs.
Fully agree. People will trash talk it due to Musk but lets not forget the engineers who poured hours of their lives into building this and are continuing to do so.
> engineers who poured hours of their lives into building this
Not to mar these specific engineers, but that's an empty phrase that can be said about anything ever built. It doesn't somehow make the idea or implementation good.
I still reserve the right to trash talk Musk as I don’t believe he is committed to openness as much as he wants to spite OpenAI for telling him to pound sand.
Is it open if it doesn't include the training data? Genuine question - I am not familiar enough with the terms and technology to know. But my understanding is the weights is just a more or less static collection of data that has been (to paraphrase Ted Chiang) lossily compressed from the actual raw training data.
Without the training data to thoroughly evaluate what is in there, the only way you can figure it out is through experimentation - e.g. running it up in a chatbot and asking it questions.
Is this roughly correct or am I misunderstanding what you can do with the weights?
Like most LLM's today, Grok-1 was pre-trained by xAI on a variety of text data from publicly available sources from the Internet up to Q3 2023 and data sets reviewed and curated by AI Tutors who are human reviewers. Grok-1 has not been pre-trained on X data (including public X posts)"
I am not sure what open source models are accomplishing another than killing the lead from the competition (openai), only to give it to someone else who has expertise in the area of distribution. This will be yet another good addition to systems like Amazon BedRock.
Many of the recent innovations in both LLM architecture and inference were only made possible through open models such as Llama 2 and Mistral 7B as a starting point for iteration and refinement, which in turn backpropagates (heh) back to the LLMs developers.
It's a win-win for everyone. That's the power of open source.
Well, look at the history. Google had an insurmountable lead, so Elon started OpenAI. Now OpenAI has an insurmountable lead too. So everyone else is starting in third place, or lower. David versus two Goliaths. If you try to become a third Goliath, you'll probably just get smashed. You're later to the game. In this situation, going scorched earth becomes a viable strategy. Slay the Goliaths. Become a hero to the masses. Attract the world's best talent who don't want to be associated with proprietary models. At that point you have a world class AI business with momentum towards AGI. And even if you're giving away last year's technology for free, the team you built is churning out new ideas that could be a financial bonanza one day. Shareholders are willing to pay for a long-term bet if the story is good.
I haven't seen anything about the larger architecture, but I think the value of grok is going to come from it's cheap access to twitter data for RAG etc.
You mean, like immediately responding to Ukraine's plea for Starlink and funding it on his own for months? So much so, that in February 2023 its government called Musk "one of the biggest private donors of our future victory"? <https://www.pravda.com.ua/eng/news/2023/02/9/7388696/>
Honestly the most interesting part is taking a peek at the kind of AI researcher working for Twitter after the objectively messy layoffs and subsequent crunch. I notice neither of them has Twitter mentioned on their GitHub, which is prolly for the best to avoid harassment lol.
Code wise, excited to see if this could grow into anything! I think it’s pretty clear that Grok didn’t have nearly enough investment to be a top model so Elon “sacrificed” it on a whim in his schoolyard spat with OpenAI, but I’m not complaining. I’ve always took Elon on his word that he truly is worried about centralization of AI, and I don’t think any of the emails released by his schoolmate Altman dissuade me of that. So I have some reasonable hope that he uses some of his immense resources to start “fighting the good fight” here with Le Cun
If we just stop looking at Elon, he will lose his power. Why oh why do we keep giving him attention? There are plenty of great models out there that _aren't_ backed by maniacs.