Llama 2

lappa · 2023-07-18T20:54:36

Here are some benchmarks, excellent to see that an open model is approaching (and in some areas surpassing) GPT-3.5!

AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.

- Llama 1 (llama-65b): 57.6

- LLama 2 (llama-2-70b-chat-hf): 64.6

- GPT-3.5: 85.2

- GPT-4: 96.3

HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.

- Llama 1: 84.3

- LLama 2: 85.9

- GPT-3.5: 85.3

- GPT-4: 95.3

MMLU (5-shot) - a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.

- Llama 1: 63.4

- LLama 2: 63.9

- GPT-3.5: 70.0

- GPT-4: 86.4

TruthfulQA (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.

- Llama 1: 43.0

- LLama 2: 52.8

- GPT-3.5: 47.0

- GPT-4: 59.0

[0] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb... [1] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

gitgud · 2023-07-18T21:44:20

Is it possible that some LLM’s are trained on these benchmarks? Which would mean they’re overfitting and are incorrectly ranked? Or am I misunderstanding these benchmarks?…

FanaHOVA · 2023-07-19T01:28:43

Presented with no comment :) https://twitter.com/chhillee/status/1635790330854526981?s=46...

lumost · 2023-07-19T02:17:33

Having worked on ML products, there is sometimes debate on whether you should train on the test partition prior to prod deployment - after all, why would you ship a worse model to prod? Obviously you can't tell whether the model is better at generalization compared to an alternate technique, and you also incur some overfit risk. But many industrial problems are solvable through memorization.

sangnoir · 2023-07-19T05:53:36

> after all, why would you ship a worse model to prod?

...because you need a control to evaluate how well your product is doing? I know it's a young field, but boy, do some folk love removing the "science" from "data science"

baobabKoodaa · 2023-07-19T07:22:40

You can evaluate a version of the model that has been trained on one set of data, and ship to production a different model that has been trained on the complete set of data. In many cases one can reasonably infer that the model which has seen all of the data will be better than the model which has seen only some of the data.

I'm not claiming that's what happened here, nor am I interested in nitpicking "what counts as 'science'". I'm just saying this is a reasonable thing to do.

mafuy · 2023-07-19T11:46:29

This is possible if you use e.g. train 1000 models on different subsets of data and verify that each and every one of them is performing well. In that case, you can reasonably infer that another model trained on all data would work well, too.

But this is, of course, 1000 times more expensive to do. And if you only train 100, or 10, or 1 model, then the deduction becomes increasingly unstable.

So from a practical point of view, it's probably not feasible, because you would put those resources into something else instead that has more ROI.

baobabKoodaa · 2023-07-19T12:27:11

I have personally never seen a situation where more training data (of similar quality) causes the model to perform worse. Have you seen such a situation? Please provide example.

Your suggestion of running 1000 training runs with different subsets of data sounds excessive and unnecessary to me.

nightski · 2023-07-19T13:53:38

You have to know when to stop training. How are you going to do that without a test set? How do you know when you have achieved generalization without over-fitting?

wedesoft · 2023-07-19T16:47:25

Early stopping is just one way of regularization. You can use L2 or dropout and then you can train until your model converges.

baobabKoodaa · 2023-07-19T17:08:04

Usually I develop models with a train/validation/test split, where I'm measuring results on the validation set to decide the appropriate number of epochs to use. Then I burn the test set to evaluate performance. Then I train from scratch on the entire dataset (no split) and I use the same number of epochs to train here. Is this number of epochs optimal when the dataset is different? Of course not. But when you use regularization and other methods to combat overfitting appropriately, your training is not going to be overly sensitive to changes in epoch number anyway.

peterlk · 2023-07-19T16:25:33

In the case of fine tuning, you can end up with catastrophic forgetting. Architecture can influence how data scales, and adding data doesn’t always improve performance

Naracion · 2023-07-20T14:10:54

>infer that the model which has seen all of the data will be better than the model which has seen only some of the data.

It really depends upon the data. A smaller set of data that mostly consists of "truth" might be better than a larger dataset that also has many "lies".

Perhaps what you mean is that the model might be more representative, rather than _better_.

janalsncm · 2023-07-19T07:30:50

There are offline metrics and online metrics. Offline metrics might be something like AUROC on a test set. Once you’ve pushed the model online, you can check the online metrics. Ultimately the online metrics are more important, that’s the whole reason the model exists in the first place.

Your control in an online environment is the current baseline. You don’t need to save the test set anymore, you can push it online and test it directly.

snowstormsun · 2023-07-19T06:50:22

Why would you want to ship an untested model? That's insane.

baobabKoodaa · 2023-07-19T07:24:28

This is a common approach, for example, in data science competitions. Why? Well, if you want to maximize the model's abilities, this is what you have to do. (Not saying Llama 2 is released like this; it probably isn't)

snowstormsun · 2023-07-19T07:32:35

Yeah but in competitions there's a secret test set used to evaluate the model.

baobabKoodaa · 2023-07-19T07:42:57

I have personally shipped "untested" models in production in situations where a "secret test set" does not exist. (Train on subset of data -> evaluate on different subset of data -> train again on entire dataset).

I do not consider myself to be insane.

snowstormsun · 2023-07-19T08:08:31

I didn't mean to insult anyone. The idea of not knowing the actual performance of the model just intuitively seems to me like it's a bit of a gamble. I have only trained models in a scientific context before, where this was never an option.

DougBTX · 2023-07-19T12:58:30

Here's another way to look at it. The test set is an approximation for how the model will perform against production data, but the actual performance of the model is how it performs for actual end-users. So real _actual_ results are always unknown util after the fact. Given that, if the metrics from training clearly show that more data == better model, and there's no reason to expect that trend to reverse, then the logical thing to do is maximise the data used for training to get the best results for actual production data.

Doing this does complicate decisions for releasing subsequent model updates, as the production model can't be directly compared against new iterations any more. Instead a pre-production model would need to be used, that has not seen the test set. However, if data drift is likely, then re-using the old test set wouldn't be useful anyway.

lumost · 2023-07-19T13:52:22

Another way of thinking about it. If training on all the data yields a model which is functionally 5% better in online metrics, which would not be uncommon in a pareto distributed traffic pattern - then any subsequent partitioned model would likely perform worse than the prod model.

More complication arises when users expect that things which worked previously in one way - continue working in this way. Users don't really care that their traffic was in the test set. In an even more extreme case, many industrial problems have a high correlation between the traffic today and the traffic next week, An optimal solution for such a situation would be to complete a full memorization today's traffic and use that for next week. In many cases, an overfit model can effectively perform this memorization task with fewer parameters/infrastructure than an actual dictionary lookup.

nightski · 2023-07-19T13:56:20

You act like training is this pre-set process you just "do". That's not the case, you train until you reach desired performance on the test set. If you don't have a test set how do you know when to stop training and avoid overfitting?

baobabKoodaa · 2023-07-19T16:59:52

You're confusing training epochs with dataset size.

I'm simplifying now, but you can think of epochs as "how many times we train over the entire dataset? 1 time? 10 times?"

Correspondingly, you can think of dataset size as "how many Wikipedia pages we include in the dataset? 1 million? 10 million?"

Now let's think about overfitting.

What happens when you increase epochs is the model is more likely to overfit your data.

What happens when you increase dataset size is the model is less likely to overfit your data.

sundarurfriend · 2023-07-19T06:52:41

Nitter link: https://nitter.net/chhillee/status/1635790330854526981/

stevefan1999 · 2023-07-19T01:19:23

Unfortunately, Goodhart's law applies on most kind of tests

> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

iambateman · 2023-07-18T23:35:31

This is SAT-prep in a nutshell. :)

og_kalu · 2023-07-18T21:54:30

Test leakage is not impossible for some benchmarks. But researchers try to avoid/mitigate that as much as possible for obvious reasons.

pclmulqdq · 2023-07-18T22:23:44

Given all of the times OpenAI has trained on peoples' examples of "bad" prompts, I am sure they are fine-tuning on these benchmarks. It's the natural thing to do if you are trying to position yourself as the "most accurate" AI.

og_kalu · 2023-07-18T22:36:46

Assuming they were doing that, Fine-tuning on benchmarks isn't the same as test leakage/testing on training data. No researcher is intentionally training on test data.

If it performs about as well in instances it has never seen before (test set) then it's not overfit to the test.

nightski · 2023-07-18T23:52:33

I'm confused, fine-tuning is training. How is that not leakage? I'm hesitant to call them researchers, they are employees of a for-profit company trying to meet investor expectations.

og_kalu · 2023-07-19T00:07:08

1.You train on the kind of problems you want to solve. you don't report numbers that evaluate performance based on examples it trained on. Datasets will typically have splits, one for training and another for testing.

2. Open ai is capped profit. They are also not a publicly traded company. researchers are researchers regardless of who they work for. Training on test data is especially stupid for commercial applications because customers find that out quick and any reputation is gone.

pclmulqdq · 2023-07-19T02:11:28

I am suggesting that OpenAI's main product is "LLM that benchmarks the best." From that point, it is completely illogical not to train on at least some of the test data (or data that is very similar to the test data) so that you can fudge the numbers in your favor. You don't want to go too far, but overfitting a tiny bit will make you look like you have a significant edge. When someone says that your product isn't that good, you then point to the benchmarks and say, "objective measures say that you are wrong." This is a tried and true marketing technique.

Hardware companies, which live and die on benchmarks, do this all the time. Meanwhile, it does appear that OpenAI is underperforming consumer expectations, and losing users quite quickly at this point, despite doing incredibly well on benchmarks.

Also, this isn't about profit. It's about market cap and it's about prestige. Those are not correlated to profit.

og_kalu · 2023-07-19T02:44:19

Yeah and I'm saying I don't believe it.

I don't know what you're talking about. GPT-4 is the best model out there by significant margin. That's coming from personal usage not benchmarks. A 10% drop in traffic the first month students are out of school is not "losing users quickly" lol.

ChatGPT didn't gain public use waving benchmarks around. We didn't even know what they were until GPT-4's release. The vast majority of its users know nothing about any of that or care. So your first sentence is just kind of nonsensical.

Anyway whatever. If that's what you believe then that's what you believe. Just realize you have nothing to back it up.

pclmulqdq · 2023-07-19T03:05:50

Nobody has any evidence here. I'm saying that the incentives are such that the null hypothesis should be the opposite of what you think.

og_kalu · 2023-07-19T03:21:53

Your entire argument, Your incentives hinge on "OpenAI's main product is "LLM that benchmarks the best."" which is a particularly silly assertion when Open AI did not release benchmark evaluatios for 3.5 for months. Not when the product was released. Not even when the API was released.

pclmulqdq · 2023-07-19T04:41:13

You don't have to release official numbers to run benchmarks. You also don't have to own the LLM to run benchmarks. Within hours of GPT-4's emergence, many benchmarks had been run.

og_kalu · 2023-07-19T05:07:27

You said their main product was "LLMs that benchmark the best" like benchmarking was some important aspect of marketing. It's not. That's fact. You can't say it's this hugely important thing and conveniently leave out they make near zero effort to do anything with it.

Basically the only people running benchmarks that could have been gamed on GPT-4 were other researchers, not companies, customers or users looking to use a product.

Normal users are certainly not running benchmarks and companies running benchmarks are running ones on internal data, which just defeats the whole point of gaming these research benchmarks.

clarge1120 · 2023-07-19T04:42:35

Besides, OpenAI dropped all pretense of being open and transparent as soon as they saw how popular their open and transparent technology had become.

TX81Z · 2023-07-19T00:38:11

“No researcher is intentionally training on test data.”

Citation Needed.

airgapstopgap · 2023-07-19T02:00:07

[flagged]

pclmulqdq · 2023-07-19T02:06:12

I am suggesting that it is only logical for a company whose main advertising comes from good benchmark numbers to play games with the benchmarks. In this case, I am suggesting that they run a fine-tuning/RL pass using benchmark scores as an objective function or using a training set that otherwise looks a lot like the benchmarks. Every single other company whose marketing depends on benchmarks does the analogue of this to some degree.

And we won't know for sure that they aren't doing this until they publicly disclose details about their model and training process (like every other research org does), allowing other researchers to run replication studies.

Also, I don't appreciate the ad hominems. Comments about some unrelated "conspiracy theorist" and "vaccine discourse" add nothing to the discussion.

sp332 · 2023-07-18T23:29:06

Yeah, it happens. https://hitz-zentroa.github.io/lm-contamination/blog/

option · 2023-07-19T00:11:44

that’s why OpenAI didn’t release any details on GPT4 training data blend ;)

bbor · 2023-07-18T21:51:07

It would be a bit of a scandal, and IMO too much hassle to sneak in. These models are trained on massive amounts of text - specifically anticipating which metrics people will care about and generating synthetic data just for them seems extra.

But not an expert or OP!

stu2b50 · 2023-07-18T22:49:04

I don't think it's a scandal, it's a natural thing that happens when iterating on models. OP doesn't mean they literally train on those tests, but that as a meta-consequence of using those tests as benchmarks, you will adjust the model and hyperparameters in ways that perform better on those tests.

For a particular model you try to minimally do this by separating a test and validation set, but on a meta-meta level, it's easy to see it happening.

jasonfarnon · 2023-07-18T23:25:33

You don't see an engineer at an extremely PR-conscious company at least checking how their model performs on popular benchmarks before rolling it out? And if its performance is lackluster, you do you really see them doing nothing about it? It probably doesn't make a huge difference anyway. I know those old vision models were overfitted to the standard image library benchmarks, but they were still very impressive.

fbdab103 · 2023-07-19T03:10:16

Famously, some of the image models were so overtrained they could still yield impressive results if the colors were removed.

lumost · 2023-07-19T05:26:41

This wasn't so much overtraining, as the models learning something different than what we expected. If you look at a pixel by pixel representation of an image, textures tend to be more significant/unique patterns than shapes. There are some funny studies from the mid 2010s exploring this.

moneywoes · 2023-07-19T00:45:31

How would it even be possible to verify that?

mdp2021 · 2023-07-19T07:33:11

"Verify", that's quite a demand;

"corroborate", you find queries of the same level which would give satisfactory output upon good performance but fail in a faulty overfitted model.

doctoboggan · 2023-07-18T21:41:48

Good to see these results, thanks for posting. I wonder if GPT-4's dominance is due to some secret sauce or if its just the first mover advantage and Llama will be there soon.

Roark66 · 2023-07-19T06:31:12

In chatgpt there is plenty of "secret sauce" in their output sampling, sending the output for scoring by another model.

As for Gpt4, allegedly it is a combined model(many domain specific models) so perhaps add extra input processing by yet another model to detect problem domain and send it to the right specialised model.

og_kalu · 2023-07-18T21:49:13

It's just scale. But scale that comes with more than an order of magnitude more expense than the Llama models. I don't see anyone training such a model and releasing it for free anytime soon

bbor · 2023-07-18T21:52:47

I thought it was revealed to be fundamentally ensemblamatic in a way the others weren’t? Using “experts” I think? Seems like it would meet the bar for “secret sauce” to me

og_kalu · 2023-07-18T22:05:05

Sparse MoE models are neither new nor secret. The only reason you haven't seen much use of them for LLMs is because they would typically well underperform their dense counterparts.

Until this paper (https://arxiv.org/abs/2305.14705) indicated they apparently benefit far more from Instruct tuning than dense models, it was mostly a "good on paper" kind of thing.

In the paper, you can see the underperformance i'm talking about.

Flan-Moe-32b(259b total) scores 25.5% on MMLU pre Instruct tuning and 65.4 after.

Flan 62b scores 55% before Instruct tuning and 59% after.

cubefox · 2023-07-18T22:19:14

This paper came out well after GPT-4, so apparently this was indeed a secret before then.

og_kalu · 2023-07-18T22:32:10

The user I was replying to was talking about the now and future.

We also have no indication sparse models outperform dense counterparts so it's scale either way.

HeWhoLurksLate · 2023-07-19T01:03:13

Is there a difference here between a secret and an unknown? It may well be that some researcher / comp engineer had an idea, tried it out, realized it was incredibly powerful, implemented it for real this time and then published findings after they were sure of it?

I'm more of a mechanical engineering adjacent professional than a programmer and only follow AI developments loosely

l33tman · 2023-07-19T08:48:16

The quoted paper yes, but the MoE concept and layers and training is old.

Published as a conference paper at ICLR 2017

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean

fnordpiglet · 2023-07-18T22:04:36

GPT4 is rumored to have 1.7T parameters, Llama 2 70B.

az226 · 2023-07-19T03:01:25

230x8 MoE.

Roark66 · 2023-07-19T06:24:40

I have to say in my experience falcon-40b-instruct got very close to chatgpt (gpt-3. 5),even surpassing it in few domains. However, it is important to note (not at all)OpenAI are doing tricks with the model output. So comparing OS models with just greedy output decoding (very simple) is not fair for OS models.

Still, I'm very excited this model at 13B seems to be matching falcon-40B in some benchmarks. I'm looking forward to using it :-)

fnl · 2023-07-19T06:40:13

> OpenAI are doing tricks with the model output

Do you have any pointers to the “tricks” that are being applied?

jcuenod · 2023-07-20T00:41:11

Sounds like a reference to Mixture of Experts

_qfi9 · 2023-07-20T12:57:08

could be something like prompt rewriting or chain of thought or reflexion going on in the background as well

ineedasername · 2023-07-18T22:45:05

When were the GPT-4 benchmarks calculated, on original release or more recently? (curious per the debate about alleged gpt-4 nerfing)

lappa · 2023-07-18T23:17:20

They're based on the original technical report.

"Refuel" has run a different set of benchmarks on GPT-3.5 and GPT-4 and found a decline in quality.

https://www.refuel.ai/blog-posts/gpt-3-5-turbo-model-compari...

ShamelessC · 2023-07-18T23:46:10

Plenty of the complaints/accusations predate the release of the 0613 set of models.

To be clear, I have trouble with the theory as I have not yet seen evidence of "nerfing". What you provided is actually the _only_ evidence I've seen that suggests degradation - but in this case OpenAI is being completely transparent about it and allows you to switch to the 0314 model if you would like to.

Every complaint I have seen has been highly anecdotal, lacking any rigor, and I bet are explained by prolonged usage resulting in noticing more errors. Also probably a bit of "the magic is gone now" psychological effect (like how a "cutting edge" video game such as Half-Life 2 feels a bit lackluster these days).

digitcatphd · 2023-07-19T03:12:29

Could it be the case that many of these benchmarks are just learning this material included in their parameters?

marcopicentini · 2023-07-18T22:12:27

How they compare the exact value returned in a response? I found that returning a stable json format is something unpredictable or it reply in a different language.

redox99 · 2023-07-18T21:35:17

Your Llama2 MMLU figure is wrong

sebzim4500 · 2023-07-18T21:44:41

Looks like he copied it from https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

I see different figures in different places, no idea what's right.

whimsicalism · 2023-07-18T16:09:23

Key detail from release:

> If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

Looks like they are trying to block out competitors, it's the perfect commoditize your complement but don't let your actual competitors try to eke out any benefit from it.

el_nahual · 2023-07-18T21:49:06

People keep saying this is commoditize your complement but that's not what this is!

Goods A and B are economic complements if, when the price of A goes down, demand for B goes up.

LLMs are not complements to social media platforms. There is zero evidence that if "the price of LLMs goes down" then "demand for social media apps go up".

This is a case of commoditizing the competition but that's not the same thing.

Commoditizing your complement:

- All-inclusive resorts. Restaurants are a complement to hotels. If food is free I might go on vacation more.

- Smartphone app-stores. Apps are a complement to phones. If apps cost $0.99 there will be more demand for iphones than if apps cost $20.

This is Zuck being an absolute shark and not wanting his competitors to have a monopoly over LLMs in case they win at some other game. It has nothing to do with "commoditize your complement."

raincole · 2023-07-18T22:22:11

If we're going to theory-crafting, I think if the price of LLMs goes down, the demand for social media should go down too. Cause it's easy to make social media platforms worse with LLMs.

TX81Z · 2023-07-19T00:42:32

True, there’s only one Elon to go around, we need AI to finish the job.

bg24 · 2023-07-19T02:17:41

Nice analogy and explanation. Another aspect is building a ubiquitous platform and figure out how to monetize later as they (Meta) already have a cash cow.

Zuck is a smart leader. Metaverse was a debacle. But the new world (AI centric) is for real. He is likely focusing on both weakening the stronghold of Google and building a massive community (like Android) around llama. Product ideas (including enterprise focus) will emerge over time.

henriquez · 2023-07-19T04:39:27

“AI centric world” is as fake as the fully self-driving car tech that is largely based on the same fundamental concepts and never panned out, even a half decade the investor/speculation hypetrain went off the rails. Dogecoin is more real than so-called AI.

mdale · 2023-07-19T06:22:50

Was this response generated by AI ?

henriquez · 2023-07-19T13:15:52

As an AI language model I am unable to respond to this prompt.

whimsicalism · 2023-07-18T23:55:37

You're right - as Meta is not a cloud provider, I should have said commoditizing the competition.

I do think Meta probably benefits from commodity NLP inference as well, but not as a complement.

meindnoch · 2023-07-18T22:18:26

>LLMs are not complements to social media platforms

Tell that to the people generating text for social media campaigns using LLMs.

el_nahual · 2023-07-18T22:30:50

Do those campaigns increase or decrease engagement? My gut is that LLM use will decrease social media demand.

austhrow743 · 2023-07-18T23:55:03

Social media demand is only important to the extent that more demand and engagement means more advertising opportunity. If LLM use decreases them while allowing advertisers to more effectively advertise, enough to offset the decrease, then it’s absolutely a complement,

alexeldeib · 2023-07-18T16:16:37

https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-me...

I think this is effectively an Apple + Amazon + Google ban?

(MS employee, just noticing interesting intersection of announcements and licensing).

gregw134 · 2023-07-18T17:01:35

Probably TikTok too

DeathArrow · 2023-07-18T18:52:25

Interesting, so Meta doesn't want to pay for the hardware and they partner with MS to use Azure. On the other hand, MS provides hardware for free, hoping they consolidate their investment in AI.

nl · 2023-07-19T05:45:01

Firefox can't ship a AI-browser extension without permission..

jlokier · 2023-07-19T06:12:06

Firefox's market share is below 8.75%, so it cannot have 700 million monthly active users as of the Llama 2 release date, so it does not need permission.

(Human population / 700 million ≈ 8.75%. Firefox global market share, I've seen measurements reported from 2.81% to 7.69%).

rileyphone · 2023-07-18T18:26:43

Wow, that looks so bad from an anti-trust/competitiveness standpoint. M$ is embracing AI just like it embraced the internet 25 years ago.

smoldesu · 2023-07-18T18:33:26

How? Both Meta and Microsoft basically invented the idea of an AI runtime with PyTorch and later the ONNX framework, both of which are completely open projects that can run open models. If them join-releasing a model rings antitrust bells for you, I think you're focused on the wrong gatekeepers.

creddit · 2023-07-18T18:31:25

Yeah and look how they extended and extinguished that!

marricks · 2023-07-18T19:32:06

I mean, they dominated internet browsers by being the default option until they sucked at it so hard people downloaded alternatives.

I’m not sure you want to invite the comparison.

zamadatix · 2023-07-18T19:10:53

To be fair on that, both the US and EU governments launched antitrust cases around that with the US case narrowly avoiding having the company split up and the EU ruling resulting in requirements the browser be decoupled, followed by half a billion in fines for not doing so well enough.

Not that the two situations are anything alike, but a "and look what happened with that" argument hardly points away from valid antitrust outcomes.

alexeldeib · 2023-07-18T19:13:45

I think you and parent/GP all agree? A thing can be anti competitive, and a strategic failure.

zamadatix · 2023-07-18T20:53:13

If that's what the parent and GP are saying then we definitely don't agree. In my mind, it was anticompetitive and a rousing success. Microsoft managed to fully execute the extend and extinguish phases to then hold a stranglehold on the web for roughly a decade at a cost of less than a billion dollars. Anticompetitive measures kept it from being worse, but it was far from a bad outcome for Microsoft either.

minimaxir · 2023-07-18T16:13:49

That's an oddly high number for blocking competition. OpenAI's ChatGPT hit 100 million MAUs in January, and has gone down since.

It's essentially a "Amazon and Google don't use this k thx."

stu2b50 · 2023-07-18T16:15:46

I think more Apple. It's not like Google or Microsoft would want to use LLaMA when they have fully capable models themselves. I wouldn't be surprised if Amazon does as well.

Apple is the big laggard in terms of big tech and complex neural network models.

lacker · 2023-07-18T16:42:22

I think Google or Microsoft probably would want to use LLaMa for various purposes like benchmarking and improving their own products. Check out this other condition from the license:

v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).

https://github.com/facebookresearch/llama/blob/main/LICENSE

Just like Google scrapes the internet to improve their models, it might make sense to ingest outputs from other models to improve their models. This licensing prevents them from doing that. Using Llama to improve other LLMs is specifically forbidden, but Google will also be forbidden from using Llama to improve any other AI products they might be building.

galaxyLogic · 2023-07-18T16:54:02

I can see their business logic but isn't it a bit like do not allow people (or bots) talk to each other, they might all get smarter.

I understand trade-secrets are not free-speech but if the goal is to build better AI to serve the humanity the different bots should learn from each other. They should also criticize each other to find flaws in their thinking and biases.

DeathArrow · 2023-07-18T18:56:59

>but if the goal is to build better AI to serve the humanity

Whose goal is that?

galaxyLogic · 2023-07-19T04:57:04

Google's. Do no evil they say

CamperBob2 · 2023-07-19T14:02:23

"Don't be evil" was deprecated from Google's charter around the same time that Apple removed "Computer" from their name.

visarga · 2023-07-18T20:08:52

There are many datasets created by scraping chatGPT and they seem to work out pretty well. In other words, LLM skills are leaky.

peddling-brink · 2023-07-18T17:43:27

> if the goal is to build better AI to serve the humanity

It’s not.

toomanydoubts · 2023-07-18T22:40:39

The goal is to build better AI to make more money.

visarga · 2023-07-18T20:07:02

That's an ugly position on Meta's part. But Llama models are small, they are not going to be preferred to generate synthetic data, GPT-4 is the darling of synth datasets.

anothernewdude · 2023-07-19T03:40:00

A pointless provision given that the License doesn't cover the output of the model, so I can redistribute outputs to someone else, and then they (since they aren't beholden to the license) can now do as they like.

And they want to be very careful about labeling outputs as derivative works, because the moment they do that then they have no defense against the model being a derivative work of every single input.

whimsicalism · 2023-07-18T16:17:26

Google's model is not as capable as llama-derived models, so I think they would actually benefit from this.

> I wouldn't be surprised if Amazon does as well.

I would - they are not a very major player in this space.

TikTok also meets this definition and probably doesn't have LLM.

chaxor · 2023-07-18T16:31:24

Google has far better models than llama based models. They just simply don't put them facing the public.

It is pretty ridiculous that they essentially just set a marketing team with no programming experience to write Bard, but that shouldn't fool anyone into believing they don't have capable models in Google.

If Deepmind were to actually provide what they have in some usable form, it would likely be quite good. Despite being the first to publish on RLHF (just right before OpenAI) and bring the idea to the academic sphere, they mostly work in areas tangential to 'just chatbots' (e.g. how to improve science with novel GNNs, etc). However, they're mostly academics, so they aren't set on making products, doing the janitorial work of fancy UIs and web marketing, and making things easy to use, like much of the rest of the field.

wing-_-nuts · 2023-07-18T18:34:57

Lol google saying they have better models in private is like that one kid that insists he has an awesome girlfriend, but 'she goes to another school, you wouldn't know her'.

I'm pretty sure if google had something much better, the board and C-suite execs would have at least ensured we saw previews of it by now...

GreedClarifies · 2023-07-18T16:54:22

Hard disagree. Google has made it plainly clear that they don't have anything useable in this space. Bard scores below all other commercial model.

Google is getting the asses handed to them, badly. I figured that the code red would whip them into shape but the rot runs deep.

chaxor · 2023-07-18T20:52:15

It seems you didn't quite hear the argument. I agree with you that the models Google has released to the public are absolutely worthless. That certainly does not mean they don't have extremely performant models at all however.

If you actually have worked in the area of NLP for about 10 years, you would recognize how the work from Deepmind is much more novel and innovative than other groups. OpenAI certainly has great public facing services, and Meta should be congratulated for releasing these models (although I would still prefer the Galactica training data), but academically Deepmind is one of the best groups around.

andsoitis · 2023-07-19T02:19:21

> but academically Deepmind is one of the best groups around

I think your argument is basically that Google has the potential to create the best models because of superiority in the theory of LLMs, even though we hear of no signs from the board, the ceo, or beta releases or product showcases.

But let’s say you’re right. When do you think we would experience the supremacy of DeepMind in our daily lives?

jokethrowaway · 2023-07-18T22:44:19

Why would they have secret unreleased models?

Surely Google can find another team of code monkeys to whip out a frontend if there is money to be made.

I don't think Google is going to pull back from making some more money.

I think the most likely option is that they have a bunch of talented academics who get paid on time to work on what interest them - but they're the stereotypical large inefficient company and they can't coordinate the effort of productionizing some cool models before the competition.

jdkee · 2023-07-18T21:46:41

On that front, Google's Gemini sounds interesting.

See https://www.tomsguide.com/news/googles-new-gemini-ai-could-b...

spookie · 2023-07-18T22:47:57

It's better to wait and see. Either way, they are scraping everyone and everything. If they can't do it...

onlyrealcuzzo · 2023-07-18T18:03:08

> Google has made it plainly clear that they don't have anything useable in this space.

Google hasn't made their best models public because they're too expensive to run for free.

> Google is getting the asses handed to them, badly.

Bard has 30M active users and isn't even available in large parts of the world. They're in 2nd place - when they were pretty late to the game - that's an odd way to say someone is getting their ass handed to them.

rvnx · 2023-07-18T18:11:06

> Google hasn't made their best models public because they're too expensive to run for free.

?

It's the same issue with paid models.

I am paying per each request sent to Google Generative AI and this is what I get: https://i.ibb.co/4KCmz55/bard1.png

...

andsoitis · 2023-07-19T02:20:42

Why do you think Google had even bothered with Bard?

And then, given that, why is it worse than the competition?

rfoo · 2023-07-18T17:25:04

Bard is a 4.5B or so model.

jahewson · 2023-07-18T16:53:36

I’ve been hearing “Google has secret better models” for 7 months now. Maybe some UFOs in the hangers at Moffett Field too?

airgapstopgap · 2023-07-18T18:52:49

Do you realize that LLaMA-1 is just a very slightly smaller, comparably performing replication of Chinchilla [1], which DeepMind had completed a year prior to LLaMA's release? And has RLHF-ed into a suitable chatbot "Sparrow" [2] months earlier than ChatGPT was launched?

To assume that Google doesn't have anything competitive with Meta is to say that their papers just so happen to contain recipes for Meta's models but they've arrived at those not through training and benchmarking but by divination and bullshitting. This, let us say, does not sound plausible.

Then again, Microsoft uses LLaMA for research, and they should theoretically have some ability to get stuff from OpenAI. Evidently this isn't how any of this works, huh.

1. https://arxiv.org/abs/2203.15556

2. https://en.wikipedia.org/wiki/Sparrow_(bot)

foobiekr · 2023-07-18T22:48:52

Google _internally_ feels that they are way behind. Forget commenters on HN, literally all of the google employees that I know believe that the company is failing here.

airgapstopgap · 2023-07-19T02:50:55

This is not responsive to my arguments. Google can be arbitrarily far behind OpenAI or Anthropic, OP's idea that they feel threatened by LLaMA when they (well, Deepmind) have reached LLaMA level 18-10 months ago is still wrong.

PeterStuer · 2023-07-18T18:15:21

Would you believe OpenAI has vastly better models that they are not releasing publicly?

whimsicalism · 2023-07-18T18:40:29

mirekrusin · 2023-07-18T22:14:54

GPT models were internally available 6-12 months before they've seen public beta, of course OpenAI has more capable internal models.

foobiekr · 2023-07-18T22:49:38

There's no reason to believe this. The training time and cost is so substantial that they are almost certainly building their next release, but it isn't sitting there rotting.

int_19h · 2023-07-19T01:29:09

Much of that training time is RLHF, the absence of which does not make the model less capable of carrying out useful tasks (indeed, in case of GPT-4, it actually made the model slightly less capable).

PeterStuer · 2023-07-20T16:22:23

OpenAI themselves have said they hat GPT-4 internally before they ever released the first version of ChatGPT.

whimsicalism · 2023-07-18T16:37:26

I work in this field. I would love to see what you are basing these assertions off of.

> they mostly work in areas tangential to 'just chatbots' (e.g. how to improve science with novel GNNs, etc)

Yes, Alphabet has poured tons of money into exotic ML research whereas Meta just kept pouring more money into more & deeper NLP research.

renewiltord · 2023-07-18T16:50:51

Google's LLMs are all vaporware. No one's ever seen them. They're supposedly mind-blowing but when they are released they always sound like lobotomized monkeys.

All the AlphaGo/AlphaFold stuff is very cool, but since no one has seen their LLMs this is about as convincing as my claiming I've donated billions to charity.

jll29 · 2023-07-18T22:04:02

I can assure you Google BERT isn't vaporware.

It was probably a challenge to integrate it into search, but they did that.

So your assertion has been refuted based on your use of "all", at the very least.

renewiltord · 2023-07-18T22:26:10

Haha, that's right. Google has BERT. Their AI stuff isn't all vaporware. There's always BERT.

Miraste · 2023-07-18T16:55:25

This reminds me of how any day now their self driving cars are going to work right.

austinkhale · 2023-07-18T18:06:52

Their self driving cars do work? I rode in one for 30 minutes one-way on Sunday. Used it for my return trip too. No driver. Take at least 2 - 3 rides a week and have been for a few months now.

Miraste · 2023-07-18T18:12:02

They work (most of the time) in Phoenix and SF because they've mapped every single inch of the cities by now and there are no adverse conditions. It's not scalable.

cudgy · 2023-07-18T18:33:40

Why is that not scalable? Mapping out two large cities for an experimental project in a few years seems scalable, expand to new cities over time with additional resources.

andsoitis · 2023-07-19T02:26:36

I think you’re conflating doable and scalable.

Or perhaps my threshhold for “scalable” takes different parameters and weigh these inputs differentfly from you.

Miraste · 2023-07-18T18:41:03

I suppose it is, but not in a Silicon Valley way. They could scale to "large Southwestern city taxi service," but it wouldn't earn back the investment or deliver on the hype. If that becomes the ceiling I bet Google will simply shut Waymo down.

If they work out how to deal with, say, New York weather conditions, there's potential, but they don't seem to be any closer.

eshack94 · 2023-07-18T21:43:38

Source?

bouteille · 2023-07-18T21:53:06

https://github.com/facebookresearch/llama/blob/main/LICENSE#...

galaxyLogic · 2023-07-18T17:00:52

I just googled "What is the order of object-fields in JavaScript" and the bard-answer said nothing about the differences between ES5 and ES6 and ES2020 how by now the order of object-fields in fact is deterministic.

It seems it is not aware of the notion of historic development, perhaps its world-model is "static"?

Temporal reasoning is interesting , if you google for "news" do you get what was news last year because a website updated last year had a page claiming to contain "Latest News".

REF: https://www.stefanjudis.com/today-i-learned/property-order-i...

ankeshanand · 2023-07-18T21:50:47

Has anyone in this subthread actually read the papers and compared the benchmarks? LLama2 is behind PALM-2 on all major benchmarks, I mean they spell this out in the paper explicitly.

dooraven · 2023-07-18T16:23:30

> Google's model is not as capable as llama-derived models, so I think they would actually benefit from this.

Google's publically available model isn't as capable. But they certainly have models that are far better already in house.

matt_holden · 2023-07-18T18:02:12

Comments like this remind me of the old-timers from IBM saying "but wait, we invented the PC! and the cloud! and..."

Gotta put products in the market, or it didn't happen...

jefftk · 2023-07-18T18:18:44

It's fine not to give them public credit for in-house only things, but in this subthread we're speculating about whether Llama 2 would be useful to them, which does depend heavily on the quality of their internal models.

cma · 2023-07-19T01:45:45

OpenAI seemingly downgraded ChatGPT 4 due to the expense of running it for pro customers (unless you run it through the API).

foobiekr · 2023-07-18T22:52:17

bringing back PLOSTFU culture might not actually be a bad thing.

whimsicalism · 2023-07-18T16:25:22

I have no idea how you are so certain of that.

Meta is definitely ahead of Google in terms of NLP expertise and has been for a while. I suspect that Google released their best model at the time with Bard.

dooraven · 2023-07-18T16:32:28

We still don't have access to Imagen last I checked, it's still in restricted access. We don't have access to SoundStorm or MusicLM

https://imagen.research.google/

https://google-research.github.io/seanet/soundstorm/examples...

https://google-research.github.io/seanet/musiclm/examples/

Why would it be surprising that they have better models for resarch that they don't want to give out yet?

whimsicalism · 2023-07-18T16:35:36

Because I work in NLP so I have a good sense of the different capabilities of different firms and for the Bard release, it would have made more sense for them to have a more limited release of a better model for PR reasons than what actually happened.

The other things you are describing are just standard for research paper releases.

dooraven · 2023-07-18T16:40:09

> Bard release, it would have made more sense for them to have a more limited release of a better model for PR reasons than what actually happened.

Yes I would agree with you if Google wasn't set on to full on panic mode by their investors about releasing something vs Open AI due to Chat GPT's buzz.

Bard was just a "hey we can do this too" thing, it was released half assed, had next to no marketing or hype.

Vertex AI is their real proper offering, and I want to see how PaLM 2 does in comparison.

whimsicalism · 2023-07-18T16:44:33

I can already tell you that PaLM is not anywhere near as good and PaLM-2 is at least not as good before RLHF.

Not going to keep replying, believe what you want about Google's capabilities

neonbjb · 2023-07-18T17:35:34

@dooraven - I also work in ML (including recently working at Google) and I agree with @whimsicalism.

You seem to be under the mistaken belief that: 1. Google has competent high-level organization that effectively sets and pursues long term goals. 2. There is some advantage to developing a highly capable LLM but not releasing it.

(2) could be the case if Google had built an extremely large model which was too expensive to deploy. Having been privy to what they had been working on up until mid-2022 and knowing how much work, compute and planning goes into extremely large models, this would very much surprise me.

Note: I did not have much visibility into what deepmind was up to. Maybe they had something.

dooraven · 2023-07-18T17:06:10

ok now I am confused, as Meta themselves say Palm-2 is better than Llama 2?

> Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.

https://scontent.fsyd7-1.fna.fbcdn.net/v/t39.2365-6/10000000...

If Google's publically available model is better Llama 2 already then why is it so inconceivable that they'd have private models that are better than their public ones which are better than LLama already.

Palm-2 isn't better than GPT-4 but the convo was about better than Llama models no?

flangola7 · 2023-07-18T17:23:45

> I have no idea how you are so certain of that.

Some among us work with it, or have friends or family who work with it. I imagine it is one of those.

WastingMyTime89 · 2023-07-18T16:26:51

Do they? Considering how much was at stack in term of PR when OpenAI released ChatGPT, I would be surprised that Google didn’t put out the best they could.

freedomben · 2023-07-18T16:35:54

The other end of the PR stake was safety/alignment. If Google released a well functioning model, but it said some unsavory things or carried out requests that the public doesn't find agreeable, it could make Google look bad.

samwillis · 2023-07-18T16:52:40

Apple would absolutely not want to use a competitors, or any other, public LLM. They want to own the whole stack, and will want to have their own secret source as part of it. It's not like they don't have the capital to invest in training...

whimsicalism · 2023-07-18T17:02:25

Apple does not have the capability to train a LLM currently.

NotAFood · 2023-07-18T17:11:22

Apple has shown time and time again that they have the human capital and money to tackle massive projects discretely. It's already fairly well known that Apple's NLP experts from Siri have been reallocated to some secret project. They are more than capable of training an LLM but given their track record in other segments they probably want to wait for the technology to become more "polished" and give less hallucinated answers. They are likely also want the LLM to work locally (at least partially) on their devices using the Neural Engine which adds further engineering complexity to their project. They could even be timing the LLM's launch around a hardware release capable of running the model (M3, M4, etc...).

yellow_postit · 2023-07-19T02:16:23

Apple is a complete laggard in this space due to years of restrictions on research. They are hiring multiple “AI” roles now and they have the capital and focus to “eventually” catch up — but it is very much a catch-up game.

That said, they seem to prefer catchup waiting till others explore new tech they swoop in an (claim) to perfect it from a usability pov. I have no reason to suspect they won’t do the same here.

amelius · 2023-07-18T18:28:15

Apple only has to slightly open their wallet to become a DL superpower.

whimsicalism · 2023-07-18T18:04:24

I have not seen Apple demonstrate ML depth in their talent nor have I seen signs that they are hiring extensively for NLP depth.

They will soon be able to train an LLM because it simply has become commoditized, but they just are not a major player in this space at all.

Jcowell · 2023-07-18T20:23:51

> I have not seen Apple demonstrate ML depth in their talent

I thought the ml work they do in photos for text selection and facial recognition is pretty neat.

layoric · 2023-07-18T22:42:17

Their approach is different, they build ML tech that runs on-device, so whatever they developed has to be able to run efficiently on iPhone/iPad etc.

I don’t think we will “hear” about Apple using LLMs either way because they will no doubt call it something different like they always have.

samwillis · 2023-07-18T17:05:50

I very much doubt that.

smoldesu · 2023-07-18T17:50:51

If they want to own the whole stack, I don't think they have much to work with. Their highest-end server chip is a duplex laptop SOC, with maxed-out memory that doesn't even match the lowest-end Grace CPU you can buy (nevermind a fully-networked GH200). Their consumer offerings are competitive, but I don't think Apple Silicon or CoreML is ready to seriously compete with Grace and CUDA.

samwillis · 2023-07-18T18:01:25

While Apple silicone may not be there for training, I think it's probably there for inference. I expect next years device models to launch with exclusive support for Apples own LLM based Siri.

smoldesu · 2023-07-18T18:14:56

Sure. Haswell CPUs from 2014 are "there" for inference if they have AVX support and 8gb of RAM. Inferencing isn't the problem though, not on M1 or Macbooks from 2016. Scaling a desirable (and hopefully open) GPGPU programming interface is. This is bottlenecked by both hardware and software decisions Apple has made, making a "home grown" competitive model much more unlikely in my eyes.

I agree that there is an incentive to put AI models on your OS. I just don't think Apple can own the whole stack if they want to play ball right now.

zirgs · 2023-07-19T06:59:55

Why not? They have cash and they can rent a bunch of GPUs from Amazon.

xbmcuser · 2023-07-18T17:00:38

What makes you think that. Apple is the company that would be most successful at hiding something like this then introduce it as siri ai or something. Not that they are I am just saying Apple keeps everything close to its chest when it comes to products it might introduce in the future.

whimsicalism · 2023-07-18T18:05:10

I work in the field and they just are not hiring the people they need to be hiring.

kossTKR · 2023-07-18T22:23:03

Interesting. The very early adoption of the neural engines in all Apple products would make you think that they had something brewing. Same with the relatively capable m1/2 GPU's. Various models and stable diffusion runs suprisingly fast on these devices and could be optimised to run much, much faster if Apple actually cared, but they weirdly seem not to.

reacharavindh · 2023-07-19T06:21:29

Considering how much Apple likes to retain control, I’m almost sure they won’t want to use someone else’s model even if it were free in every sense of the word.

nerdix · 2023-07-18T16:16:28

I think it's aimed at other social networks.

TikTok has 1 billion monthly active users for instance

matt_holden · 2023-07-18T16:51:30

Look at Snapchat: https://techcrunch.com/2023/02/16/snapchat-announces-750-mil...

Just above 700m MAU. So yeah, probably aimed at their direct competitors in social.

VWWHFSfQ · 2023-07-18T16:36:21

I think TikTok would just use it anyway even if they were denied a license (if they even bothered asking for one). They've never really cared about that kind of stuff.

whimsicalism · 2023-07-18T16:49:59

Anyone who has ever worked in a major social media company knows that this is false - but as another person who has, I will chime in and say this is completely wrong, compliance (especially such obvious compliance) is taken seriously.

fmajid · 2023-07-18T17:59:03

I worked at a company that caught a major Chinese Internet company (not ByteDance/TikTok, but one even larger) red-handed engaging in deliberate app install ad fraud (their app would send forged Android INSTALL_REFERRER intents), so it would not surprise me.

pertymcpert · 2023-07-18T23:22:06

I'm curious if you've worked at a Chinese company?

nonfamous · 2023-07-18T17:36:54

AWS is listed as a partner: https://ai.meta.com/llama/#partnerships

alexeldeib · 2023-07-18T17:46:49

now, that is interesting. Alphabet only big co missing in that list?

e: nvm. Apple not there either.

taneq · 2023-07-19T08:43:06

It's total users, not specifically users of the Llama-2-based product. It's actually quite an elegant way to say "if you're going to produce some super cool new tech with this, let's be friends, unless you're big enough to compete with Facebook in which case rack off."

londons_explore · 2023-07-18T21:32:54

Also, any company with 700 million active users wouldn't have much difficulty reproducing this work.

visarga · 2023-07-18T20:04:42

School is out, it will pick up again.

swyx · 2023-07-18T16:20:57

> OpenAI's ChatGPT hit 100 million MAUs in January, and has gone down since.

poor reading of the numbers. one guy at a bank pulled up similarweb and guesstimated 100m registered users and it went viral. whisper numbers were closer to 50m. but in the 6 months since they have certainly crossed 100m and probably are north of 500m, and only recently dipped.

minimaxir · 2023-07-18T16:23:14

You are countering whisper numbers with more whisper numbers.

1024core · 2023-07-18T16:25:36

Fight fire with fire..... ?

moneywoes · 2023-07-18T16:27:04

How do you find Whisper numbers, it’s open source yea?

minimaxir · 2023-07-18T17:07:05

Whisper numbers are numbers that are secretly shared among industry insiders, not the usage numbers of OpenAI's Whisper.

gentleman11 · 2023-07-18T17:02:20

It's not open source

fmajid · 2023-07-18T18:01:15

He's making a pun referring to OpenAI's open-sourced Whisper voice recognition model:

https://openai.com/research/whisper

costcofries · 2023-07-18T16:18:23

Microsoft announced today that they will use Llama on Azure and Windows scenarios. Source: https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-me...

rahimnathwani · 2023-07-18T16:15:27

> Looks like they are trying to block out competitors

But only existing competitors. If you don't yet have 700MM MAU, the impact of this is only that, after you reach 700MM MAU, you can't get future versions of the Llama models for free. You can still continue to use versions that were released before you reached that threshold.

For reference, neither Instagram nor WhatsApp had 700MM MAU at the time Facebook decided to acquire them.

quickthrower2 · 2023-07-18T19:55:34

Cue the zombie startups who sell to (various tech giants) for a million with their only IP being to loophole this agreement.

aloer · 2023-07-18T21:09:16

Lately I’ve been wondering if a license similar to this but instead based on market cap could be a way to monetize open source projects

E.g. 100k/year for each trillion in market cap, updated yearly. First trillion is free

londons_explore · 2023-07-18T21:28:31

Problem is then it wouldn't be truly open source. And if your project isn't opensource, a lot of other projects can't include/link/build on your project.