Things I don't know about AI

cs702 · 2024-02-21T16:22:34

I would add to the list a few questions about the evolution of cost dynamics going forward, given the advent of new sequence modeling architectures.

As we all know, Transformers are very expensive to train and run because their compute cost is quadratic in context length: O(N²).

If newer model architectures like RWKV, Mamba, and various others, which incur cost that is linear in context length, O(N), prove as successful as Transformers, the demand for compute for a query of N tokens would decline from O(N²) to O(N).

For a sequence with N = 100,000 tokens, it would mean cost dropping by a factor of 100,000×. That's not peanuts!

The implications for all market participants would be significant.

doctorpangloss · 2024-02-21T18:29:15

> The implications [of a bunch of different architectures] for all market participants would be significant.

I don't know, who cares. How much good training data is there that correctly exercises 100k context windows?

All the costs and complexity are tied up in authoring good training data, not compute. 1 person can invent an alternative easier to train architecture. 1 person cannot author 100k context window instruct examples with the same quality as the 4k ones. It will still take thousands of people for 100k just like it took 1000s of people for 4k windows, image labeling, etc.

og_kalu · 2024-02-21T18:37:55

>All the costs and complexity are tied up in authoring good training data, not compute.

No it's compute. Post-Training by Human reinforcement is not necessary. Anthropic employs RLAIF and it works just fine. Some don't bother with reinforcement learning at all and just leave it at fine-tuning on Instruction-response pairs.

The work being done to pre or post-training data is insignificant in comparison.

You don't need 100k instruct-tuning examples. The vast majority of instruct-tuning data is nowhere maxing even a 4k context.

100k pre-training runs would probably be very helpful but the thing stopping that from happening even in domains that regularly match or exceed that context (fiction, law, code, etc) is the ridiculous compute it would require to train with that much context.

CuriouslyC · 2024-02-22T00:49:58

I don't think it's fair to say compute is the blanket bottleneck. Meta has shown that they have plenty of compute to throw at problems but they're still resorting to generating synthetic data to try and improve the quality of their models, i.e. data is the bottleneck and they can afford to burn compute to get ahead there.

Phi-2 has shown that a very reasonable compute budget can produce a phenomenal model if the data is pristine and well chosen, and the model is trained well.

doctorpangloss · 2024-02-22T01:20:33

Yes. But consider this scenario: You get some cloud provider to offer you $90m in credits that are convertible to equity. Then you go to an investor and say, you've raised $90m in cash so far, because the bottleneck is compute and compute is expensive. Then they're like, "Oh well a $10m cash investment is reasonable then. Compute is the bottleneck." You go out into the world and say you have a $100m raise on a $500m post for your idea to train an LLM on some niche.

Nevermind that if the cloud provider is willing to forgo $90m in hard cash to give somebody else those GPUs "for free," it must not be that expensive in actuality. I mean from their point of view, $90m in cloud credits might cost them only $1m to provide your first year, when you get around to using just a little of them.

I guess I am saying that there are a ton of people in startups right now, in this gold rush, for whom "compute is the bottleneck" is an essential narrative to their survival / reason for investing cash. It's not just the chip vendors and the cloud providers. My scenario illustrates how "compute is the bottleneck" turns bad napkin ideas into $10m slush funds. So 4/5 actors in Elad's chart benefit from this being the case.

It's the foundational model people like you're saying with Meta who are not bottlenecked by compute neither in fiction nor reality. I'd argue that not only is hardly anyone bottlenecked by compute, but that the people claiming that they are bottlenecked by compute are either guilty of not being informed enough about their problems or guilty of making up a story that results in the easiest-to-obtain investment dollars. Like imagine the counterfactual: Anthropic goes and pitches a synthetic data thing. The tech community mocks that, even if it makes a ton of sense and is responsible for huge demos like Sora.

og_kalu · 2024-02-22T03:44:26

I'm not saying data isn't important or even that it's not the most important thing.

I'm making a statement of the current economics. Most of the cost of building a LLM comes from compute. The changes we make to the data while meaningful are dwarfed in cost by the compute required to train.

CuriouslyC · 2024-02-22T04:42:40

That's not true in the case of the phi-2 model though, and if the approach of phi-2 was scaled up to a class of hyper-efficient LLMs I think it would continue to not be true while also producing SOTA results.

og_kalu · 2024-02-22T04:58:16

>That's not true in the case of the phi-2 model though

Isn't it ? Genuinely, what did phi-2 do exactly that is economically more expensive than compute ?

From what I read, phi-2's training data was gotten from:

1. Synthetic Data Generation from more performant models. phi-2 is essentially not possible without spending the compute on that better model in the first place. And inference is much cheaper than training.

2. Selecting for Text data of a certain kind e.g Textbooks

Meanwhile, it's still taking 14 days to train this tiny model on 96 A100 GPUs

CuriouslyC · 2024-02-22T05:13:49

I'm sure the man hour cost of dataset curation on Phi-2 were higher than its training compute cost. Regarding synthetic data, I think that could technically be classified as an asset that is not consumed in the training process, and thus you could amortize its compute cost over quite a bit (or even monetize it). Given that, even if the compute for synthetic data put the total compute cost over curation costs I don't think it's on an order that would contradict my point.

og_kalu · 2024-02-22T05:40:34

> I'm sure the man hour cost of dataset curation on Phi-2 were higher than its training compute cost.

There's no indication of human curation.

From the first paper

> We annotate the quality of a small subset of these files (about 100k samples) using GPT-4: given a code snippet, the model is prompted to “determine its educational value for a student whose goal is to learn basic coding concepts”. We then use this annotated dataset to train a random forest classifier that predicts the quality of a file/sample using its output embedding from a pretrained codegen model as features. We note that unlike GPT-3.5, which we use extensively to generate synthetic content (discussed below), we use GPT-4 minimally only for annotations on the quality of a small subset of The Stack and StackOverflow samples.

>Given that, even if the compute for synthetic data put the total compute cost over curation costs I don't think it's on an order that would contradict my point.

Don't worry that's not even necessary. 96 A100 GPU's for 2 weeks straight is a ~50k USD Endeavour. It'll be somewhat less for the likes of Microsoft but so will inference of GPT-4, GPT-3.

Even if we forget about compute costs of those more powerful models, inference costs of the generated tokens won't be anywhere near that amount.

doctorpangloss · 2024-02-21T21:21:23

> The vast majority of instruct-tuning data is nowhere maxing even a 4k context

This should indicate for you how much the problem really is the training data.

og_kalu · 2024-02-21T22:11:19

It shows you don't need a instruction-tuning dataset that wide to create a model that follows instructions for large contexts.

I don't know how else to tell you that you're completely wrong on what costs construe the majority of LLM development/training.

doctorpangloss · 2024-02-22T01:22:28

> construe the majority of LLM development/training.

Compute will constitute the majority of development and training costs of LLMs that perform poorly.

og_kalu · 2024-02-22T01:44:58

Compute being the majority of costs is the reality for every single LLM in existence. This isn't an opinion i'm trying to convince you of.

imtringued · 2024-02-22T12:21:47

Compute isn't the majority of costs at all. It's primarily memory capacity and memory bandwidth. Tenstorrent's Grayskull chip has 384 TOPS and only a paltry 8GB memory capacity with roughly 120GB/s memory bandwidth. The compute vs memory ratio is so heavily skewed towards compute its ridiculous. 3200 OPS per byte/s of memory bandwidth. 48000 OPS per byte of memory capacity. How can compute possibly ever be a bottleneck?

The moment you start thinking about connecting these together to form a 100 TB system, you need 12.5k nodes or $10 million. A billion dollar budget means you can afford multiple one petabyte systems. Admittedly, I am ignoring the cost of the data center itself, but given enough money, there is no bottleneck on the compute or memory side at all!

How exactly are you going to find enough data to feed these monstrous petabyte scale models? You won't! Your models will stay small and become cheap over time!

og_kalu · 2024-02-23T04:30:44

Memory is part of compute. I have no idea why you've separated them.

>How exactly are you going to find enough data to feed these monstrous petabyte scale models? You won't! Your models will stay small and become cheap over time!

People have this idea that we're stretching the limit of tokens to feed to models. Not even close.

This is a 30T token dataset https://www.together.ai/blog/redpajama-data-v2

It's just web scrapped which means all the writing largely inaccessible on the web (swaths of fiction, textbooks, papers etc) and in books are not a part of it. It doesn't contain content in some of the most written languages on earth (e.g Chinese. You can easily get trilions of tokens of Chinese alone)

chillee · 2024-02-22T00:29:09

This is extremely wrong.

The attention component that's quadratic is a relatively small portion of compute.

cs702 · 2024-02-22T12:14:04

I'm the author of the parent comment, and just saw your response.

First of all, by "a factor of 100,000×" I mean a model-specific factor, call it c, of 100,000x, but looking at what I wrote I can see how it could be misinterpreted. Second, as sequences become longer, the quadractic component will become more important.

chillee · 2024-02-23T00:58:14

> For a sequence with N = 100,000 tokens, it would mean cost dropping by a factor of 100,000×

I'm not sure I understand the intended interpretation of this. Concretely speaking, if it cost CoolAI 100k seconds of compute to process a sequence of length 100k, it would not take them 1 second now.

I agree that as sequences become longer, the quadratic component will become more important. But as models get bigger, the attention component also becomes less important.

For example, to take a concrete model (say Llama-70B), it takes about 1.4e16 MLP FLOPs (70 billion * 100000 * 2) to process 100k tokens. The attention component takes about 6.5e15 FLOPS (80 [layers] * 100k [sequence length]^2 * 8192 [hidden dim]).

So even if attention turned constant it would reduce runtime by about 30% with today's model at 100k sequence length.

cs702 · 2024-02-23T13:55:21

> I'm not sure I understand the intended interpretation of this. Concretely speaking, if it cost CoolAI 100k seconds of compute to process a sequence of length 100k, it would not take them 1 second now.

Cost would be lower by a factor of sequence length N = 100,000. If you call the factor C, cost would be lower by C × N. For most models and context lengths today, C is below 1. As we continue to increase sequence length N -- say, as we go from 100K to 100M tokens, incorporating multiple modalities -- factor C will increase toward 1 for all Transformers.

Again, I see how what I wrote could be misinterpreted. Sorry about that!

dTal · 2024-02-22T10:35:11

To anyone doubting this, note that llama.cpp does not slow down by a factor of 16 when you pass -c 2048 instead of -c 512.

cochne · 2024-02-22T03:09:21

I think they are correct, do you have a source? From my knowledge the only other components are the fully connected networks which are not big contributors.

imtringued · 2024-02-22T12:55:02

It's quadratic, because of the dot product in the attention mechanism.

You can use K-V Caching to get rid of a lot of the quadratic runtime that comes from redundant matrix multiplications, but after you have cached everything, you still need to calculate the dot product k_i * q_j with i,j being index of the tokens. With n tokens, you will get O(n*n).

But you have to remember that this is only n^2 multiplications. It's not exactly the end of the world at context sizes of 32k, for example. It only gets nasty in the hundred thousands to millions.

Here is the source I used: https://sebastianraschka.com/blog/2023/self-attention-from-s...

chillee · 2024-02-23T01:01:38

https://news.ycombinator.com/item?id=39475528

My source is me :) I work at PyTorch on ML compilers.

If you don't believe me, perhaps you'll believe Karpathy's diagram (and the general discussion in the thread): https://twitter.com/karpathy/status/1658161721251602432

lumost · 2024-02-22T04:09:51

For small values of N, the linear terms of the transformer dominate. At the end of the day, a double layer of 764*2048 is still north of 3.1 MM flops/token/layer.

casercaramel144 · 2024-02-22T21:22:35

Huh? I thought the issue before ringattention is the memory requirement of the softmax layer, since you have to load the whole matrix in at once? It's O(s^2) no?

Also hi horace.

chillee · 2024-02-23T01:02:49

Who is this :think:

But no, FlashAttention already solved the memory requirements of attention. RingAttention is primarily useful for parallelizing across the sequence component.

casercaramel144 · 2024-02-23T20:00:59

It's camel.

How do you do matrix vector attention without keeping the full matrix in cache, surely you don't just load unload it a million times

jxy · 2024-02-21T20:05:40

> architectures like RWKV, Mamba, and various others, which incur cost that is linear in context length, O(N)

These are actually O(N*M) where M is independent of inference time context length, but it controls the actual amount of information that gets retained during inference. Ignoring that factor of M is misleading.

hansvm · 2024-02-21T22:47:20

All models are wrong. Some are useful.

The "linear" models I'm aware of have a cost of O(N * M), where M is the amount of information retained _per token_ (with some exceptions that store less state but still have intermediate computations for some M about the same size as the others, and with less state you necessarily have forgetting issues and can't scale to unbounded sequences anyway).

Contrast that with a transformer, which has a cost of O(N^2 * M), where M is still the amount of information retained _per token_. Dividing the two you still find a multiplicative speedup of O(N).

For accurate cost estimates or any number of other use cases, yes, the M is critical. When comparing architectures (or when talking about a single architecture with the current implicit conversation happening that whatever you say is meant to be taken with respect to other architectures) it's conventional to note that the M term always appears as a multiplicative factor beyond the rest of the costs. You drop it because it's obviously there but it's just visual noise that doesn't help you understand where the models sit with respect to each other. You only include it if it's relevant, like if you found a way to get similar performance with a smaller M.

It's kind of like how we always drop the log term on a hash table (equivalently, we work in the "word RAM" model of computation), but bignum algorithm papers carefully consider every log(log(log(N))) that crops up. It's relevant in some contexts and not in others. You can't ever include enough context to fully represent your thoughts to another person, so you truncate somewhere, hopefully capturing the essence of what matters.

nyrikki · 2024-02-21T16:53:51

In the general case that would be the same as hierarchy collapse of PH, which would prove P=NP

dadadad100 · 2024-02-21T17:07:49

As in the case of constraint solvers there are tons of heuristics that have been applied over the years to make performance look like it scales somewhat linearly in many or most problems. This doesn’t mean p==np

nyrikki · 2024-02-21T18:10:13

Most of those have had practical deciders functions.

If you consider NP-complete as the intersection of NP, which is a decision problem, and NP-hard which a decider function isn't possible; look at the success of heuristics between those two sets.

Existential quantifiers on the second order terms in NP is the same thing.

Over Parameterization can be thought as less lossy compression and attention can be thought of as improved retrieval or as additional expressiveness by not missing out on as many outputs using binary output.

You can consider how LLMs tend to append, resulting in telescope reduction as one of those approximation reductions.

But this is due to the combinatorial optimization, which is far more difficult.

Another way to think about it is that PH is equal to the set of boolean queries on a concurent random acess machine using exponentially many processors and constant time.

If your datasets have the Markovian property and are close to ergotic there are options, but you probably wouldn't need to resort to attention and over parameterization in that case.

The word sense disambiguation problem is probably another lens. That is harder yet but may be a way to think about what I am trying to explain.

dadadad100 · 2024-02-21T19:23:41

Thanks for this reply. HN paid for itself again today

cs702 · 2024-02-21T17:01:14

Well, it would depend on how many tokens from the vocabulary are required, in the worst case, to solve NP problems.

Think of the tokens as symbols, and of the vocabulary as a finite alphabet of symbols, in a formal system.

In other words, no, it would NOT necessarily prove P = NP.

nyrikki · 2024-02-21T17:37:16

The Singleton case is NP-complete

If you consider NP second-order existential logic, meaning that given x, you will get a y, While there may be many true values that aren't y, it will help

Attention, which requires feed forward networks, which approximate a DAG, you can think of attention as run time reweighing to change what y value is returned.

PH is a query expressible in second-order logic, which would allow for 'for any' at the second level.

The reduction in the general case to just logarithmic time: O(n log n), would cause issue, there will not be any movement from quadratic time to linear time without massive new discoveries.

It is inherent from the restrictions of ANNs having binary output.

Considering FP and FNP vs P and NP may help as they make that property clearer than the concept of second-order existential logic does in some contexts.

cs702 · 2024-02-21T18:04:38

Thank you. Right now, my puny little brain is searching through what I would describe as "those distant and slightly vague memories" that make up my poor man's knowledge of computational complexity theory.

> Attention, which requires feed forward networks, which approximate a DAG, you can think of attention as run time reweighing to change what y value is returned.

Isn't that's true for linear RNN mechanisms too? Note that linear RNNs like RWKV and Mamba are linear in the number of tokens in the sequence, but not in the number of features per token.

nyrikki · 2024-02-21T18:52:12

You don't typically get to the Sigma_2^P level unless are a sadist or forced.

Attention works differently in RNNs, and it may be able to find different tractable forms, and may solve different problems, but increasing generalization is a reduction of compression in the general case.

A many to one reduction to a recursively enumerable set gets you to finite time, a reduction to NP gets you to quadratic time.

Obviously those are upper limits and specific instances or classes of instances may do better. P being inside NP is a good example.

All ANNs are binary output, which helps with that. SNN or spikey neutral networks, which better model cortical neurons and have continuous output have problems with being un-computable without much complexity as an example.

That can be viewed as the problems with computable numbers in place of the reals. But there are lots of interesting and practical things to solve. It is just those solutions will be more domain specific IMHO.

cs702 · 2024-02-21T23:48:29

Hmm...

You're right: It's quite possible that the new crop of linear RNNs may not be able to solve the same problems as Transformers.

It could also be that we will have to increase the depth of these linear RNNs (possibly even dynamically) to enable them to perform comparably well on all problems. Right now it's hard to tell for sure. I don't know.

What I do know is that recently proposed linear RNNs are performing comparably to Transformers at similar model sizes, but have not yet been evaluated at state-of-the-art scales.

nyrikki · 2024-02-22T17:48:54

My hunch is it will be horses for courses.

hansvm · 2024-02-21T23:28:46

One the one-hand, "no". On the other, that's an interesting observation. There are known Θ(N^2) problems, which a linear transformer substitute couldn't possibly solve (in a single pass), and which current transformers might be able to. It's an open question how many of the current LLM capabilities rely on quadratic compute, but linear transformers might be a pipe dream for some current use cases.

Generally though, I don't think that's quite the same. Collapsing all O(N^2) problems down to O(N) would collapse the hierarchy and prove P=NP, but it's blatantly impossible and would prove a lot of falsehoods too.

nyrikki · 2024-02-22T18:42:37

The known lower bound for self attention is n^(2-epislon)

PH, or the polynomial hierarchy is the collapse I was talking about, but note I was talking about global and not individual or families of instances.

Schaefer's Dichotomy Theorem shows that HORNSAT is solvable, given those constraints.

I found this paper useful as it reframes it as topology which is helpful to build it intuitions, at least me.

https://arxiv.org/abs/2307.03446

Of the proposed linear self attention like solutions I have seen, they tend to approach it more from what can be reduced the clustering problem.

A lens for that is FTA vs FMEA, being thought of as top down vs bottom up.

Top down is cheaper but may miss things, bottom up doesn't miss things but can result in combinatorial explosion.

This paper covers some optimization problems that are NP-hard to approximate to within a given approximation ratio.

Being able to do so would show P=NP

https://dl.acm.org/doi/pdf/10.1145/321958.321975

But that is under the assumption that increasing the self attention window was to improve accuracy, not to trade accuracy for efficiency.

While this paper is about casual inference and not PAC learning, it is the reason most 'causal AI' is just stocastic search and event trees.

https://www.cs.cornell.edu/home/halpern/papers/newcause.pdf

araes · 2024-02-21T19:32:20

With the Big O notation, are there also possibilities for either the whole process to go to O(nlogn) or possibly subportions to the O(logn) range? There's also a lot of math operations over the years that have gone to O(Nª) where 1 < a < 2.

xianshou · 2024-02-21T15:31:40

From the article:

"It is important to note that the scale of investments being made by these cloud providers is dwarfed by actual cloud revenue. For example, Azure from Microsoft generates $25B in revenue a quarter. The ~$10B OpenAI investment by Microsoft is roughly 6 weeks of Azure revenue. This suggests the cloud business (at least for now) is more important than any one model set for Azure."

The most interesting implication in the short term is that the impact of model availability on cloud provider choice is more important than the models themselves. Will organizations choose Azure because it offers GPT-4, or AWS because it offers Claude?

This also explains reciprocal VC structures whereby e.g. Amazon invests billions in Anthropic while Anthropic in turn promises to spend billions on AWS cloud resources. Anthropic gets a higher valuation and broader distribution, while AWS gets more customers. It would be a fascinating outcome if the predicted oligopoly in fact ends up as several exclusive partnerships.

simpletone · 2024-02-21T17:36:04

> The ~$10B OpenAI investment by Microsoft is roughly 6 weeks of Azure revenue.

Revenue isn't what is important. Profit is since profit is the money you have after costs to invest. But since cloud is highly profitable as it is, the point still stands I suppose.

> It would be a fascinating outcome if the predicted oligopoly in fact ends up as several exclusive partnerships.

I thought people were predicting independent silos? Where highly profitable and cash rich companies like apple, microsoft, google, facebook, etc own their entire stack or exist as their own ecosystem. Where the likes of apple does everything in-house - AI, cloud, OS, chips, etc. The whole she-bang.

jffry · 2024-02-22T01:08:43

The relative scale still stands - Microsoft's net income in October, November, and December 2023 was over $21 billion, according to https://www.microsoft.com/en-us/Investor/earnings/FY-2024-Q2

jjjjj55555 · 2024-02-22T00:34:31

Stupid question, but where would their competitive advantages be in these silos? Sounds like at that point cloud+AI is just a commodity.

bbor · 2024-02-21T18:12:57

Opened the comments for exactly this. This person seems very nice but also convinced that they’re a business maven, which I think leads them to making “clear” observations like that. Microsoft hasn’t invested a full year of revenue into OpenAI, so it’s not that invested? Meh

infecto · 2024-02-21T15:42:48

Doubtful, at the end of the day the models are just like any other API. No reason not to signup for all the big players (GCP/Azure/AWS) and have access to the various models. Should have no weight as to where you should build out your infra.

eladgil · 2024-02-21T15:02:55

Would love to get any views / counterpoints on these topics.

Eg which AI infra approach thrives relative to the big clouds?

What new architectures are needed for agents vs transformers?

Etc...

Lots of stuff to figure out....

danielmarkbruce · 2024-02-21T16:02:24

There is a decent (<50%, >20%) chance that frontier foundation models are less oligopoly like than it seems. The reason is that there are so many levers to pull, so much low hanging fruit.

For example: * Read the Bloomberg GPT paper - they create their own tokenizer. For specialized domains (finance, law, medicine, etc) the vocabulary is very different and there is likely a lot to do here, where individual tokens really need to map to specific concepts and having a concept capture in several tokens makes it too hard to learn on limited domain data. * Data - so many ways to do different data - more/less, cleaner, "better" on some dimension. * Read the recent papers on different decoding strategies - there seems to be a lot to do here. * Model architecture (SSM etc). If you speak to people who aren't even researchers, they have 10 ideas around architecture and some of them are decent sounding ideas - lots of low hanging fruit. * System architecture - ie likely to see more and more "models" served via API which are actually systems of several model calls, and there is a lot to do here. * Hardware, lower precision etc likely to make training much cheaper

It's reasonably likely (again, guessing < 50% > 20%) that this large set of levers to pull become ways to see constant leap-frogging for years and years. Or, at least they become choices/trade-offs rather than strictly "better".

eladgil · 2024-02-21T16:11:49

I agree this is a potential outcome. One big question is generalizability versus niche models. For example, is the best legal model a frontier model + a giant context window + RAG? Or is it a niche model trained or fine tuned for law?

Right now at least people seem to decouple some measures of how smart the model is from knowledge base, and at least for now the really big models seem smartest. So part of the question is well is how insightful / synthesis centric the model needs to be versus effectively doing regressions....

CuriouslyC · 2024-02-21T18:06:35

Frontier model + rag is good when you need cross-discipline abilities and general knowledge, niche models are best when the domain is somewhat self contained (for instance, if you wanted a model that is amazing at role playing certain types of characters).

The future is model graphs with networked mixtures of experts, where models know about other models and can call them as part of recursive prompts, with some sort of online training to tune the weights of the model graph.

sfink · 2024-02-21T18:35:09

> The future is model graphs with networked mixtures of experts, where models know about other models and can call them as part of recursive prompts, with some sort of online training to tune the weights of the model graph.

What's the difference between that and combining all of the models into a single model? Aren't you just introducing limitations in communication and training between different parts of that über-model, limitations that may as well be encoded into the single model if they're useful? Are you just partitioning for training performance? Which is a big deal, of course, but it just seems like guessing the right partitioning and communication limitations is not going to be straightforward compared to the usual stupid "throw it all in one big pile and let it work itself out" approach.

CuriouslyC · 2024-02-21T18:57:22

The limitation is the amount of model you can fit on your hardware, and also sometimes information about one domain can incorrectly introduce biases in another which are very hard to fix, so training on one domain only will produce much better results.

danielmarkbruce · 2024-02-21T16:22:00

Yup, it's unclear. The current ~consensus is "general purpose frontier model + very sophisticated RAG/system architecture" for legal as an example. I'm building something here using this idea and think its 50/50 (at best) I'm on the right path. It's quite easy to build very clever sounding but often wrong insights into various legal agreements (m&a docs for example). When looking at the tokenization, the training data, decode, architecture (lots of guesses) of the big models, there are a lot of things where the knobs seem turned slightly incorrectly for the domain.

Some of the domains are so large that a specialized model might seem niche but the value prop is potentially astronomical.

doctorpangloss · 2024-02-21T18:38:34

I haven't listen to your great podcasts so hard to say what is not covered.

Architectures matter a lot less than data. "Knowledge" and "reasoning" in LLMs is a manifestation of instruct-style data. It won't matter how much cheaper training gets if there is limited instruct data for use cases.

How do you make 100k context window data for example? Still need thousands of people. Same with so-called niches.

Maybe it turns out to be a complex coordination problem to share the data. That's bad for equity investors and giant companies. Anyway, all of this would cost less than the moon landing so it's practicable, you don't need cheaper, you just need risk-taking.

The obviousness of the path from here to there means it's not about innovation. It's all about strategy.

If Google could marshal $10b for Stadia it could spend $10b on generating 100k context window instruct style data and have the best model. It could also synthesize videos from Unity/Unreal for Sora-style generation. It would just be very hard in an org with 100,000+ people to spend $10b on 10 developers and 10,000 writers compared to 400 developers and 3,600 product managers and other egos. At the end of the day you are revisiting the weaknesses that brought Google and other big companies to this mess in the first place.

Anyway I personally think the biggest weakness with ChatGPT and the chat-style UX is that it feels like work. Netflix, TikTok, etc. don't feel like work. Nobody at Google (or OpenAI for that matter) knows how to make stuff that doesn't feel like work. And you can't measure "fun." So the biggest thing to figure out is how much technical stuff matters in a world where people can be poached here and there and walk out with the whole architecture in their heads, versus the non-technical stuff that takes decades of hard-worn personal experiences and strongly held opinions like answers to questions "How do you make AI fun?"

mywittyname · 2024-02-21T20:01:16

> "How do you make AI fun?"

Bring back text-based adventure games.

doctorpangloss · 2024-02-21T21:16:14

People go for this level of obviousness and it doesn't work. I have no doubt that a meme text game will find some level of literal objective success. But it will still suck. Meme games are a terrible business, both in terms of profits and equity.

This also speaks to why OpenAI, Google and the other developers will struggle to create anything that feels like fun: they will chase obvious stuff like this, they will think its similar to all problems. And in reality, you don't need any testing or data or whatever to know that people hate reading in video games, the best video game writing is worse than the average movie's screenplay, that most dialogue is extremely tedious, so why are you going to try to make it even worse by making it generated by an AI?

jchonphoenix · 2024-02-21T16:23:29

Arguably one of the earliest consumer use cases that found footing was AI girlfriend/boyfriend. Large amounts of revenue spread across many small players are generated here but it's glossed over due to the category.

rockostrich · 2024-02-21T16:31:36

Weird how close your user name is to Joaquin Phoenix, star of the film "Her" centered around an AI girlfriend.

PeterisP · 2024-02-22T12:11:53

Given that how widespread romance scam schemes already are (the "market" is at least $0.5 billion/year), I would expect any reasonably functioning AI girlfriend/boyfriend model to be massively (ab)used also against unwilling/unwitting "partners".

notpachet · 2024-02-21T16:28:06

I think one related area we'll start seeing more of in the future is "resurrected" companions. You have a terminally ill family member, so you train a model on a bunch of video recordings of them, then you can talk to "them" after they've shuffled off this mortal coil.

swozey · 2024-02-21T17:29:17

Licensing my soul as closed source now ..

datadrivenangel · 2024-02-21T19:21:04

DNR: Do Not Retrain

babyshake · 2024-02-21T19:15:19

Be right back.

texas2toss · 2024-02-21T15:24:53

Do you think there is the possibility of consumer or end-user apps collecting enough specialized data to move downwards on your graph to infra and foundational models?

eladgil · 2024-02-21T15:37:03

I think this is already true for things based on diffusion models - e.g. Midjourney, Pika etc

For LLMs, Character and ChatGPT are arguably two vertically integrated consumer apps (with some B2B applications for ChatGPT as well)

3abiton · 2024-02-21T23:44:48

Agents and transformers are kinda different things, built on different layers.

RyanShook · 2024-02-22T02:08:45

Surely we are in an AI bubble not completely unlike the dotcom bubble, right? The promise and potential for AI to be a long-term growth driver exists but the current market euphoria can’t be sustained in the short-term… at least that’s what I’ve told myself every time NVDA jumps and I feel like a loser.

causal · 2024-02-22T04:13:21

If it's a bubble, it's early days and not likely to pop soon. If it's a dotcom bubble, that means a road bump for the tech as it continues to take over the world.

bbor · 2024-02-21T22:29:44

The Google CEO said that this (unexpected!) invention is more important than fire or electricity. In that light, I think it’s arrogant to try to soothsay like this. Which is tough advice cause this is otherwise great analysis, and acknowledging the great unknown feels scary and of dubious value. But I think it’s the only honest way forward

m0llusk · 2024-02-22T00:24:49

Or this one: How is AI not outsourcing all over again? People pay almost as much for work that needs to be checked before it can be relied on. Sure, there is value, but there is also a lot of risk and questionable margins depending on the context.

mr_toad · 2024-02-22T00:53:32

One difference is that we can expect models to improve, become more reliable, cheaper and need less hand-holding.

galaxyLogic · 2024-02-21T23:08:01

Would it be possible to start an Open-Source AI project that works like SETI that runs on people's PCs on their spare time and collaborate to train an AI model. People could also help by providing feedback to the models?

haberdasher · 2024-02-21T23:20:38

Basically: https://rendernetwork.com/

bane · 2024-02-21T15:57:18

The confusion that everybody feels right now about "what do I do in this rapidly evolving space" is likely more an artifact of one of the most vertical lead-edges to a hype curve in a generation. There is so much piling on, at absolutely tremendous investment amounts, that it's not really even possible to pick a launch point to a business plan that would survive even months at this point.

The second problem is more obvious, but not talked about in concrete terms yet. As these technologies mature, the legal, policy, and social constructs around them will also mature, and will shape the future of what is coming as much as any technical or business decision. Some examples:

- Imagine if courts really find that these models are in gross copyright violation and that model builders will need to work through a scheme similar to sampling in music in order to do their work into the future. Not only will it dramatically increase the cost of model development, but will drive some of the work to alternative sources of data if it's cheaper. What effect will this have on these models?

- Suppose these models are found to be reliable enough for certain mission critical work (healthcare, defense, etc.) and they lead a user down a path that results in harm or death? Who's at fault? Suppose it becomes a liability of the company offering the model? How does the insurance market respond?

- What if these models become good enough, and the automation around them as well, that we start turning them into fully automated agents that make up a significant part of our economy? Do people still have jobs? Or will these things just become the equivalent of hiring more people into an economy and we work in a blended AI/Human Intelligence world? What happens when these things decide to tank capital markets or choke supply chains?

- What about the long-term use of these models by individuals with severe mental impairments or psychological issues? What happens when those models go down, or produce a response that's not liked, or there's a loss of historical conversational data? What about otherwise normal people who end up down a path of social isolation?

- There's an entire legal market about to be created for remote-work fraud where somebody just replaces themself with one of these models for 90% of their work. It's sci-fi utopia, but fraudulent if you misrepresent who's doing the work in many cases.

- Some company, somewhere, will crack the nut on figuring out how to improve the speed and performance of these models, and build custom hardware that offloads the entire effort onto rapidly commoditizing systems. When multi-modal LLMs and diffusion models come preloaded on disposable $.10 SOCs that are put on cereal boxes and in Birthday cards, we'll be well beyond every one of these questions. But we'll be firmly in a different future.

We have yet to encounter most of these things to the point that as a society we have to build responses, and that itself points to the immaturity of both these technologies and the use cases for which they will be adopted.

Which cloud to pick and why aren't consumers spending thousands of dollars getting the complex systems in place to use this stuff aren't even the problems that are going to become interesting in the next 10 years.

xamuel · 2024-02-21T16:34:59

>Suppose these models are found to be reliable enough for certain mission critical work (healthcare, defense, etc.) and they lead a user down a path that results in harm or death? Who's at fault? Suppose it becomes a liability of the company offering the model? How does the insurance market respond?

So much this. I think a lot of people overlook this aspect so much. I wonder if we'll see a corporate push to make people think of individual instantiations of LLMs as being their own entities, for scapegoating purposes. It's completely ludicrous, but you know they'd just love to point a finger at Machine #73583 and make that take all the blame.

chasd00 · 2024-02-21T18:12:00

> corporate push to make people think of individual instantiations of LLMs as being their own entities, for scapegoating purposes

an airline already tried this.

"Air Canada argues it cannot be held liable for information provided by one of its agents, servants, or representatives – including a chatbot. It does not explain why it believes that is the case. In effect, Air Canada suggests the chatbot is a separate legal entity that is responsible for its own actions."

https://www.forbes.com/sites/marisagarcia/2024/02/19/what-ai...

mr_toad · 2024-02-22T01:13:33

They lost that case.

https://www.theguardian.com/world/2024/feb/16/air-canada-cha...

astrange · 2024-02-22T05:03:39

Also, the original incident happened in November 22, which is before the release of ChatGPT, so I'm not sure the chatbot was even an LLM. It might just've been programmed wrong.

somewhereoutth · 2024-02-21T19:11:29

- What happens if everyone realises that the areas where LLMs can be usefully applied are few and not particularly lucrative?

bane · 2024-02-21T21:49:31

That's the far right of the hype curve.

shermantanktop · 2024-02-21T18:08:17

What happens when AI-generated content organically becomes a large percentage of the user-generated corpus that is used for training, with no way to differentiate it from human-generated content?

We may look back on 2022/3 as the last time we had a training set that was clean. The fact that it was already polluted with SEO garbage will be a quaint problem.

citizenpaul · 2024-02-22T05:25:33

I've been thinking about this a lot lately. I guess the 4D chess move is to think about what will be the next market that will benefit from whatever this effect causes.

I have a completely un-worked thought that the music industry might have some clues. As music becomes more and more entrenched in codified process ie pop culture music. It all starts to become very generic.

I'm starting to notice that music is returning to the old way of discovery. Where you hear music from a 3rd party that is playing it and ask them who that is. There is no longer a good bubble to the top source like radio because the corps have cannibalized themselves with production line pop music that has to play on their vertical stack of companies or be licensed. Seems the stuff I find now is some artist with less than 100k followers like (I'm guessing) back in the pre internet days.

bane · 2024-02-21T18:16:52

The current thinking is that it becomes a negative feedback loop, a kind of lossy "compression" of our cultural output. Combined with normal bitrot we'll eventually start losing components of our shared culture until civilization ends.

shermantanktop · 2024-02-21T23:26:15

Well, hopefully we’ll still have toast and jam, so there’s that.

This suggests a sort of cultural event horizon beyond which AI cannot see. Five years from now, the way actual people talk and think about a new topic will be drowned in AI content, creating a gulf between reality as people actually see it and the AI universe. Unfortunately for many people their reality is already slated to become AI-mediated.

astrange · 2024-02-22T05:04:55

It doesn't really seem to be much of a problem. Data quality is important, but there is plenty of incorrect information in the training data whether or not it's AI generated. And training on synthetic data works fine.

I see a lot of uninformed people claiming this is going to doom AI though.

shermantanktop · 2024-02-23T21:16:17

I'm no AI expert, but I'm not uninformed. Can you explain what "informed" means in this context? I'm aware of the use of synthetic data for training in the context of a curated training effort, with HITL checking and other controls.

What we're talking about here is a world where 1) the corpus is polluted with an unknown quantity of unlabeled AI-generated content, and 2) reputational indicators (link counts, social media sharing/likes) may amplify AI-generated content, and lead to similar content being intentionally AI-generated.

At that point, can the incorrect info in the training set really be controlled for using noise reduction or other controls?

ericvsmith · 2024-02-21T23:53:57

I like to think of it like mad cow: that’s what you get when you feed animals to animals. Same with AI.

citizenpaul · 2024-02-22T05:19:32

Not trying to make a point but just throwing another layer in the game.

There is a tribe with a long history of ritual cannibalism that has developed a genetic modification(mutation) that prevents them from getting this disease.

rglover · 2024-02-22T00:16:09

This is one of those moments where Peter Thiel's "last mover advantage" makes a whole lot of sense.