Yes, could be. Not sure how or even if anyone could prove it, though.

godelski · 2024-04-18T21:02:03 1713474123

This should be fairly de facto true. Remember your dataset is some proxy for some real (but almost surely intractable) distribution.

Now let's think about filling the space with p-balls that are bound by nearest points. So there should be no data point inside the ball. Then we've turned this problem into a sphere packing problem and we can talk about the size and volumes of those spheres.

So if we uniformally fill our real distribution with data then the average volume of those spheres decrease. If we fill but not uniformly the average ball will decrease but the largest ball will shrink slower (this case being we aren't properly covering data in that region). In either case that more you add data, the more the balls shrink. Essentially meaning the difference between data decreases. The harder question is about those under represented regions. Finding them and determining how to properly sample.

Another quick trick you can use to convince yourself if thinking about basis vectors (this won't be robust btw but a good starting point). In high dimensions the likelihood that two randomly sampled vectors are orthogonal is almost certainly true. So then we think of drawing basis vectors (independent vectors that span our space). So as we fill in data, we initially are very likely to have vectors (or data) that are independent in some way. But as we add more, the likelihood that they are orthogonal decreases. Of course your basis vectors don't need to be orthogonal but that's more semantics because we can always work in a space where that's true.

cs702 · 2024-04-19T13:54:04 1713534844

I agree, but my question was not whether distance between data points tends to decrease as dataset size grows, but whether that is the reason why the number of training tokens required per parameter declines. It could be, but proving it would require a better understanding of how and why these giant AI models work.

godelski · 2024-04-19T17:37:02 1713548222

Wasn't your question about how *independent* the data is?

We could talk about this in different ways, like variance. But I'm failing to see how I didn't answer your question. Did I misscommunicate? Did I misunderstand?

The model is learning off of statistics so most of your information gain would be through more independent data. Think of this space we are talking about as "knowledge." And our "intelligence" as how easy it is to get to any point in this space. The vector view above might help with understanding this one, because you can step in the direction of any vectors you have and then how you combine them to get to your final point. The question is how many you have to use (how many "steps" away)? And of course, how close you can get to your final destination. As you can imagine, from my previous comment, that doing this for any given point you'll reduce "steps" to your final destination if you have more vectors, but you can also understand that the utility of each vector decreases as you add more. (Generally. Of course if you have a gap in knowledge you can get a big help from a single vector that goes into that area but let's leave that aside).

Does this help clarify? If not I might need you to clarify your question a bit more. (I am a ML researcher fwiw)

cs702 · 2024-04-20T00:54:59 1713574499

> Wasn't your question about how independent the data is?

No. My original (top) comment was about how the number of training tokens required per parameter slowly declines as models become larger. dzdt suggested it could be because the independence of training points declines as the dataset size grows. I said it could be, but I'm not sure how one would go about proving it, given how little we know about the inner working of giant models. Makes sense?

Otherwise, I agree with everything you wrote!

godelski · 2024-04-20T16:59:27 1713632367

Oh I see. It's because yes, we expect this to happen once we get to sufficient coverage. As we linearly increase the number of parameters the number of configurations increases super linearly. In other words, the information we can compress.

There's a lot we didn't know, but it isn't nothing. There's a big push for ML not needing math. It's true you can do a lot without, especially if you have compute. But the math helps you understand what's going on and what are your limits. We can't explain everything yet, but it's not nothing.

sebzim4500 · 2024-04-18T19:52:13 1713469933

I guess you could artifically limit the training data (e.g. by removing languages, categories) and see if the utility of extra tokens drops off as a result.