The Friedman functions are sufficiently well known to be in the training set, likely in the form of some csv file from somebody testing other methods on them. Though likely any data set would not share the same x-values. Still, it's surprising performance when we're used to the LLM's getting confused by straightforward math problems and with arithmetic often crippled by the tokenization system.
They did invent their own functions to test if the results were due to these functions being on the training date. See the section on data contamination in the paper.
Agree it both kind of makes sense (regression is the best way to predict the next token in this context) and kind of ironic (LLMs can do high school regression but can’t do elementary school long digit arithmetic).
It's surprising, but in a weird way it makes sense if you think about it as "learning a pattern" instead of "doing math". The model doesn't have to understand logic as we know it (and this class of models doesn't seem to be able to do that in general), but it does have to be able to learn the pattern somehow, and we already know LLMs can learn patterns from the input context without any fine-tuning.
I'd like to try it on a dataset like Titanic. That might be a more interesting experiment.
Or to avoid training data bias, maybe a completely random dataset generated by some more-sophisticated means, like the scikit-learn data generator which can introduce clusters, irrelevant features, etc.
Initially I experimented with house prediction. But the models have probably seen this data, so I ended up experimenting with random functions.
I tried with linear regression with irrelevant features (e.g., NI 1/2 -> 1 informative variable, 2 total variables). The models still perform reasonably well.
I'm a little skeptical. I grabbed a small dataset and asked ChatGPT to create a linear regression equation and then did the same in excel. Perhaps they estimate values for beta differently, maybe ChatGPT is secretly a Bayesian, but the output I got was different. It was close, which I will say is impressive. Also, maybe I missed it, but I think it would be valuable to add to the prompt to provide the estimates for the parameters. It seemed as if they sort of just took the prediction values, computed MAE, and took the estimate for what it said. As I type this, I do wonder if you could use it to compute estimates for the parameters iteratively, average them, and essentially compute a bootstrapped estimate for the population parameters..
I'm not surprised that a linear regression estimated by a method that knows it should be estimating a linear regression would beat everything else.
The traditional OLS estimator in Excel has all sorts of classical optimality properties when its assumptions (normality, linearity) are true, so no fancy neural net can outperform it even in principle (the only way to outperform it would be to have an informative prior for how you generated the data set's parameters). So if the LLM's beat, or even matched, Excel in that case, they would be thinking too narrowly.
Same with any other models where we know the form of the answer up to some unknown parameters.
If we want parameter estimates, that means we already have a functional form in mind. In that case, we get to use established statistical theory to design optimal estimators, whether by Bayesian or other methods. Black box neural magic wouldn't help (or it might help indirectly, in computationally intractable cases).
What we would want the LLM's to do, ideally, is explore the space of known/possible 'patterns' and perform well in situations where the underlying relationship exists, is not known in advance, and is known not to have a simple form we can describe. Much like they (and we!) produce text without being able to describe why they are producing that particular text, we would expect them to make those predictions without being able to explain them in terms of parameters and functions - not without a whole other layer of explainability machinery.
> In that case, we get to use established statistical theory to design optimal estimators, whether by Bayesian or other methods. Black box neural magic wouldn't help (or it might help indirectly, in computationally intractable cases).
Using 'established statistical theory to design optimal estimators' isn't trivial for most people. Black box magic might still be useful for them.
Let the LLM's magically write code for the optimal estimators, using existing theory; but that code will implement interpretable math with provable characteristics, as opposed to magic.
Maybe the LLM is just doing interpolation rather than regression?
I've played with math examples (a while back, not with recent models) where they make errors, but seem to get the magnitude right, so perhaps easy to find the closest points (or roughly closest) to interpolate between.
Interpolation is a legitimate technique for regression problems, a special case of the k-nearest-neighbor estimator (which is one of the methods they test against). Lots of related regression techniques involve a tradeoff between global trends and nearby points: KNN, kernel regressions, gaussian process regressions, generalized additive models, local regressions, mixed models/generalized linear models with some covariance structures
- all of them more or less manifestations of the same underlying math. The tree based techniques - random forests, GBM and the like - they don't look like interpolations, but the more you zoom in to the leaf nodes of those trees, the more they look like averaging one or more local y-values.
Literal interpolation would be use a lot more often, except you can't practically do it in higher dimensions (even 2 isn't trivial).
Ok I tried to get this to work for a long time ~6 months ago. It doesn’t work. You can find datasets where it will work, but if you select random datasets you generally see 0 correlation between the predicted and actual values.
Interesting, thanks for sharing! I noticed when the LLMs were trying to explain the prediction they would sometimes erroneously generate that the relation is linear when it was not. This happened when I removed the following part of the prompt:
`The task is to provide your best estimate for "Output". Please provide that and only that, without any additional text.`
Examples of this are available in Appendix J.
Sidenote, but all the experiments were ran with the API. There are some differences between the Chat and the API, for example the Chat can generate and execute code. I shared Chats since they are easy to look at and to try.
If you have access to an API key, I made some google colabs:
The GitHub repo now contains (1) Links to Google Colab for examples + small-scale evaluation; (2) Jupyter Notebooks for small example + small-scale evaluation.
Very interesting indeed. My current masters research is is doing related work and I can't get this to work on GPT4 (random forest specifically). My as yet untested assumption is my data set needs to be bigger, and/or I need more features scores. So its interesting to see someone is getting this to work.
I am suspicious of the use of MAE as a metric. I would like to see a range of metrics, especially RMSE that penalizes large differences more. If we can pick the metric to judge the models by surely some metrics make the results look better than others.
Prompts missing is a severe minus in terms of reproducibility - and hence how serious I take the paper. I would urgently recommend to the authors to add everything needed to reproduce the results, ideally as a well-coded GitHub repo.
The GitHub repo now contains (1) Links to Google Colab for examples + small-scale evaluation; (2) Jupyter Notebooks for small example + small-scale evaluation;