We investigate the potential constraints on LLM scaling posed by the availability of public human-generated text data. We forecast the growing demand for training data based on current trends and estimate the total stock of public human text data. Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that synthetic data generation, transfer learning from data-rich domains, and data efficiency improvements might support further progress.
This work presents AstroPT, an autoregressive pretrained transformer developed with astronomical use-cases in mind. The AstroPT models presented here have been pretrained on 8.6 million 512 × 512 pixel grz-band galaxy postage stamp observations from the DESI Legacy Survey DR8. We train a selection of foundation models of increasing size from 1 million to 2.1 billion parameters, and find that AstroPT follows a similar saturating log-log scaling law to textual models. We also find that the models' performances on downstream tasks as measured by linear probing improves with model size up to the model parameter saturation point. We believe that collaborative community development paves the best route towards realising an open source `Large Observation Model' -- a model trained on data taken from the observational sciences at the scale seen in natural language processing. To this end, we release the source code, weights, and dataset for AstroPT under the MIT license, and invite potential collaborators to join us in collectively building and researching these models.
Which is not unreasonable for that amount of hardware.
You have to ask yourself if you want to drop that kind of money on consumer GPUs, which launched late 2022. But then again, with that kind of money you are stuck with consumer GPUs either way, unless you want to buy Ada workstation cards for 6k each and those are just 4090s with p2p memory enabled. Hardly worth the premium, if you don't absolutely have to have that.
The beefy workstation cards are 2 slots, but yeah the 4090 cards are usually 3.something slots, which is ridiculous. The few dual slot ones are water cooled.
I find it challenging to get my 4090s to consume more than 300 watts. There are also a lot of articles, benchmarks, etc around showing you can dramatically limit power while reducing perf by insignificant mounts (single digit %).
If you are interested in this also check out EarthPT, which is also a time series decoding transformer (and has the code and weights released under the MIT licence): https://arxiv.org/abs/2309.07207
The PinePhone is probably the most stable one. I have seen a few people use it on this one and in a recent thread about the PinePhone it was also mentioned multiple times as a good alternative to the default Plasma.
Wanted to share the code release of EarthPT, a model that predicts future satellite observations in a zero shot setting! I'm the first author so please shoot any questions you have at me.
EarthPT is a 700 million parameter decoding transformer foundation model trained in an autoregressive self-supervised manner and developed specifically with EO use-cases in mind. EarthPT can accurately predict future satellite observations across the 400-2300 nm range well into the future (we found six months!).
The embeddings learnt by EarthPT hold semantically meaningful information and could be exploited for downstream tasks such as highly granular, dynamic land use classification.
The coolest takeaway for me is that EO data provides us with -- in theory -- quadrillions of training tokens. Therefore, if we assume that EarthPT follows neural scaling laws akin to those derived for Large Language Models (LLMs), there is currently no data-imposed limit to scaling EarthPT and other similar ‘Large Observation Models.’(!)