Assuming power consumption is not an issue, Can these loads be effectively run on farms of systems using older cards with smaller vram sizes?
How about for training? Are the hardware requirements fundamentally different apart from scale?
Certain models may require storage formats like bfloat16 to run efficiently that may not be supported on older hardware.
Parallelism is also supported (and more necessary actually) for training, but since backprop is expensive it typically requires many multiples of memory requirements needed for inference.
>Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.
> These models can be applied on:
> - Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages.
> - Images, for tasks like image classification, object detection, and segmentation.
> - Audio, for tasks like speech recognition and audio classification.
Also I read about GPT-J, whose capability is comparable with GTP-3.
But I believe it requires buying or renting GPUs.
These models can be self-hosted but they will require advanced hardware and some specific skills related to AI model deployment.
by fine-tuning you should be able to attain niche-specific better results than default chatGPT theoretically, if you're lucky and thorough I guess
ChatGPT in a nutshell.
(I hadn't heard the slang before, but it seems descriptive for a false air of sophistication, or [possibly over-]sophisticated complexity that nevertheless leads to bad results.)
Someone recently put this well:
Don’t think of gpt chat as one really smart friend, instead think of it as an army of dumb pawns.
The goal is different so it's not a direct alternative, but it's the closest that can be run on consumer hardware. In some ways models that run via Kobold are a lot more practically useful because they have a semblance of memory, but they don't have the breadth of knowledge that ChatGPT does. You can go back and edit a response and models will be able to help generate pages and pages of writing. It's also not limited to chat mode.
I run it locally but I have heard a lot of success stories about running on Collab as well and there are GPU and TPU notebooks maintained in the repo. It can use a lor of different models like OPT, Fairseq Dense and even older models like GPT-J. There are fine tuned models for NSFW content as well, my understanding is that those models were motivated to move away from AI Dungeon that censored and read the text of user's stories.
The models on hugging face come in all sizes, 8bg VRAM is enough for a 2.7B model, without taking a time penalty and splitting compute across GPU and RAM+CPU. There are options ranging from 350M  to a OPT 66B in the FB/Meta AI repo released on May 3rd , the 66B parameter one is openly available and the full 175B parameter model is available via request. GPT2 and GPT-Neo are supported too. I found 2.7B and 6.7B impressive personally. 66B Models would take over hundreds of gigabytes of vram to run at a similar speed to how ChatGPT works. I think that's part of the reason why it's not as popular, very few people can run even the smaller model at a reasonable speed.
OPT is motivated by only training on open access material. The example bias demonstrated about the how the model complete "the man works as a" vs "the woman works as a". I thought having more modern works to train on would help avoid strong biases because culture has shifted toward being more moderate, but that doesn't seem to be case.
"The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents:
BookCorpus, which consists of more than 10K unpublished books,
CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas,
The Pile, from which * Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews* were included.
Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021)
CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b)
The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus.
The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.“