I've seen a lot of discussion about key aspects of LLMs like ML (research, architecture), Infrastructure (GPUs, Cloud), and Product (ChatGPT et al) but not much on the data engineering side. A lot of hand waving like you "just" train on the entire public Internet. There must be a ton of complexity here, as well.
What is the difference between web scraping and crawling? They are not simply indexing websites, these systems must be extracting and storing vasts amount of data from those crawled sites (hence Reddit, Twitter, etc calling foul). Do these systems rely on tons of proxy IPs?
There's probably not too much going on after ingestion beyond storing all this data as text or image in an optimal format for the training system(s) to use.
Comprehensive analysis paper (the 'what' not the 'how'):
https://lifearchitect.ai/whats-in-my-ai/
Holly wrote very eloquently on how/why they tokenize words:
A token is a way of dealing with rare words by breaking a word up into 50k unique subword units using byte pair encoding (BPE) Neural Machine Translation of Rare Words with Subword Units [arxiv.org] (Sennrich et al, 2015). This is particularly helpful with agglutinative or polysynthetic words where an infinite number of words can be created by combining morphemes. For example, the Yup’ik word tuntussuqatarniksaitengqiggtuq is composed of many morphemes that translate to “He had not yet said again that he was going to hunt reindeer” Describing Morphosyntax: A Guide for Field Linguists [cambridge.org] (Payne, 1997). Rather than training GPT-3 on tuntussuqatarniksaitengqiggtuq, it is more efficient to train on the BPEs: "t", "unt", "uss", "u", "q", "at", "arn", "i", "ks", "ait", "eng", "q", "igg", "tu", "q". Breaking up words like this has some strange side effects. For instance, GPT-3 performs better at addition when you use a comma as a separator GPT-3 Prompts: Math [wikidot.com] (Brockman, 2020). BPE encoding may also confuse GPT-3, by obscuring what it needs to understand in the text.
https://hollygrimm.com/gpt3musings