Ask HN: What does the data engineering behind LLMs look like?

adt · on April 20, 2023

If they don't already own the data (Alphabet/YouTube, Meta/FB, etc.) they do scrape it, or use ready-made datasets.

Comprehensive analysis paper (the 'what' not the 'how'):

https://lifearchitect.ai/whats-in-my-ai/

Holly wrote very eloquently on how/why they tokenize words:

A token is a way of dealing with rare words by breaking a word up into 50k unique subword units using byte pair encoding (BPE) Neural Machine Translation of Rare Words with Subword Units [arxiv.org] (Sennrich et al, 2015). This is particularly helpful with agglutinative or polysynthetic words where an infinite number of words can be created by combining morphemes. For example, the Yup’ik word tuntussuqatarniksaitengqiggtuq is composed of many morphemes that translate to “He had not yet said again that he was going to hunt reindeer” Describing Morphosyntax: A Guide for Field Linguists [cambridge.org] (Payne, 1997). Rather than training GPT-3 on tuntussuqatarniksaitengqiggtuq, it is more efficient to train on the BPEs: "t", "unt", "uss", "u", "q", "at", "arn", "i", "ks", "ait", "eng", "q", "igg", "tu", "q". Breaking up words like this has some strange side effects. For instance, GPT-3 performs better at addition when you use a comma as a separator GPT-3 Prompts: Math [wikidot.com] (Brockman, 2020). BPE encoding may also confuse GPT-3, by obscuring what it needs to understand in the text.

https://hollygrimm.com/gpt3musings

cjbprime · on April 20, 2023

https://commoncrawl.org/

Closi · on April 19, 2023

The other element here is quality content - you probably aren’t just training on public internet data for commercial LLMs, hopefully you can train on scanned books too, and closed academic journals, radio transcripts, photograph stores, map data, codebases, technical documentation...

lostpharoah · on April 20, 2023

I suspect a lot of low-quality content still makes it into the training process and the RLHF layer is specifically designed to improve the quality of the output despite noisy input.

i.e. without RLHF, these models would for sure be vulgar, explicit, etc.

jeeeb · on April 19, 2023

I don’t have any insider insight on this but the GPT3 paper discusses some of their data sets and curation techniques (https://arxiv.org/pdf/2005.14165.pdf).

The recent DinoV2 paper is also interesting reading (https://arxiv.org/pdf/2304.07193.pdf), as they particularly focus on techniques for improving the training set.

OpenAI also have been open about making heavy use of RL (via PPO) to fine tune the models.

For RL it seems they’ve basically developed a second model that can be used to score the quality of responses based on the encoded preferences of human evaluators. I.e you build a ranking of different responses based on desired characteristics (e.g. polite, helpful etc) and use those to train a second model which models the RL reward function. This can then be used to fine tune the main model.