Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What does the data engineering behind LLMs look like?
19 points by lostpharoah on April 19, 2023 | hide | past | favorite | 5 comments
I've seen a lot of discussion about key aspects of LLMs like ML (research, architecture), Infrastructure (GPUs, Cloud), and Product (ChatGPT et al) but not much on the data engineering side. A lot of hand waving like you "just" train on the entire public Internet. There must be a ton of complexity here, as well.

What is the difference between web scraping and crawling? They are not simply indexing websites, these systems must be extracting and storing vasts amount of data from those crawled sites (hence Reddit, Twitter, etc calling foul). Do these systems rely on tons of proxy IPs?

There's probably not too much going on after ingestion beyond storing all this data as text or image in an optimal format for the training system(s) to use.




If they don't already own the data (Alphabet/YouTube, Meta/FB, etc.) they do scrape it, or use ready-made datasets.

Comprehensive analysis paper (the 'what' not the 'how'):

https://lifearchitect.ai/whats-in-my-ai/

Holly wrote very eloquently on how/why they tokenize words:

A token is a way of dealing with rare words by breaking a word up into 50k unique subword units using byte pair encoding (BPE) Neural Machine Translation of Rare Words with Subword Units [arxiv.org] (Sennrich et al, 2015). This is particularly helpful with agglutinative or polysynthetic words where an infinite number of words can be created by combining morphemes. For example, the Yup’ik word tuntussuqatarniksaitengqiggtuq is composed of many morphemes that translate to “He had not yet said again that he was going to hunt reindeer” Describing Morphosyntax: A Guide for Field Linguists [cambridge.org] (Payne, 1997). Rather than training GPT-3 on tuntussuqatarniksaitengqiggtuq, it is more efficient to train on the BPEs: "t", "unt", "uss", "u", "q", "at", "arn", "i", "ks", "ait", "eng", "q", "igg", "tu", "q". Breaking up words like this has some strange side effects. For instance, GPT-3 performs better at addition when you use a comma as a separator GPT-3 Prompts: Math [wikidot.com] (Brockman, 2020). BPE encoding may also confuse GPT-3, by obscuring what it needs to understand in the text.

https://hollygrimm.com/gpt3musings



The other element here is quality content - you probably aren’t just training on public internet data for commercial LLMs, hopefully you can train on scanned books too, and closed academic journals, radio transcripts, photograph stores, map data, codebases, technical documentation...


I suspect a lot of low-quality content still makes it into the training process and the RLHF layer is specifically designed to improve the quality of the output despite noisy input.

i.e. without RLHF, these models would for sure be vulgar, explicit, etc.


I don’t have any insider insight on this but the GPT3 paper discusses some of their data sets and curation techniques (https://arxiv.org/pdf/2005.14165.pdf).

The recent DinoV2 paper is also interesting reading (https://arxiv.org/pdf/2304.07193.pdf), as they particularly focus on techniques for improving the training set.

OpenAI also have been open about making heavy use of RL (via PPO) to fine tune the models.

For RL it seems they’ve basically developed a second model that can be used to score the quality of responses based on the encoded preferences of human evaluators. I.e you build a ranking of different responses based on desired characteristics (e.g. polite, helpful etc) and use those to train a second model which models the RL reward function. This can then be used to fine tune the main model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: