Hacker News new | past | comments | ask | show | jobs | submit login
Large language model data pipelines and Common Crawl (christianperone.com)
139 points by sonabinu 10 months ago | hide | past | favorite | 12 comments



Nicely written, thanks for posting!

I was reminded about recent LLM wins coming from training data improvements (eg. fineweb)


  After that it removes (or replaces) Unicode punctuation, performs a SHA1 hashing, and uses the first 8 bytes for deduplication comparisons (paragraph level)
Is taking the first few bytes that much faster than comparing the entire hash? Or something else? That is one of those performance optimizations I would go back and forth on endlessly wondering if I lost something by trying to shave off a few cycles.


The OP’s paper reads:

> Document Deduplication … We use the MinHash algorithm to compute 13-gram Jaccard similarities to determine which documents are near-duplicates of each other (Lee et al., 2021a). … We define two documents to be too similar when their Jaccard similarity exceeds 0.8, and randomly remove one of them.

The inline referenced [Lee et al](https://arxiv.org/pdf/2107.06499) goes into detail, and notes that one of the algorithms is based on:

> 4.2 Approximate Matching with MinHash … perform approximate deduplication based on matching entire examples. This method, which we call NEARDUP, is a good complement to the exact substring matching, especially for web crawl text, as it handles the very common case of docu- ments being identical except for interspersed templated fields.

I’ve searched to my Peter Principle limit here, and was able to find [another implementation of MinHash](https://web.eecs.utk.edu/~jplank/plank/classes/cs494/494/not...) that used the first eight bytes of the hash, and so I am just going to assume that’s what they needed in order to meet their similarity KPI. Hopefully somebody here knows better/is smarter.


You want to store hashed value in database as effectively as possible, hence the first 8 bytes.

Though I think a 64 bits hash algorithms might be more suitable than sha1. Personally I use fnv-1a for hashing (not the fastest but trivial to implement) instead.


Does anyone know of a maintained alternative to fasttext? It is mentioned here for language identification, but clicking through to the GitHub project, it looks to be recently archived.

I usually use a BERT model for text classification these days, but would like to have an alternative that it less CPU-heavy like fasttext at hand for high-volume use cases.


fasttext appears to be up and fine.[0] with fairly recent activity.[1] If you are looking for something similar, word2vec maybe?

[0] https://fasttext.cc/

[1] https://github.com/facebookresearch/fastText


"This repository has been archived by the owner on Mar 19, 2024. It is now read-only."

I would assume that since it has been archived that no future work will happen on it.


The list of forks https://github.com/facebookresearch/fastText/forks has explosion.ai (makers of spaCy)'s floret https://github.com/explosion/floret as the most starred one.


fasttext failed for me with numpy2.0.

reverted to numpy 1.26.4 and it was fine.


The section on deduplication was very useful thanks for posting


This is a great blog btw.


(2023)

Still very useful, but it should probably have a date in the title.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: