After that it removes (or replaces) Unicode punctuation, performs a SHA1 hashing, and uses the first 8 bytes for deduplication comparisons (paragraph level)
Is taking the first few bytes that much faster than comparing the entire hash? Or something else? That is one of those performance optimizations I would go back and forth on endlessly wondering if I lost something by trying to shave off a few cycles.
> Document Deduplication … We use the MinHash algorithm to compute 13-gram Jaccard similarities to determine which documents are near-duplicates of each other (Lee et al., 2021a). … We define two documents to be too similar when their Jaccard similarity exceeds 0.8, and randomly remove one of them.
The inline referenced [Lee et al](https://arxiv.org/pdf/2107.06499) goes into detail, and notes that one of the algorithms is based on:
> 4.2 Approximate Matching with MinHash … perform approximate deduplication based on matching entire examples. This method, which we call NEARDUP, is a good complement to the exact substring matching, especially for web crawl text, as it handles the very common case of docu- ments being identical except for interspersed templated fields.
I’ve searched to my Peter Principle limit here, and was able to find [another implementation of MinHash](https://web.eecs.utk.edu/~jplank/plank/classes/cs494/494/not...) that used the first eight bytes of the hash, and so I am just going to assume that’s what they needed in order to meet their similarity KPI. Hopefully somebody here knows better/is smarter.
You want to store hashed value in database as effectively as possible, hence the first 8 bytes.
Though I think a 64 bits hash algorithms might be more suitable than sha1. Personally I use fnv-1a for hashing (not the fastest but trivial to implement) instead.
Does anyone know of a maintained alternative to fasttext? It is mentioned here for language identification, but clicking through to the GitHub project, it looks to be recently archived.
I usually use a BERT model for text classification these days, but would like to have an alternative that it less CPU-heavy like fasttext at hand for high-volume use cases.
I was reminded about recent LLM wins coming from training data improvements (eg. fineweb)