Hacker News new | past | comments | ask | show | jobs | submit | ubutler's comments login

Simhash is an extremely fast and simple algorithm for detecting near duplicate text at scale which makes it particularly useful for deduplicating AI training datasets.


Is it “this couple reign” or “this couple reigns”? The former just feels wrong despite it being the title.


Looks terrible to me but apparently both are acceptable:

https://www.merriam-webster.com/grammar/is-couple-singular-o...

> When writing of a couple getting married, it is more common to use the plural form ("the couple are to be wed"). When writing of an established couple, it is more common to use a singular verb ("the couple has six puppies, each more destructive than the next").

So according to MW we're a bit more right than NYT.


I'd expect the former in British English, because in that standard, grammatically singular nouns that refer to multiple people are conjugated in the plural. The latter would be typical for American English. It's a bit surprising to see the former used by an American author in an American publication, however.


> It's a bit surprising to see the former used by an American author in an American publication, however.

Strange things happen to American writers who read lots of Russian novels translated by Brits.


I think reigns sounds better. A couple is actually a singular noun.



Yep. And that Wikipedia article explains, collective nouns are normally taken to be singular in American English:

> In American English, collective nouns almost always take singular verb forms (formal agreement). In cases that a metonymic shift would be revealed nearby, the whole sentence should be recast to avoid the metonymy. (For example, "The team are fighting among themselves" may become "the team members are fighting among themselves" or simply "The team is infighting.") Collective proper nouns are usually taken as singular ("Apple is expected to release a new phone this year"), unless the plural is explicit in the proper noun itself, in which case it is taken as plural ("The Green Bay Packers are scheduled to play the Minnesota Vikings this weekend").

I guess "couple" may be one of the exceptions?


Translation issue from the original English.

(joke)


* without already labelled training data (assuming you're referring to causal LLMs).

If you have labelled training data (or semi-labelled), BERT takes the cake, both in terms of accuracy and efficiency. In fact, you can have luck with getting a CLM to generate noisy labels and then training BERT/RoBERTa on that to get a robust strong classifier.


Between my experience and arXiv papers I've read I'd say this:

Personally I am willing to label 1000-2000 documents to create a training set. It's reasonable to make about 2000 simple judgements in an 8-hour "day" so it is something you could do at work or in your spare time in a few days if you want.

You can compute an embedding and then use classical ML algorithms from scikit-learn such as the support vector machine. My recommender can train a new model in about 3 minutes and that includes building about 20 models and testing them against each other to produce a model which is well tested and probability calibrated. This process is completely industrial, can run unattended, and always makes a good model if it gets good inputs. Running it every day is no sweat.

You can also "fine-tune" a model, actually changing the parameters of the deep model. I've fine-tuned BERT family models for my recommender, it takes at least 30 minutes and the training process is not completely reliable. A reliable model builder would probably do that 20 or so times with different parameters (learning rate, how long to train the model, etc.) and pick out the best model. As it is the best models from it is about as good as my simpler models, and a bad one is worse. I can picture industrializing it but I'm not sure it's worth it. In a lot of papers people just seemed to copy a recipe from another paper and don't seem to do any model selection.

My problem is fuzzy: the answer to "Do I like this article?" could vary from day to day. If I had a more precise problem "fine tuning" might pull ahead. Some people do get a significant improvement which would make it worth it, particularly if you don't expect to retrain frequently.

I see papers where somebody does the embedding transformation but instead of pooling over the tokens (averaging the vectors) they input the vectors into an LSTM, GRU or something like that and train the recurrent model. This kind of model does great when word order matters as in sentiment analysis. I found that kind of model was easy to train in a repeatable way a decade ago so that's an option I'd consider.


You're better off trying to figure out what features you like about articles that are less ambiguous as signal. You would then be able to finetune models to classify whether those features are present. Whether it's classificaton of chunks/sentences/tokens. For these, a bert model could be fine tuned to efficiently detect it.


I misinterpreted the title as Apple approving development of iOS apps on PC. That would’ve been an exciting development.


> and indeed, much of what used to be fertile farmland in that part of the world is now desert from over-farming

Do you have a source for this?


Probably because the title of the article uses 402,000,000 Mbps and I think there’s a rule(?) on HN about not misrepresenting titles or rewording them too much, though I’d wager 402 Tbps is fair alteration.


Although worth noting that more recent techniques (eg, Transformers) need stop words for context.


Does it actually help though? I would think embeddings of cat and the cat would be functionally similar


There are two reasons I can think of why someone might reuse a tokeniser:

1. They want to continue pretraining a model instead of starting from scratch. But actually people might not know that you can pretty easily reuse model weights even when training with a new tokeniser (I’ve got a blog post on how to do that: https://umarbutler.com/how-to-reuse-model-weights-when-train... ).

2. Because it’s convenient for end users. Tokenising and chunking really large corpora can take a long time and it’s nice that I can use the GPT2 tokeniser and then train a bunch of different models on that data without having to retokenise everything.


I’ve trained tokenizers on medium-sized datasets (+5GB of text, although that could be considered small or large depending on who you ask) and have always found training quite fast. As in, it takes a couple minutes.

Maybe if we’re talking terabytes it might not scale as well but so far in my experience training tokenizers has never been an issue. It’s training models that takes ages.


It surprises me that this is a surprise, I thought this was a given.


Yeah, I'm very certain that I've read about this phenomenon in the not so recent past. To the point where I've intentionally listened for the difference on many occassions since then.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: