float-trip's comments

float-trip · on April 24, 2024

Reddit's caches are set up to only ever return the last 1,000 of anything. So for example - you can't scroll past 1k items on /new, and if you save more than 1k posts then you'll have to unsave some to retrieve the others.

If this extension only edits comments, it'll only touch the most recent 1k. You would need to retrieve the older ones with a Pushshift replacement like this: https://pullpush.io/. But that also shows how ineffective this is. We still have public reddit archives (like Pullpush and https://github.com/ArthurHeitmann/arctic_shift) which contain comments as they were originally posted. This isn't gonna be a problem for Google.

float-trip · on Dec 5, 2023

I tried adding special tokens for a reddit-style dataset once. The format was: `<|post_author|>username<|post_title|>title here...`

The resulting model was so much worse than just formatting everything plaintext. This was with MPT-30B, 15 special tokens, 300M training tokens, and a full finetune.

I may have made a mistake, but I haven't seen any open source finetunes successfully add a large number of tokens yet either.

Tostino · on Dec 5, 2023

Try doing the same thing in your dataset, but don't actually add them as "special tokens", and just let them just be multiple tokens.

Adding new tokens needs a ton of data to train what the token means. Reusing existing tokens, will allow you to easily teach that a sequence of tokens now has a new meaning after fine tuning.

float-trip · on Dec 5, 2023

That's what I ended up doing (`[Author] username [Title] post title...`)

> Adding new tokens needs a ton of data to train what the token means.

But how much? 300M tokens is fine for a simple version of ChatML with ~4 tokens. Not for 15, at least in my case. How's this relationship scale?

Just trying to offer one datapoint for what doesn't work, with the hedge that I might have just had a bug

Tostino · on Dec 6, 2023

I don't know how many tokens are required to get good results, because I simply didn't mark mine as "special_tokens" due to the issues that I had read about. I got great results, whereas others who have tried special tokens got pretty poor results. I'm sure there is a magic number, but it's just not been worth it for me to explore that area yet.

tayo42 · on Dec 5, 2023

I don't mean add special tokens, but make the vocab only the set of possible cards. each card is a token.

a simple input might be <cards you hold> 1 14 56</end><cards to pick> 5 64 2</end> -> predicted token is the draft pick.

Then train a transformer based network from scratch.

float-trip · on Dec 5, 2023

Thanks for writing up. Rather than zeroing out the loss for the prompt, did you also try using weighted loss with Axolotl? At one point, Microsoft's GPT 3 docs suggested this was beneficial when the responses are short (like you have with "Cut in.") Domain adaptation over subreddits/forums before finetuning may help as well.

dmakian · on Dec 5, 2023

> did you also try using weighted loss with Axolotl

This is really smart, I didn't think about this! Will add it to my list of things to try, great idea!

> Domain adaptation over subreddits/forums before finetuning may help as well.

I was thinking about this too (along with transcribing draft youtube videos), I'd definitely be curious how much this helps.

float-trip · on Dec 6, 2023

Related comment from gwern: https://news.ycombinator.com/item?id=38438859. Can't find the docs now - I think they were the old GPT 3 ones - but they suggested a low value somewhere around 0.01 and 0.1.

Also - why qlora rather than a full finetune? Using LambdaLabs, it'd cost roughly the same as your quote. Cheaper I think if you're willing to gamble with fp8: https://github.com/mosaicml/llm-foundry/tree/main/scripts/tr.... And fewer hyperparameters to tune as well

float-trip · on Aug 16, 2023

There's a breakdown here for anyone interested (ctrl+f "weight flops for")

https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-la...

float-trip · on May 13, 2023

The prompt for Bing Chat was previously reproduced by the same person as here, using the same trick. The Bing lead disclaimed it as inaccurate, though: https://twitter.com/MParakhin/status/1627491603731423232

float-trip · on April 17, 2023

Two other recent literature reviews worth reading:

"Transformer Taxonomy" - https://kipp.ly/blog/transformer-taxonomy/

"Five years of progress in GPTs" - https://finbarrtimbers.substack.com/p/five-years-of-progress...