Hacker News new | past | comments | ask | show | jobs | submit login
Learnings from fine-tuning LLM on my Telegram messages (asmirnov.xyz)
216 points by furiousteabag on Nov 27, 2023 | hide | past | favorite | 69 comments



This part caught my eye:

"Using a half-precision FSDP full shard with a 1024 sequence length and a micro batch size of 2 required 63GB of VRAM on each of the eight A100 80 GB GPUs. The training, lasting three epochs, took just 20 minutes. The total cost for the VM was $8.88 per hour, resulting in $3, not including the time for experiments and bug fixes."

I wondered where you could rent cycles on a machine like that, a quick Google found that p4d.24xlarge on AWS is available, while the on-demand cost is $20.1755 per hour, the Spot is only $8.99 (I guess it's gone up?)

Cool to know I could fine-tune for only ~$3.


I've been using vast.ai for a very long time. It is like a GPU marketplace, where people rent and lease GPUs. There are a lot of VMs with 4090, and beasts like 8xA100 80GB are also available from time to time.


I've used vast.ai to do some fine-tuning just a few days ago. It is indeed pretty great, though some servers fail to start up properly, or have some weird performance issues. I also wish they had more templates to try.


Yeah it works pretty well for the price - just need to be comfortable with running code and putting data on random peoples' computers (which I am for certain things). Someone on HN posted a script or snippet of output on mass-testing vast.ai servers for connectivity and configuration, and auto-labeling them using their API. Wish I could find it now... maybe with the search?


There are all these 8x 4090 machines on Vast.ai running in ASRock epyc servers and I just want to know where the hell all those are coming from. Like I want to see pictures of these setups, since there are no off-the-shelf 4090s with blower cooler setups and watercooling that many cards together is a lot of custom hardware. And the backstories, because the fact they are 4090s and not datacenter cards, are these hobbyists just building octo-gpu $18k EPYC rigs for fun? (I even saw one with 9x 4090s! gotta use up those occulink PCIe lanes) It's not ex-mining hardware since the 4090 landed after the Eth proof-of-stake-changeover.

I've been looking for an answer to this every time I check out the current vast.ai console.


There were some posts recently about 4090s being mass imported into China and the chips being desoldered/converted to lower height reference boards and blower fans.


I think Tensordock and vast.ai are cheaper than AWS. Lambda labs can be as well, but they seem to only have reserved instances now.


We are building dstack.ai, an open-source tool that helps run anything on vast.ai and TensorDock. Happy to hear your feedback.


Happy user of dstack.ai. The simplicity of just spinning up a machine with my required set of GPUs and memory, from my vendor of choice, and have an endpoint to easily access it via ssh and VSCode has been game changing for me.

I once had some trouble setting it up and the founder literally went on a zoom chat with me to help navigate through things. Couldnt recommend them enough!!


runpod.io is another good-and-cheap option


It caught mine too. I'm weighting several alternatives to "fine-tuning model fine-tuning", meaning the back-and-forth, trial-and-error previous to massively running the full training set.

My goal is to fine-tune a model on our codebase. I find RAG to be too orthopedic, I'd really would like to train the model on what is each part of the code and how we do things and see how it responds with a more complete perspective that goes beyond context.

The options I've considered for pre-fine-tuning:

- using a service like vast.ai, runpod, gradient or similar

- use Google Collab

- getting a more powerful MacBook, M3max with plenty of RAM


Excuse the ignorance but are you using these instances to fine tune a “fresh install” of a model, and then when you’ve finished fine tuning it do you download the whole model from the instance for use somewhere else?


First I download the weights of the base pre-trained model to the VM instance. Then I upload my data there. Afterward, I fine-tune either LoRA or full and when training finishes, from the VM instance I download the adapters in case of LoRA and full weights in case of full fine-tune and run inference on a way less expensive instance (usually 3090).


Check out other prices on https://gpumonger.com/

Disclosure: I collected the data and built the site, but it has a ton of comparison data for GPU clouds.


This post and its findings are really interesting!

Back when the "I forced a bot to watch 1000 hours..." memes were popular (https://knowyourmeme.com/memes/i-forced-a-bot), ages ago in AI/ML time, I tried to do something similar by fine-tuning GPT-2 on messages from a group chat of my friends. Since there were years of chat data, it seemed like a really good opportunity to test whether the language model would capture everyone's personality and generate funny, uncanny-valley versions of our banter.

Turns out that the group chat was used nearly exclusively for sending funny pictures and videos (that the language model obviously couldn't see), and for making plans to meet up. The generated conversations almost exclusively consisted of a random group chat member starting with "there is a party tonight, who wants to go?" and others saying "I'm down" or "when?" or "where?" It was 0% banter, and 100% logistics.

It was pretty hilarious in its own way, but not for the reasons anyone expected! I didn't learn very much about language models with that experiment, but I did learn that my friends' group chat is actually pretty boring.

I guess the best banter happens in real life. Glad to see it worked out somewhat more interestingly for this person, even if they did allude to some similar results in their closing thoughts section.


This is an interesting insight into model training because it shows how hidden bias arises - think of the "left-leaning" stance of GPT3 etc. being a side-effect of the cohort that trained the model, not any deliberate action on their behalf.

Additionally - it's why it's important to think ahead of what you're training your model on because the model will always regress to the training data itself even if that means going backwards in ability.


> My data collator ensures that the loss is only calculated based on someone’s response. Predicting who will speak next is relatively straightforward, and we don’t want the model to focus on learning that. Therefore, parts of the conversation where the loss is calculated are highlighted in bold.

If it's so easy, then you don't need to remove it. The model will solve it easily and focus on everything else. At best, you save some parameters and compute, at worst, you are damaging its ability to learn important things like conversational skills or modeling people. When it comes to LLMs, more is more, and trying to hand-engineer the dataset or think for the LLM can backfire in very subtle and difficult to diagnose ways.

> Ok, it is capable of forming coherent sentences. The most noticeable problem is its lack of awareness regarding the context of the conversations which leads to bland and generic replies. The messages lacked any distinct style, feeling quite basic... > > Conversations have become more interesting and engaging, although there’s still a risk of losing context. Russian language performance has improved, but errors still occur. I believe that before fine-tuning for a specific task with limited data, like mine, it would be beneficial to first fine-tune the model unsupervised on a large corpus of Russian texts. Additionally, incorporating common conversation partners’ names as separate tokens might enhance the quality. I wouldn’t say it has turned out to be significantly better than LoRA. It might be more effective to focus solely on a single person and calculate the loss based only on my responses (or someone else’s), instead of trying to learn about each and every conversational partner.


I agree that usually 'more is more' for training LLMs. However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible. Since the model still encounters these masked sentences in the data, it effectively learns to respond based on the speaker's name. So, complicating the task might not be necessary. Also, I'm concerned about interpreting the loss value. If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.


> However, for fine-tuning with limited data, it seems crucial to focus the task as much as possible.

That doesn't make any sense when you're dealing with a model which is so hugely over-parameterized. The model will learn the easy data that you are removing just fine. There's no 'limited data' there.

> If the model quickly reduces loss by picking up predictable phrases, it's hard to tell if it's genuinely learning or just echoing these predictable elements.

You can't interpret the loss qualitatively anyway. It's totally dependent on the details of tokenization, formatting, corpus size, etc. You still have to look at the samples or a downstream task to see if it's working well. Even quantitatively, the loss is only meaningful if you're comparing to a heldout sample or something, and then it doesn't matter if you were screwing with it like OP.


We're probably quite some time off from the bio-mimetic android part, but we're feeling closer and closer to the AI replacement avatar from the Black Mirror episode "Be Right Back"[0]

[0]https://en.wikipedia.org/wiki/Be_Right_Back


Off-topic but I think it's important: OP, in the article you say you don't want a company to have your private messages, but you are using Telegram? I also use Telegram, but I am under no illusion of privacy!

Except for encrypted chats (which have bad UI and only work on one device) your messages are stored unencrypted on their servers (handed over to authorities, etc.).


In IM, there's a balance between total privacy and widespread use. Apps like Signal offer high privacy but have fewer users, while popular ones like WhatsApp are less secure. Telegram lies somewhere in between, offering a level of privacy that most users find comfortable. It's widely used and there haven't been significant incidents of legal issues arising from its messages. Ultimately, it boils down to whom you trust and which app has more of your contacts.


> like WhatsApp are less secure

WhatsApp uses end to end encryption by default. In fact, it uses the library that Signal developed. It is much more secure than Telegram, unless proved otherwise (which would need some backdoor in the application code to change its behavior).


It doesn't really matter if the app claims to use E2E when it actually discloses message content [0] [1]. WhatsApp is also filled with backdoors [2].

[0] https://therecord.media/fbi-document-shows-what-data-can-be-...

[1] https://www.rollingstone.com/politics/politics-features/what...

[2] https://telegra.ph/Why-Using-WhatsApp-Is-Dangerous-01-30-4


0 and 1 don't support your claim that WhatsApp discloses message content. those two links explicitly say WhatsApp doesn't give messages to the FBI.


Also[2] claims backdoors, which is impossible to prove. They could just be bugs that were exploited (and fixed).


the fact that the iranian regime blocks telegram, messenger, instagram, signal, etc. but has no problem with whatsapp makes me worried that the app probably cooperates with the regime and complies with their data requests.


>you don't want a company to have your private messages

But that's not what he says.

>I don’t want to use any third-party fine-tuning services

He might be okay with a third-party storing his messages but not using them in their models etc.


This may sound stupid, but from my perspective renting random VMs on vast.ai is safe in general and might be safer than using traditional cloud providers in particular. Consider this: on your VM a new image starts several times a day, each time with a new volume. It downloads tens of GBs of data and weights for training. Once training is done, everything gets cleaned up and the process starts again for a new tenant. This constant cycle makes it kind of difficult to track and extract any meaningful data from it.


vast didn’t work for me despite installing their certificate on macOS.


Great post. I wonder how much this can improve if you RAG-ify a diverse set of contextual data, for example calendar, meals, recent conversations from the real world, etc.

It's also interesting that бля was translated to 'damn'. :)


I think incorporating knowledge from other apps is a good next step because the model definitely lacks the context of what is going on right now. The nature of instant messaging is that most of the messages are about what is happening right now or what will happen in the near future, so past communication history does not help much.


Fascinating. A few years ago a friend and I finetuned GPT-2 on our WhatsApp chat. So it was just a long text file of:

Mark: wassup

Andy: just chilling

It simulated our conversational style and topics quite well, though GPT-2 reads like a glorified Markov chain. Sometimes the outputs were absolutely hilarious and inappropriate. GPT-2 was peak comedy.

My friend described GPT-2 as "like watching a toddler learning how to walk. When it stumbles it's cute and funny." GPT-3, not so much...

Also, it was oddly (painfully) accurate as far as personality goes... like looking into a mirror. For one thing, I talk way more, and this was reflected in the model's output. For another, I am constantly trying to turn my life around and failing, but ever optimistic... and talking about creative plans endlessly without much execution. (So GPT-2-andai ended up the same way...)


GPT-2 is surprisingly good at fine-tuning such conversations even now. I gave a talk recently on "Sparks of Digital Immortality" that covers a bit about how we did it - https://www.youtube.com/watch?v=F9-Qk86QyMM


Really interesting - as another person who's used telegram for several long-standing group chats, I imagine a tool to simplify this would be well-received.

I've wondered since fine-tuning started being a thing how long it'd be before somebody makes a utility where you can dump a giant chat export into it and an API key and then it fine tunes a Telegram bot that can imitate any of your friends - would be fun to play with and even create a group chat with multiple friend-bots talking to each other to see how long until it goes off the rails.


It's true that fine-tuning models on personal messages could be simplified, but many, like myself, can't use third-party services due to sensitive data in our messages. I'm curious if others face this trust issue and how it might be resolved.


Great example of immortal digital avatars. This is just a simple personal avatar but it is possible to make technological gods with the same techniques. All that's needed is scale and $80B.


We do this at https://meraGPT.com


This was a fun read. Wondering if it will work for Hindi. TIL that Mistral > Llama 2


A meta-comment, but, what is the difference between "learnings" and "lessons"? Why use the former when we have the latter?


Learnings implies a report of your own experience; lessons implies something prepared as teaching material for the audience. (In the context of the title sentence anyway.)


‘Lessons’ to me also seems to carry a sense of regret, as in ‘things (we) got wrong’. ‘Learnings’ is a more obscure word that I would take to mean something more neutral: literally ‘things (I’ve) learnt’.


Perhaps "findings" over "learnings", based on your description?


https://en.m.wiktionary.org/wiki/learnings

Beyond what's noted there (contemporary business jargon), English is diffused across the globe and has many regional variations that are different than class-signalling/formal American and British usage. As we all encounter each other online, it's not always worth over-analyzing word choice when you can understand the intent.


This usage of "learnings", while certainly more common in "business jargon" today, was used by Shakespeare:

https://www.opensourceshakespeare.org/views/plays/play_view....


Some words in Shakespeare have different meanings today or have simply left standard usage. I don't think the presence of a word in Shakespeare means it is de facto good style to use today.

From a correctness stand-point, I think a descriptionist would be satisfied with an attested usage, especially from such a source. From a style point of view, I still find myself feeling embarrassed for the author when I encounter this usage (which is my own problem).


I think when you ask what the difference between two phrases is, people will really dig down to try and find a difference.

IMO in this context it is basically shorthand for “things I learned/lessons learned while tuning LLM…,” and either would be fine. It is sort of an informal list of stuff the author learned.

In my experience (nothing special, just another native speaker) “lessons from <event>” is the more typical American (at least) English phrase. But it is sort of close to “Lessons on.” “Lessons on” would imply more refined material that is more narrowly focused on teaching. So I wonder if the author decided they just didn’t want to worry about any confusion, or the possibility that they might misuse a phrase.


I've always associated it with Indian English, possibly it's a dialect thing that's spread from that community.



Gotta earn those fat management consultant fees somehow. I’m sure there’s a whole team at McKinsey doing nothing but inventing new ways to say the same things.


In Swedish, there's a commonly used word "lärdomar" which is a direct match for "learnings".

But where the Swedish word sounds natural in that language, "learnings" just sounds wrong in English, even though it apparently is technically correct.


Lessons may be given, but are not necessarily learned.


I think it's new. I've only heard it in the last few years.


learnings = lessons learned


I assumed the author was a non-native English speaker


[flagged]


I'm a native English speaker from the US, and a pedant who hates "ask" as a noun, "workshop" as a verb, and "performant" as a word. But I don't get the hate for "learnings" here. What's wrong with it? "Lessons" connotes negativity, "stuff I learned" doesn't naturally fit into many sentences, and "useful information gleaned" can be shoved right back up the tightly puckered ass it came out of.

What's the problem? That title is exactly the way I would have written it.


Out of curiosity, what's your take on how to write "this item requires repair"?

"It needs repaired" is something I've seen, which to me is confusing, because it seems like "to be" is missing. When did "needs" run away from the words it's been associated with before?


They wrote this for you, I think:

https://ygdp.yale.edu/phenomena/needs-washed


Oh wow, reading this makes me deeply uncomfortable in an odd way. Like it could pollute my brain and somehow it would become normal to say things like that. I didn't dare read it all.


This is a regionalism in parts of the US, which I’ve seen described as Pittsburgh and its surroundings.

I come across it often and struggle with cognitive dissonance every time - I know of the regionalism but it feels so strongly like a glaring grammatical error.

I see/hear the specific phrase “needs fixed” most often.


"This shit's broke."

Seriously, though, I'm with you on "It needs to be repaired."

Then again, I went to school in the land of the yinzers, so "it needs repaired" doesn't even sound all that off to me anymore even though I'd never use it myself. (I kind of mentally map it to "it needs repairing".) But I think of that as a dialect of English, with no bearing on "standard" English.

For standard English, it has to be "it needs to be repaired", "it requires repair", "it needs repairing", or "excuse me, my good sir, but I do believe that this shit here is most definitely in need of repair".


So you're saying "needs" is not doing the needful.


I always thought of "learning" as an uncountable noun.


This is a ridiculous, arbitrary judgment that has nothing to do with anything even remotely related to this post. This type of pedantry is low-brow and annoying.


The fact that it is tangentially related to the post doesn't make it any less valuable.

As a non-native speaker I find this discussion as interesting as the original post. I like to hear native speakers give their "opinions" on all matters language as it lightens some dark alleys of the language that are otherwise inaccessible.

Today's lesson: "it needs repaired" is equally puzzling for some native speakers as it seems to be a regional thing :)

Also: yinzers.


It's an incorrect judgement that's entirely ignorant of the entire field of linguistics. People should have fewer opinions of this type as they are founded on very wrong ideas of how language works. In short, it is not valuable at all in any context.


It's also plainly wrong, because "learnings" is perfectly commonplace.



“Learnings.” Horrible word




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: