Hi all, I'm the main maintainer from the OpenLLM team here. I'm actively developing the fine-tuning feature and will release a PR soon enough. Stay tuned. In the meanwhile, the best way to track the development workflow is at our discord, so feel free to join!!
Thanks for the great project! Any chance, your team might consider more open platform than Discord for posting updates? I personally find Discord hard to use, and there’s no way to have sensible subscription (like RSS). Discord is usually muted.
Discord is a black hole where information goes to die. Its search and scrollback is awful. It's awful at being an archive, as finding anything that was asked more than a day or two ago is impractical.
To use Discord in good faith and with open eyes, you have to prioritize communication in the present, and give up hope of archiving anything that was said for people who might need the information in the future.
Discord is just a rich IRC replacement. You can log and search in IRC too but nobody seriously tries to archive information for research later. And big difference is it's all closed and operated by one entity that can change conditions at will. Don't even try to use it for anything else than real time chat.
That's only half true. Yes, Discord does allow a "rich" chat experience, with channels and servers, but there the similarities end.
IRC is based on an open protocol, with many open source clients available for it, and a decentralized server infrastructure.
Discord is closed and centralized, with only a single client available for it.
You can easily log IRC channels, but there is no easy way to do that on Discord, if it can be done at all.
I've logged every channel I've ever visited on IRC, and I can use powerful text tools to regex search through all of my conversations on IRC and have the results appear instantly. Nothing remotely like that is possible with Discord.
Paging through IRC logs is virtually instant on a modern terminal, while Discord makes you wait a long time between every other page load, so if you need to look through more than a handful of pages it's incredibly slow and painful.
Some IRC channels have their logs published on the web, making them fully searchable through web search engines, but to my knowledge no Discord channels do that.
For gaming communities (where you'd use voice chat), Discord was great. Easy to set up, free as in beer, runs in cloud. The alternatives back in the days did not have these features. They were either expensive (Ventrilo) or bad quality (Ventrilo and Skype latency/quality) or proprietary (only Mumble wasn't, TeamSpeak, Ventrilo etc were) or lacked community features (Ventrilo) or these were very archaic (TeamSpeak, Mumble) or you'd have to self-host (all but Ventrilo). It was also before GDPR existed. So Discord happily used and abused that unique position.
Its a shame its being used for general communities who don't use or need the voice chat feature. Especially when its an official community for a place, given their stance on third party clients and privacy issues.
If you don't need voice chat, Zulip, Mattermost, Revolt, Discourse, and many other would suffice (Linen recently got featered on HN). If you do, I think even Signal would be suffice these days.
For Discord search, recently Answer Overflow was recently featured on HN [1].
They stem words aggressively, so searching for "repeater", which is a less common, specific term, gives you results including "repeat", a commonly used word.
And there's no way to do an exact word search.
FYI, specifying the nfpr=1 query string parameter will disable Google's idiot attempt to try be helpful by searching for something other than that which you want to link to.
when I click this google gives me results for “how to use openlm” a commercial product, they literally change your search term if there’s a product that fits
Related: As an operator/mod/admin it's fairly straight-forward to bridge a Discord channel to Matrix (and, if one so desires, from there to IRC), allowing users not on Discord to participate. Conservative mods concerned about spam can start with an allowlist for which servers can join.
What an insulting interview question, I hope it was just in jest or at the end looking to pad the time
However, it did make me realize hidden therein is an actual interesting interview question, similar to the "describe what happens when you type an address into the browser's URL bar and hit enter": describe what happens after you type `:s/foo/bar` and hit enter. Followup version: what about `:%s/foo/bar`? The kind of thing that can be interesting to watch them reason through even if they don't know the answer, or even know what those syntaxes do.
I used to play EVE Online a fair bit, and always thought it interesting how some of the groups used Discord but only for text communications. Voice was still done over Teamspeak or Mumble.
When I played EVE, Mumble was the de facto voice comms since it supported 100s of pilots which happened many times during joint ops and xmpp for text chat and pings.
My understanding from asking several people, since I hate discord and want to know why people insist on using it, is that it’s a free alternative to Slack. Simple as that.
But it’s crazy, people are aggressive about Discord for some reason. I maintain an OpenAI SDK package for .Net, and I had some random person decide they wanted it to be a Discord community, so they created a Discord claiming it was the official community discord for my library, and submitted a PR updating my readme to say that it’s my project’s official Discord. They also replied to several issues and pull requests telling people to discuss it on that discord. If Discord isn’t paying this person in some guerrilla marketing tactic, they should be...
People didn't put documentation in IRC channels because they didn't want to answer the same questions over and over. Info went into a wiki, and you would get flamed for asking a question on IRC that was answered on the wiki. Discord is not a good place to stash documentation.
In my experience, I don’t think I’ve ever seen an IRC log in a search result.
#haskell on Libra is publicly logged, but I couldn’t get Google to return a quoted phrase from a message a few weeks ago.
Many people on IRC don’t enjoy being in logged channels. I’ve also heard that there are GDPR implications to publicly logging people’s messages without their consent.
Discussion of the difficulty and downsides of IRC logging, from a coulple years ago:
No, it's not. If you work on an opensource / open development project, it totally makes sense to avoid walled gardens for the community chat/forum (a few years ago it was public Slack instance, nowadays it's Discord servers).
Can you show me how to access the archives of the ask-for-help channel on the openllm Discord server? Right now they're discussing "loading models on CPU vs GPU". No matter how explicit I got, google did not find the discussion.
Isn't there something really nice about it though?
It seems to me that most every community gradually evolves into one where every new message from a new-ish member is answered by something like "Duplicate, please search first!". And this in turn makes those newcomers either go away, become passive lurkers, or become part of the "hive-mind" (as only likeminded questions get answered).
On the other hand, if people have to actually converse to get an answer to their questions (like back in the real world), newcomers can more rapidly become part of the community, and help make it more diverse.
LLMs could indeed address the first part, but not the second, of bringing the newcomers in via actual conversation with the older members.
The only good solution I encountered to this is of having some (preferably not too experienced) member(s) actively take upon themselves the role of welcoming newcomers and answering their questions, whether that's in an official or unofficial capacity.
This to me is the real way through this "Eternal September", where in every "cohort" of newcomers, one or more choose to stay close to the doorway to welcome and guide the next cohort.
The best of both worlds - a friendly community that welcomes newbies, with a searchable archive - is possible. Limiting to only chat-based support means that support is bottle-necked by the folks who are available and engaged at the time of the question, and that knowledge will "drop out" of the community as people forget it.
Apologies for my skepticism, but is it just "possible", or do you actually have an example of a long-lived community that remained fully welcoming to newbies while utilizing a searchable archive?
In any case, I'm not arguing that it's impossible, but rather that the more comprehensive the archive, the less welcoming the community would tend to be, all other things being equal. To take it to the extreme, I'll posit the following law: "A well-curated archive is the grave of a community"
Hard disagree. If anything you'll find that the most knowledgeable members get burned out answering the same questions over and over again, so they begin to simplify their answers until they just become copy pasta.
You can still have channels open to welcoming new people while at the same time having a large archive of answered questions so that over time a reservoir gets built.
Saying that the same questions getting asked over and over again by new people is somehow a more welcoming community, is like saying that there's any meaningful interaction happening when two people say "What's up?" followed by the response "not much". It's a handshake protocol equivalent without actual depth.
A very reasonable question, and I'll admit that I'm not deeply-entrenched in enough technical communities to give you an actual example. But yeah, intuitively I do agree with the sibling commenter - a well-curated archive is a tool of technology which allows skilled respondents to preserve their time and energy for new and interesting questions. A pointer to search is not necessarily dismissive - there is a world of difference between the following _technically_ equivalent responses:
* FFS, read the fuckin' archive noob and stop wasting our time
* Hey there, thanks for asking! This is actually a pretty common question, and we have guides written up for just this case. Try entering some of your search terms here [link], and come back with a follow-up question if that doesn't help you!
But yes, in fairness, I'll certainly agree that a community which _chooses_ to respond as the former will stagnate and die.
short answer: because it's one of the options with least friction to get running
a lot of people who are into tech stuff already have a discord account making joining the community a one click process, the instant nature of it seems to appeal to younger users more than async forums, it's a fairly mature platform so it has a bunch of moderation/customization/integration features you might want, etc.
> are discord conversations persisted and indexed on search engines ?
I've used IRC for a long time and still do, but I do think Discord has a nicer UX for most use cases. In particular, building communities around clusters of channels ("servers") and support for rich media (yes, some old people might call that a downside) increase the appeal for most people. It's also a lot more work to have a persistent connection on IRC (bouncers).
My main problem with Discord is that it's someone else's centralized, for-profit company and has no apparent barriers to enshittification[0]. As Reddit recently demonstrated, it's probably a mistake to build communities on top of something like that.
Matrix is a good candidate for a modern successor to IRC. It's not quite as slick a UX as Discord, but it addresses the main advantages Discord has over IRC.
Practically all of my friends grew up with IRC, we are in our late 30s, 40s, early 50s.
We might reminisce about irc but we all prefer discord.
Even the searchability of indexed irc has been surpassed by other knowledge sites. It would have to be something extremely niche these days where the only source of info is in an irc chat log
IRC doesn't even have history, one of the most basic requirements for a modern rudimentary chat app. It's ridiculous to suggest using it in 2023 when it doesn't have features a freshman homework assignment chat app has.
Very cool, btw it's not mentioned in the readme so I assume it's only for running full precision models or do quantized GGML/GPTQ/etc. also work with it?
GPTQ support would be amazing (AutoGPTQ is an easy way to integrate GPTQ support - it's basically just importing autogptq and switching out 1 line in the model loading code).
Stray thought: It would be better to specify NNs in terms of their training-size to weight-size in bytes. Rather than "No. Parameters", or at least, this ratio with the number of parameters.
So, eg., I'd imagine ChatGPT would be, say: 100s PB in 0.5TB.
The number of parameters is a nearly meaningless metric, consider, eg., that if all the parameters covary then there's one "functional" parameter.
The compression ratio tells you the real reason why a NN performs. Ie., you can have 100s bns of parameters, but without that 100s PB -> 0.5TB, which you can't afford, it's all rather pointless.
I’m not sure ML researchers would agree that number of (compressed) bytes are more meaningful than number of parameters. Parameters have mathematical meaning – bytes doesn’t.
My point is that they don't have a mathematical meaning. They have a training-time impact, but that's more relevant when creating than using.
You'd need to know how many parameters were independently covarying for any given class of predictions. It certainly isnt all of them.
You could cite the "average dropout percent to random-level accuracy on a given class of problems" (my guess is that this would show 5-20% of parameters could be dropped).
My point, I suppose, is that users of NNs arent interested in the architecture characteristics which affect training -- they're interested in how capable any given model will be.
For this we really want to know how large the training data was, and how compressed it has been. If it's 1PB -> 1MB then we can easily say that's much less useful (but much faster to use) than 1PB -> 0.5TB.
Likewise we can say, if it's a video generator, that 1PB is far too small to be generally useful -- so at best it'll be domain-speicifc.
> My point, I suppose, is that users of NNs arent interested in the architecture characteristics which affect training -- they're interested in how capable any given model will be.
Yes.
A more helpful bit of information could be what the model was pre-trained on. Assuming they’re trying to refine it for a more specific task.
Size is helpful for “what can I run on my machine” (or how much would it cost to run on a server.) Not all models are created equal, given a byte size, for a given task.
If you don’t believe me you can try training an LLM with just a single parameter, specified to an incredible precision using e.g a trillion bytes. Hint: it won’t perform very well.
Allocating more bits for each parameter increases precision, by definition. But that doesn’t come for free.* So it is useful to optimize network performance for a given number of total parameter bytes.
I haven’t done a recent literature review, but my hand-wavy guessplanation is that a NN (as a whole) can adapt to relatively low precision parameters. Up to a point.
* In general. Given actual hardware designs, there are places where you have slack in the system. So adding some extra parameters, e.g. to fully utilize a GPU’s core’s threads (e.g. 32), might actually cost you nothing.
Unrelated and I know it is just a representative number but I have seen the training data to be assumed something in this range few times. Entire training set of ChatGPT is almost surely less than a TB or two with compression which is 5 orders of magnitude lower. I believe that such efficient representation of text is one of the biggest reason why text models are working so well but image understanding models are not.
Perhaps the process was an initial c. 1PB then sampled down to the TBs.
Text is extremely lightweight, so I suppose everything ever written is at most 1-10PB.
This is one of the illusions of text-generative NNs: a 0.5TB weight set is basically enough to store every book. Making claims to "out-sample generalisation" extremely suspicious, and indeed, fairly obviously false.
eg., Ask ChatGPT to write tic-tak-toe in javascript and you get a working game; as it to write duck-hunt and you dont.
Well then do explain a bit further, I still don't fully grasp what "100s PT in 0.5T" means exactly. 100 petatokens in half a trillion? Half a terrabyte? 100 seconds?
Plus afaik base model training tokens don't have the same effect as fine tuning tokens, so there would need to be a way to specify each of those separately.
FWIW I easily interpreted these as '100s of petabytes' and '0.5 terabytes' without having to give it too much thought. The original comment explicitly specified 'bytes' as the unit being suggested.
For falcon 40b you probably need an A100 40gb or so.
Every model is drastically different.
If you want to run something on consumer hardware, your best bet is using anything ported to the ggml framework, especially if you're on Apple silicon.
To add: usually when you go to download a ggml model you want a quantized version. People like TheBloke will usually have some RAM requirements for running it, eg: https://huggingface.co/TheBloke/vicuna-13b-v1.3-GGML
The number after q determines how many bits the weights are. Eg q4 means that is 4-bit.
If you use something like KoboldCPP you can only put some of the layers onto the GPU and be able to run larger models that way.
Eg the above linked Vicuna model requires about 10GB of memory at q4, but I have less VRAM than that. I can still run it though.
Sorry, I don't know. I suspect that it's not possible (yet?). OpenLLM lists a bunch of models in the Github Readme. I think the best way would be to use those for now.
Model size in GB = VRAM required for uncompressed inference (16bit aka "half precision) plus ~1-4GB for context. For 8bit you need half that. For GPTQ/4bit you need 1/4 that.
For example, assuming GPTQ 4bit a 100GB 16bit model needs 26-30GB of VRAM. The smallest video cards which meet this requirement will be 32GB or 40GB cards. (two 24GB cards in parallel work as well, e.g. 2x3090)
You want the entire model to fit in memory. So if you’re looking at a download size of, say, 100GB then don’t run on less than that. Your machine will swap to/from disk constantly, which will be slow and wear out an SSD.
If you want to train a model, that’s a different story.
Do we know how LLMs available in OpenLLM and other open source LLMs compare to different versions of GPT models? I know there’s a leaderboard on huggingface: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb... but it doesn’t contain GPT models.
Agreed, mostly in real life we have specialized experts. You go to doctor to ask about health related stuff, you ask your colleague who is expert in ML only about ML and probably not much about mobile development.
Instead if having ML that is expert at coding in all languages probably would be better to allow to switch context.
Native mobile dev LLM? - e.g. train only on swift, objc, kotlin, java, c, c++ code
Python dev LLM? - train only on python, c, c++, rust code
Would this way final model be smaller, faster and maybe event better?
In my experience it doesn't work like that. It's like if you take an idiot and spend a bunch of time training them, they'll still perform much worse than a moderately intelligent person with much less training. And smaller models can be pretty idiotic.
Provided that the problem is suited to the strengths of an LLM at all. An example might be a small ai custom trained on documentation for libraries. You ask it a question like "how do I make the background move with parallax effect when you move the cursor". It's a little ambiguous, high-level concept, and probably not a single function.
Small ai: likely makes up a function or suggests a single function which isn't sufficient. Refuses to budge from its answer or apologies and gets confused
Large LLM: able to actually understand the question, combine several functions. If it doesn't work you can tell it why and it fixes it
Because there’s a world of difference between a reinforcement learning trained special purpose model and asking a general purpose large language model to have a go at something.
Because they do completely different things? They literally have nothing to do with each other. Why do planes fly better than ships if ChatGPT can't do math?
Why no? Chess notation is text. But the problem is that LLMs are not that good for problems which require evaluation of a search tree. Also leading chess engines such as lc0 are without search better than 90+x% of All humans
Unfortunately, for transformer-based LLMs the magic starts only when they are trained by more that 10^22 TFlops (preferably 10^24) so smaller models might not cut it even for fine-tuned tasks.
Fine-tuning is the most important part, but it is under intense research today, things change fast. I hope they can streamline this process because these smaller models can only compete with big models when they are fine-tuned.
Yes but fine tuning requires a lot more gpu memory and is thus much more expensive, complicated and out of reach of most people. To fine tune a >10B model you still need multiple A100 / H100. Let’s hope that changes with quantized fine tuning, forward pass only etc.
What exactly do you mean here that the smaller models can compete with the the larger once they are fine-tuned? What about once the larger models are fine-tuned? Are they then out of reach of the fine-tuned smaller models?
They're probably referring to fine-tuning on private/proprietary data that is specific to a use case. Say a history of conversation transcripts in a call center.
Larger models, like OpenAI's GPT, don't have access to this by default.
Smaller models are likely more efficient to run inference and doesn't necessarily need the latest GPU. Larger language model trend to have better performance over more different type of tasks. But for a specific enterprise use case, either distilling a large model or use large model to help with training a smaller model can be quite helpful in getting things to production - where you may need cost-efficiency and lower latency.
What kind of hardware do I need to run something small scale (1 user concurrently) and get reasonable result? Are we talking about Raspberry Pi, Core i5, Geforce 4090?
Anecdote: I can tell you that Vicuna-13B, which is a pretty decent model, runs about 4 to 5 tokens/second on an Apple M1 with 16GB. Takes about 10GB of memory when loaded. Friend of mine with a RTX 2070 Super gets comparable results.
I'm upgrading my M1 laptop on Thursday to a newer model, M2 MAX with 96GB of memory. I'm totally going to try Falcon-40B on it, though I do not expect it to run that well. But I do expect it (the M2, I mean) will be snappier on the smaller models than my original M1 is.
I think that strongly depends on what you count as "reasonable": smaller models take less memory, so there's a trade-off between quality and speed depending on if you can fit it all in graphics VRAM, or system RAM, or virtual memory…
Just keep in mind the speed differences between the types of memory. If you've got a 170bn 4-bit parameter model, and you're on virtual memory on a 400 Mbps port, a naive calculation says it will take at best 28 minutes per token unless your architecture lets you skip loading parts of the model. Might take longer if the network has an internal feedback loop in the structure.
If your model is 170bn 4-bit parameters, you have 85 Gbytes that has to be loaded into the CPU or GPU; if those parameters are on the other side of a 400 Mbps port, that takes 85 Gbyte / 400 Mbps = 1700 seconds = 28 minutes and 20 seconds.
If you don't have sufficient real RAM or VRAM, the entire model has to be re-loaded for each step of the process.
Assuming no looping (looping makes it longer) and no compartmentalisation in the network structure (if you can make it so an entire fragment might be not-activated, you have the possibility of skipping loading that section; I've not heard of any architecture that does this, it would be analogous to dark silicon or to humans not using every brain cell at the same time (outside of seizures)).
OpenLLM in comparison focuses more on building LLM apps for production. For example, the integration with LangChain + BentoML makes it easy to run multiple LLMs in parallel across multiple GPUs/Nodes, or chain LLMs with other type of AI/ML models, and deploy the entire pipeline on Kubernete (via Yatai or BentoCloud).
Suppose I’ve written code that calls the OpenAI API. Is there some library that helps me easily switch to a local/other LLM. I.e a library that (ideally) provides the same OpenAI interface for several models, or if not then at least the same interface.
OpenLLM plan to provide an OpenAI-compatible API, which allows you to even use OpenAI's python client to talk to OpenLLM, user just need to change to Base URL to point to your OpenLLM server. This feature is working-in-progress.
I like the idea of having a standard API for interacting with LLMs over the network. Many models need to run on beefy hardware and would benefit from offloading to a remote (possibly self-hosted) server, and I think makes logical sense to separate the code for running LLMs from the UI for accessing them.
It would be great to have this, but the space is rapidly moving and haven't converged on a set of uniformly accepted practices yet. For example, I'm not aware of a single open source LLM that has something similar to OpenAI's function calls.
Looks great! I'm planning to integrate it into my new project(to-chatgpt: https://github.com/SimFG/to-chatgpt), which will provide users of the ChatGPT applications with a wider range of LLM service options.
OpenLLM is adding a OpenAI-compatible API layer, which will make it even easier to migrate LLM apps built around OpenAI's API spec. Feel free to join our Discord community and discuss more!
Though only related to topic wrt LLMs: It seems LLMs occasionally mix up CJK vocabularies and also generate invalid UTF-8 sequences, due to CJK texts having overlapping code points and inputs being processed by tokenizer. Are there developments in that directions? Aren't CJK ideograms essentially tokens?
What is the license like for this? Correct me if I'm wrong, but I think the official Llama has a license that allows research use. Would this have a similar restriction if it had the same model architechture but different parameters?
OpenLLM itself is under Apache 2 license, which does NOT restrict commercial use. However, OpenLLM as a framework can be extended to support other LLMs which may come with additional restrictions.
Question: for someone that wants to play around with self-hosted text generation but has a crap laptop – are there any hosting providers (like a VPS) where I can run open source models?
Check out BentoML, which is the underlying serving framework used by OpenLLM, and it supports other type of models and modality such as images and videos.
Think of the model as a gigantic compiled binary where you send in strings in a certain format and get back a response. This is a web API wrapper for that so you only need an HTTP client instead of having to run something like llama.cpp yourself.