Hacker News new | past | comments | ask | show | jobs | submit login
OpenLLM (github.com/bentoml)
659 points by fzliu on June 19, 2023 | hide | past | favorite | 169 comments



Hi all, I'm the main maintainer from the OpenLLM team here. I'm actively developing the fine-tuning feature and will release a PR soon enough. Stay tuned. In the meanwhile, the best way to track the development workflow is at our discord, so feel free to join!!


Thanks for the great project! Any chance, your team might consider more open platform than Discord for posting updates? I personally find Discord hard to use, and there’s no way to have sensible subscription (like RSS). Discord is usually muted.


Discord is a black hole where information goes to die. Its search and scrollback is awful. It's awful at being an archive, as finding anything that was asked more than a day or two ago is impractical.

To use Discord in good faith and with open eyes, you have to prioritize communication in the present, and give up hope of archiving anything that was said for people who might need the information in the future.


Discord is just a rich IRC replacement. You can log and search in IRC too but nobody seriously tries to archive information for research later. And big difference is it's all closed and operated by one entity that can change conditions at will. Don't even try to use it for anything else than real time chat.


"Discord is just a rich IRC replacement"

That's only half true. Yes, Discord does allow a "rich" chat experience, with channels and servers, but there the similarities end.

IRC is based on an open protocol, with many open source clients available for it, and a decentralized server infrastructure.

Discord is closed and centralized, with only a single client available for it.

You can easily log IRC channels, but there is no easy way to do that on Discord, if it can be done at all.

I've logged every channel I've ever visited on IRC, and I can use powerful text tools to regex search through all of my conversations on IRC and have the results appear instantly. Nothing remotely like that is possible with Discord.

Paging through IRC logs is virtually instant on a modern terminal, while Discord makes you wait a long time between every other page load, so if you need to look through more than a handful of pages it's incredibly slow and painful.

Some IRC channels have their logs published on the web, making them fully searchable through web search engines, but to my knowledge no Discord channels do that.

What happens in Discord stays in Discord.


Greping through IRC logs has a 10x better UX


Furthermore, you risk getting banned for deleting messages you wrote in the past


For gaming communities (where you'd use voice chat), Discord was great. Easy to set up, free as in beer, runs in cloud. The alternatives back in the days did not have these features. They were either expensive (Ventrilo) or bad quality (Ventrilo and Skype latency/quality) or proprietary (only Mumble wasn't, TeamSpeak, Ventrilo etc were) or lacked community features (Ventrilo) or these were very archaic (TeamSpeak, Mumble) or you'd have to self-host (all but Ventrilo). It was also before GDPR existed. So Discord happily used and abused that unique position.

Its a shame its being used for general communities who don't use or need the voice chat feature. Especially when its an official community for a place, given their stance on third party clients and privacy issues.

If you don't need voice chat, Zulip, Mattermost, Revolt, Discourse, and many other would suffice (Linen recently got featered on HN). If you do, I think even Signal would be suffice these days.

For Discord search, recently Answer Overflow was recently featured on HN [1].

[1] https://news.ycombinator.com/item?id=36383773


agreed.


I find their search amazing. What's your issue with it?


Here's just one issue:

They stem words aggressively, so searching for "repeater", which is a less common, specific term, gives you results including "repeat", a commonly used word. And there's no way to do an exact word search.


The issue is it's not indexed by Google


There was a recent post about an open source tool for indexing Discord content and making it available for Google search:

https://news.ycombinator.com/item?id=36383773


have you used google lately? might as well not be indexed with all the seo spam you get as top results


> have you used google lately? might as well not be indexed with all the seo spam you get as top results

I just googled "how to use openllm" as an example to test your thesis, and the results look very relevant to me.

https://www.google.com/search?client=safari&rls=en&q=how+to+...


You might want to glance again because all of those results are for a different product.


Top of the results page says:

"Showing results for how to use openlm

Search instead for how to use openllm"


FYI, specifying the nfpr=1 query string parameter will disable Google's idiot attempt to try be helpful by searching for something other than that which you want to link to.


when I click this google gives me results for “how to use openlm” a commercial product, they literally change your search term if there’s a product that fits


Related: As an operator/mod/admin it's fairly straight-forward to bridge a Discord channel to Matrix (and, if one so desires, from there to IRC), allowing users not on Discord to participate. Conservative mods concerned about spam can start with an allowlist for which servers can join.

https://github.com/matrix-org/matrix-appservice-discord


I know this isn't a great time for reddit, but I just made this on your behalf:

https://www.reddit.com/r/OpenLLM/

I much prefer the HN/Reddit discussion format to Discord and even Stack Overflow.


plugging the open source and self hostable https://revolt.chat which i've found to have great UX and be very performant compared to discord.


I'm liking revolt. Thanks for the suggestion.


good alternative: https://www.linen.dev/


s/rd/urse/g


HAHA this was one of my panel interview questions at Goooog'

Q: "How do you do a search and replace for a string in VI"

Me: I cant recall right now, i'd just google it"


What an insulting interview question, I hope it was just in jest or at the end looking to pad the time

However, it did make me realize hidden therein is an actual interesting interview question, similar to the "describe what happens when you type an address into the browser's URL bar and hit enter": describe what happens after you type `:s/foo/bar` and hit enter. Followup version: what about `:%s/foo/bar`? The kind of thing that can be interesting to watch them reason through even if they don't know the answer, or even know what those syntaxes do.


Alt proposed answer "I'd install emacs".


Side question : why are people working on open source project communicating through discord a lot noawadays ?

are discord conversations persisted and indexed on search engines ?


I find Discord quite versatile and a bit overwhelming at the same time. As to SEO, see https://news.ycombinator.com/item?id=36383773

AFAIK most of the gamers choose it for voice chat (Anyone remember TeamSpeak?)


In Europe, TeamSpeak is still very popular.


I used to play EVE Online a fair bit, and always thought it interesting how some of the groups used Discord but only for text communications. Voice was still done over Teamspeak or Mumble.


When I played EVE, Mumble was the de facto voice comms since it supported 100s of pilots which happened many times during joint ops and xmpp for text chat and pings.


My understanding from asking several people, since I hate discord and want to know why people insist on using it, is that it’s a free alternative to Slack. Simple as that.

But it’s crazy, people are aggressive about Discord for some reason. I maintain an OpenAI SDK package for .Net, and I had some random person decide they wanted it to be a Discord community, so they created a Discord claiming it was the official community discord for my library, and submitted a PR updating my readme to say that it’s my project’s official Discord. They also replied to several issues and pull requests telling people to discuss it on that discord. If Discord isn’t paying this person in some guerrilla marketing tactic, they should be...


Because it's easy, free and it just works.

Very few people actually care about indexing the conversations.


So all knowledge is lost and questions have to be asked and answered again and again?


That didn't stop IRC being popular in the 1990s.

There has long been a place in the ecosystem for ephemeral chat. Often alongside non-ephemeral things like written documentation.


People didn't put documentation in IRC channels because they didn't want to answer the same questions over and over. Info went into a wiki, and you would get flamed for asking a question on IRC that was answered on the wiki. Discord is not a good place to stash documentation.


It's ok you get scolded for asking an FAQ in many Discord "servers" as well.


> That didn't stop IRC being popular in the 1990s.

IRC chats, especially in opensource projects channels, could and would be archived, published over the web and indexed by search engines.


In my experience, I don’t think I’ve ever seen an IRC log in a search result.

#haskell on Libra is publicly logged, but I couldn’t get Google to return a quoted phrase from a message a few weeks ago.

Many people on IRC don’t enjoy being in logged channels. I’ve also heard that there are GDPR implications to publicly logging people’s messages without their consent.

Discussion of the difficulty and downsides of IRC logging, from a coulple years ago:

=> https://news.ycombinator.com/item?id=22892015

=> https://web.archive.org/web/20200417001532/https://echelog.c...

The HN blowback to developers choosing to use Discord is just wildly out of proportion.


No, it's not. If you work on an opensource / open development project, it totally makes sense to avoid walled gardens for the community chat/forum (a few years ago it was public Slack instance, nowadays it's Discord servers).


So just like Discord then..


I wasn't aware that was being done.

Can you show me how to access the archives of the ask-for-help channel on the openllm Discord server? Right now they're discussing "loading models on CPU vs GPU". No matter how explicit I got, google did not find the discussion.


It's up to the server owners/admins to configure archiving, same as IRC.


No


Also monks being the only ones who can read and write didn't stop religion to be popular in middle ages.

/s

C'mon


Isn't there something really nice about it though? It seems to me that most every community gradually evolves into one where every new message from a new-ish member is answered by something like "Duplicate, please search first!". And this in turn makes those newcomers either go away, become passive lurkers, or become part of the "hive-mind" (as only likeminded questions get answered).

On the other hand, if people have to actually converse to get an answer to their questions (like back in the real world), newcomers can more rapidly become part of the community, and help make it more diverse.


I just recently saw a post where someone said something similar about Reddit versus traditional forums.

There's a balance between engaging with new members and not turning it into a time sink for older members. This is probably a good use case for LLMs.


LLMs could indeed address the first part, but not the second, of bringing the newcomers in via actual conversation with the older members. The only good solution I encountered to this is of having some (preferably not too experienced) member(s) actively take upon themselves the role of welcoming newcomers and answering their questions, whether that's in an official or unofficial capacity.

This to me is the real way through this "Eternal September", where in every "cohort" of newcomers, one or more choose to stay close to the doorway to welcome and guide the next cohort.


Newcomers are also different. Some are actually experienced vs some are real newbies.

I’m wondering how could learn from games, making the content also adaptive to user levels/experiences.

It’s prob also the key agenda in education.


The best of both worlds - a friendly community that welcomes newbies, with a searchable archive - is possible. Limiting to only chat-based support means that support is bottle-necked by the folks who are available and engaged at the time of the question, and that knowledge will "drop out" of the community as people forget it.


Apologies for my skepticism, but is it just "possible", or do you actually have an example of a long-lived community that remained fully welcoming to newbies while utilizing a searchable archive?

In any case, I'm not arguing that it's impossible, but rather that the more comprehensive the archive, the less welcoming the community would tend to be, all other things being equal. To take it to the extreme, I'll posit the following law: "A well-curated archive is the grave of a community"


Hard disagree. If anything you'll find that the most knowledgeable members get burned out answering the same questions over and over again, so they begin to simplify their answers until they just become copy pasta.

You can still have channels open to welcoming new people while at the same time having a large archive of answered questions so that over time a reservoir gets built.

Saying that the same questions getting asked over and over again by new people is somehow a more welcoming community, is like saying that there's any meaningful interaction happening when two people say "What's up?" followed by the response "not much". It's a handshake protocol equivalent without actual depth.


    I see friends shaking hands
    Saying, "How do you do?"
    They're really saying
    I love you.


A very reasonable question, and I'll admit that I'm not deeply-entrenched in enough technical communities to give you an actual example. But yeah, intuitively I do agree with the sibling commenter - a well-curated archive is a tool of technology which allows skilled respondents to preserve their time and energy for new and interesting questions. A pointer to search is not necessarily dismissive - there is a world of difference between the following _technically_ equivalent responses:

* FFS, read the fuckin' archive noob and stop wasting our time

* Hey there, thanks for asking! This is actually a pretty common question, and we have guides written up for just this case. Try entering some of your search terms here [link], and come back with a follow-up question if that doesn't help you!

But yes, in fairness, I'll certainly agree that a community which _chooses_ to respond as the former will stagnate and die.


No, questions don't have to be asked and answered again and again, because all the knowledge is lost, full stop. No one would know anything.


Not lost enough to use as a transient space for sharing secret intelligence reports.


Maybe it doesn't matter because this type of knowledge is relevant for current week only?


indexing conversations is secondary for gaming but primary for FOSS projects and Discord sucks at that. its like wiping your ass with a fork.


short answer: because it's one of the options with least friction to get running

a lot of people who are into tech stuff already have a discord account making joining the community a one click process, the instant nature of it seems to appeal to younger users more than async forums, it's a fairly mature platform so it has a bunch of moderation/customization/integration features you might want, etc.

> are discord conversations persisted and indexed on search engines ?

nope (and that is a drawback many point out)


Isn't it a generation thing? If I had the choice everyone would be on IRC still.


I've used IRC for a long time and still do, but I do think Discord has a nicer UX for most use cases. In particular, building communities around clusters of channels ("servers") and support for rich media (yes, some old people might call that a downside) increase the appeal for most people. It's also a lot more work to have a persistent connection on IRC (bouncers).

My main problem with Discord is that it's someone else's centralized, for-profit company and has no apparent barriers to enshittification[0]. As Reddit recently demonstrated, it's probably a mistake to build communities on top of something like that.

Matrix is a good candidate for a modern successor to IRC. It's not quite as slick a UX as Discord, but it addresses the main advantages Discord has over IRC.

[0] https://pluralistic.net/2023/01/21/potemkin-ai/#hey-guys


Matrix would be my choice as well but good luck getting people to use the uglier alternative. Discord is great.


Fue problem with IRC is that it's crucial to have really robust read state synchronization across desktop and mobile these days.

Slack was the first to really get that right, and Discord effectively emulated them and made it available for free.

IRC users could get there with bouncers, but those were always a lot harder to get going with.


Practically all of my friends grew up with IRC, we are in our late 30s, 40s, early 50s.

We might reminisce about irc but we all prefer discord.

Even the searchability of indexed irc has been surpassed by other knowledge sites. It would have to be something extremely niche these days where the only source of info is in an irc chat log


IRC doesn't even have history, one of the most basic requirements for a modern rudimentary chat app. It's ridiculous to suggest using it in 2023 when it doesn't have features a freshman homework assignment chat app has.


1.People like talking to each other on discord. 2. No. :/



What's the rationale for telemetry tracking?

https://github.com/bentoml/OpenLLM/blob/main/src/openllm/uti...


They have a section about it in the README:

https://github.com/bentoml/OpenLLM#-telemetry


Very cool, btw it's not mentioned in the readme so I assume it's only for running full precision models or do quantized GGML/GPTQ/etc. also work with it?


Hi there, 8bit and 4bit is currently supported on main. GPTQ is working in progress, as well as GGML


GPTQ support would be amazing (AutoGPTQ is an easy way to integrate GPTQ support - it's basically just importing autogptq and switching out 1 line in the model loading code).


How can we stay tuned if we can't do tuning? :P


Fine-tuning is coming up in the next release!

You can actually try it out on the main branch :P


Stray thought: It would be better to specify NNs in terms of their training-size to weight-size in bytes. Rather than "No. Parameters", or at least, this ratio with the number of parameters.

So, eg., I'd imagine ChatGPT would be, say: 100s PB in 0.5TB.

The number of parameters is a nearly meaningless metric, consider, eg., that if all the parameters covary then there's one "functional" parameter.

The compression ratio tells you the real reason why a NN performs. Ie., you can have 100s bns of parameters, but without that 100s PB -> 0.5TB, which you can't afford, it's all rather pointless.


I’m not sure ML researchers would agree that number of (compressed) bytes are more meaningful than number of parameters. Parameters have mathematical meaning – bytes doesn’t.


My point is that they don't have a mathematical meaning. They have a training-time impact, but that's more relevant when creating than using.

You'd need to know how many parameters were independently covarying for any given class of predictions. It certainly isnt all of them.

You could cite the "average dropout percent to random-level accuracy on a given class of problems" (my guess is that this would show 5-20% of parameters could be dropped).

My point, I suppose, is that users of NNs arent interested in the architecture characteristics which affect training -- they're interested in how capable any given model will be.

For this we really want to know how large the training data was, and how compressed it has been. If it's 1PB -> 1MB then we can easily say that's much less useful (but much faster to use) than 1PB -> 0.5TB.

Likewise we can say, if it's a video generator, that 1PB is far too small to be generally useful -- so at best it'll be domain-speicifc.


> My point, I suppose, is that users of NNs arent interested in the architecture characteristics which affect training -- they're interested in how capable any given model will be.

Yes.

A more helpful bit of information could be what the model was pre-trained on. Assuming they’re trying to refine it for a more specific task.

Size is helpful for “what can I run on my machine” (or how much would it cost to run on a server.) Not all models are created equal, given a byte size, for a given task.


If you don’t believe me you can try training an LLM with just a single parameter, specified to an incredible precision using e.g a trillion bytes. Hint: it won’t perform very well.


Bytes does imply a level of precision, however, which affects the mathematical meaning. Perhaps there’s a metric that captures both.


Allocating more bits for each parameter increases precision, by definition. But that doesn’t come for free.* So it is useful to optimize network performance for a given number of total parameter bytes.

I haven’t done a recent literature review, but my hand-wavy guessplanation is that a NN (as a whole) can adapt to relatively low precision parameters. Up to a point.

* In general. Given actual hardware designs, there are places where you have slack in the system. So adding some extra parameters, e.g. to fully utilize a GPU’s core’s threads (e.g. 32), might actually cost you nothing.


> 100s PB

Unrelated and I know it is just a representative number but I have seen the training data to be assumed something in this range few times. Entire training set of ChatGPT is almost surely less than a TB or two with compression which is 5 orders of magnitude lower. I believe that such efficient representation of text is one of the biggest reason why text models are working so well but image understanding models are not.


Perhaps the process was an initial c. 1PB then sampled down to the TBs.

Text is extremely lightweight, so I suppose everything ever written is at most 1-10PB.

This is one of the illusions of text-generative NNs: a 0.5TB weight set is basically enough to store every book. Making claims to "out-sample generalisation" extremely suspicious, and indeed, fairly obviously false.

eg., Ask ChatGPT to write tic-tak-toe in javascript and you get a working game; as it to write duck-hunt and you dont.


Well it's not a completely meaningless metric as it immediately tells you roughly how much memory you need to load it, which is kind of important?


If you look at my suggestion, it's to state exactly that memory -- rather than to estimate based on bits/parameter.


Well then do explain a bit further, I still don't fully grasp what "100s PT in 0.5T" means exactly. 100 petatokens in half a trillion? Half a terrabyte? 100 seconds?

Plus afaik base model training tokens don't have the same effect as fine tuning tokens, so there would need to be a way to specify each of those separately.


FWIW I easily interpreted these as '100s of petabytes' and '0.5 terabytes' without having to give it too much thought. The original comment explicitly specified 'bytes' as the unit being suggested.


I edited to be TB,PB --- I was thinking of these as prefixes on bytes


The project seems great!

However, newcomers (like me) are pretty blind about minimum system requirements.

Could you please add them to the models list?

For example: what minimum hardware do I need to run Falcon-40b?

PS: If you only have a few setups "known to work" (or just one), listing that would be helpful too.


For falcon 40b you probably need an A100 40gb or so.

Every model is drastically different.

If you want to run something on consumer hardware, your best bet is using anything ported to the ggml framework, especially if you're on Apple silicon.


To add: usually when you go to download a ggml model you want a quantized version. People like TheBloke will usually have some RAM requirements for running it, eg: https://huggingface.co/TheBloke/vicuna-13b-v1.3-GGML

The number after q determines how many bits the weights are. Eg q4 means that is 4-bit.

If you use something like KoboldCPP you can only put some of the layers onto the GPU and be able to run larger models that way.

Eg the above linked Vicuna model requires about 10GB of memory at q4, but I have less VRAM than that. I can still run it though.


Let's say I wanted to use one of their quantized models with this OpenLLM project. How would I do that?


Sorry, I don't know. I suspect that it's not possible (yet?). OpenLLM lists a bunch of models in the Github Readme. I think the best way would be to use those for now.


Do you know any "standard" way or measures to determine the approximate hardware requirements of a model?


Model size in GB = VRAM required for uncompressed inference (16bit aka "half precision) plus ~1-4GB for context. For 8bit you need half that. For GPTQ/4bit you need 1/4 that.

For example, assuming GPTQ 4bit a 100GB 16bit model needs 26-30GB of VRAM. The smallest video cards which meet this requirement will be 32GB or 40GB cards. (two 24GB cards in parallel work as well, e.g. 2x3090)


You want the entire model to fit in memory. So if you’re looking at a download size of, say, 100GB then don’t run on less than that. Your machine will swap to/from disk constantly, which will be slow and wear out an SSD.

If you want to train a model, that’s a different story.


From the HuggingFace page:

> You will need at least 85-100GB of memory to swiftly run inference with Falcon-40B.

So it may be possible with less (swap around method), though not as efficiently and also slower.


For Falcon 40b, the 8-bit version would probably need about 48GB of VRAM while the 4-bit would need something closer to 28GB.


Currently on main, 8bit and 4bit quant is supported

One can simply do

```openllm start falcon --model-id tiiuae/falcon-40b-instruct --quantize int4```

Beware that there is no free lunch, meaning the quality of inference will degrade by alot when using int 4 quantization


Do we know how LLMs available in OpenLLM and other open source LLMs compare to different versions of GPT models? I know there’s a leaderboard on huggingface: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb... but it doesn’t contain GPT models.



The GPT4/ChatGPT ones are in the source visible: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...


This sounds promising. Smaller but custom trained/tuned models would be ideal - works for the task without the overhead


Agreed, mostly in real life we have specialized experts. You go to doctor to ask about health related stuff, you ask your colleague who is expert in ML only about ML and probably not much about mobile development.

Instead if having ML that is expert at coding in all languages probably would be better to allow to switch context.

Native mobile dev LLM? - e.g. train only on swift, objc, kotlin, java, c, c++ code

Python dev LLM? - train only on python, c, c++, rust code

Would this way final model be smaller, faster and maybe event better?


In my experience it doesn't work like that. It's like if you take an idiot and spend a bunch of time training them, they'll still perform much worse than a moderately intelligent person with much less training. And smaller models can be pretty idiotic.


Then why do Chess AI perform much better than LLMs trying to play chess.


Provided that the problem is suited to the strengths of an LLM at all. An example might be a small ai custom trained on documentation for libraries. You ask it a question like "how do I make the background move with parallax effect when you move the cursor". It's a little ambiguous, high-level concept, and probably not a single function.

Small ai: likely makes up a function or suggests a single function which isn't sufficient. Refuses to budge from its answer or apologies and gets confused

Large LLM: able to actually understand the question, combine several functions. If it doesn't work you can tell it why and it fixes it


Because there’s a world of difference between a reinforcement learning trained special purpose model and asking a general purpose large language model to have a go at something.


Because they have an explicit model of chess and specific heuristics for learning chess.

An LLM could have picked up some chess patterns through osmosis, but it can not reason explicitly in the domain.


No"they" (lc0) don't have specific heuristics


Because they do completely different things? They literally have nothing to do with each other. Why do planes fly better than ships if ChatGPT can't do math?


Why would a language model be good at playing chess?


Why no? Chess notation is text. But the problem is that LLMs are not that good for problems which require evaluation of a search tree. Also leading chess engines such as lc0 are without search better than 90+x% of All humans


Unfortunately, for transformer-based LLMs the magic starts only when they are trained by more that 10^22 TFlops (preferably 10^24) so smaller models might not cut it even for fine-tuned tasks.


At medium size (13B) Microsoft Orca demonstrated you can trade off size with larger fine-tuning dataset.


Any references on this?



Agreed.


Fine-tuning is the most important part, but it is under intense research today, things change fast. I hope they can streamline this process because these smaller models can only compete with big models when they are fine-tuned.


Yes but fine tuning requires a lot more gpu memory and is thus much more expensive, complicated and out of reach of most people. To fine tune a >10B model you still need multiple A100 / H100. Let’s hope that changes with quantized fine tuning, forward pass only etc.


The OpenLLM team is actively exploring those techniques for streamlining the fine-tuning process and making it accessible!


You can fine-tune medium models 3..60B on a single GPU with QLoRA


What is the $ cost of a fine tune though? $500?


Can you fine tune on an M2 with adequate memory?


What exactly do you mean here that the smaller models can compete with the the larger once they are fine-tuned? What about once the larger models are fine-tuned? Are they then out of reach of the fine-tuned smaller models?


They're probably referring to fine-tuning on private/proprietary data that is specific to a use case. Say a history of conversation transcripts in a call center.

Larger models, like OpenAI's GPT, don't have access to this by default.


OpenAI’s API has fine tuning options for older GPT models: davinci, curie, babbage, and ada


It doesn't for the newer (relevant) ones. Fine-tuning them is expensive and slow, because they are large


Smaller models are likely more efficient to run inference and doesn't necessarily need the latest GPU. Larger language model trend to have better performance over more different type of tasks. But for a specific enterprise use case, either distilling a large model or use large model to help with training a smaller model can be quite helpful in getting things to production - where you may need cost-efficiency and lower latency.


Is there a good checklist or framework for fine tuning vs using a vector db to increase the context size


What kind of hardware do I need to run something small scale (1 user concurrently) and get reasonable result? Are we talking about Raspberry Pi, Core i5, Geforce 4090?


Anecdote: I can tell you that Vicuna-13B, which is a pretty decent model, runs about 4 to 5 tokens/second on an Apple M1 with 16GB. Takes about 10GB of memory when loaded. Friend of mine with a RTX 2070 Super gets comparable results.

I'm upgrading my M1 laptop on Thursday to a newer model, M2 MAX with 96GB of memory. I'm totally going to try Falcon-40B on it, though I do not expect it to run that well. But I do expect it (the M2, I mean) will be snappier on the smaller models than my original M1 is.


I think that strongly depends on what you count as "reasonable": smaller models take less memory, so there's a trade-off between quality and speed depending on if you can fit it all in graphics VRAM, or system RAM, or virtual memory…

Just keep in mind the speed differences between the types of memory. If you've got a 170bn 4-bit parameter model, and you're on virtual memory on a 400 Mbps port, a naive calculation says it will take at best 28 minutes per token unless your architecture lets you skip loading parts of the model. Might take longer if the network has an internal feedback loop in the structure.


Would you mind sharing the calculation memory that reaches at 28 minutes (!) per token?


If your model is 170bn 4-bit parameters, you have 85 Gbytes that has to be loaded into the CPU or GPU; if those parameters are on the other side of a 400 Mbps port, that takes 85 Gbyte / 400 Mbps = 1700 seconds = 28 minutes and 20 seconds.

If you don't have sufficient real RAM or VRAM, the entire model has to be re-loaded for each step of the process.

Assuming no looping (looping makes it longer) and no compartmentalisation in the network structure (if you can make it so an entire fragment might be not-activated, you have the possibility of skipping loading that section; I've not heard of any architecture that does this, it would be analogous to dark silicon or to humans not using every brain cell at the same time (outside of seizures)).


Cool stuff! How does this compare with Fastchat, which seems like another open source project that helps run LLM models?

At a glance, it seems like it's going for lots of similar goals (run LLMs with interoperable APIs):

https://github.com/lm-sys/FastChat


OpenLLM in comparison focuses more on building LLM apps for production. For example, the integration with LangChain + BentoML makes it easy to run multiple LLMs in parallel across multiple GPUs/Nodes, or chain LLMs with other type of AI/ML models, and deploy the entire pipeline on Kubernete (via Yatai or BentoCloud).

Disclaimer: I helped build BentoML and OpenLLM.


Suppose I’ve written code that calls the OpenAI API. Is there some library that helps me easily switch to a local/other LLM. I.e a library that (ideally) provides the same OpenAI interface for several models, or if not then at least the same interface.



There's the OpenedAI-API extension for text-generation-webui: https://github.com/oobabooga/text-generation-webui/tree/main...


Found a couple others that do something like this. Turns out I had bookmarked them a while ago.

https://github.com/r2d4/openlm

https://github.com/hyperonym/basaran


OpenLLM plan to provide an OpenAI-compatible API, which allows you to even use OpenAI's python client to talk to OpenLLM, user just need to change to Base URL to point to your OpenLLM server. This feature is working-in-progress.


Langchain offers abstraction across many LLMs, including OpenAI’s.


Langchain might be what you need.


I like the idea of having a standard API for interacting with LLMs over the network. Many models need to run on beefy hardware and would benefit from offloading to a remote (possibly self-hosted) server, and I think makes logical sense to separate the code for running LLMs from the UI for accessing them.


It would be great to have this, but the space is rapidly moving and haven't converged on a set of uniformly accepted practices yet. For example, I'm not aware of a single open source LLM that has something similar to OpenAI's function calls.


Looks great! I'm planning to integrate it into my new project(to-chatgpt: https://github.com/SimFG/to-chatgpt), which will provide users of the ChatGPT applications with a wider range of LLM service options.


Looking forward to it!

OpenLLM is adding a OpenAI-compatible API layer, which will make it even easier to migrate LLM apps built around OpenAI's API spec. Feel free to join our Discord community and discuss more!


Why doesn't it support llama?


Though only related to topic wrt LLMs: It seems LLMs occasionally mix up CJK vocabularies and also generate invalid UTF-8 sequences, due to CJK texts having overlapping code points and inputs being processed by tokenizer. Are there developments in that directions? Aren't CJK ideograms essentially tokens?


What is the license like for this? Correct me if I'm wrong, but I think the official Llama has a license that allows research use. Would this have a similar restriction if it had the same model architechture but different parameters?


OpenLLM itself is under Apache 2 license, which does NOT restrict commercial use. However, OpenLLM as a framework can be extended to support other LLMs which may come with additional restrictions.


Question: for someone that wants to play around with self-hosted text generation but has a crap laptop – are there any hosting providers (like a VPS) where I can run open source models?


You can rent pretty easy a server with a GPU using runpod, vast.ai or datacrunch. Or maybe even use something Like Google Colab.


Thanks!


Does it work only with text? Or image/video processing too?


Check out BentoML, which is the underlying serving framework used by OpenLLM, and it supports other type of models and modality such as images and videos.


From the description of the repo: "An open platform for operating large language models (LLMs) in production."


Hello can someone provide usecase for me as an user of this? Is it better because it is cheaper than commercially available apis?


Imo its not cheaper (or has better quality) and is only worth it

i) if you want to toy around with it

ii) don't want to depend on the api to be available (or don't want to be censored or share sensitive information with a third party)

iii) finetune your own model that you need to deploy by yourself


What does it mean to “serve a model”? Where exactly does the request go and how does it interface with the model?


Think of the model as a gigantic compiled binary where you send in strings in a certain format and get back a response. This is a web API wrapper for that so you only need an HTTP client instead of having to run something like llama.cpp yourself.


Is this the best? Seeing a lot of these projects somewhat confused on what to go with


Anyone know what terminal/theme they are using ? haha...


The theme looks like rose-pine-dawn.


This looks like a very cool project and much needed


very cool.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: