Ask HN: Who is using small OS LLMs in production?

boredumb · on Aug 2, 2023

I'm integrating llama 2 7b with an application I'm building out currently and one of the biggest reasons was privacy, followed closely by price and lastly by getting it to work locally in a few minutes.

I built a now abandoned project using the GPT API and it was fine and not terribly expensive for my use case but customers didn't like the pay for usage model and the alternative was do weird UX to limit people abusing the prompts into something I couldn't afford bootstrapping as a side project.

LewisVerstappen · on Aug 2, 2023

Can you elaborate on the pricing difference?

microtonal · on Aug 2, 2023

The long-term price difference can be hard to estimate. Suppose that your application is heavily dependent on GPT-4. OpenAI can double your API prices the next term. Or they can decide to stop supporting the model that you wrote all your custom prompts for. Or they could decide to disable your account because they feel like it.

Unless it's for some fringe feature, building your business on OpenAI is probably a considerable (financial) risk in the future.

fauigerzigerk · on Aug 2, 2023

These are definitely very significant risks. But some of those risks are hard to avoid unless you train your own model, which can be prohibitively expensive.

Say you're building on top of llama and Facebook decides not to update it any more or change the licensing terms (again). Say you're building on some other "open source" model and that project dies.

At least you can keep running the existing model rather than getting locked out over night. That's definitely much better. At least it's survivable. But you would still have to find alternatives and review/scrap all your custom prompts.

metiscus · on Aug 2, 2023

LLAMA 2 appears to require no cost because it can be run locally. The license does mention that if you are using facilitating than 700 million users that you have to negotiate for a different license or something like that but for most peoples uses, it would seem that LLAMA 2 is basically "free".

moffkalast · on Aug 2, 2023

Well it's free as in free hops. You still gotta buy/rent the brewery to make the free beer.

metiscus · on Aug 2, 2023

Exactly. It isn't free as in FSF definitional freedom, you still can't do certain things with it etc but it is unencumbered by external costs unless you exceed a certain usage threshold.

smallerfish · on Aug 2, 2023

Right, but can you scale GPUs for cheaper than OpenAI charges to use their APIs? 3.5 is _cheap_, and perfectly good for many use cases.

LewisVerstappen · on Aug 2, 2023

Thanks for elaborating. Yeah I was curious how much you'd pay running it on a cloud server in a production-type scenario. Thanks

boredumb · on Aug 2, 2023

OpenAI costs money and llama 2 i'm able to run on my GPU so at least for development purposes at the moment it's ""free"" for me to experiment with - mind you this is a side project with zero funding outside of myself, ymmv if you have access to funding to a point of making OpenAI tokens totally disposable.

LewisVerstappen · on Aug 2, 2023

Ah okay, yeah I was curious how much you'd pay running it on a cloud server in a production-type scenario. Thanks

boredumb · on Aug 2, 2023

I have a vultr account I just logged into and checked, I could rent enough GPU memory to run llama 2 for ~180$ a month from my very quick look - so if I make more than 27000 requests with a 4k token payload I will break even - otherwise I could be using openAIs API and be making out better.

EDIT** im sorry that was gpt3.5 - with gpt4 it would be 1000 requests before I broke even at 180$

binarymax · on Aug 2, 2023

I see comments here about running Llama on 4090's, which is fine for local development and testing - but getting into production is a significant leap and a significant cost.

The thing that I keep running into in my SLA plans is concurrency. Yes, you can have a Llama 2 model running on an A100 somewhere - but that will support 1 concurrent prompt. Anything at a higher concurrency needs another GPU, or your end users will be waiting a while. Want to rent an 8 GPU machine in the cloud for inference? Be prepared to pay a lot of money for it.

huac · on Aug 2, 2023

you need an inference server. I am doing ~400 tokens/sec on 7B with a 4090 with multiple concurrent (streaming!) requests.

it's reasonably straightforward for me to host this and serve public requests, but would likely just be a base model -- not sure if hosting (eg) 13B chat can serve peoples' use cases

binarymax · on Aug 2, 2023

But is the 7B model any good and actually production worthy for things like RAG?

huac · on Aug 2, 2023

I'm writing a blog post with some more reasoning but my view is that it can be useful for certain simpler tasks (eg unstructured -> structured, basic summarization) and not more complex things (eg generation).

The tricky thing is that finetuning makes a big difference, and while it should be possible to hotswap LoRA adapters (at some cost to performance), I haven't figured that out yet.

treprinum · on Aug 2, 2023

Not as good as GPT-4 of course but nor far from 3.5 if you just need to reword whatever returned by the retrieval. It's like losing 20 IQ points which might be still better than most support interactions I had.

treprinum · on Aug 2, 2023

LLaMA 2 7B 8-bit can run pretty well on 64 core EPYCs which are cheaper than GPU instances. Moreover, you can periodically batch multiple users and not just run a single inference for a single user.

TradingPlaces · on Aug 2, 2023

Facebook is working very hard to make the main dividing line in generative AI not company vs. company but commercial vs free. Starting from way behind, they are trying to make that irrelevant.

pavlov · on Aug 2, 2023

“Way behind” seems harsh when they have one of the best models available.

TradingPlaces · on Aug 2, 2023

Not that impressed with llama 2 70b so far tbh. It’s a GPT3-level bullshit machine imo. But huge advantages in running privately and at the edge, so that’s going to be the dividing line imo. Commercial v free. Small v big. H100s in the cloud v edge

Demotooodo · on Aug 2, 2023

Price.

Data privacy.

Controlled latency.

Plenty of reasons to not send arbitrary data to a third party service.

tycoon177 · on Aug 2, 2023

There's also the availability factor. OpenAI has been known to go down on occasion and without warning. If a product relies on an LLM, I wouldn't feel great about the observed uptime of OpenAI APIs.

phillipcarter · on Aug 2, 2023

FWIW, OpenAI's availability seems to have gotten significantly better since May when we launched with them. I monitor our availability Service Level Objective and we keep needing to increase the success rate because they keep improving things.

This doesn't take away from high availability being a legitimate need to host your own LLM, though.

lolinder · on Aug 2, 2023

Another side of availability is that they'll make changes to the model without warning, which alters the results of the prompts you already have written. Developing against their API is developing against a moving target.

LewisVerstappen · on Aug 2, 2023

Can you elaborate on the pricing difference?

alrlroipsp · on Aug 2, 2023

Free vs monthly cost. What is there to elaborate on?

phillipcarter · on Aug 2, 2023

Hosting your own LLM is anything but free. Aside from the constant operational expense with people monitoring and fixing issues, you need to provision enough resources and run your own inference server, which is both nontrivial and likely to perform far worse than OpenAI. There's legitimate reasons to host an LLM yourself, but it's not a "make this cheaper" button.

kuchenbecker · on Aug 2, 2023

There may be a tipping point where you're burning XXM/year in API costs and the maintenance cost of rolling your own can be justified.

In the short term I agree, and one thing to consider is how rapidly the space is evolving and whether your team can even keep up with the latest advancements.

However, there will come a time when the bill comes due after launch and it will be very tempting to hire people to reduce the CapEx on the API.

rimeice · on Aug 2, 2023

It’s gona have to be hosted and run from somewhere…

ramesh31 · on Aug 2, 2023

Running llama-2-7b-chat at 8 bit quantization, and completions are essentially at GPT-3.5 levels (and instant) on a single RTX4090 using 15gb VRAM. I don't think most people realize just how small and efficient these models are going to become.

jeanloolz · on Aug 2, 2023

7B or 70B?

ramesh31 · on Aug 2, 2023

>7B or 70B?

7B 8bit GGML running on a single 4090 with llama.cpp. It's hard to overstate the massive jump in capability between llama 1 and 2.

rimeice · on Aug 2, 2023

Are you hosting that somewhere? If so, how much does that cost and do you have concurrent users?

ramesh31 · on Aug 2, 2023

>Are you hosting that somewhere?

Tensordock. RTX4090 instances are ~$0.50/hr and can handle 3/4 concurrent users each.

npsomaratna · on Aug 2, 2023

Data security and privacy. Our clients (in aviation, finance, etc.) need this due to legal and regulatory reasons. Also, the new Llama 2 models are very powerful. In my testing, Llama 2 70b is comparable to GPT-3.5 in capability.

(Shameless plug: here's our website: https://www.amw.ai/)

andy99 · on Aug 2, 2023

I'm in the same boat. We have customers that need to run models in an environment that have access to the public internet, even if they did trust OpenAI et al.

More importantly for me, I don't want to be beholden to a model provider and have to take what they give me. I'd rather host my own model if an API was an option, because then I have control over it and can hack it as I want. I don't want to be just a wrapper on GPT which is sort of what you're stuck with if you just want to use their APIs.

npsomaratna · on Aug 2, 2023

Agreed. Worse, you don't want to be in a position where any single company can end your business at the flip of a switch.

euazOn · on Aug 9, 2023

I am also interested in your open-source-first approach, and am a bit confused from the landing page. Could you explain what exactly your product does? Is it an LLM-enhanced document parser?

fogx · on Aug 2, 2023

have you considered azure's GPT, or is that not private enough?

npsomaratna · on Aug 2, 2023

We have. This is acceptable for some clients, but not for others. Both groups, however, prefer maintaining complete control over their data, given the chance.

Edit: plus, my personal view is that local LLMs are the future. They've already caught up to GPT-3.5 (based on my testing); and they continue to evolve rapidly. Makes sense to focus our limited resources on riding that wave.

OpenAI won't go away, but neither will they remain the first choice (or only choice!) for most use-cases.

hhh · on Aug 2, 2023

This assumes you trust Microsoft.

MilStdJunkie · on Aug 2, 2023

There's some . . entrepreneurs . . who have been promising NIST/ITAR-compliant LLM frameworks on Azure, but when you ask around, they have not done all the legwork (AG/AGS). They're working off Azure Public, with "waivers" that they won't show anyone. Also, the history of their leadership is . . questionable. It all feels just a little hinky. Until that's cleared up, I advise anyone fooling with LLMs to do it on-prem, at least for the moment. One thing I'm worried about: doing LLMs with something like GovCloud is going to be absolutely bananas in terms of price-per-compute.

tab_jockey · on Aug 2, 2023

Looks great my friend

itake · on Aug 2, 2023

My friend trained his own gpt2 because it is faster and cheaper to tune it.

phillipcarter · on Aug 2, 2023

Although we haven't gone down the path of deploying a fine-tuned model on our own infrastructure, we do see that as an eventual reality. Our current feature is disabled for any customer who signs a BAA with us because we can't get a DPA signed with OpenAI, and not for lack of trying. Maybe that resolves itself over time, but the most reliable option available is to fine-tune a model and run it ourselves. It's also likely a more expensive and challenging one, though, hence we're not doing it yet.

treprinum · on Aug 2, 2023

I am using one of the uncensored versions of LLaMA 2 to allow chatbot roleplays without constant moralizing and replying to every other request with "I am just an AI, I don't have any opinion, emotion, feelings, don't like anything" etc.

xfalcox · on Aug 2, 2023

We have several customers who aren't using OpenAI API / Anthropic API because of privacy reasons. We are spinning up infrastructure and making the features who rely on those APIs to also wok with OS LLMs too.

lumost · on Aug 2, 2023

Exploring a few options in my off-time. Main motivation for an OS LLM is to get it to do things which GPT-3.5/4 are somewhat promising at - but not good enough for applications.

derealized · on Aug 2, 2023

To me, the simple models might not cross the boundary where LLMs start to be useful versus, say, a fixed menu with choices in a helpdesk app.

It's a paradox because, to really feel human like and not make huge mistakes, we need these huge LLMs and they are expensive... and the alternative is not-so-smart traditional code.

So what I'm trying to say is that I think the small LLMs might not be that useful before they cross some arbitrary quality threshold (which they may never do.. considering more parameters => better model, in general).

ewuhic · on Aug 2, 2023

Tangential question - how well does Llama 2 do on coding tasks on less-mainstream languages like Rust?

moffkalast · on Aug 2, 2023

Well I'm not too familiar with Rust so I can't gauge correctness, but I do have the Llama 2 13B NewHope fine tune loaded (which is afaik tuned for python coding), so I gave it and 3.5-turbo the same random post request question.

3.5's result: https://chat.openai.com/share/9e1aafd3-631c-4c13-80f6-f99c88...

NewHope's result: https://i.imgur.com/dfACQC3.png

If you have any ideas for a more comprehensive test let me know and I'll try to run it. Giving it some existing code to fix up or change is usually more of a typical use case for me anyway.

kristianp · on Aug 2, 2023

Interestingly, they have withdrawn their model because they discovered that test data leaked into the training data. Quantised versions are still available on huggingface from others though.

https://github.com/SLAM-group/newhope

moffkalast · on Aug 2, 2023

Yeah it seemed suspiciously high for HumanEval and it only ranks 14th for JS and 7th for Python on other benchmarks now: https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...

WizardCoder is a bit of a problem since it's not llama 1/2 based but is its own 15B model and as such the support for it in anything practical is near nonexistent. WizardLM v1.2 looks like it may be worth testing out.

All of the LLama 2 fine tunes I've tried out so far have weird issues though. Saying unrelated things at times, ignoring parts of the conversation and such. Could be fine tuning or prompt template goofs or Llama 1 may actually be a more self consistent base model overall.

avereveard · on Aug 2, 2023

Customizing logists