Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Who is using small OS LLMs in production?
50 points by alaeddine-13 on Aug 2, 2023 | hide | past | favorite | 55 comments
It is not clear whether relatively small open source LLMs in production or not, for instance replit code 3b, llama 2 7b, codegen,... What are the motivations for using those models over prompted GPT api ?


I'm integrating llama 2 7b with an application I'm building out currently and one of the biggest reasons was privacy, followed closely by price and lastly by getting it to work locally in a few minutes.

I built a now abandoned project using the GPT API and it was fine and not terribly expensive for my use case but customers didn't like the pay for usage model and the alternative was do weird UX to limit people abusing the prompts into something I couldn't afford bootstrapping as a side project.


Can you elaborate on the pricing difference?


The long-term price difference can be hard to estimate. Suppose that your application is heavily dependent on GPT-4. OpenAI can double your API prices the next term. Or they can decide to stop supporting the model that you wrote all your custom prompts for. Or they could decide to disable your account because they feel like it.

Unless it's for some fringe feature, building your business on OpenAI is probably a considerable (financial) risk in the future.


These are definitely very significant risks. But some of those risks are hard to avoid unless you train your own model, which can be prohibitively expensive.

Say you're building on top of llama and Facebook decides not to update it any more or change the licensing terms (again). Say you're building on some other "open source" model and that project dies.

At least you can keep running the existing model rather than getting locked out over night. That's definitely much better. At least it's survivable. But you would still have to find alternatives and review/scrap all your custom prompts.


LLAMA 2 appears to require no cost because it can be run locally. The license does mention that if you are using facilitating than 700 million users that you have to negotiate for a different license or something like that but for most peoples uses, it would seem that LLAMA 2 is basically "free".


Well it's free as in free hops. You still gotta buy/rent the brewery to make the free beer.


Exactly. It isn't free as in FSF definitional freedom, you still can't do certain things with it etc but it is unencumbered by external costs unless you exceed a certain usage threshold.


Right, but can you scale GPUs for cheaper than OpenAI charges to use their APIs? 3.5 is _cheap_, and perfectly good for many use cases.


Thanks for elaborating. Yeah I was curious how much you'd pay running it on a cloud server in a production-type scenario. Thanks


OpenAI costs money and llama 2 i'm able to run on my GPU so at least for development purposes at the moment it's ""free"" for me to experiment with - mind you this is a side project with zero funding outside of myself, ymmv if you have access to funding to a point of making OpenAI tokens totally disposable.


Ah okay, yeah I was curious how much you'd pay running it on a cloud server in a production-type scenario. Thanks


I have a vultr account I just logged into and checked, I could rent enough GPU memory to run llama 2 for ~180$ a month from my very quick look - so if I make more than 27000 requests with a 4k token payload I will break even - otherwise I could be using openAIs API and be making out better.

EDIT** im sorry that was gpt3.5 - with gpt4 it would be 1000 requests before I broke even at 180$


I see comments here about running Llama on 4090's, which is fine for local development and testing - but getting into production is a significant leap and a significant cost.

The thing that I keep running into in my SLA plans is concurrency. Yes, you can have a Llama 2 model running on an A100 somewhere - but that will support 1 concurrent prompt. Anything at a higher concurrency needs another GPU, or your end users will be waiting a while. Want to rent an 8 GPU machine in the cloud for inference? Be prepared to pay a lot of money for it.


you need an inference server. I am doing ~400 tokens/sec on 7B with a 4090 with multiple concurrent (streaming!) requests.

it's reasonably straightforward for me to host this and serve public requests, but would likely just be a base model -- not sure if hosting (eg) 13B chat can serve peoples' use cases


But is the 7B model any good and actually production worthy for things like RAG?


I'm writing a blog post with some more reasoning but my view is that it can be useful for certain simpler tasks (eg unstructured -> structured, basic summarization) and not more complex things (eg generation).

The tricky thing is that finetuning makes a big difference, and while it should be possible to hotswap LoRA adapters (at some cost to performance), I haven't figured that out yet.


Not as good as GPT-4 of course but nor far from 3.5 if you just need to reword whatever returned by the retrieval. It's like losing 20 IQ points which might be still better than most support interactions I had.


LLaMA 2 7B 8-bit can run pretty well on 64 core EPYCs which are cheaper than GPU instances. Moreover, you can periodically batch multiple users and not just run a single inference for a single user.


Facebook is working very hard to make the main dividing line in generative AI not company vs. company but commercial vs free. Starting from way behind, they are trying to make that irrelevant.


“Way behind” seems harsh when they have one of the best models available.


Not that impressed with llama 2 70b so far tbh. It’s a GPT3-level bullshit machine imo. But huge advantages in running privately and at the edge, so that’s going to be the dividing line imo. Commercial v free. Small v big. H100s in the cloud v edge


Price.

Data privacy.

Controlled latency.

Plenty of reasons to not send arbitrary data to a third party service.


There's also the availability factor. OpenAI has been known to go down on occasion and without warning. If a product relies on an LLM, I wouldn't feel great about the observed uptime of OpenAI APIs.


FWIW, OpenAI's availability seems to have gotten significantly better since May when we launched with them. I monitor our availability Service Level Objective and we keep needing to increase the success rate because they keep improving things.

This doesn't take away from high availability being a legitimate need to host your own LLM, though.


Another side of availability is that they'll make changes to the model without warning, which alters the results of the prompts you already have written. Developing against their API is developing against a moving target.


Can you elaborate on the pricing difference?


Free vs monthly cost. What is there to elaborate on?


Hosting your own LLM is anything but free. Aside from the constant operational expense with people monitoring and fixing issues, you need to provision enough resources and run your own inference server, which is both nontrivial and likely to perform far worse than OpenAI. There's legitimate reasons to host an LLM yourself, but it's not a "make this cheaper" button.


There may be a tipping point where you're burning XXM/year in API costs and the maintenance cost of rolling your own can be justified.

In the short term I agree, and one thing to consider is how rapidly the space is evolving and whether your team can even keep up with the latest advancements.

However, there will come a time when the bill comes due after launch and it will be very tempting to hire people to reduce the CapEx on the API.


It’s gona have to be hosted and run from somewhere…


Running llama-2-7b-chat at 8 bit quantization, and completions are essentially at GPT-3.5 levels (and instant) on a single RTX4090 using 15gb VRAM. I don't think most people realize just how small and efficient these models are going to become.


7B or 70B?


>7B or 70B?

7B 8bit GGML running on a single 4090 with llama.cpp. It's hard to overstate the massive jump in capability between llama 1 and 2.


Are you hosting that somewhere? If so, how much does that cost and do you have concurrent users?


>Are you hosting that somewhere?

Tensordock. RTX4090 instances are ~$0.50/hr and can handle 3/4 concurrent users each.


Data security and privacy. Our clients (in aviation, finance, etc.) need this due to legal and regulatory reasons. Also, the new Llama 2 models are very powerful. In my testing, Llama 2 70b is comparable to GPT-3.5 in capability.

(Shameless plug: here's our website: https://www.amw.ai/)


I'm in the same boat. We have customers that need to run models in an environment that have access to the public internet, even if they did trust OpenAI et al.

More importantly for me, I don't want to be beholden to a model provider and have to take what they give me. I'd rather host my own model if an API was an option, because then I have control over it and can hack it as I want. I don't want to be just a wrapper on GPT which is sort of what you're stuck with if you just want to use their APIs.


Agreed. Worse, you don't want to be in a position where any single company can end your business at the flip of a switch.


I am also interested in your open-source-first approach, and am a bit confused from the landing page. Could you explain what exactly your product does? Is it an LLM-enhanced document parser?


have you considered azure's GPT, or is that not private enough?


We have. This is acceptable for some clients, but not for others. Both groups, however, prefer maintaining complete control over their data, given the chance.

Edit: plus, my personal view is that local LLMs are the future. They've already caught up to GPT-3.5 (based on my testing); and they continue to evolve rapidly. Makes sense to focus our limited resources on riding that wave.

OpenAI won't go away, but neither will they remain the first choice (or only choice!) for most use-cases.


This assumes you trust Microsoft.


There's some . . entrepreneurs . . who have been promising NIST/ITAR-compliant LLM frameworks on Azure, but when you ask around, they have not done all the legwork (AG/AGS). They're working off Azure Public, with "waivers" that they won't show anyone. Also, the history of their leadership is . . questionable. It all feels just a little hinky. Until that's cleared up, I advise anyone fooling with LLMs to do it on-prem, at least for the moment. One thing I'm worried about: doing LLMs with something like GovCloud is going to be absolutely bananas in terms of price-per-compute.


Looks great my friend


My friend trained his own gpt2 because it is faster and cheaper to tune it.


Although we haven't gone down the path of deploying a fine-tuned model on our own infrastructure, we do see that as an eventual reality. Our current feature is disabled for any customer who signs a BAA with us because we can't get a DPA signed with OpenAI, and not for lack of trying. Maybe that resolves itself over time, but the most reliable option available is to fine-tune a model and run it ourselves. It's also likely a more expensive and challenging one, though, hence we're not doing it yet.


I am using one of the uncensored versions of LLaMA 2 to allow chatbot roleplays without constant moralizing and replying to every other request with "I am just an AI, I don't have any opinion, emotion, feelings, don't like anything" etc.


We have several customers who aren't using OpenAI API / Anthropic API because of privacy reasons. We are spinning up infrastructure and making the features who rely on those APIs to also wok with OS LLMs too.


Exploring a few options in my off-time. Main motivation for an OS LLM is to get it to do things which GPT-3.5/4 are somewhat promising at - but not good enough for applications.


To me, the simple models might not cross the boundary where LLMs start to be useful versus, say, a fixed menu with choices in a helpdesk app.

It's a paradox because, to really feel human like and not make huge mistakes, we need these huge LLMs and they are expensive... and the alternative is not-so-smart traditional code.

So what I'm trying to say is that I think the small LLMs might not be that useful before they cross some arbitrary quality threshold (which they may never do.. considering more parameters => better model, in general).


Tangential question - how well does Llama 2 do on coding tasks on less-mainstream languages like Rust?


Well I'm not too familiar with Rust so I can't gauge correctness, but I do have the Llama 2 13B NewHope fine tune loaded (which is afaik tuned for python coding), so I gave it and 3.5-turbo the same random post request question.

3.5's result: https://chat.openai.com/share/9e1aafd3-631c-4c13-80f6-f99c88...

NewHope's result: https://i.imgur.com/dfACQC3.png

If you have any ideas for a more comprehensive test let me know and I'll try to run it. Giving it some existing code to fix up or change is usually more of a typical use case for me anyway.


Interestingly, they have withdrawn their model because they discovered that test data leaked into the training data. Quantised versions are still available on huggingface from others though.

https://github.com/SLAM-group/newhope


Yeah it seemed suspiciously high for HumanEval and it only ranks 14th for JS and 7th for Python on other benchmarks now: https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...

WizardCoder is a bit of a problem since it's not llama 1/2 based but is its own 15B model and as such the support for it in anything practical is near nonexistent. WizardLM v1.2 looks like it may be worth testing out.

All of the LLama 2 fine tunes I've tried out so far have weird issues though. Saying unrelated things at times, ignoring parts of the conversation and such. Could be fine tuning or prompt template goofs or Llama 1 may actually be a more self consistent base model overall.


Customizing logists




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: