Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What's the best self hosted/local alternative to GPT-4?
328 points by surrTurr 3 months ago | hide | past | favorite | 194 comments
Constant outages and the model seemingly getting nerfed[^1] are driving me insane. Which viable alternatives to GPT-4 exist? Preferably self-hosted (I'm okay with paying for it) and with an API that's compatible with the OpenAI API.

[^1]: https://news.ycombinator.com/item?id=36134249

There is literally no alternative.

You’re stuck with openai, and you’re stuck with whatever rules, limitations or changes they give you.

There are other models, but specifically if you’re actively using gpt-4 and find gpt-3.5 to be below the quality you require…

Too bad. You’re out of luck.

Wait for better open source models or wait patiently for someone to release a meaningful competitor, or wait for openai to release a better version.

That’s it. Right now, there’s no one else letting people have access to their models which are equivalent to gpt-4.

This point is understated. So many people are going around like "I'm building an AI app!" when the reality is, OpenAI built an AI app, you're just designing a front end for it.

OpenAI builds an LLM and an api-interface to that model.

The design of abstractions, prompt engineering, custom fine-tunes and software engineering required to ship a valuable application on top of that interface counts as "building an app" in my book.

The LLM _is_ the product. Everything else in the stack is window dressing.

I don’t think OpenAI’s contribution should be so understated- they built a technology that was considered science fiction just a few years ago. They deserve all the credit for the “AI”.

Good luck building a moat when 95% of your app is just calling the API everyone else has access to.

Paul English (founder of Kayak, exited at $2 billion) said this in response to the exact comment you are making: "Kayak was just a thin layer above ITA, and ITA sold for $700m and Kayak sold for $2b. So don’t dismiss thin layers. The ultimate power is the app and UX."[0]

I would also add "distribution" to the "app and UX" part, but you can certainly build a valuable business upon an API "everyone else has access to" - plenty of companies out there that do that.


I can appreciate this, but there's a reason I said 95%. The stuff I see being built so far is really an extremely thin veneer on top of the API, but that could change with time.

Kayak would never have succeeded today. Competition is far too hungry

100% of all existing code is just calling an OS-API that everyone else has access to. That's just nonsense

Anyone can build an OS without needing billions of dollars, and in fact there are state of the art open source OSes. Not quite the same.

Yes, but also, come on now.

I mean, it refutes the argument. Everyone has access to the internet, to conv nets, to C/python/pandas/TCP, you name it. Yet, nobody would seriously argue that one can not build a moat with products based on those stacks for the reason that everyone else has access to the same stack. Its just not an argument, at all.

We can now write software that interprets language under the hood (to some degree). The value propositions enabled by this change in the world are so vast, and partly so complex - to make absolute statements like "yeah but you don't control the model, so anyone can copy your solution" seems out of touch to me. What subset of technology doesn't get reverse engineered? Either this applies almost nowhere (because every piece of tech that an engineer can get their hands on is effectively open), or everywhere.

Moat does not come from compile time but runtime. The company, the operation, and the accumulated data, the brand, the trust and the reputation.

There are technology moats as well, but for the most part, you're correct. Everything else usually matters much more.

by this logic, you can just copy facebook, twitter and pretty all software as they are just calling some high level api (http protocol , ios/android api..) etc .

there are plenty of FB, twitter and reddit alternative, they just don't have the data of the original. but openAI provide the API and the data

more like utilizing prompts that anyone can figure out

what does this moat even mean in this context ? I see it throw around but cannot infer the meaning

I think that people who believe this don't realise how complex and nuanced code surrounding LLMs have gotten in the last half a year.

I wasn't trying to understate OpenAI's contribution, far from it. I can hardly express my appreciation for their work.

That doesn't mean that everything else in the stack is window dressing though - custom, domain specific wrangling with the different api endpoints, finding a satisfying prompt, temperature param etc. for specific tasks - the entire process of designing systems around an LLM-api has many intricacies to it, lots of which are completely uncharted territory.

I can assure you: very smart people are knees deep in this process, and they deserve the credit for their share of the value that is being created.

I am not convinced it won’t all be obviated with improvements in the foundation model. Seems to be carving out a very very temporary space for yourself.

Re: smart people avoiding wrapper work - sure! Not just because the result might be short lived, but also because there is a lot more prestige in building something from scratch rather than wrapping.

However, the "very very temporary space" might as well lead to momentum and a moat in a subdomain, and anyone who met a sufficiently large number of smart people knows that lots of them are very pragmatic, don't chase prestige, and enjoy laying ground work for future iterations.

> The engine _is_ the product. Everything else in the car is window dressing.

Interesting analogy, since the engine manufacturer also manufactures the car.

> required to ship a valuable application on top of that interface counts as "building an app" in my book.

It's certainly building an app. It's not building an AI app, though. It's building a front-end to an existing AI application.

how can you finetune gpt-4 or 3.5? The shell/wrapper around a request to open AI is not any worthy piece of software, It'd be more impressive to vend those services. Otherwise it's basically just a few scripts.

Google has recently announced infrastructure for fine tuning and RLHF on their pre trained foundation models. I imagine all cloud providers will follow suit.


As your wording implies, finetuning is restricted to the smaller models, i.e. babbage, curie etc.

You can generate the training data for this with 3.5 and 4 and tune smaller models with the resulting data. For lots of tasks, this results in robust results, which btw are also faster than 3.5 turbo.

It's just like building a Facebook app back in the day

Replace social media / graph APIs with the ones from OpenAI

while this is technically correct, difference is that your product won't survive without openai at this point. If you need the model quality openai provides you are stuck and your product can just disappear. Because llm is core building block, irreplaceable one.

Building a product that relies 100% on a single external vendor is taking a huge risk. So many companies have been burned by this in the past that it's amazing anyone doesn't see it as a risky thing.

So I have to believe that the people making these products are intending to make as much cash as possible up front and aren't aiming for a long-term thing.

Use the facade pattern so it’s easier to swap out if needed.

Of course it is an app, but ultimately it is just a frontend.

There is a lot of engineering work and reasoning to make sufficiently complex prompts.

Depending on your business case, the 4096 tokens given to you have to go quite far. Vector embeddings are not "easy" to work with. Trying to splat together a range of techniques to craft a good prompt is hard™.

Adding in Actions (e.g. using headless browsers to open pages etc) is also pioneering territory.

Sucks that OpenAI currently has the market, but there is still plenty of reasons to develop on top of it.

Wait so when I'm using postgres I'm just designing a front end for it? Maybe I'm just designing a front end for my CPU?

So, ChatGPT isn't a technology, it is a service.

That was never in question. ChatGPT is openAIs web frontend for several of their GPT-Models.

The technology is just GPT and transformers, the open source alternatives are just not as advanced as the GPT4 model yet. It might change with current trajectory, mainly because OpenAI keeps nerfing it.

The technology is in the hardware/software synergy.

Anthropic's Claude is on par with or better than GPT-4 for many tasks. It's not self-hosted, but it is competitive with OpenAI's best model.

What tasks do you use Claude for?

I tend to use Claude primary for creative writing (at which it's better than GPT4 and even Claude+, in my experience), and for explanations, which are more thorough than GPT4 or ChatGPT (GPT 3.5) without any special prompting.

Claude should be more well known, imo. ChatGPT/GPT4 gets all the hype, but Claude is really good too... sometimes even better.

I thought Anthropic has Claude and Claude Instant. When you say Claude sometimes feels better than Claude+, do you mean that you feel Claude Instant sometimes gives better result than Claude?

This is what I referred to as Claude: https://poe.com/Claude-instant

And this is Claude+: https://poe.com/Claude%2B

it would be more known if they would provide an interface as easy to use as chatgpt.

Guess there really is a moat.

Just to nitpick that a bit, they have a lead. To me, a moat is some other axis of advantage that makes it hard for a rival to compete directly.

So for example Tesla has/had a lead in EV tech, but their supercharger network is a moat because it’s very difficult for a rival to compete with, even if they have equivalent EV tech.

I know there isn’t established terminology for this. It is a nitpick, but I think a lead is already a term we have, and moat has connotations of being in some way ‘unfair’. To me, just creating a better product isn’t unfair in any meaningful way, while moats in some circumstances can represent anti trust issues.

So like, would that not suggest that Google has a moat with AI, as it is already intertwined with our email, calendars, etc... (if they ever make their AI public)

It makes it all the more interesting that Google is currently losing to an upstart. They have a MASSIVE advantage in data and moats, but cannot execute on novel products to take advantage of it, despite also hiring a significant chunk of the talent in the market as well.

But perhaps at this stage Google's not meant to / need to build cutting edge products, but rather focus on commoditizing the ones that prove profitable.

> t makes it all the more interesting that Google is currently losing to an upstart

How is Google "losing" when they are not even in the same market? Are you expecting Google to develop and sell access to an LLM API or use AI internally to enhance it's other products (which it jas been doing to great effect)

or maybe they like their chinese wall between their internal AI and the public at large!!!

Yes, no one can build a car from scratch in their garage, but there are plenty of companies with enough capital to build their own cars and sell them to people.

OpenAI has a lead on Bard, LLaMA and friends, but I would expect that to close in the next few months or years.

It's not that the isn't a moat, but rather that it isn't looking particularly durable

Compute moat. aka cash moat. It's a deep one, too.

Company moats can grow and later shrink.

There is no AI, there are LLM models, that people hardly understand. Modern solutions are to AI what Alchemy was to Chemistry. Sometimes they get some results but nobody really sure why. The little that OpenAI mentioned about how they achieved GPT-4 performance, says it was by trying a bunch of things and dismissing others. They probably also themselves not sure how they managed to pull it of, meaning that, if for example an increase in parameter size or other settings, kick in negative effects that actually reduce the model performance, they will also be at a loss where to go next.

Is this opinion based on some benchmarking you (or someone else) did?

Nothing that you can self host seems to come close to gpt 3.5, let alone gpt-4. r/LocalLlama is good subreddit to lurk to get a pulse on the local llms. Current leader seems to be Guanaco-65B.

I believe there are benchmarks, but I can informally second that opinion. I'm building a writing app (chiseleditor.com) and there is nothing as good as the ChatGPT models right now.

Since you have your hands in the mess, let me ask you this, and I ask, because I think this is what is meant by people who ask what's a.. bla..bla alternative.. to bla..bla..bla. How can an industry specific or company specific AI be created? meaning you take the LLM engine and you ingress company data.. or if you want to be bold, industry datasets. CHATGPT is marketted as being doctor/architect/lawyer/professor/etc. But what if all you want to do is build an ask jeeve's type of ai lawyer??

I would distrust the currently available benchmarks, as recent research (gah, can't remember the paper title) indicates that for many benchmarks at least some of the data splits have leaked into model training data; and there's some experience with the open source models which match an OpenAI model on the benchmark scores but subjectively feel much worse than that model on random questions.

I’m telling you from looking at this closely that there is substantial evidence solely from new, never seen before prompts that GPT4 is by far the best, ChatGPT/Claude is second, with other anthropic, Vicuña, etc bringing up the rear

Have you tried Anthropic, specifically Claude? I have no doubt GPT-4 is still king, I'm just curious how much of a lead it has.

I've played around a lot with Claude, and find it much better than GPT4 and even Claude+ at creative writing.

I also generally prefer Claude or Claude+ over GPT4 or ChatGPT (GPT 3.5) for explanations too, which tend to be more thorough without any special prompting.

No, I would love to try it out but unfortunately I don't have early access yet.

It seems reasonable for what many who are using openai and self hosting are finding.

There’s a gap, it’s closing, likely faster than anticipated.

Huggingface awaits :)

The model is the spice then basically?

Fitting analogy. The Kenyan data transcribers that OpenAI used are the Fremen, and transcripts of erotic Harry Potter fanfiction are the Shai-Hulud.

Are the kenyan's really going to rise up like the Freman? Most likely not. If anything, they'll be used for cheap labor when the next new thing comes along.

haha was thinking the same. The other day openAI's api hiccup gave me a small panic attack

Techies repeat what they chastise artists for about Adobe

I don't know the licensing and all that jazz (even if you self-host for your personal use it shouldn't matter). But, this paper[0] released a week ago claims " 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU" (QLORA).

A quick test of the huggingface demo gives reasonable results[1]. The actual model behind the space is here[2], and should be self-hostable with reasonable effort.

0. https://arxiv.org/abs/2305.14314 1. https://huggingface.co/spaces/uwnlp/guanaco-playground-tgi 2. https://huggingface.co/timdettmers/guanaco-33b-merged

You link to Guanaco-33B, but Guanaco-65B is much more capable.

CPU Version: https://huggingface.co/TheBloke/guanaco-65B-GGML

GPU Version: https://huggingface.co/TheBloke/guanaco-65B-HF

4bit GPU Version: https://huggingface.co/TheBloke/guanaco-65B-GPTQ

It irritates me to no end that people don't list the system requirements of various models.

How much ram and vram does one need to run 4,13,33,65B models at a reasonable speed?

edit: instead I'll ask this, what's the best model to run on a system with a 24gb 4090 and 64gb of ram?

A 4-bit quantized 33B parameter model will fit on your GPU and you'll be able to use a 2048 token context too. (4-bit quantized larger models are better than smaller 8bit/16bit models)

You can run 4-bit quantized 65B models on your cpu, but it is slow, 1-2 tokens a second instead of 8-15 people typically get with a gpu, but you need two 24gb or an enterprise card with 48gb of ram to load them there.

https://old.reddit.com/r/LocalLLaMA/wiki/models has the information you are irritated about not being listed.

You can just barely fit a 33B GPTQ model in 24GB VRAM. It will be in 4-bit mode, and without maximum context size, but it will be quite fast. Or you can run from RAM+VRAM in GGML format with llama.cpp (or a derivative), which will easily fit 65B models even at 5 or 8 bits, but at much lower speed.

It's fairly simple to estimate ram requirements based on parameter count: https://blog.eleuther.ai/transformer-math/

The first link there gives very specific RAM requirements for each of the models.

"Reasonable speed" is subjective, though.

Incredible how people are underestimating ChatGPT or overestimating open-source models.

A basic question: How can i join with SQL a column to a string separated with comma

GPT-4: PostgreSQL: SELECT STRING_AGG(columnName, ', ') FROM tableName;

Guanaco: Here is an example of how you could use the CONCAT function in MySQL to concatenate a string to a column value in a single-line query: SELECT CONCAT('The total price for ', product_name, ’ is ', SUM(price)) AS total FROM products; This will result in output like “The total price for Chocolate Bar is 10” or similar depending on your data.

Yeah, no...

> How can i join with SQL a column to a string separated with comma

I think I might give this one to Guanaco. I’m moderately familiar with SQL, and I can’t really understand your question. You’re faking the reader out by using “join” to mean something which is not JOIN.

So do you mean to concatenate a string in a column (coming from a single row) to a “string separated with a comma”? If so (and assuming “separated” means starting or ending with), then Guanaco nailed it. If you meant to join (in the Python sense) the values from a given column in multiple rows, then GPT-4 is doing better.

But GPT-4 gave no explanation, and my general experience with it is that it’s happy to write code that does something vaguely related to the prompt. As the prompt gets more complex or unusual, the degree to which the code doesn’t actually do something useful increases. And I have had quite poor results getting ChatGPT to generate code along with some explanation such that the code actually matches the explanation.

> You’re faking the reader out by using “join” to mean something which is not JOIN.

Yeah and not having to do precise prompt engineering is a major advantage of GPT-4 vs other models.

I don’t think that “prompt engineering” is quite the same thing as clearly specifying the problem. If I’m using a magic AI code-writing companion or asking an actual human to write code for me, I need to define the problem well enough if I want any expectation of getting _correct_ code.

A skilled human will distinguish themselves from a less skilled human by also thinking about the overall problem and helping make sure the specification is correct. This far, I have not seen an AI meet that bar unless the problem is something well-covered already on the web.

I have seen ChatGPT (3.5 and 4) output code that is syntactically correct, executes without errors, and does the wrong things in ways that are not entirely obvious to me until I read the docs myself.

> helping make sure the specification is correct. This far, I have not seen an AI meet that bar

I've only used Bard so far, but indeed, it has never replied to my question with a clarifying question. Only with a statement. This seems like an obvious evolution of the chatbot concept...

To be fair, I had to read the postgresql docs and interpret the answer to figure out what the _laiq was actually asking.

edit: I guess GPT-4 has better reading comprension than me!

I wish there was a model out there that asked me "what do you mean by 'join'" but I guess we're somewhat far from there :)

I think you’ve just quantified the difference between a text generator (LLM) and AGI.

I have a hard time understanding the question as you phrased it. Tested with guanaco-33B and a cleaner question, looks like a much better response:

> How can I join a column with SQL in Postgres to a string separated with a comma?

You can use the `string_agg` function in PostgreSQL to concatenate columns into a string. For example, to concatenate all values from a column named "column1" into a single string separated by commas, you can run the following query:


SELECT string_agg(column1, ',') FROM table;


The `string_agg` function takes two arguments: the first argument is the expression to concatenate (in this case column1), and the second argument is the separator between each value (in this case a comma). The query above will return a single string containing all values from column1 separated by commas.

I generally find WizardLM 1.0, Guanaco and Manticore to be good enough for most GPT cases.

> join with SQL a column to a string

As a human, I wouldn't know what to give you for this either. Not surprised the models give unexpected results. "Guanaco" gives something more in line with what I understand you to be asking for, but clearly you're asking for something else that I don't understand.

GPT4All-J-v1.3 Groovy [1] gave me the following answer (no idea if this is good or not, but keep in mind that the model comes in a 3.8Gb file and is released under an Apache 2 license, freely available for use and distribution):

    To join a column with SQL in Postgres to a string separated by a comma, you can use the STRING_AGG function. Here's an example query that demonstrates this:
    SELECT STRING_AGG(column_name, ', ') WITHIN GROUP (ORDER BY id) AS joined_string FROM table;
    In this query, we're selecting the joined_string column from a table called table. The STRING_AGG function is used to concatenate all values in the column_name column into a single string separated by commas. We use the WITHIN GROUP (ORDER BY id) clause to specify that we want to group the results by an integer value called id. This ensures that each row of data is treated as a separate entity, even if it has the same column_name.
    The resulting output will be a single column with all values from the joined_string column concatenated into a string separated by commas. You can then use this query in your application to join multiple columns together and store them as a single string value.
[1] https://gpt4all.io/reports/GPT4All_Technical_Report_3.pdf

Your question is a bit confusing and ambiguous. I think there are multiple valid answers. Here's a local test of a less ambiguous programming question with "Wizard-Vicuna-30B-Uncensored.ggmlv3.q8_0.bin" on llama.cpp on an M1 Max laptop with 64GiB of RAM.

This model seems roughly on par with GPT-3, maybe GPT-3.5 in some cases.

> Write a function to compute the maximum, minimum, and average of an array of doubles in C.

Here is an example function in C that computes the maximum, minimum, and average of an array of doubles:

```c void compute_statistics(double arr[], int n) { double max = arr[0]; double min = arr[0]; double sum = 0.0; for (int i = 1; i < n; i++) { if (arr[i] > max) { max = arr[i]; } else if (arr[i] < min) { min = arr[i]; } sum += arr[i]; } double avg = sum / n; printf("Maximum: %.2f", max); printf("Minimum: %.2f", min); printf("Average: %.2f", avg); } ```

To use this function, simply call it with the array of doubles and its length as arguments. For example:

```c double arr[] = {1.0, 2.5, 3.0, 4.0, 5.0}; int n = sizeof(arr) / sizeof(arr[0]); compute_statistics(arr, n); ```

This will output:

``` Maximum: 5.00 Minimum: 1.00 Average: 3.00 ```

Here are some simple logic questions:

> Compute the sum of an array of 512 elements with each element having the value 2.

The sum of an array of 512 elements with each element having the value 2 is 1024.

> Compute the sum of an array of 512 elements with each element having the value 3.

The sum of an array of 512 elements with each element having the value 3 is 1536.

Not bad.

My test for open models is surprisingly simple.

I just ask "What is the capital of France?" and I haven't had a correct answer yet in any model I tried. They often get Paris right at least, but most other details are wrong.

Guanaco says:

> The current capital of France is Paris. It has been so since 1982 when it replaced the previous one which was Vichy.

What model size? 7B? 13B? 33B? When people say "Guanaco" or "Alpaca" or "Manticore" they are referring to a whole family of models. The really small ones like 7B often are very weird.

I just got "The capital of France is Paris." from vicuna-v1-7B running entirely on my iPhone (using the MLC Chat app).

>They often get Paris right at least, but most other details are wrong.

What other detail in that statement is wrong?

Guanaco 7B gives the same exact answer.

I also prefer the first but your point would be stronger if the example from Guanaco was actually wrong or internally inconsistent

One of the issues that i've observed with OpenAI, is that things the model got wrong a week ago it suddenly starts getting right but other examples that are not exactly the same question but very similar or another example of the same thing it gets wrong, then a few weeks passes and it gets this right.

My conclusion is that they are using the data from users to improve or some other prompt injection tricks.

So the comparision is not even fair.

What's the result for: How can I aggregate with SQL a column to a string separated with comma ?

> GPT-4: PostgreSQL: SELECT STRING_AGG(columnName, ', ') FROM tableName;

Well, you asked for sql and ansi sql function is listagg(), so technically this is not the right answer - and definitely a completely useless one for mysql or oracle users.

"Thanks ChatGPT. Could you please write that for MySQL and additionally for Oracle?"


Oracle: SELECT LISTAGG(columnName, ', ') WITHIN GROUP (ORDER BY columnName) FROM tableName;

I mean, it's not a big deal.. you just give have to give it a little nudge.

I have chatGPT with GPT-4 model. It never gives such concise answers. Is it because you used the API?

The default is rather wordy. You could instruct chatGPT4 to be less wordy or in the API put instructions to be less wordy in the system prompt or limit answers to a sentence.

I would assume the poster just edited this though.

Guanaco is indeed very capable and can replace GPT 3.5 in almost all scenarios, based on my tests.

Easy way to self-host it is to use text-generation-webui[1] and 33B 4-bit quantized GGML model from TheBloke[2].

[1] https://github.com/oobabooga/text-generation-webui

[2] https://huggingface.co/TheBloke/guanaco-33B-GGML

Table 1 in your link (https://arxiv.org/pdf/2305.14314.pdf) is a good way to compare models.

I'd really like someone to make a big leaderboard/ranking engine which pits all these engines against eachother and publishes the resulting Elo score.

As people note, you cannot substitute locally for the Azure GPU cloud that GPT-4 runs on. But I believe that will change, and maybe quickly. After years of explosive exponential growth in model size, all of a sudden, small is beautiful.

The precipitating factor is that running large models for research is very expensive, but pales in comparison to putting these things into production. Expenses rise exponentially with model size. Everyone is looking for ways to make the models smaller and run at the edge. I will note that PaLM 2 is smaller than PaLM, the first time I can remember something like that happening. The smallest version of PaLM 2 can run at the edge. Small is beautiful.


Works on all platforms, but runs much better on Linux.

Running this in Docker on my 2080Ti, can barely fit 13B-4bit models into 11G of VRAM, but it works fine, produces around 10-15 tokens/second most of the time. It also has an API, that you can use with something like LangChain.

Supports multiple ways to run the models, purely with CUDA (I think AMD support is coming too) or on CPU with llama.cpp (also possible to offload part of the model to GPU VRAM, but the performance is still nowhere near CUDA).

Don't expect open-source models to perform as well as ChatGPT though, they're still pretty limited in comparison. Good place to get the models is TheBloke's page - https://huggingface.co/TheBloke. Tom converts popular LLM builds into multiple formats that you can use with textgen and he's a pillar of local LLM community.

I'm still learning how to fine-tune/train LoRAs, it's pretty finicky, but promising, I'd like to be able to feed personal data into the model and have it reliably answer questions.

In my opinion, these developments are way more exciting than whatever OpenAI is doing. No way I'm pushing my chatlogs into some corp datacenter, but running locally and storing checkpoints safely would achieve my end-goal of having it "impersonate" myself on the web.

There are no viable self-hostable alternatives to GPT-4 or even to GPT3.5.

The “best” self-hostable model is a moving target. As of this writing it’s probably one of Vicuña 13B, Wizard 30B, or maybe Guanaco 65B. I’d like to say that Guanaco is wildly better than Vicuña, what with its 5x larger size. But… that seems very task dependent.

As anecdata: my experience is that none of these is as good as even GPT3.5 for summarization, extraction, sentiment analysis, or assistance with writing code. Figuring out how to run them is painful. The speed at which their unquantized variants run on any hardware I have access to is painful. Sorting through licensing is… also painful.

And again: they’re nowhere close to GPT-4.

How much GPU memory do you have access to? If you can run it, Guanaco-65B is probably as close as you can get in terms of something publicly available. https://github.com/artidoro/qlora. But as other comments mention, it's still noticeably worse in my experience.

Another vote from me for guanaco 65b. Here's a link to the model and weights: https://huggingface.co/TheBloke/guanaco-65B-GPTQ

I use it all the time at home. It's decent at things like summaries and writing content. It follows instructions as well as GPT and isn't nearly as pretentious but it's still a bit pretentious.

It's not very good at code.

I was wondering how much GPU memory does G-65B need, and from the docs, it's "48GB".

I'm running it on Serge with 51 GB of RAM. I don't think it requires any GPU memory (llama.cpp runs on the CPU), I have a RTX 2060 in the system I'm running it on. Correct me if I'm wrong.

How many tokens per second are you getting with your setup?

About .5 tokens per second if I had to guess.

Is this with the 4-bit quantization? The only issue with it is that inference is incredibly slow with it on right now, but that should be fixed up in the next few weeks I think.

LLM Leaderboard:


The short answer is that nothing self hosted can come close to GPT-4. The only thing that comes close period is Anthropic's Claude.

In our experimentation, we've found that it really depends what you're looking for. That is you really need to break down down evaluation by task. Local models don't have the power yet to just "do it all well" like GPT4.

There are open source models that are fine tuned for different tasks, and if you're able to pick a specific model for a specific use case you'll get better results.


For example, for chat there are models like `mpt-7b-chat` or `GPT4All-13B-snoozy` or `vicuna` that do okay for chat, but are not great at reasoning or code.

Other models are designed for just direct instruction following, but are worse at chat `mpt-7b-instruct`

Meanwhile, there are models designed for code completion like from replit and HuggingFace (`starcoder`) that do decently for programming but not other tasks.


For UI the easiest way to get a feel for quality of each of the models (or, chat models at least) is probably https://gpt4all.io/.

And as others have mentioned, for providing an API that's compatible with OpenAI, https://github.com/go-skynet/LocalAI seems to be the frontrunner at the moment.


For the project I'm working on (in bio) we're currently struggling with this problem too since we want a nice UI, good performance, and the ability for people to keep their data local.

So at least for the moment, there's no single drop-in replacement for all tasks. But things are changing every week and every day, and I believe that open-source and local can be competitive in the end.

For personal use, check out https://github.com/imartinez/privateGPT. It's lightweight and has lots of momentum from the OS community. There's even an open PR to support huggingface LLMs. For business use, here's some shameless self promotion: https://mirage-studio.io/private_chatgpt. We offer a version that can be hosted on your own GPU cluster.

The answer to this question changes every week.

For compatibility with the OpenAI API one project to consider is https://github.com/go-skynet/LocalAI

None of the open models are close to GPT-4 yet, but some of the LLaMA derivatives feel similar to GPT3.5.

Licenses are a big question though: if you want something you can use for commercial purposes your options are much more limited.

Have you had a chance to look at the Direct Preference Optimization [1] pre-print? It seems like it might help get around RLHF which (as far as I can tell) is the really hard/expensive part of training the best of these models. They said they will release the code soon, so I guess we will found out soon enough.

[1] https://arxiv.org/abs/2305.18290

> Preferably self-hosted (I'm okay with paying for it)

I'm the founder of Mirage Studio and we created https://www.mirage-studio.io/private_chatgpt. A privacy-first ChatGPT alternative that can be hosted on-premise or on a leading EU cloud provider.

Haha, beat me to it by a minute! But idd, do check out our self hosted option :D

Nothing self hosted is even remotely close to gpt 3.5, let alone gpt4.

Wizardlm-uncensored-30B is fun to play with.

How much fun ? Do you have any guidance for me to follow ?

Guanaco-65B[0] using Basaran[1] for your OpenAI compatible API.

(You can use any ChatGPT front-end which lets you change the OpenAI endpoint URL.)

[0] https://huggingface.co/TheBloke/guanaco-65B-HF A QLoRA finetune of LLaMA-65B by Tim Dettmers from the paper here: https://arxiv.org/abs/2305.14314

[1] https://github.com/hyperonym/basaran

What's the best self hosted for ingesting a local codebase and wiki to ask questions of it? Some of the projects linked here have ingest scripts for doc, pdf files; but it'd be cool to ingest a whole git repo and wiki, have a little chat interface to ask questions about the code.

Not self-hosted/local but Claude by Anthropic from what I've heard is really good but the API is not publicly available. It's apparently accessible via Poe (https://poe.com)

As for open models, HuggingFace has a nice leaderboard to see which ones are decent: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

"Okay with paying for it" gives you a wide range of options.

Most of the open source stuff people are talking about is things like running a quantized 33B parameter LLaMA model on a 3090. That can be done on consumer hardware, but isn't quite as good at general purpose queries as GPT-4. Depending on your use case and your ability to fine tune it, that might be sufficient for a number of applications. Partcularly if you've got a very specific task.

However, if you're willing to spend, there are bigger models available (e.g. Falcon 40B, LLaMA 65B) that can be run on data server class machines, if you're willing to spend $15-20K.

Will that get you GPT-4 level inference? Probably not (though it is difficult to quantify); will it get you a high-quality model that can be further fine-tuned on your own data? Yes.

For the smaller models, the fine-tunes for various tasks can be fairly effective; in a few more weeks I expect that they'll have continued to improve significantly. There's new capabilities being added every week.

The biggest weakness that's been highlighted in research is that the open source models aren't as good at the wide range of tasks that OpenAI's RLHF has covered; that's partly a data issue and partly a training issue.

Nothing open source is quite as good as GPT-4 yet but the community continues to edge closer.

For general use Falcon seems to be the current best:


For code specifically Replit's model seems to be the best:


There is a model that was just released called falcon-40B that is available for commercial user. It outperforms every other open LLM model available today. Buyer beware, however, because the license is custom[1] and has restrictions for "attributable revenues" over $1M/year. I'll leave that for you to interpret as you will.

[0]: https://huggingface.co/tiiuae/falcon-40b-instruct [1]: https://huggingface.co/tiiuae/falcon-40b-instruct/blob/main/...

EDIT: I just realized you seem to be asking for a fully realized, turn-key commercial solution. Yeah, refer to others who say there's no alternative. It's true. Something like this gives you a lot more power and flexibility, but at the cost of a lot more work building the solution as you try to apply it.

They changed the license for Falcon 40B today (a day later) to Apache 2.0 https://www.tii.ae/news/uaes-falcon-40b-now-royalty-free

>Buyer beware, however, because the license is custom and has restrictions for "attributable revenues" over $1M/year.

It's more than that, it requires permission in advance and royalty payments (art. 8). See also attribution requirements in art. 5. Arguably, any license could be waived - it should be possible to write to Meta for permission as well, so it's not in a much better state than Llama itself for commercial use.

I think you have to distinguish between self-hosted to run on CPU (like LLAMA), on consumer GPU or on big GPUs. I find the market currently very confusing.

I'm especially interested since the data center I'm working for is sitting on a bunch of A100 and I get daily requests of people asking for LLMs tuned to specific cases, who can't or won't use OpenAI for various reasons.

Here you can try vicunia (and quite a few others) easily https://chat.lmsys.org/

They also have A/B testing with a leaderboard where vicunia wins for the self-hostable ones: https://chat.lmsys.org/?leaderboard

I would monitor and research each of these top models to determine which best fits your use case.





Also keep up to date with r/LocalLLaMA where new best open models are posted all the time.

You can check out this leaderboard to see a current state of LLM alternatives to GPT4


But unfortunately for now it seems there aren't any viable self-hosted options...

https://gpt4all.io/ works fairly well on my 16 GB M1 Pro MacBook. It's certainly not on a level with ChatGPT, but what is?

It's a simple app download and allows you to select from multiple available models. No hacking required.

> It's certainly not on a level with ChatGPT, but what is?

Guanaco-65B is per https://arxiv.org/abs/2305.14314

CPU Version: https://huggingface.co/TheBloke/guanaco-65B-GGML

GPU Version: https://huggingface.co/TheBloke/guanaco-65B-HF

4bit GPU Version: https://huggingface.co/TheBloke/guanaco-65B-GPTQ

how can we load this 4bit GPU Version: https://huggingface.co/TheBloke/guanaco-65B-GPTQas that is in safetensors format

How much ram/vram required?

At least 48GB of RAM to run with llama.cpp, which now has CUDA GPU Offloading support.

The more you can fit on your GPU (in VRAM) the better (for speed), but no GPU is strictly required.

Why hasn't the community created a distributed Folding@home style GPT4/LLM software, where anyone can ask a question, and all participants machine's are contributing to the computation of the answer? It seems like the ideal way for the open source community. Is it not possible for some reason with the way LLMs work and compute? Or has it simply just not been done yet?

Distributing very large matrix multiplications doesn't scale across networks well.

While this is a MINST classifier (and can be run in the browser) you can get an idea of the math behind it. https://www.3blue1brown.com/lessons/gradient-descent and https://www.3blue1brown.com/lessons/neural-network-analysis

When dealing with a LLM, it's being run again and again - token by token. Pick the best next token, append it to the input, run it again.

If you want to generate 100 tokens (rather small amount of data when compared to much of the GPT-4 conversations), that means running it 100 times.

The network traffic makes the entire system much slower than running it all in one spot.

Consider also the "are you the input to everyone?" and the privacy implications of that.

Interesting thought!

I think the problem is that it's hard to parallelize this effectively across a network with Internet-scale latency - individual matrix multiplications parallelize very well, but there would need to be coordination of each result, which would be much slower than just doing the computation locally.

In the case of Folding@home, evaluating possible folds could be done completely in parallel and only need to be coordinated on discovery of a plausible fold (which is rare), so distributing over a high-latency network is still beneficial.

The model can run across multiple machines but it's so slow that way that nobody bothered to ship that.

There's a bunch of stuff on this area. See https://www.fedai.org/ for one.

I also remember something more directly analogous to @home, but I'm having a hard time finding it.

Edit: Petals! Thank you sibling commenter!

If you want/need to go cpu only then llama.cpp, and the assorted front ends people are building for it, is looking like a good project: https://github.com/ggerganov/llama.cpp

I'll append this reference too, as it's related and interesting:


your statement sounds like it doesn't support GPU, llama supports GPU as well. However llama is just code to run models. OP needs to find a model that works for them, there are lots of models out there.

IMO, the best out right now is https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored... (in llama.cpp ggmlv3 format). A 30B llama fine-tuned on a mix of WizardLM and Vicuna training data. It does conversational interactions pretty well (the wizard) and handles instruction fine too (vicuna).

But it is definitely no GPT3.5 or GPT4 replacement. It will not be good for getting "work" done or helping do tasks. It's for recreation. If you want a GPT3.5 level LLM, something akin to text-davinci-002, you'll need to do the SFT and RLHF fine tuning of the 65B llama yourself. And that's no small task. Neither is running a 65B model (even at 4 bits).

I've gone down this rabbit hole and I want to reaffirm what the other commenters are saying: even if you use a massive model and have the compute to back it up at a reasonable pace (you likely don't), it sucks, can't even hold a candle to GPT 3.5

It depends what you mean by "viable alternatives" and how much money you are prepared to spend on hardware to self-host. As others have mentioned, you can try llama.cpp and LocalAI, but for most ChatGPT-like applications, you won't get anything like as good results. I've found that using GPT-4 via the OpenAI API is somewhat more reliable than ChatGPT, either via the Playground or via a local chat interface like https://github.com/mckaywrigley/chatbot-ui

I often worry about aa "The Machine Stops" scenario.

GPT AI actually gives me hope. What if we can store and run an AI in a phone-sized-device that is superior to a similarly sized library of books? Can we have a rugged, solar-powered device that could survive the fall of Civilization and help us rebuild?

It would certainly have military applications in a warfare. Imagine being the 21ct century equivalent of a 1940's US Marine on Guadal Canal who need to know some survival skills. ChatGPT-on-a-phone would be handy if you could keep the battery charged.

"Imagine being the 21ct century equivalent of a 1940's US Marine on Guadal Canal who need to know some survival skills. ChatGPT-on-a-phone would be handy if you could keep the battery charged."

Would it not make sense to learn survival skills...beforehand?

I would much rather learn survival skills in a controlled environment instead waiting until I need them and then whipping out the ChatGPT enabled phone for help...

Also, I don't need to have an entire library of knowledge with me, when a simple book of survival skills would be more than adequate...

I think pretty much every military in the world agrees with this.

I'll +1 the votes for Guanaco and Vicuna running with the Oobabooga text-generation-webui.

With a 4090, you can get ChatGPT 3.5 level results from Guanaco 33B. Vicuna 13B is a solid performer on more resource-constrained systems.

I'd urge the naysayers who tried the OPT and LLaMA models only to give up to note that the the LLM field is moving very quickly - the current set of models are already vastly superior to the LLaMA models from just two months ago. And there is no sign the progress is slowing - in fact, it seems to be accelerating.

You can find more details here - https://old.reddit.com/r/LocalGPT/

The best self hosted/local alternative to GPT-4 is a (self hosted) GPT-X variant by OpenAI.

No kidding, and I am calling it on the record right here.

OpenAI will release an 'open source' model to try and recoup their moat in the self hosted / local space.


Does it benefit OpenAI if people are using an "open source" OpenAI model versus any other "open source" model?

Great question.

Yes it does benefit OpenAI because Sam is pushing for and betting on regulatory capture. Meaning AI models that are compliant with AI safety principles would be allowed and considered safe by law should regulations be put in place.

Current open source models wouldn't be compliant and would require work to make them compliant, thus creating a moat for OpenAI in the self hosted space.

Because there is nothing better than a free model that you can self host that OpenAI has made that is also regulated.

Enterprises would love this, since OpenAI has the mindshare already, they can self host a licensed model in their orgs or air gapped environment that they know is regulated.

OpenAI may release different model sizes of this GPT-X variant like they did quietly for Whisper, Bigger sizes may require an enterprise license.

This is a good candidate: https://github.com/imartinez/privateGPT

This is like an artist getting used to Adobe’s products before they’re put behind a wall. And borrowing HN’s attitude to that, you apparently deserve it

You can fine tune a open source model for your task and achieve better results, at least, instead of just using them directly. But they are still not close to the openai models in generality. Huggingface is the place for exploring models, recently went through a lot of them for my use case, and they are simply not good enough, yet.

There is so much parallel progress happening left and right at the same time they are not there yet. When things like sparseGPT and models fine-tuned with data with tool ability (not just instruct data) may be soon we get there, as long as there is progress i am hopeful. Some sort of inference optimized hardware would also help.

> Preferably self-hosted (I'm okay with paying for it)

The big models, if even available, need >100GB of graphics memory to run and would likely take minutes to warm up.

The pricing available via OpenAI/GCP/etc is only effective when you can multi-tenant many users. The cost to run one of these systems for private use would be ~$250k per year.

... strange. I'm running 30B models on a 10yr old PC with a $400 RTX 3060. Folks can run the 65B models with 4090 or dual 3090. Usually for about a cost of $2500.

GPT-3.5 is up to 175B parameters, GPT-4 (which is what OP is asking for) has been speculated as having 1T parameters, although that seems a little high to me.

It's easy to run a much worse model on much worse hardware, but there's a reason why it's only companies with huge datacenter investments running the top models.

The 1T number is wild hype. It's likely GPT4 is actually a mixture-of-experts with smaller models under the hood.

Don't 4090 and 3090 have the same amount of vram?

How? 24GB VRAM can only handle 30B 4bit. Are you offloading some layers to the CPU? That kills performance AFAIK.

30B isn't big anymore. GPT-4 is rumored to have 1T parameters.

I admittedly haven't used GPT-4 yet, but I've replaced several uses of GPT-3 with RWKV on the Raven dataset. I can load it onto my RTX 2060 with 12GB of mem (quantized of course), and use it to whittle down or summarize data for GPT.

OpenAssistant is pretty good. It still has some censorship but nowhere near the levels of commercial models.

It’s actually impressive how good it is considering the limited resources they have.

Have you tried using GPT-4 via Azure? My understanding is that it's faster and more reliable.

There really do not exist any alternatives, self-hosted or not. But more importantly, there may never be, what with the rising tide of AI risks and regulations discourse. It seems that soon training and opensourcing or otherwise making accessible a model of that class will be impossible, even as the cost of its production falls.

The UAE's apparently planning to release a Falcon 180B, which would be the most powerful model you can run at home by far. I don't think they'll bow down to western pressure to lobotomise the models given the country's whole business model is bypassing sanctions and the like.

That's assuming it's substantially superior to stuff like BLOOM. I sure hope that's the correct assumption to make.

Of course, running a 180B dense transformer at home for personal use is utterly impractical.

Is anyone using a self hosted thing to assist with parsing?

Buy a tinybox from tiny corp https://tinygrad.org/

Falcon 40B

openai not so open. should rebrand to closedai


Worth noting that Llama and derivatives most likely can be used for commercial use, despite what the licensing terms say. There seems to be growing consensus that model weights are not copyrightable, and therefore adhering to the terms of the license isn't required if you downloaded them from a third party.

Obviously being legally right doesn't necessarily save you from having Meta's legal team breathing down your neck...

So if I embed a custom "special sauce" model in my phone app, to do inference faster on the device, my bigger competitors could simply extract it from the bundle and use it too?

Perhaps that's okay, I don't know, but it seems strange that the little pieces of client-side javascript are so copyrightable yet this other work isn't.

> I lay out the whole LLM landscape in this article:

No, you don't.

Agreed. Information sparse, lays out almost nothing. One would do better by far to just go to /r/LocalLlama and sort all by Top.

Have you come across any resource that lists hardware requirements for open source models (GPU/VRAM, CPU, RAM)?


I'm all for promoting own solutions if it fits the context but OP is asking for self-hosted alternatives and this is not. This fits the definition of spam.

Tbf, the OP explicitly said "Preferably self-hosted" which is why I wanted to offer this along with the explicit disclaimer that I am the founder

Just stop, you are spamming on an unrelated post and now arguing about it. I’m sure people will want to sign up for your service now.

You could hire a human to manually respond to the queries

Confidently stated responses that are often incorrect or miss the point? Just post it on /r/AskReddit and wait a little bit.

Thank you for calling Movie Phone! Why dont you just tell me which movie you want to see!


Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact