I'm having a lot of fun chatting with characters using Faraday and koboldcpp. Faraday has a great UI that lets you adjust character profiles, generate alternative model responses, undo, or edit dialogue, and experiment with how models react to your input. There's also SillyTavern that I have yet to try out.
I looked at Ollama before, but couldn't quite figure something out from the docs [1]
It looks like a lot of the tooling is heavily engineered for a set of modern popular LLM-esque models. And looks like llama.cpp also supports LoRA models, so I'd assume there is a way to engineer a pipeline from LoRA to llama.cpp deployments, which probably covers quite a broad set of possibilities.
Beyond llama.cpp, can someone point me to what the broader community uses for general PyTorch model deployments?
I haven't quite ever self-hosted models, and am really keen to do one. Ideally, I am looking for something that stays close to the PyTorch core, and therefore allows me the flexibility to take any nn.Module to production.
As far as I know, ollama doesn’t support exllama, qlora fine tuning, multi-GPU, etc. Text-generation-webui might seem like a science project, but it’s leagues ahead (Like 2-4x faster inference with the right plugins) of everything else. Also has a nice openai mock API that works great.
Feature set seems like a decent amount of overlap. One limitation of FastChat, as far as I can tell, is that one is limited to the models that FastChat supports (though I think it would be minor to modify it to support arbitrary models?)
I have a question, looking for the most popular "uncensored" model I just find "TheBloke/Luna-AI-Llama2-Uncensored-GGML", but it has 14 files to download between 2 to 7 GB, I just download the first one: https://imgur.com/a/DE2byOB
Honest question from someone new to exploring and using these models; why do you need uncensored? What are the use-cases that would call for it?
Again, not questioning your motives or anything, just straight up curious. To use your example, any of us can find bomb building info online fairly easily, and has been a point of social contention since the Anarchist's cookbook. Nobody needs an uncensored LLM for that, of course.
Where do you typically have the 'safe search' setting when you web search? Personally I have it 'off' ('moderate' when I worked in an office I think) even though I'm not looking for anything that ought to be filtered by it.
I'm not using models censored or otherwise, but I imagine I'd feel the same way there - I won't be offended, so don't try to be too clever, just give me the unfiltered results and let me decide what's correct.
(Bing AI actually banned me for trying to generate images a rough likeness of myself by combining traits of minor celebrities - in combination it shouldn't have looked like those people either, so I don't think should violate ToS, certainly it didn't in intention (I wanted myself, but it doesn't know what I look like and I couldn't provide a photo at the time, idk if you can now (banned!)) so it does happen, 'false positive censoring' if you like.)
Holy shit that's so badass BingAI banned you. I mean like it sucks and i hope it gets resolved, but still awesome. you got it just from trying to make a composite of people's faces? i guess it takes fraud seriously, at least for biometrics. however you clearly either weren't doing that or omitted some choice details. good luck with appealing; that process always wrecks my faith in humanity.
i've been trying to see how much i can get away with before it suspends or bans me using one of my throwaway accounts. it takes some doing, but if you convince it you aren't doing something shady right before doing something shady it'll play along with you a fair bit more than if you just say "hi bing, i wanna do some shady shit." unfortunately you have to engineer your "i'm not gonna do (insert something shady)" prompt on a per shade basis.
I'd switch that question around: Why would I want to use a censored LLM?
It doesn't make sense for me personally. It does make sense if you're offering an LLM publicly, so that you doesn't get bad PR if your LLM says some politically incorrect or questionable things.
It’s very easy to hit absurd “moral” limits on chatgpt for the most stupid things.
Earlier I was looking for “a phrase that is used as an insult for someone who writes with too much rambling” and all I got was some bullshit about how it’s sorry but it can’t do that because it’s allegedly against its OpenAI rules.
So I asked again “a phrase negatively used to mean someone that writes too much while rambling” and it worked.
I simply cannot be bothered to deal with stupid insipid “corporate friendly language” and other dumb restrictions.
Imagine having a real conversation with someone and they freaked out any time anything negative was discussed?
> Again, not questioning your motives or anything, just straight up curious. To use your example, any of us can find bomb building info online fairly easily, and has been a point of social contention since the Anarchist's cookbook. Nobody needs an uncensored LLM for that, of course.
When you ask a local LLM, at worst you get no useful info.
When you ask online, at worst you spend the rest of your life in a government black site without any chance of due process.
Because I don’t appreciate it when a model has blatant democrat/anti-republican bias, for example. The fact that chatGPT, Bard, etc are heavily and purposefully biased on certain topics is well documented[1].
At best the false positives are a nuisance and makes the model dumber. But really censorship is fundamentally wrong.
Unless we are talking about gods and not flawed humans like me I prefer to have the say in what is right and what is wrong for things that are entirely personal and only affect me.
The readme of their repositories each have tables that detail the quality of each file. The QK_4_M and QK_5_M seem to be the two main recommended ones for low quality loss while too being too large.
Only need 1 of the files, but recommend checking out the GGUF version of the model (just replace GGML in the URL) instead of GGML. Llama.cpp no longer supports GGML, and not sure if TheBloke still uploads new GGML versions of models.
- The chatbox field has a normal "write here" state, when no chat is really selected. I thought my keyboard broke until I discovered that
- I didn't find a way to set cuda acceleration before loading a model, only managed to set gpu offloaded layers and using "relaunch to apply"
- Some HugginFace models are simply not listed and there's no indication about why. I guess models are really curated, but somehow presented as a HuggingFace browser?
- Scrolling in the accordion parts of the interface seems to be responding to mouse wheel scroll only. I have a mouse with a damaged one and couldn't find a way to reliably navigate to bottom drawers
That said, I really liked the server tab, which allowed for initial debugging very easily
I don't mean this as a criticism, I'm just curious because I work in this space too: who is this for? What is the niche of people savvy enough to use this who can't run one of the many open source local llm software? It looks in the screenshot like it's exposing much of the complexity of configuration anyway. Is the value in the interface and management of conversation and models? It would be nice to see info or even speculation about the potential market segments of LLM users.
In most workplaces that deal with LLMs you’ve got a few classes of people:
1. People who understand LLMs and know how to run them and have access to run them on the cloud.
2. People who understand LLMs well enough but don’t have access to cloud resources - but still have a decent MacBook Pro. Or maybe access to cloud resources is done via overly tight pipelines.
3. People who are interested in LLMs but don’t have enough technical chops/time to get things going with Llama CPP.
4. People who are fans of LLMs but can’t even install stuff in their computer.
This is clearly for #3 and it works well for that group of people. It could also be for #2 when they don’t want to spin up their own front end.
It's actually quite handy. I built all the various things by hand at one point, but had to wipe it all. Instead of following the directions again I just downloaded this.
Being able to swap out models is also handy. This probably saved a couple of hours of my life, which I appreciate.
It's for people who want to discover LLMs and either don't have the skill to deploy it, or value their time, and prefer not to fool around for hours getting it to work before they can try it.
The fact it has configuration is good, as long as it has some defaults.
Exactly. People like me have been waiting for a tool like this.
I'm more than capable of compiling/installing/running pretty much any software, but all I want is the ability to chat with a LLM of my choice without spending an afternoon tabbing back to a 30 step esoteric GitHub .md full of caveats, assumptions, and requiring dependencies to be installed and configured according to preferences I don't have.
Yeah, I think I fit into this category. If I see a new model announced, it’s been nice to just click and evaluate for myself if it’s useful for me. If anyone knows other tools for this kind workflow I’d love to hear about them. Right now I just keep my “test” prompts in a text file.
I got Mistral-7b running locally, and although it wasn't hard, it did take some time nonetheless. I just wanted to try it out and was not that interested in the technical details.
For me it's pretty simple- LM Studio supports Apple Silicon GPU acceleration out of the box, and I like the interface better than Gradio Web UI. It saves me the headache and the tinkering of the alternatives. That said, free software that's hiring developers probably won't stay free for long, so I'm keeping my eye on other options.
> “Deep understanding of what is a computer, what is computer software, and how the two relate.”
Seems like a joke, but many developers do not really understand what's going on behind the scenes. This gets straight to the point. They don't care about HR keyword matching on your CV, or how many years of experience you have of being a mediocre developer with language X or framework Y. I guess during the interview they will investigate whether you truly understand the fundamentals.
I tend to agree. I have been working in IT for 48 years and it is not at all uncommon to come across developers who have a very narrow and niche view of software development. I have had the privilege to work with a wide range of architects and senior engineers over the years and I have found that the ones who tended to be the most creative (solution wise) were the ones who had deep knowledge right from the bottom of the stack all the way to the top - they did not look at a problem through the lens of a specific language (they all knew multiple languages) - when I have hired for senior positions, unless its been for a very specific skill set, experienced generalists have tended to impress more
The second one isn't that bad in context. But the Senior Systems Software Engineer is wild, with "Deep understanding of what is a computer, what is computer software, and how the two relate" followed by "Experience writing and maintaining production code in C++14 or newer". You'd think the latter would imply the former, but maybe not...
They seem to have even lowered expectations a bit. Two months ago [1] they were already hiring for that role (or a very, very similar one), but back then you needed experience with "mission-critical code in C++17", now just "production code in C++14".
> You'd think the latter would imply the former, but maybe not...
I wouldn't put C++ devs on too high of a pedestal. I got away with writing shitty C++ code for years before I really knew what I was doing. It still worked though.
For my experiments with new self-hostable models on Linux, I've been using a script to download GGUF-models from TheBloke on HuggingFace (currently, TheBloke's repository has 657 models in the GGUF format) which I feed to a simple program I wrote which invokes llama.cpp compiled with GPU support. The GGUF format and TheBloke are a blessing, because I'm able to check out new models basically on the day of their release (TheBloke is very fast) and without an issue. However, the only frontend I have is console. Judging by their site, their setup is exactly the same as mine (which I implemented over a weekend), except that they also added a React-based UI on top. I wonder, how they're planning to commercialize it, because it's pretty trivial to replicate, and there're already open-source UI's like oogabooga.
I'd like to build myself a headless server to run models, that could be queried from various clients locally on my LAN, but am usure where to start and what the hardware requirements would be. Software can always be changed later but I'd rather buy the hardware parts only once.
Do you have recommendations about this? or blog posts to get started? What would be a decent hardware configuration?
Ollama does this. I run it in a container on my homelab (Proxmox on a HP EliteDesk SFF G2 800) and 7B models run decently fast on CPU-only. Ollama has a nice API and makes it easy to manage models.
Together with ollama-webui, it can replace ChatGPT 3.5 for most tasks. I also use it in VSCode and nvim with plugins, works great!
I have been meaning to write a short blog post about my setup...
Depending on what you mean by "production" you'll probably want to look at "real" serving implementations like HF TGI, vLLM, lmdeploy, Triton Inference Server (tensorrt-llm), etc. There are also more bespoke implementations for things like serving large numbers of LoRA adapters[0].
These are heavily optimized for more efficient memory usage, performance, and responsiveness when serving large numbers of concurrent requests/users in addition to things like model versioning/hot load/reload/etc, Prometheus metrics, things like that.
One major difference is at this level a lot of the more aggressive memory optimization techniques and support for CPU aren't even considered. Generally speaking you get GPTQ and possibly AWQ quantization + their optimizations + CUDA only. Their target users and their use cases are often using A100/H100 and just trying to need fewer of them. Support for lower VRAM cards, older CUDA compute architectures, etc come secondary to that (for the most part).
Thanks! Really helpful. I've a 3090 at home and my idea is to do some testing on a similar config in the cloud to have an idea of the amount of requests that could be served.
The good news is the number of requests and performance is very impressive. For example, on my RTX 4090 from testing many months ago with lmdeploy (it was the first to support AWQ) I was getting roughly 70 tokens/s each across 10 simultaneous sessions with LLama2-13b-Chat - almost 700 tokens/s total. If I were to test again now with all of the impressive stuff that's been added to all of these I'm sure it would only be better (likely dramatically).
The bad news is because "low VRAM cards" like the 24GB RTX 3090 and RTX 4090 aren't really targetted by these frameworks you'll eventually run into "Yeah you're going to need more VRAM for that model/configuration. That's just how it is." as opposed to some of the approaches for local/single session serving that emphasize memory optimization first and tokens/s for a single session next. Often with no consideration or support at all for multiple simultaneous sessions.
It's certainly possible that with time these serving frameworks will deploy more optimizations and strategies for low VRAM cards but if you look at timelines to even implement quantization support (as one example) it's definitely an after-thought and typically only implemented when it aligns with the overall "more tokens for more users across more sessions on the same hardware" goals.
Loading a 70B model on CPU and getting 3 tokens/s (or whatever) is basically seen as an interesting yet completely impractical and irrelevant curiosity to these projects.
In the end "the right tool for the job" always applies.
You can currently do this in an M2 Max with ollama and a Nextjs UI [0] running in a docker container. Any devices in the network can use the UI... and I guess if you want a LAN API you just need to run another container with with OAI compatible API that can query ollama.. eg [1]
> usure where to start and what the hardware requirements would be
Have a look at the localllama subreddit
In short though dual 3090 is common, single 4090 or various flavours of M123 macs. Alternatively p40 can be jury-rigged too but research that carefully. In fact anything with more than one gpu is going to require careful research
What a comment. Why do it the easy way when the more difficult and slower way works ok it to the same result‽ For people who just want to USE models and not back at them, TheBloke is exactly the right place to go.
Like telling someone interested in 3D printing minis to build a 3D printer instead of buying one. Obviously that helps them get to their goal of printing minis faster right?
Actually, consider that the commenter may have helped un-obfuscate this world a little bit by saying that it is in fact easy. To be honest the hardest part about the local LLM scene is the absurd amount of jargon introduced - everything looks a bit more complex than it is. It’s really is easy with llama.cpp, someone even wrote a tutorial here: https://github.com/ggerganov/llama.cpp/discussions/2948 .
But yes, TheBloke tends to have conversions up very quickly as well and has made a name for himself for doing this (+more)
This works, but I've noticed that my CPU use goes up to about 30 percent, all in kernel time (windows), after installing and opening this, even when it's not doing anything, on two separate machines... I also hear the fan spinning fast on my laptop.
Killed the LM studio process and re-opened it and the ghost background usage is down to about 5%.
M1 is only 3 years old and no one cares to support intel macs any more. There are surely a lot of them out there. Are they that much worse to run LLMs on?
There isn't enough text in the Tolkein works to generate a functional LLM. You need to start with a base model that contains enough English (or language of your choice) to become functional.
Such a model (LLaMa is a good example) is not "ignorant," but rather a generalized model capable of a wide range of language tasks. This base model does not have specialized knowledge in any one area but has a broad understanding based on the diverse training data.
If you were to "feed" Tolkien's specific books into this general LLM, the model wouldn't become a Middle Earth savant. It would still provide responses based on its broad training. It might generate text that reflects the style or themes of Tolkien's work if it has learned this from the broader training data, but its responses would be based on patterns learned from the entire dataset, not just those books.
If you give Tolkien books to a newborn child, they won't become a Tolkien expert. You need to first teach them English, and then give them the books. They will be able to answer questions about Middle Earth, but they'll also be able to form any other English sentence based on what they learned previously. It's basically the same with LLMs.
You would fine tune a pretrained LLM because those books are written in English. And natural languages are in flux, and the corpuses that describe them are not neutral, so you can impose fairness after the fact. Sebastian Raschka has written some relevant popular articles, like:
I expected it to not let me run this. I have an intel Macbook, was expecting that I'd need Apple Silicon... am I misunderstanding something? I get fairly fast results at the prompt with the default model. How's this thing running with whatever shitty GPU I have in my laptop?
I include a universal binary of llama.cpp's server example to do inference. What's your machine? The lowest spec I've heard it running on is a 2017 iMac with 8GB RAM (~5.5 tokens/s). On my m1 with 64GB RAM I get ~30 tokens per second on the default 7B model.
Macbook Pro 2020 with 16gb of system ram. I think the gpu is Iris Plus? But I don't much keep up on those.
I'm now delving into getting this running in Terminal... there are a few things I want to try that I don't think the simple interface allows.
Also, I've noticed that when chats get a few kilobytes long, it just seizes up and can't go further. I complained to it, it spent a sentence apologizing, started up where it left off... and got about 12 words further.
I did just that, including signing up for discord. Despite never having used discord before, I was able to find the link to the beta AppImage in a pinned message and downloaded it. Made it executable with chmod +x LM...... Ran it. Searched for some of the models referenced in this discussion. Downloaded one and ran it. It just worked on Linux Mint 21.2.
What can I do with any of these models that won't result in 50% hallucinations/it recommending code with APIs that don't exist/it recommending basically regurgitated StackOverflow historical out of date answers (that it was trained on) for libraries that have had their versions/APIs change, etc?
Can somebody share one real use case they are using any of these models for?
> What can I do with any of these models that won't result in 50% hallucinations
There are many times when I am searching for a solution to a problem, and I would be perfectly happy with a possible answer I could test that has a 50% chance of being correct.
Humans should test the output of all AIs.
Humans should test the validity of everything we hear, see, and read.
Because I pay $20/mo for GPT-4 and don't understand why anybody would run a "less-good" version locally that you can trust less/that has less functionality.
That's why I wanted to try to understand, what am I missing about local-toy LLMs. How are they not just noise/nonsense generators?
Sometimes you just need to write creative nonsense. Emails, comments, stories, etc. Great for fiction since there are low stakes for errors.
They're bad at generative tasks. Don't have it write code or scientific papers from scratch, but you can have it review anything you've written. You can also do summaries, keyword/entity extraction, and the like safely. Any reductive task works pretty well.
So do I, but on a flight 2 days ago, I forgot the name of a Ruby method, but knew what it does. I tried looking it up in Dash (offline rdocs) but didn’t find it.
On a whim, I asked Zephyr 7B (Mistral based) “what’s the name of that Ruby method that does <insert code>” and it gave me 3 different correct ways of doing what I wanted, including the one I couldn’t remember. That was a real “oh wow” moment.
So offline situations is the most likely use case for me.
If you just want to try something quick. You can try AskCyph LITE https://askcyph.cypherchat.app. It runs AI model natively on browser without having to do any installation, etc.
Why purple or some shade of purple is the color of all AI products? For some reason, the landing pages of AI products immediately remind of Crypto products. This one does not have Crypto vibes but the colour is purple. I don't get why.
It's a default color in Tailwind.css and is used in a lot of the templates and examples. Nine times out of ten, if you check the source of a page with this flavor of purple, you'll see it's using Tailwind, as the OP site in fact does.
Ah! that makes more sense. New startup, new tech and therefore the new default color. I hope its just that and because I only tend to notice AI startups purple is what I end up seeing.
Because apps mostly prefer dark theme now, and dark red, brown, dark green and so on look weird, and gray is OK, but very boring, like someone desaturated the UI. Which leaves shades of blue and purple.
LMStudio is great to run local LLMs, also support OpenAI-compatible API. In the case you need more advance UI/UX, you can use LMStudio with MindMac(https://mindmac.app), just check this video for details https://www.youtube.com/watch?v=3KcVp5QQ1Ak.
Thank you for your support. I just found a workaround solution to use Ollama with MindMac. Please check this video https://www.youtube.com/watch?v=bZfV70YMuH0 for more details. I will integrate Ollama deeply in the future version.
MindMac is the first example I've seen where the UI for working w/ LLMs is not complete and utter horseshit and starts to support workflows that are sensible.
I will buy this with so much enthusiasm if it holds up. Argh, this has been such a pain point.
Newbie question... Is this purely for hosting text language models? Is there something similar for image models? i.e., upload an image and have some local model provide some detection/feedback on it.
After the latest Chatgpt debacles, the poor performance I'm getting from 4 turbo, I'd really like a local version of chatgpt4 or equivalent.
I'd even buy a new pc if I had too.
It's a standard clause for most apps. If a breach of the terms of conditions (such as using it for commercial purposes, like selling the software), they are allowed to launch an investigation. No where does this mention "spying" on modifying the app for such use.
Considering the code is closed source and they can change the ToS anytime to send conversation data to their servers whenever they want, i would like to know what would be the benefit of using this over ChatGPT?
Am I missing something here? I'm on a recent M2 machine. Every model I've downloaded fails to load immediately when trying to load it. Is there some way to get feedback on the reason for failure, like a log file or something?
EDIT: The problem is I'm on macOS 13.2 (Ventura). According to a message in Discord, the minimum version for some (most?) models is 13.6.
Is anyone using open source models to actually get work done or solving problems in their software architecture? So far I haven't found anything near the quality of GPT-4.
Top of the line consumer machines can run this at a good clip, though most machines will need to use a quantized model (ExLlamaV2 is quite fast). I found a model for that as well, though I haven't used it myself:
Zephyr is coherent enough to bounce ideas off of, but I'm eagerly awaiting when open-source models are on par productivity wise with the big providers. I imagine some folks are utilizing codellama 34b somehow, but I haven't been able to effectively.
If you're looking to do the same with open source code, you could likely run Ollama and a UI.
https://github.com/jmorganca/ollama + https://github.com/ollama-webui/ollama-webui