Note that they don't compare with deepseek coder 6.7b, which is vastly superior to much bigger coding models. Surpassing codellama 7b is not that big of a deal today.
The most impressive thing about these results is how good the 1.3B deepseek coder is.
Deepseek Coder Instruct 6.7b has been my local LLM (M1 series MBP) for a while now and that was my first thought… They selectively chose benchmark results to look impressive (which is typical).
I tested out StableLM Zephyr 3B when that came out and it was extremely underwhelming/unusable.
Based on this, Stable Code 3B doesn’t look to be worth trying out. Guessing if they could put out a 7B model which beat Deepseek Coder 6.7B they would have.
Do you know how Deepseek 33b compares to 6.7b? I'm trying 33b on my (96GB) MacBook just because I have plenty of spare (V)RAM. But I'll run the smaller model if the benefits are marginal in other peoples' experience.
The smaller model is great at trivial day-to-day tasks.
However, when you ask hard things, it struggles; you can ask the same question 10 times, and only get 1 answer that actually answers the question.
...but the larger model is a lot slower.
Generally, if you don't want to mess around swapping models, stick with the bigger one. It's better.
However, if you are heavily using it, you'll find the speed is a pain in the ass, and when you want a trivial hint like 'how do I do a map statement in kotlin again?', you really don't need it.
What I have setup personally is a little thumbs-up / thumbs-down on the suggestions via a custom intellij plugin; if I 'thumbs-down' a result, it generates a new solution for it.
If I 'thumbs-down' it twice, it swaps to the larger model to generate a solution for it.
This kind of 'use ok model for most things and step up to larger model when you start asking hard stuff' approach scales very nicely for my personal workflow... but, I admit that setting it up was a pain, and I'm forever pissing around with the plugin code to fix tiny bugs, which I would prefer to be spending doing actual work.
So... there's not really much tooling out there at the moment to support it, but the best solution really is to use both.
If you don't want to and just want 'use the best model for everything', stick with the bigger one.
The larger model is more capable of turning 'here is a description of what I want' into 'here is code that does it that actually compiles'.
The smaller model is much better at 'I want a code fragment that does X' -> 'rephrased stack overflow answer'.
I’m not sure what to say; responsive fast output is ideal, and the larger model is distinctly slower for me, particularly for long completions (2k tokens) if you’re using a restricted grammar like json output.
I’m using an M2 not an M3 though; maybe it’s better for you.
I was under the impression quantised results were generally slower too, but I’ve never dug into it (or particularly noticed a difference between q4/q5/q6).
ollama is actually not a great way to run these models as it makes it difficult to change server parameters and doesn't use `mlock` to keep the models in memory.
The 1.3b model is amazing for real time code complete, it's fast enough to be a better intellisense.
Another model you should try is magicoder 6.7b ds (based on deepseek coder). After playing with it for a couple weeks, I think it gives slightly better results than the equivalent deepseek model.
I run tabby [0] which uses llama.cpp under the hood and they ship a vscode extension [1]. Going above 1.3b, I find the latency too distracting (but the highest end gpu I have nearby is some 16gb rtx quadro card that's a couple years old, and usually I'm running a consumer 8gb card instead).
There are many workflows, with hardware-dependent requirements. Three which work for my MacBook:
1. Clone & make llama.cpp. It's a CLI program that runs models, e.g. `./main -m <local-model-file.gguf> -p <prompt>`.
2. Another CLI option is `ollama`, which I believe can download/cache models for you.
3. A GUI like LM Studio provides a wonderful interface for configuring, and interacting with, your models. LM Studio also provides a model catalog for you to pick from.
Assuming that your hardware is sufficient, options 1 & 2 should satisfy your terminal needs. Option 3 is an excellent playground for trying new models/configurations/etc.
Models are heavy. To fit one in your silicon and run it quickly, you'll want to use a quantized model. It's a model's "distilled" version -- say 80% smaller for a 0.1% accuracy loss. TheBloke on HuggingFace is one specialist in distilling. After finding a model you like, you can download some flavor of quantization he made, e.g: `huggingface-cli download TheBloke/neural-chat-7B-v3-3-GGUF neural-chat-7b-v3-3.Q4_K_M.gguf --local-dir .`; then use your favorite model runner (e.g. llama.cpp) to run it.
Don’t entirely understand Stability’s business model. They’ve been putting out a lot of models recently and Stable Diffusion was novel at the time, but now their models consistently seem to be somewhat second rate compared to other things out there. For example Midjourney now seems to have far surpassed them in the image generation front. After raising a ton of funding Stability seems to just be throwing a bunch of stuff out there that’s OK but no longer ground breaking. What am I missing?
Many other startups in the space will like face similar issues given the rapid commoditization of these models and the underlying tech. It’s very easy to spend a fortune building a model that offers a short lived incremental improvement at best before one can just quickly swap it out for something else someone else paid to train.
Vouch, I finally tried Stable LM 3b zephyr today and I'm stunned this slipped by. It's the only model I've tried that's not Mistral 7B that can do RAG. And it can run ~any consumer grade hardware released in last 3 years. I'm literally stunned it's been sitting out since December 8th. I've heard 10x more about Phi-2 than it, and I'm not sure why.
(Official ONNX version, please!! Then you get Transformers.js / web / I can deploy on every platform from Web to iOS to Windows)
re: art, Dalle-3 costs significantly more. XL costs are 1/5th of what they were at launch, 0.0002/image versus Dalle-3's 0.04. And you'd be surprised how often people are happy with XL -- Dalle-3's marginal advantage is mostly text, especially with the excessive filtering of stylistic stuff, and forced prompt rewrites
I use Stable Diffusion family models for innovative art products.
On a small scale, you have to professionalize ComfyUI’s development. My PR to make it installable and to make a plugin ecosystem that makes sense should not be sitting unmerged (https://github.com/comfyanonymous/ComfyUI/pull/298).
On a medium scale, CLIP is holding you back. I would eagerly buy a 48GB card to accommodate a batch size 1, gradient checkpointed LoRA-trainable model with T5 for conditioning. I want PixArt-a or DeepFloyd/IF with the SDXL dataset and training. I get I can achieve so much with SDXL on 24GB, including just barely a fine tuning, I understand the engineering decisions here, but it’s too weak on prompts.
On a large scale, I’m willing to spend a little money up front. In those conditions you can be far more innovative, you don’t have to make everything for $0. Shane Carruth didn’t make Primer for $0. I’m sure you’ve seen this movie, you get how astoundingly good it is. But he still spent something. He spent only slightly more than an RTX 6000 Ada.
Innovators have budgets. It’s still worth releasing the most powerful possible model for expensive hardware, this is why everyone is talking about Mixtral, but it’s especially true of visual art.
Setting aside I've tried both, we'll bore each other to death if we just assert one is better:
From first principles, Phi 2 is extremely unlikely to be better, it's a base model and doesn't know how to chat. (see README on HF repo and also "Responses by phi-2 are off, it's depressed and insults me for no reason whatsover?", https://huggingface.co/microsoft/phi-2/discussions/61)
Phi-2's license just changed and you still need to finetune it yourself. $20/month is more than reasonable for commercial use IMHO, it's a game changer.
Until I can use a truly* chat finetuned Phi-2, StableLM remains a clear winner in my experience. It can do RAG, the only other small model I've seen do that is Mistral 7B, and Phi-2 acts like PaLM acted when I would play around with it internally at Google, when it was just a base model. Impossible to use but fun toy.
* there's a couple other there, but they don't seem to have enough fine-tuning...yet
Yeah, Phi-2 is weird on chat, StableLM beats it on some metrics, Phi-2 does on others but also doesn't really have system integration yet.
The base model of StableLM 3b zephyr is actually under an even more permissive license (we didn't change in retrospect) and is the best base to train on for MacBooks with 8gb RAM, edge devices etc.
With LLM Farm quantised you can run it faster than you can read on a iPhone or whatever.
Midjourney is decidedly underwhelming if you've spent any time using the expansive tooling and control nets of Stable Diffusion. Yes, it's easy to get impressive first gens with MJ, but all of the coolest work and integration happening is using SD.
It depends. For great looking pics that you need to get out quickly MJ does a great job. Especially with its image + text feature. Dalle is also an interesting choice.
SDXL and controlnet is odd a lot of the time. 1.5 + controlnet still seem to give quicker and better results.
Basically SD atleast seems to be for when you want unique content. MJ/Dalle for everything else.
Fooocus is definitely more beginner friendly. It does a lot of the prompt engineering for you. Automatic1111 has a ton of plugins, most notably ControlNet which gives you fine grained control over the images, but there is a learning curve.
> For example Midjourney now seems to have far surpassed them in the image generation front
Nope. Stable Diffusion with alternative models offers far more customization and control than Midjourney. Midjourney is good for beginners but sucks for experts.
We use SD at work because we need more control over the image generation pipeline (and to a lesser extent don’t want extra latency from web APIs).
Believe it or not, generating a full image from a prompt is a small slice of the image generation pie. Highly tuned in-painting is key to a number of budding startups.
Midjourney has better quality but does not offer any control. Community has done and is still doing a a lot with SD models because they can be played and tinkered with in any way anyone wants to.
The open space on small models is a whole other developing angle, but O was referring to the general commoditization of a lot of these models. With rare exception after launch it seems the lifespan of any of these models is rather limited. From a business standpoint that sort of scenario is generally very unattractive and thus was trying to understand if they have some other angle they’re trying to play here to make a viable business out of this. Or the business model can just be get acquired before that matters and let that be someone else’s problem to figure out.
>Stable Diffusion was novel at the time, but now their models consistently seem to be somewhat second rate compared to other things out there. For example Midjourney now seems to have far surpassed them in the image generation front.
This isn’t entirely true, after the fumble that was SD2 they shipped SDXL and SDXL Turbo that are both excellent. And in real world results Midjourney doesn’t just straight out perform them it’s a lot more complex and ultimately SDXL is the more powerful tool.
Definitely found the LLMs underwhelming and Stable Audios launch was poor but don’t think Midjourney has outright surpassed them on image gen.
For what I’d call “art” or at least artsy works, or anything I want to iterate on (using inpainting and redraws) I want to make I use stable diffusion, but if I just want to send some dumb silly picture to send to my friend I’ve found myself using DALL-E almost every single day. It’s just so easy and in 4 images it’ll almost always get pretty close to what I’m describing. I’m constantly sending my friends dumb pictures because it’s really funny and gets a laugh out of people.
That said it was super cool the time I trained a model on my friends selfies and made her into her D&D character. She was super excited about it, made me feel like a real life wizard.
>Midjourney now seems to have far surpassed them in the image generation front.
What? Have you actually used either? MJ is just a ultra-fine tuned model with a few layers to prevent stuff from looking bad. Stable Diffusion has their own 'single shot' version, maybe someone remembers it, I played with it for 1-2 hours. Everything looks great, but I want hyper specific stuff in my art and I'm never getting that with 1 shots.
Heck, I did a few flyers and used some icons I made with img2img + inpainting + controlnet. The work is completely stunning and scalable. That is never happening even at an individual level with MJ.
> This model is included in our new Stability AI Membership. Visit our Membership page to take advantage of our commercial Core Model offerings, including SDXL Turbo & Stable Video Diffusion.
what exactly is the license lol. can people use this or is this "see dont touch"
It's free for noncommercial use. If you use it in your company, your company should pay the membership fee. afaik most openai competitors also use similar usage restriction (e.g. free for noncommercial or research use, contact us for commercial license).
That's not the way legal cases went. Indeed, they went all over the place.
That's the reason you see "IANAL" disclaimers all over the internet. Legal advice from non-lawyers can be problematic in many ways. Some jurisdictions, although not where I live, you can even go to jail for giving bad legal advice without being a licensed lawyer.
There are a growing number of open source options out there. I was playing with Simon Willison's excellent llm cli tool this morning and tried out some models from the gpt4all project. One of the better ones come from companies like mistral which release their models under the Apache license.
Gpt4all has a UI as well that you can use with models running locally on your laptop.
That is fantastic. I'm building a small macOS SwiftUI client with llama cpp built in, no server-client model, and it's already so useful with models like openhermes chat 7B, and fast.
If this opens it to smaller laptops, wow!
We truly live in crazy time. The rate of improvement in this field is off the walls.
Not sure if this is where your head is, but I think there's a lot of value in integrating LLMs directly into complex software. Jira, Salesforce, maybe K8s - should all have an integrated LLMs that can walk you through how to perform a nuanced task in the software.
IMO, for many real business use cases, the hallucinations are still a big deal. Once we have models that are more reliable, I think it makes sense to go down that path - the AI is the interface to the software.
But until we're there, a system that just provides guidance that the user can validate is a good stepping stone - and one I suspect is immediately feasible!
A beginner tutorial is also not used frequently by users, but that doesn't make it a bad investment. I an LLM can help a lot with getting familiar with the tool it could be pretty valuable, especially after a UI rework etc.
That sounds awesome! Can you share any details about how you're working with llama cpp? Is it just via the Swift <> C bridge? I've toyed with the idea of doing this, and wonder if you have any pointers before I get started.
I've got a machine with 4 3090s-- Anyone know which model would perform the best for programming? It's great this can run on a machine w/out a graphics card and is only 3B params, but I have the hardware. Might as well use it.
Try mistral 8x7b, which some human evals place above gpt-3.5 and you have enough VRAM and compute to make training a LORA either on your own dataset, or one of the freely available datasets on huggingface worthwhile, or at least interesting
AFAIK deepseek coder family are the best open coding models.
I haven't tested, but I think deepseek coder 33b can run in a single RTX 3090 when 4-bit quantized. In your case you might be able to run the non quantized version
I've seen the CanAiCode leaderboard several times (and used many of the models listed), but I wouldn't use it to pick a model. It's not a bad list, but the benchmark is too limited. The results are not accurately ranked from best to worst.
For example the deepseek 33b model is ranked 5 spots lower than the 6.7b model, but the 33b model is definitely better. WizardCoder 15b is near the top while WizardCoder 33b is ranked 26 spots lower, which is a wildly inaccurate ranking.
It's worth noting that those 33b models score in the 70s for HumanEval and HumanEval+ while the 15b model scores in the 50s.
I did! I started by going to vast.ai. I was able to look at the specs of the top-scoring machines. I started with the motherboard (as I knew it could support my 3090s, because some PCIe busses can't handle all that data). Then of course I copied everything else that I could. I ended up using PCIe extenders and zip-tieing (plastic, I should use metal zip ties instead) the cards to a rack I got from Lowes. I'm not too pleased with how it looks, but it works!
BTW, depending on where you're at in your ML journey, Jeremy Howard from FastAI says you should focus more on using hosted instances like paperspace until you really need to get your own machine. Unless, of course, you enjoy linux sysadmin tasks. :) It can get really annoying trying to match the right version of CUDA with the version of Pytorch you're trying to get running for the newest model you're trying.
afaik "edge" nearly always means taking place on the device a user is interacting with. no server involved except perhaps as authentication etc. but there is probably some other situation where "edge" could mean local infra or caching.
I've definitely seen edge computing used in the context of IoT to refer to compute done on sensing devices. The less narrow meaning at least isn't really "new".
I was able to run this model in http://lmstudio.ai as well. Just remove Compatibility Guess in Filters, so you can see all the models. LM Studio can load it and run requests against it.
I've been experimenting with code-llama extensively on my laptop, and from my experience, it seems that these models are still in their early stages. I primarily utilize them through a Web UI, where they can successfully refactor code given an existing snippet. However, it's worth noting that they cannot currently analyze entire codebases or packages, refining them based on the most suitable solutions using the most appropriate algorithms. While these models offer assistance to some extent, there is room for improvement in their ability to handle more complex and comprehensive coding scenarios.
I think there is a decent chance SourceGraph will figure this all out. The most important thing at this point is figuring what context to feed. They can build up a nice graph of a codebase and I expect from there they can put in the best context and then boom.
They might also be able to train a model more intelligently by generating training data from said graphs.
I'm honestly failing to see the utility for LLMs, because the context for any given problem is far too small, and we're already at 33B parameter models. They just don't seem to be a technology that scales to an interesting problem size.
A 3B tiny model is not going to compare to GitHub copilot. However, there are plenty of nice 7B models that are excellent at code and I encourage you to try them out.
If you just want to get stuff done, use the best tools like a Milwaukee Drill - and right now, thats copilot/gpt-4.
If you don't want to be tied to a company and like opensource, feel free to connect a toy motor to an AA battery to drill your holes... Or to use Llama/Stable Code 3B.
Openai just invisibly dropped my API requests to a lower model with a 4k context limit. And my commit scripts started failing for being over the context limit. It's buried in the docs somewhere that low tier api users will be served on lower models during peek times.
So,I guess they're like a Milwaukee Drill that will sometimes refuse to work unless you buy more drill credits.
You clearly have never used these other tools. Mixtral / Deepseek perform very well on coding challenges. I've used them against local code without issues, sometimes they are a bit optimistic and produce too much, but thats far better than producing too little (like GPT4 does).
Given the complete failure of the first stable lm, I'm interested to try this one out. Haven't really seen a small language model, except mixtral 7b that's really useful for much.
I also hope stability comes out with a competitor to the new midjourney and dalle models! That's what put them on the map in the first place
All of the Mistral versions have been excellent, including the OpenHermes versions. I encourage you to check out Phi-2 as well, it's the only 3b model I've found really quite interesting outside of Replit's code model built into Replit Core.
It's amazing to see more smaller models being released. This creates opportunities for more developers to run it on their local computers, and makes it easier to fine-tune for specific needs.
Seems like they caught the Apple Marketing bug and are chasing things noonecares about. Great 3B model, everyone is already running 7B models over here.
Maybe one day when I need to do offline coding on my cellphone, it will be really useful.
does anyone have recommendations for addins to integrate these 'smaller' llms into an IDE like VSCode? I'm pretty embedded with GH copilot, but curious to explore other options.
> This model is included in our new Stability AI Membership. Visit our Membership page to take advantage of our commercial Core Model offerings, including SDXL Turbo & Stable Video Diffusion.
A hypothetical Stable Code 13B/70B could be hosted only, with more languages or specialized use-cases (Stable Code 3B iOS-Swift-Turbo)
Agreed, and not only do they not compare their model to Phi-2 directly, the benchmarks they report don't overlap with the ones in the Phi-2 post[1], making it hard for a third party to compare without running benchmarks themselves.
(In turn, in the Phi-2 post they compare Phi-2 to Llama-2 instead of CodeLlama, making it even harder)
I just tried this model with Koboldcpp on my LLM box. I got gibberish back.
My prompt - "please show me how to write a web scraper in Python"
The response?
<blockquote>
I've written my first ever python script about 5 months ago and I really don't remember anything except for the fact that I used Selenium in order to scrape websites (in this case, Google). So you can probably just copy/paste all of these lines from your own Python code which contains logic to determine what value should be returned when called by another piece of software or program.
</blockquote>
It's quite amazing - I often find that I read quite positive comments towards LLM tools for coding. Yet, an "Ask HN" I posted a while ago (and which admittedly didn't gain much traction) seemed to mirror mostly negative/pessimistic responses.
You got two positive and two negative responses. You replied only to the negative responses. Now you think that the responses were mostly negative. I blame salience bias.
Anyways, there's also a difference between "are you excited about this new thing becoming available" and "now that you've used it, do you like the experience". The former is more likely to feature rosy expectations and the latter bitter disappointment. (Though it could also be the other way around, with people dismissing it at first and then discovering that it's kind of nice actually.)
If somebody can show me a coding task that LLMs have successfully done that isn't an interview question or a documentation snippet, I might start to value it.
Spending huge amount of resource to be a bit better at autocompleting code doesn't have value to me. I want it to solve significant problems, and it's looking like it can't do it and scaling it to be able to is totally impractical.
> In aggregate, training all 9 Code Llama models required 400K GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W).
That is:
* 45⅔ GPU years
* 160 MWh or...
* 45 average UK homes annual electric consumption
* 18 average US homes
* 64 average drivers annual milage in an EV.
...and that's just the GPUs. Add on all the rest of the system (s).
In the grand scheme of things it's ancient history, but https://code-as-policies.github.io/ works by generating code then executing it. That's worth running at. The code generation in that paper was done on code-davinci-002, which is (or rather was - it's deprecated) a 15B GPT-3 model. I've not done it yet, but I'd expect the open source 7B code completion models to be able to replicate it by now.
The most impressive thing about these results is how good the 1.3B deepseek coder is.