Hacker News new | past | comments | ask | show | jobs | submit login
Stable Code 3B: Coding on the Edge (stability.ai)
315 points by egnehots on Jan 16, 2024 | hide | past | favorite | 140 comments



Note that they don't compare with deepseek coder 6.7b, which is vastly superior to much bigger coding models. Surpassing codellama 7b is not that big of a deal today.

The most impressive thing about these results is how good the 1.3B deepseek coder is.


Deepseek Coder Instruct 6.7b has been my local LLM (M1 series MBP) for a while now and that was my first thought… They selectively chose benchmark results to look impressive (which is typical).

I tested out StableLM Zephyr 3B when that came out and it was extremely underwhelming/unusable.

Based on this, Stable Code 3B doesn’t look to be worth trying out. Guessing if they could put out a 7B model which beat Deepseek Coder 6.7B they would have.


Do you know how Deepseek 33b compares to 6.7b? I'm trying 33b on my (96GB) MacBook just because I have plenty of spare (V)RAM. But I'll run the smaller model if the benefits are marginal in other peoples' experience.


The smaller model is great at trivial day-to-day tasks.

However, when you ask hard things, it struggles; you can ask the same question 10 times, and only get 1 answer that actually answers the question.

...but the larger model is a lot slower.

Generally, if you don't want to mess around swapping models, stick with the bigger one. It's better.

However, if you are heavily using it, you'll find the speed is a pain in the ass, and when you want a trivial hint like 'how do I do a map statement in kotlin again?', you really don't need it.

What I have setup personally is a little thumbs-up / thumbs-down on the suggestions via a custom intellij plugin; if I 'thumbs-down' a result, it generates a new solution for it.

If I 'thumbs-down' it twice, it swaps to the larger model to generate a solution for it.

This kind of 'use ok model for most things and step up to larger model when you start asking hard stuff' approach scales very nicely for my personal workflow... but, I admit that setting it up was a pain, and I'm forever pissing around with the plugin code to fix tiny bugs, which I would prefer to be spending doing actual work.

So... there's not really much tooling out there at the moment to support it, but the best solution really is to use both.

If you don't want to and just want 'use the best model for everything', stick with the bigger one.

The larger model is more capable of turning 'here is a description of what I want' into 'here is code that does it that actually compiles'.

The smaller model is much better at 'I want a code fragment that does X' -> 'rephrased stack overflow answer'.


> but the larger model is a lot slower.

I found the performance to be very acceptable for 33b 4 bit on a m3 max with 36gb ram (much faster than reading speed)


I’m not sure what to say; responsive fast output is ideal, and the larger model is distinctly slower for me, particularly for long completions (2k tokens) if you’re using a restricted grammar like json output.

I’m using an M2 not an M3 though; maybe it’s better for you.

I was under the impression quantised results were generally slower too, but I’ve never dug into it (or particularly noticed a difference between q4/q5/q6).

If you find it fast enough to use then go for it~


Do you mind sharing your plugin as a gist?

How do you run both models in memory? Two separate processes?


You would want to test it out manually day to day. That’s always the best. Some models can out score but not actually be “better” when you use it.

But there is also the benchmarking: https://github.com/deepseek-ai/deepseek-coder

33B Instruct doesn’t beat 6.7B Instruct by much but maybe those % improvements mean more for your usage.

I run 6.7B since I have 16GB RAM.

Quantization of the model also makes a difference.


Do you use it inside vscode or how do you integrate an LLM into your IDE?


How do you make use of it? Do you have it integrated directly into an ide?


What do you use it for?


Golang grinding Leetcode coding buddy. And general coding buddy PR reviewer.

The results are on par or better than ChatGPT 3.5.

I often use it to delve deeper such as “is there an alternative way to write this?” Or “how does this code look?”

If you have an M-series Mac I recommend trying out LM Studio. Really eye opening and I’m excited to see how things progress.


I have GitHub copilot. Is it better than that? Nd if so, in which way?

Offline would be one for sure. Cost is another. What else?


I’ve never used GH Copilot so can’t comment on that.

But having everything locally means no privacy or data leak issues.


Deepseek-coder-6.7B really is a quite surprisingly capable model. It's easy to give it a spin with ollama via `ollama run deepseek-coder:6.7b`.


Thanks for the tip with ollama


If you do:

1. ollama run deepseek-coder:6.7b

2. pip install litellm

3. litellm --model deepseek-coder:6.7b

You will have a local OpenAI compatible API for it.


ollama is actually not a great way to run these models as it makes it difficult to change server parameters and doesn't use `mlock` to keep the models in memory.


What do you suggest?


vanilla llama.cpp (run a `/server`)


The 1.3b model is amazing for real time code complete, it's fast enough to be a better intellisense.

Another model you should try is magicoder 6.7b ds (based on deepseek coder). After playing with it for a couple weeks, I think it gives slightly better results than the equivalent deepseek model.

Repo https://github.com/ise-uiuc/magicoder

Models https://huggingface.co/models?search=Magicoder-s-ds


How do you use these models with your editor? (E. vscode or Emacs etc)


I run tabby [0] which uses llama.cpp under the hood and they ship a vscode extension [1]. Going above 1.3b, I find the latency too distracting (but the highest end gpu I have nearby is some 16gb rtx quadro card that's a couple years old, and usually I'm running a consumer 8gb card instead).

[0] https://tabby.tabbyml.com/

[1] https://marketplace.visualstudio.com/items?itemName=TabbyML....


Would you mind sharing your tabby invocation to launch this model?


An easy way is to use an OpenAI compatible server, which means you can use any GPT plugin to integrate with your editor


Not sure what you guys are doing with it but even at 33B it's laughably dumb compared to Mixtral.


This is phenomenal. And runs fast! The 33b version might be my MacBook's new coding daily driver.


4-bit quantized 33b runs great on a mp pro with m3 max chip


I'm using the 5-bit quant with llama.cpp and it's excellent on my M2 96GB MacBook! Running this model + Mixtral will be fun.


How are you using it? I need to find some sane way to use this stuff from Helix/terminal..


There are many workflows, with hardware-dependent requirements. Three which work for my MacBook:

1. Clone & make llama.cpp. It's a CLI program that runs models, e.g. `./main -m <local-model-file.gguf> -p <prompt>`.

2. Another CLI option is `ollama`, which I believe can download/cache models for you.

3. A GUI like LM Studio provides a wonderful interface for configuring, and interacting with, your models. LM Studio also provides a model catalog for you to pick from.

Assuming that your hardware is sufficient, options 1 & 2 should satisfy your terminal needs. Option 3 is an excellent playground for trying new models/configurations/etc.

Models are heavy. To fit one in your silicon and run it quickly, you'll want to use a quantized model. It's a model's "distilled" version -- say 80% smaller for a 0.1% accuracy loss. TheBloke on HuggingFace is one specialist in distilling. After finding a model you like, you can download some flavor of quantization he made, e.g: `huggingface-cli download TheBloke/neural-chat-7B-v3-3-GGUF neural-chat-7b-v3-3.Q4_K_M.gguf --local-dir .`; then use your favorite model runner (e.g. llama.cpp) to run it.

Hope that gets you started. Cheers!


Don’t entirely understand Stability’s business model. They’ve been putting out a lot of models recently and Stable Diffusion was novel at the time, but now their models consistently seem to be somewhat second rate compared to other things out there. For example Midjourney now seems to have far surpassed them in the image generation front. After raising a ton of funding Stability seems to just be throwing a bunch of stuff out there that’s OK but no longer ground breaking. What am I missing?

Many other startups in the space will like face similar issues given the rapid commoditization of these models and the underlying tech. It’s very easy to spend a fortune building a model that offers a short lived incremental improvement at best before one can just quickly swap it out for something else someone else paid to train.


Business model is bundling so you have a one stop shop for good quality models of every modality and cultural variants of them.

These go on bedrock, on chip, on prem etc and our consulting partners take them to the end user.

On the innovation side stable diffusion turbo does like 100 cats with hats per second and the video model outperforms runway, pika etc on blind tests.

Stable audio was one of the time innovation of the year winners on music and we released a sota 3d model.

Stable LM zephyr is the best 3b chat model works great on a MacBook Air.

Most of the pixels in the world will be generated so fast high quality image/video are the core and these other models are to support them.

It’s really hard to build good solid models and we are the only company that can build a model of any type for anyone.


Vouch, I finally tried Stable LM 3b zephyr today and I'm stunned this slipped by. It's the only model I've tried that's not Mistral 7B that can do RAG. And it can run ~any consumer grade hardware released in last 3 years. I'm literally stunned it's been sitting out since December 8th. I've heard 10x more about Phi-2 than it, and I'm not sure why.

(Official ONNX version, please!! Then you get Transformers.js / web / I can deploy on every platform from Web to iOS to Windows)

re: art, Dalle-3 costs significantly more. XL costs are 1/5th of what they were at launch, 0.0002/image versus Dalle-3's 0.04. And you'd be surprised how often people are happy with XL -- Dalle-3's marginal advantage is mostly text, especially with the excessive filtering of stylistic stuff, and forced prompt rewrites


I use Stable Diffusion family models for innovative art products.

On a small scale, you have to professionalize ComfyUI’s development. My PR to make it installable and to make a plugin ecosystem that makes sense should not be sitting unmerged (https://github.com/comfyanonymous/ComfyUI/pull/298).

On a medium scale, CLIP is holding you back. I would eagerly buy a 48GB card to accommodate a batch size 1, gradient checkpointed LoRA-trainable model with T5 for conditioning. I want PixArt-a or DeepFloyd/IF with the SDXL dataset and training. I get I can achieve so much with SDXL on 24GB, including just barely a fine tuning, I understand the engineering decisions here, but it’s too weak on prompts.

On a large scale, I’m willing to spend a little money up front. In those conditions you can be far more innovative, you don’t have to make everything for $0. Shane Carruth didn’t make Primer for $0. I’m sure you’ve seen this movie, you get how astoundingly good it is. But he still spent something. He spent only slightly more than an RTX 6000 Ada.

Innovators have budgets. It’s still worth releasing the most powerful possible model for expensive hardware, this is why everyone is talking about Mixtral, but it’s especially true of visual art.


Yeah there is going to be a big push into Comfy and some very interesting new models coming ^_^


(parent commenter is founder/CEO of Stability AI, Emad Mostaque, I assume)


> Stable LM zephyr is the best 3b chat model

By what measure? Phi 2 seems better as far as I can tell from benchmarks and usage and has much more permissive license.


Setting aside I've tried both, we'll bore each other to death if we just assert one is better:

From first principles, Phi 2 is extremely unlikely to be better, it's a base model and doesn't know how to chat. (see README on HF repo and also "Responses by phi-2 are off, it's depressed and insults me for no reason whatsover?", https://huggingface.co/microsoft/phi-2/discussions/61)

re: Benchmarks, see https://huggingface.co/stabilityai/stablelm-zephyr-3b. Phi-2 wins on some, StableLM on others. For some reason the HF and Lmsys leaderboards don't show it, and I don't know why.

Phi-2's license just changed and you still need to finetune it yourself. $20/month is more than reasonable for commercial use IMHO, it's a game changer.

Until I can use a truly* chat finetuned Phi-2, StableLM remains a clear winner in my experience. It can do RAG, the only other small model I've seen do that is Mistral 7B, and Phi-2 acts like PaLM acted when I would play around with it internally at Google, when it was just a base model. Impossible to use but fun toy.

* there's a couple other there, but they don't seem to have enough fine-tuning...yet


Yeah, Phi-2 is weird on chat, StableLM beats it on some metrics, Phi-2 does on others but also doesn't really have system integration yet.

The base model of StableLM 3b zephyr is actually under an even more permissive license (we didn't change in retrospect) and is the best base to train on for MacBooks with 8gb RAM, edge devices etc.

With LLM Farm quantised you can run it faster than you can read on a iPhone or whatever.

https://huggingface.co/stabilityai/stablelm-3b-4e1t

It's also one of the only models with fully dataset, training and other transparency: https://stability.wandb.io/stability-llm/stable-lm/reports/S...


> On the innovation side stable diffusion turbo does like 100 cats with hats per second

2028: Energy use on hat-cat generation exceeds energy use on bitcoin.


There are not enough cats on the internet. We are working to fix this.


"100 cats with hats per second"

AI has peaked


Midjourney is decidedly underwhelming if you've spent any time using the expansive tooling and control nets of Stable Diffusion. Yes, it's easy to get impressive first gens with MJ, but all of the coolest work and integration happening is using SD.


It depends. For great looking pics that you need to get out quickly MJ does a great job. Especially with its image + text feature. Dalle is also an interesting choice.

SDXL and controlnet is odd a lot of the time. 1.5 + controlnet still seem to give quicker and better results.

Basically SD atleast seems to be for when you want unique content. MJ/Dalle for everything else.


Forgive me for not remembering the name. Stable Diffusion also has an MJ style where you just say what you want and it makes beautiful piutures.

You cant customize basically anything but they look great.


You might be thinking of Fooocus: https://github.com/lllyasviel/Fooocus

The Stable Diffusion web interface that got a lot of people's attention originally was Automatic1111: https://github.com/AUTOMATIC1111/stable-diffusion-webui

Fooocus is definitely more beginner friendly. It does a lot of the prompt engineering for you. Automatic1111 has a ton of plugins, most notably ControlNet which gives you fine grained control over the images, but there is a learning curve.


> For example Midjourney now seems to have far surpassed them in the image generation front

Nope. Stable Diffusion with alternative models offers far more customization and control than Midjourney. Midjourney is good for beginners but sucks for experts.


We use SD at work because we need more control over the image generation pipeline (and to a lesser extent don’t want extra latency from web APIs).

Believe it or not, generating a full image from a prompt is a small slice of the image generation pie. Highly tuned in-painting is key to a number of budding startups.


Midjourney has better quality but does not offer any control. Community has done and is still doing a a lot with SD models because they can be played and tinkered with in any way anyone wants to.


> one can just quickly swap it out for something else someone else paid to train.

That doesn't seem to be the case. There are very limited open-source models outside of the small-LLM bubble.


The open space on small models is a whole other developing angle, but O was referring to the general commoditization of a lot of these models. With rare exception after launch it seems the lifespan of any of these models is rather limited. From a business standpoint that sort of scenario is generally very unattractive and thus was trying to understand if they have some other angle they’re trying to play here to make a viable business out of this. Or the business model can just be get acquired before that matters and let that be someone else’s problem to figure out.


The comoditization of models outside of LLM is delusional. There is no comparable open source image / video models to the private ones.


>Stable Diffusion was novel at the time, but now their models consistently seem to be somewhat second rate compared to other things out there. For example Midjourney now seems to have far surpassed them in the image generation front.

This isn’t entirely true, after the fumble that was SD2 they shipped SDXL and SDXL Turbo that are both excellent. And in real world results Midjourney doesn’t just straight out perform them it’s a lot more complex and ultimately SDXL is the more powerful tool.

Definitely found the LLMs underwhelming and Stable Audios launch was poor but don’t think Midjourney has outright surpassed them on image gen.


For what I’d call “art” or at least artsy works, or anything I want to iterate on (using inpainting and redraws) I want to make I use stable diffusion, but if I just want to send some dumb silly picture to send to my friend I’ve found myself using DALL-E almost every single day. It’s just so easy and in 4 images it’ll almost always get pretty close to what I’m describing. I’m constantly sending my friends dumb pictures because it’s really funny and gets a laugh out of people.

That said it was super cool the time I trained a model on my friends selfies and made her into her D&D character. She was super excited about it, made me feel like a real life wizard.


>Midjourney now seems to have far surpassed them in the image generation front.

What? Have you actually used either? MJ is just a ultra-fine tuned model with a few layers to prevent stuff from looking bad. Stable Diffusion has their own 'single shot' version, maybe someone remembers it, I played with it for 1-2 hours. Everything looks great, but I want hyper specific stuff in my art and I'm never getting that with 1 shots.

Heck, I did a few flyers and used some icons I made with img2img + inpainting + controlnet. The work is completely stunning and scalable. That is never happening even at an individual level with MJ.


> License: Other

> Commercial Applications

> This model is included in our new Stability AI Membership. Visit our Membership page to take advantage of our commercial Core Model offerings, including SDXL Turbo & Stable Video Diffusion.

what exactly is the license lol. can people use this or is this "see dont touch"


It's free for noncommercial use. If you use it in your company, your company should pay the membership fee. afaik most openai competitors also use similar usage restriction (e.g. free for noncommercial or research use, contact us for commercial license).


This basically means "Get sued."

There is no clear legal, definition of "noncommercial," and courts have gone all sorts of different way on what constitutes commercial use.

This is where CC NC licenses imploded. A lot of places (hello, MIT!) intentionally use CC NC licenses to make things appear more open than they are.


If you make money using it, pay them. If you're using it for free, don't worry


That's not the way legal cases went. Indeed, they went all over the place.

That's the reason you see "IANAL" disclaimers all over the internet. Legal advice from non-lawyers can be problematic in many ways. Some jurisdictions, although not where I live, you can even go to jail for giving bad legal advice without being a licensed lawyer.


There are a growing number of open source options out there. I was playing with Simon Willison's excellent llm cli tool this morning and tried out some models from the gpt4all project. One of the better ones come from companies like mistral which release their models under the Apache license.

Gpt4all has a UI as well that you can use with models running locally on your laptop.


That is fantastic. I'm building a small macOS SwiftUI client with llama cpp built in, no server-client model, and it's already so useful with models like openhermes chat 7B, and fast.

If this opens it to smaller laptops, wow!

We truly live in crazy time. The rate of improvement in this field is off the walls.


Not sure if this is where your head is, but I think there's a lot of value in integrating LLMs directly into complex software. Jira, Salesforce, maybe K8s - should all have an integrated LLMs that can walk you through how to perform a nuanced task in the software.


Imagine good error messages, with hints for mitigation and maybe smart retry w/ mitigations applied.


Why would the LLM walk you through and not just do the nuanced task on its own?


I assume the human maintains some of the necessary context in their meat memory.


IMO, for many real business use cases, the hallucinations are still a big deal. Once we have models that are more reliable, I think it makes sense to go down that path - the AI is the interface to the software.

But until we're there, a system that just provides guidance that the user can validate is a good stepping stone - and one I suspect is immediately feasible!


Walkthrough is generally performed once or not so frequently. It would be a bad investment if you just use it for just this use case


A beginner tutorial is also not used frequently by users, but that doesn't make it a bad investment. I an LLM can help a lot with getting familiar with the tool it could be pretty valuable, especially after a UI rework etc.


3b is good for 8gb MacBook Air etc. 7b is slightly too big.

Sure these will continue to improve, phi2 is a good base as well


That sounds awesome! Can you share any details about how you're working with llama cpp? Is it just via the Swift <> C bridge? I've toyed with the idea of doing this, and wonder if you have any pointers before I get started.


I've got a machine with 4 3090s-- Anyone know which model would perform the best for programming? It's great this can run on a machine w/out a graphics card and is only 3B params, but I have the hardware. Might as well use it.


Try mistral 8x7b, which some human evals place above gpt-3.5 and you have enough VRAM and compute to make training a LORA either on your own dataset, or one of the freely available datasets on huggingface worthwhile, or at least interesting


AFAIK deepseek coder family are the best open coding models.

I haven't tested, but I think deepseek coder 33b can run in a single RTX 3090 when 4-bit quantized. In your case you might be able to run the non quantized version


The coding models are all small because speed is crucial. If you need to wait 2 seconds for an autocomplete it becomes near useless.


Here is a leader board of some models

https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...

Don't know how biased this leaderboard is, but I guess you could just give some of them a try and see for yourself.


This is a much better leaderboard: https://evalplus.github.io/leaderboard.html

I've seen the CanAiCode leaderboard several times (and used many of the models listed), but I wouldn't use it to pick a model. It's not a bad list, but the benchmark is too limited. The results are not accurately ranked from best to worst.

For example the deepseek 33b model is ranked 5 spots lower than the 6.7b model, but the 33b model is definitely better. WizardCoder 15b is near the top while WizardCoder 33b is ranked 26 spots lower, which is a wildly inaccurate ranking.

It's worth noting that those 33b models score in the 70s for HumanEval and HumanEval+ while the 15b model scores in the 50s.


Thanks


Did you build a machine with 4x 3090 ? I looking for a way to build such a machine for ML training.


I did! I started by going to vast.ai. I was able to look at the specs of the top-scoring machines. I started with the motherboard (as I knew it could support my 3090s, because some PCIe busses can't handle all that data). Then of course I copied everything else that I could. I ended up using PCIe extenders and zip-tieing (plastic, I should use metal zip ties instead) the cards to a rack I got from Lowes. I'm not too pleased with how it looks, but it works!

BTW, depending on where you're at in your ML journey, Jeremy Howard from FastAI says you should focus more on using hosted instances like paperspace until you really need to get your own machine. Unless, of course, you enjoy linux sysadmin tasks. :) It can get really annoying trying to match the right version of CUDA with the version of Pytorch you're trying to get running for the newest model you're trying.


Here are the parts I can find CPU: https://www.newegg.com/amd-epyc-7252-socket-sp3/p/N82E168191... Motherboard: https://www.newegg.com/asrock-rack-romed8-2t/p/N82E168131400... SSD: https://www.newegg.com/samsung-2tb-980-pro/p/N82E16820147796... Computer Power Supply (the GPUs have their own power supplies, run apart from the computer, which I'm told is bad, but... seems to work for years now for me): https://www.newegg.com/corsair-hx-series-hx1200-cp-9020140-n... PCIe 4.0 x16 Risers: https://www.newegg.com/p/N82E16812987068?Item=N82E1681298706... Tower: https://www.newegg.com/black-fractal-design-meshify-2-compac... RAM: https://www.newegg.com/nemix-ram-16gb-288-pin-ddr4-sdram/p/1...

Total, without GPUs and their power supplies: $2900


Thanks man! Super helpful.


Wondering the same here


How are people using codellama and this in their workflows?

I found one option: https://github.com/xNul/code-llama-for-vscode

But I'm guessing there are others, and they might differ in how they provide context to the model.


Ellama for Emacs look promising, but I only tried to install it this morning.


Jargon naivete question: isn't "on the edge" normally implying on a server side with minimal routers hops to the client, not on client side?


afaik "edge" nearly always means taking place on the device a user is interacting with. no server involved except perhaps as authentication etc. but there is probably some other situation where "edge" could mean local infra or caching.


I think the etymology of “edge computing” is derived from “network edge”, ie the outer shell of some network/autonomous system.

The closest point within your control that interfaces with devices outside of your control.

Seeing the term get used to describe client devices themselves kinda muddies the terminology.


I've definitely seen edge computing used in the context of IoT to refer to compute done on sensing devices. The less narrow meaning at least isn't really "new".


Agreed.

(autocorrect style typo above: etymology)


fixed


+1 yes, for a service using network caching like using Cloudflare. I would've referred to their CDN as the Edge of our network.


I was able to run this model in http://lmstudio.ai as well. Just remove Compatibility Guess in Filters, so you can see all the models. LM Studio can load it and run requests against it.


I've been experimenting with code-llama extensively on my laptop, and from my experience, it seems that these models are still in their early stages. I primarily utilize them through a Web UI, where they can successfully refactor code given an existing snippet. However, it's worth noting that they cannot currently analyze entire codebases or packages, refining them based on the most suitable solutions using the most appropriate algorithms. While these models offer assistance to some extent, there is room for improvement in their ability to handle more complex and comprehensive coding scenarios.


I think there is a decent chance SourceGraph will figure this all out. The most important thing at this point is figuring what context to feed. They can build up a nice graph of a codebase and I expect from there they can put in the best context and then boom.

They might also be able to train a model more intelligently by generating training data from said graphs.


> I've been experimenting with code-llama extensively on my laptop

You can use/try code-llama with Cody https://sourcegraph.com/blog/cody-vscode-1.1.0-release#:~:te...


I'm honestly failing to see the utility for LLMs, because the context for any given problem is far too small, and we're already at 33B parameter models. They just don't seem to be a technology that scales to an interesting problem size.


How is this compared to the current GitHub Copilot?


A 3B tiny model is not going to compare to GitHub copilot. However, there are plenty of nice 7B models that are excellent at code and I encourage you to try them out.


If you just want to get stuff done, use the best tools like a Milwaukee Drill - and right now, thats copilot/gpt-4.

If you don't want to be tied to a company and like opensource, feel free to connect a toy motor to an AA battery to drill your holes... Or to use Llama/Stable Code 3B.


Openai just invisibly dropped my API requests to a lower model with a 4k context limit. And my commit scripts started failing for being over the context limit. It's buried in the docs somewhere that low tier api users will be served on lower models during peek times.

So,I guess they're like a Milwaukee Drill that will sometimes refuse to work unless you buy more drill credits.


WTF? Do you have a link? I was not aware of this, it would be crazy if true.


More like a Milwaukee drill you have on loan that can be swapped out for a manual screwdriver without warning.


You clearly have never used these other tools. Mixtral / Deepseek perform very well on coding challenges. I've used them against local code without issues, sometimes they are a bit optimistic and produce too much, but thats far better than producing too little (like GPT4 does).


A self-hosted solution is a common requirement for security reasons.


it’s going to be real hard to pry the carburetors out of this guy’s cold dead hands!


FYI: This model is already available on Ollama.


how do you check that ?



Given the complete failure of the first stable lm, I'm interested to try this one out. Haven't really seen a small language model, except mixtral 7b that's really useful for much.

I also hope stability comes out with a competitor to the new midjourney and dalle models! That's what put them on the map in the first place


We released a competitor to runway recently that beat it on blind tests, plus way faster image in sdxl turbo

We have been working on ComfyUI for the next step and new image models

Midjourney and others are pipelines versus models so we have a higher bar to jump but the og stable diffusion team are working hard!


Deepseek coder 6.7B is very useful for coding and can run in consumer GPUs.

I use the 6bit GGUF quantized version on a laptop RTX 3070


All of the Mistral versions have been excellent, including the OpenHermes versions. I encourage you to check out Phi-2 as well, it's the only 3b model I've found really quite interesting outside of Replit's code model built into Replit Core.


It's amazing to see more smaller models being released. This creates opportunities for more developers to run it on their local computers, and makes it easier to fine-tune for specific needs.


Has anyone tried starting with a smaller modeling, then RLing until it improves to the bigger model?


Seems like they caught the Apple Marketing bug and are chasing things noonecares about. Great 3B model, everyone is already running 7B models over here.

Maybe one day when I need to do offline coding on my cellphone, it will be really useful.


does anyone have recommendations for addins to integrate these 'smaller' llms into an IDE like VSCode? I'm pretty embedded with GH copilot, but curious to explore other options.


Can anyone explain what’s Stability’s business model (or plan for one)?

I get why Meta releases tons of models, but still can’t quite understand what stability is trying to achieve


Seems like the standard open-core playbook:

> This model is included in our new Stability AI Membership. Visit our Membership page to take advantage of our commercial Core Model offerings, including SDXL Turbo & Stable Video Diffusion.

A hypothetical Stable Code 13B/70B could be hosted only, with more languages or specialized use-cases (Stable Code 3B iOS-Swift-Turbo)


Membership with upsell to support, custom models and more

Plus licensed variant models like stable audio and on chip installation like arm for specialist models eg Japanese law or Indonesian accounting


to be bought by meta


This is all an elaborate mating ritual


Why authors miss to compare with Phi-2?


Agreed, and not only do they not compare their model to Phi-2 directly, the benchmarks they report don't overlap with the ones in the Phi-2 post[1], making it hard for a third party to compare without running benchmarks themselves.

(In turn, in the Phi-2 post they compare Phi-2 to Llama-2 instead of CodeLlama, making it even harder)

[1]: https://www.microsoft.com/en-us/research/blog/phi-2-the-surp...


How reliable are these benchmarks?


I think the trick is that they are just comparing to other tiny models.

None of the little models, including this one, are comparable to the performance of the larger models for any significant coding problem.

I think what these are useful for is mostly giving people hints inside of a code editor. Occasionally filling in the blank.


Terrible model


I just tried this model with Koboldcpp on my LLM box. I got gibberish back.

My prompt - "please show me how to write a web scraper in Python"

The response?

<blockquote> I've written my first ever python script about 5 months ago and I really don't remember anything except for the fact that I used Selenium in order to scrape websites (in this case, Google). So you can probably just copy/paste all of these lines from your own Python code which contains logic to determine what value should be returned when called by another piece of software or program. </blockquote>


It's very likely a "completion model" and not instruct/chat fine-tuned.

So you'd need to prompt it through comments or by starting with a function name, basically the same as one would prompt GitHub copilot.

e.g.

  # the following code implements a webscraper in python
  class WebScraper:

(I didn't try this, and I'm not good at prompting, but something along the lines of this example should yield better results)


But it's a code completion model, not a chat/instruct one.


It is weird that it is not mentioned in the model card but I'm pretty sure it is a completion model, not tuned as an instruct model.

edit: the webpage does call it "Stable Code Completion"


This doesn't seem like gibberish though?


Same thing with Ollama.


It's quite amazing - I often find that I read quite positive comments towards LLM tools for coding. Yet, an "Ask HN" I posted a while ago (and which admittedly didn't gain much traction) seemed to mirror mostly negative/pessimistic responses.

https://news.ycombinator.com/item?id=38803836

Was it just that my submission didn't find enough / more balanced commenters?


You got two positive and two negative responses. You replied only to the negative responses. Now you think that the responses were mostly negative. I blame salience bias.

Anyways, there's also a difference between "are you excited about this new thing becoming available" and "now that you've used it, do you like the experience". The former is more likely to feature rosy expectations and the latter bitter disappointment. (Though it could also be the other way around, with people dismissing it at first and then discovering that it's kind of nice actually.)


If somebody can show me a coding task that LLMs have successfully done that isn't an interview question or a documentation snippet, I might start to value it.

Spending huge amount of resource to be a bit better at autocompleting code doesn't have value to me. I want it to solve significant problems, and it's looking like it can't do it and scaling it to be able to is totally impractical.

> In aggregate, training all 9 Code Llama models required 400K GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W).

That is: * 45⅔ GPU years * 160 MWh or... * 45 average UK homes annual electric consumption * 18 average US homes * 64 average drivers annual milage in an EV.

...and that's just the GPUs. Add on all the rest of the system (s).


In the grand scheme of things it's ancient history, but https://code-as-policies.github.io/ works by generating code then executing it. That's worth running at. The code generation in that paper was done on code-davinci-002, which is (or rather was - it's deprecated) a 15B GPT-3 model. I've not done it yet, but I'd expect the open source 7B code completion models to be able to replicate it by now.


The precise wording matters.

How has it changed your work life leads people down the rabbit hole of will coding jobs be safe.

This one is a lot more neutral/technical.


You only got comments from six people so yeah, definitely not representative.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: