The demos I see for these types of tools are always some toy project and doesn't reflect day to day work I do at all. Do you have any example PRs on larger more complex projects that have been written with codebuff and how much of that was human interactive?
The real problem I want someone to solve is helping me with the real niche/challenging portion of a PR, ex: new tiptap extension that can do notebook code eval, migrate legacy auth service off auth0, record and replay API GET requests and replay a % of them as unit tests, etc.
So many of these tools get stuck trying to help me "start" rather than help me "finish" or unblock the current problem I'm at.
I hear you. This is actually a foundational idea for Codebuff. I made it to work within the large-ish codebase of my previous startup, Manifold Markets.
I want the demos to be of real work, but somehow they never seem as cool unless it's a neat front end toy example.
Historically, Pepsi won taste tests and people chose Coke. Because Pepsi is sweeter, so that first sip tastes better. But it's less satisfying—too sweet—to drink a whole can.
The sexy demos don't, in my opinion and experience, win over the engineers and leaders you need. Lil startups, maybe, and engineers that love the flavor of the week. But for solving real, unsexy problems—that's where you'll pull in organizations.
> The sexy demos don't, in my opinion and experience, win over the engineers and leaders you need.
Great point, we're in talks with a company and this exact issue came up. An engineer used Codebuff over a weekend to build a demo app, but the CEO wasn't particularly interested even after he enthusiastically explained what he made. It was only when the engineer later used Codebuff to connect the demo app to their systems that the CEO saw the potential. Figuring out how to help these two stakeholders align with one another will be a key challenge for us as we grow. Thanks for the thought!
> Historically, Pepsi won taste tests and people chose Coke. Because Pepsi is sweeter, so that first sip tastes better. But it's less satisfying—too sweet—to drink a whole can.
As a Pepsi drinker (Though Pepsi Max/Zero), I disagree with this. That's one interpretation, the other is the one Pepsi was gesturing at - that people prefer Coke when knowing it's Coke, because of branding, but with branding removed, prefer Pepsi.
I personally drank Coke Zero for years, always being "unhappy" when a restaurant only had Pepsi, until one day I realized I was actually enjoying the Pepsi more when not thinking about it, and that the only reason I "preferred" Coke was the brand. So I know that this story can also be true, at least on n=1 examples.
Watching the demo it seems like it would be more effective to learn the skills you need rather than using this for a decade.
It takes 5+ seconds just to change one field to dark mode, I don't even want to imaigne a situation where I have two fields and I want to explain that I need to change this field and not that field.
I'm not sure who is the target audience for this, people who want to be programmers without learning programming ?
> it seems like it would be more effective to learn the skills you need rather than using this for a decade.
Think of it as a calculator. You do want to be able to do addition, but not neccessarily to manually add 4-digit numbers in your head.
> It takes 5+ seconds just to change one field to dark mode
Our current LLMs are way too slow for this. I am chuckling every time someone says "we don't need LLMs to be faster because people can't read faster". Imagine this using Groq with a future model with similar capability level, and taking 0.5 seconds to do this small change.
People need to remember we're at the very beginning of using AI for coding. Of course it's suboptimal for majority of cases. Unless you believe we're way past half the sigmoid curve on AI improvements (which I don't), consider that this is the worst the AI is ever going to be for coding.
A year ago people were incredulous when told that AI could code. A year before that people would laugh you out of the room. Now we're at the stage where it kinda works, barely, sometimes. I'm bullish on the future.
Every experience I have had with LLMs generating code. LLMs tend to follow the prompt much too closely and produce large amounts of convoluted code that in the end prove not only unnecessary but quite toxic.
Where LLMs shine is in being a personal Stack Overflow: asking a question and having a personalized, specific answer immediately, that uses one's data.
But solving actual, real problems still seem out of reach. And letting them touch my files sound crazy.
(And yes, ok, maybe I just suck at prompting. But I would need detailed examples to be convinced this approach can work.)
I'm sure your prompting is great! It's just hard because LLMs tend to be very wordy by default. This was something we struggled with for a while, but I think we've done a good job at making Codebuff take a more minimal approach to code edits. Feel free to try it, let me know if it's still too wordy/convoluted for you.
> Do you have any example PRs on larger more complex projects that have been written with codebuff and how much of that was human interactive?
We have a lot of code in production which are AI written. The important thing is that you need to consciously make a module or project AI-ready. This means that things like modularity and smaller files are even more important than they usually are.
I can't share those PRs, but projects on my profile page are almost entirely AI written (except the https://bashojs.org/ link). Some of them might meet your definition of niche based on the example you provided.
Kind of like "please describe the solution and I will write code to do it".
That's not how programming works.
Writing code and testing it against expectations to get to the solution, that's programming.
FWIW I don't find that I'm losing good engineering habits/thought processes. Codebuff is not at the stage where I'm comfortable accepting its work without reviewing, so I catch bugs it introduces or edge cases it's missed. The main difference for me is the speed at which I can build now. Instead of fussing over exact syntax or which package does what, I can keep my focus on the broader implications of a particular architecture or nuances of components, etc.
I will admit, however, that my context switching has increased a ton, and that's probably not great. I often tell Codebuff to do something, inevitably get distracted with something else, and then come back later barely remembering the original task
Language is important here. Programming, at its basic definition, is just writing code that programs a machine. Software development or even design/engineering are closer to what you’re referring to.
+1; Ideally I want a tool I don't have to specify the context for. If I can point it via config files at my medium-sized codebase once (~2000 py files; 300k LOC according to `cloc`) then it starts to get actually usable.
Cursor Composer doesn't handle that and seems geared towards a small handful of handpicked files.
Would codebuff be able to handle a proper sized codebase? Or do the models fundamentally not handle that much context?
Yes. Natively, the models are limited to 200k tokens which is on the order of dozens of files, which is way too small.
But Codebuff has a whole preliminary step where it searches your codebase to find relevant files to your query, and only those get added to the coding agent's context.
That's why I think it should work up to medium-large codebases. If the codebase is too large, then our file-finding step will also start to fail.
I would give it a shot on your codebase. I think it should work.
RAG is a well-known technique now, and to paraphrase Emily Bender[1], here are some reasons why it's not a solution.
The code extruded from the LLM is still synthetic code, and likely to contain errors both in the form of extra tokens motivated by the pre-training data for the LLM rather than the input texts AND in the form of omission. It's difficult to detect when the summary you are relying on is actually missing critical information.
Even if the set up includes the links to the retrieved documents, the presence of the generated code discourages users from actually drilling down and reading them.
This is still a framing that says: Your question has an answer, and the computer can give it to you.
We actually don't use RAG! It's not that good as you say.
We build a description of the codebase including the file tree and parsed function names and class names, and then just ask Haiku which files are relevant!
This works much better and doesn't require slowly creating an index. You can just run Codebuff in any directory and it works.
It sounds like it's arguably still a form of RAG, just where the retrieval is very different. I'm not saying that to knock your approach, just saying that it sounds like it's still the case where you're retrieving some context and then using that context to augment further generation. (I get that's definitely not what people think of when you say RAG though.)
Genuine question: at what point does the term RAG lose its meaning? Seems like LLMs work best when they have the right context, and that context must be pulled from somewhere for the LLM. But if that's RAG, then what isn't? Do you have a take on this? Been struggling to frame all this in my head, so would love some insight.
RAG is a search step in an attempt to put relevant context into a prompt before performing inference. You are “augmenting” the prompt by “retrieving” information from a data set before giving it to an LLM to “generate” a response. The data set may be the internet, or a code base, or text files. The typical examples online uses an embedding model and a vector database for the search step, but doing a web query before inference is also RAG. Perplexity.ai is a RAG (but fairly good quality). I would argue that Codebuff’s directory tree search to find relevant files is a search step. It’s not the same as a similarity search on vector embeddings, and it’s not PageRank, but it is a search step.
Things that aren’t RAG, but are also ways to get a LLM to “know” things that it didn’t know prior:
1. Fine-tuning with your custom training data, since it modifies the model weights instead of adding context.
2. LoRA with your custom training data, since it adds a few layers on top of a foundation model.
3. Stuffing all your context into the prompt, since there is no search step being performed.
Gotcha – so broadly encompasses how we give external context to the LLM. Appreciate the extra note about vector databases, that's where I've heard this term used most, but I'm glad to know it extends beyond that. Thanks for explaining!
I think parsimo2010 gave a good definition. If you're pulling context from somewhere using some search process to include as input to the LLM, I would call that RAG.
So I would not consider something like using a system prompt (which does add context, but does not involve search) would not be RAG. Also, using an LLM to generate search terms before returning query results would not be RAG because the output of the search is not input to the LLM.
I would also probably not categorize a system similar to Codebuff that just adds the entire repository as context to be RAG since there's not really a search process involved. I could see that being a bit of a grey area though.
> We build a description of the codebase including the file tree and parsed function names and class names
This sounds like RAG and also that you’re building an index? Did you just mean that you’re not using vector search over embeddings for the retrieval part, or have I missed something fundamental here?
I'm currently working on a demonstration/POC system using my ElasticSearch as my content source, generating embeddings from that content, and passing them to my local LLM.
It would be cool to be talking to other people about the RAG systems they’re building. I’m working in a silo at the moment, and pretty sure that I’m reinventing a lot of techniques
I didn't mean to be down on it, and I'm really glad it's working well! If you start to reach the limits of what you can achieve with your current approach, there are lots of cute tricks you can steal from RAG, eg nothing stopping you doing a fuzzy keyword search for interesting-looking identifiers on larger codebases rather than giving the LLM the whole thing in-prompt, for example
I'll need to get approval to use this on that codebase. I've tried it out on a smaller open-source codebase as a first step.
For anyone interested:
- here's the Codebuff session: https://gist.github.com/craigds/b51bbd1aa19f2725c8276c5ad36947e2
- The result was this PR: https://github.com/koordinates/kart/pull/1011
It required a bit of back and forth to produce a relatively small change, and I think it was a bit too narrow with the files it selected (it missed updating the implementations of a method in some subclasses, since it didn't look at those files)
So I'm not sure if this saved me time, but it's nevertheless promising! I'm looking forward to what it will be capable of in 6mo.
What's the fundamental limitation to context size here? Why can't a model be fine-tuned per codebase, taking the entire code into context (and be continuously trained as it's updated)?
Forgive my naivety, I don't now anything about LLMs.
It's pretty good for complex projects imo because codebuff can understand the structure of your codebase and which files to change to implement changes. It still struggles when there isn't good documentation, but it has helped me finish a number of projects
One cool thing you can do is a ask Codebuff to create these docs. In fact, we recommend it.
Codebuff natively reads any files ending in "knowledge.md", so you can add any extra info you want it to know to these files.
For example, to make sure Codebuff creates new endpoints properly, I wrote a short guide with an example on the three files you need to update, and put it in backend/api/knowledge.md. After that, Codebuff always create new endpoints correctly!
you can put the information into knowledge.md or [description].knowledge.md, but sometimes I can't find documentation and we're both learning as we go lmao
Absolutely! Imaging setting a bunch of css styles through a long winded AI conversation, when you could have an IDE to do it in a few seconds. I don't need that.
The long tail of niche engineering problems is the time consuming bit now. That's not being solved at all, IMHO.
"On the checkout page at the very bottom there are two buttons that are visible when the user chooses to select fast shipping. The right one of those buttons should be a tiny bit more round and it seems like it's not 100% vertically aligned with the other button."
Takes a lot longer to write than just diving into the code. I think that's what they meant.
Great question – we struggled for a long time to put our demo together precisely for this reason. Codebuff is so useful in a practical setting, but we can't bore the audience with a ton of background on a codebase when we do demos, so we have to pick a toy project. Maybe in the future, we could start our demo with a half-built project?
Hopefully the demo on our homepage shows a little bit more of your day-to-day workflows than other codegen tools show, but we're all ears on ways to improve this!
To give a concrete example of usefulness, I was implementing a referrals feature in Drizzle a few weeks ago, and Codebuff was able to build out the cli app, frontend, backend, and set up db schema (under my supervision, of course!) because of its deep understanding of our codebase. Building the feature properly requires knowing how our systems intersect with one another and the right abstraction at each point. I was able to bounce back and forth with it to build this out. It felt akin to working with a great junior engineer, tbh!
If you're not worried about showing off little hints of your own codebase, record running it on one of your day to day engineering tasks. It's perfect dog fooding and would be a fun meta example.
> To give a concrete example of usefulness, I was implementing a referrals feature in Drizzle a few weeks ago, and Codebuff was able to build out the cli app, frontend, backend, and set up db schema
Record this!
Better yet, stream it on Twitch and/or YouTube and/or Discord and build a small community of followers.
Great idea! We were kicking around something like this, but couldn't get it ready in time for today's launch – but keep your eyes peeled! Our Discord community is a great place to stay up to date.
Yup, I had the same thought. I just ran into an issue during today's launch and used Codebuff to help me resolve it: https://www.tella.tv/video/solving-website-slowdown-with-ai-.... Next time, I'll try to record before I start working, but it's hard to rememeber sometimes.
My favorite example is the asana loader[0] for llama-index. It's literally just the most basic wrapper around the Asana SDK to concatenate some strings.
gpt3.5 turbo is (mostly likely) Curie which is (most likely) 6.7b params. So, yeah, makes perfect sense that it can't compete with a 70b model on cost.
Curious to know what value you've seen out of these clusters. In my experience k means clustering was very lackluster. Having to define the number of clusters was a big pain point too.
You almost certainly want a graph like structure (overlapping communities rather than clusters).
But unsupervised clustering was almost entirely ineffective for every use case I had :/
I only got the clustering working this morning, so aside from playing around with it a bit I've not had any results that have convinced me it's a tool I should throw at lots of different problems.
I mainly like it as another example of the kind of things you can use embeddings for.
There are iterative methods for optimizing the number of clusters in k-means (silhouette and knee/elbow are common), but in practice I prefer density-based methods like HDBSCAN and OPTICS. There's a very basic visual comparison at https://scikit-learn.org/stable/auto_examples/cluster/plot_c....
You could also use a Bayesian version of kmeans. It applies a Dirichlet process as a prior to an infinite (truncated) set of clusters such that the most probable number k is automatically found.
I found one implementation here: https://github.com/vsmolyakov/DP_means
Alternatively, there is a Bayesian GMM in sklearn. When you restrict it to diagonal Covariance matrices, you should be fine in high dimensions
Definitely a difficult problem you're taking on here, but I don't see anything specific to LLMs here? How or why are you marketing towards LLMs?
How do you compare to the larger players here already Nango[0] and Merge[1] ?
I'm curious how you're thinking about data access / staleness? It's great that you're handling the oauth dance, but does that mean every end user of the product has to auth every product they interface with or are you handling this all at the super admin / enterprise level?
Right now I think there's too much emphasis on the "data loading" aspect of LLMs. I expect to see a swing back into using 3rd party API's SDKs. Interested to hear your thoughts on the Google API, it's absolutely massive and trying to shoehorn that into a unified API scares me.
The only real player that I could see to launch something like this and be successful is Okta.
Hey I'm one of the co-founders of Poozle. Thanks for asking great questions let me take them one by one
<Why LLMs>
Our goal is to provide context for LLMs. Our first step is to normalize data and offload syncing, similar to other Unified API providers like Merge. In the future, we also plan to assist with vector embeddings or storing data directly in Vector DB for a search context API. We are exploring the best solution and believe building in the community will be a big help.
<Competition with large Players>
Nango doesn't offer a pre-built Unified API. Merge focuses on B2B SAAS companies looking to build customer-facing integrations. Our goal is to develop tools and infrastructure to support LLMs. This is similar to how Plaid bet on the Fintech industry and built infrastructure and tools around it, starting with a Unified API for banking data.
I don't think the comparison to Plaid is helping as much as you think. You and Plaid are in completely different verticals and as a result have completely different goals and users.
Currently every user of the product has to do the auth. However in future for our enterprise customers, we plan to support SSO and SAML.
<Google API>
You're absolutely right, the array of Google APIs is vast. However, if we approach it from a category perspective, there are typically a couple of key APIs that we need to manage for instance in documentation category we take google docs and for Email we take gmail APIs.
I had interviewed with Uber's ATG years ago and their pitch, even then, was that they were building a platform for whoever won the autonomous game to be available on uber, not just their own cars.
Interesting they kept that strategy even after spinning out that group. Curious if they managed to keep anyone from that team to help with this product.
I played with building out a graphql mesh [0] of a few different APIs as I was curious to see if I could build one schema (and subsets of it) and have GPT interface over that. Turns out, it did a pretty good job if you can provide it the right portions of the schema it needs.
It also helped out when I was struggling to reconcile with how large of JSON payloads I was getting. The REST endpoints are just killing the prompt size, but having the model choose the fields it needed from GraphQL really helped out there.
Put it down for a while until I can get access to the plugin fine-tuned version of chatgpt and see if there's still a need or if it is additive still.
> Reasonable to assume that in 1-2 years it will also come down in cost.
Definitely. I'm guessing they used something like quantization to optimize the vram usage to 4bit. The thing is that if you can't fit the weights in memory then you have to chunk it and that's slow = more gpu time = more cost. And even if you can fit it in GPU memory, less memory = less gpus needed.
But we know you _can_ use less parameters, and that the training data + RLHF makes a massive difference in quality. And the model size linearly relates to the VRAM requirements/cost.
So if you can get a 60B model to run at 175B's quality, then you've almost 1/3rd your memory requirements, and can now run (with 4bit quantization) on a single A100 80GB which is 1/8th the previously known 8x A100's that GPT-3.5 ran on (and still half GPT-3.5+4bit).
Also while openai likely doesn't want this - we really want these models to run on our devices, and LLaMa+finetuning has shown promising improvements (not their just yet) at 7B size which can run on consumer devices.
The real problem I want someone to solve is helping me with the real niche/challenging portion of a PR, ex: new tiptap extension that can do notebook code eval, migrate legacy auth service off auth0, record and replay API GET requests and replay a % of them as unit tests, etc.
So many of these tools get stuck trying to help me "start" rather than help me "finish" or unblock the current problem I'm at.
reply