Chunking 2M files a day for code search using syntax trees

kevinlu1248 · on Aug 1, 2023

Update: this algo is now publicly accessible in LlamaIndex at https://github.com/jerryjliu/llama_index/blob/e567e6a20cf89b...

d4rkp4ttern · on Aug 1, 2023

curious why it is under "langchain_helpers"... I assume there is nothing specific to langchain here?

yding · on Aug 1, 2023

Currently all of LlamaIndex's text splitters were/are Langchain compatible.

intalentive · on July 31, 2023

Next step is to train models directly on syntax trees. Higher probability of correct output.

kevinlu1248 · on July 31, 2023

That's interesting, I've seen a few papers about this. I'm personally curious about editing syntax trees using language models, since it would prevent syntax errors altogether.

madeofpalk · on Aug 1, 2023

In my limited use, I've never seen these models (chatgpt, and github copilot) generate invalid syntax. I don't see much to improve there

I do see them generate code that fails the type checker though.

kevinlu1248 · on Aug 1, 2023

For editing code there's a decent chance of syntax errors or undefined variables since it's only modifying a subset of the code.

vladf · on Aug 1, 2023

Which papers?

kevinlu1248 · on Aug 1, 2023

Here's some of the papers: https://arxiv.org/abs/1911.09983 and https://aclanthology.org/2021.findings-acl.384.pdf

karmasimida · on Aug 1, 2023

Programming languages are artificial languages. LLM are able to synthesize human languages with almost perfect grammatical quality, they are in fact, very unlikely to make obvious syntactic errors on programming languages.

Also, syntax level information are local or short sighted, it is called context-free grammar for a reason. My own observation with playing with those coding LLMs all day, is that they most likely had acquired the grammar themselves implicitly. Providing explicit regularization by enforcing grammar, is going to provide at best modest benefits, and that is dependent on good that parser is written, in many cases, it is not a given.

kevinlu1248 · on Aug 1, 2023

Ya I think forcing correct syntax at the generation level likely will not be extremely beneficial. At Sweep, we iterate the language models on linters and type-checkers using GitHub Actions and it yields better results.

eldenring · on July 31, 2023

I'd guess these model's understand works more closely to people so encoding in text is more token efficient and things like comments help.

Also syntax seems a lot easier to understand for them than semantics/logic. If you've used GPT-4 it almost never makes syntax errors. Logical errors on the other hand...

kevinlu1248 · on July 31, 2023

From my experience, GPT-4 never makes syntax errors directly but when making edits to existing code it's harder to prevent these syntax errors from appearing. We used to add a second pass to check for these syntax errors.

It also frequently makes undefined variables and the like, however.

reitzensteinm · on July 31, 2023

Did you get rid of the second pass? I'm working on something quite similar and find a pass that inspects and rejects erroneous code to be a big boost to correctness.

kevinlu1248 · on July 31, 2023

We got rid of it. Our new edit framework works around search-and-replace pairs with an example at https://github.com/sweepai/sweep/blob/d37dda3a626f09dea3b322...

anotherpaulg · on Aug 1, 2023

That edit format looks familiar!

https://github.com/paul-gauthier/aider/blob/d5a7aac560d4584d...

https://aider.chat/docs/benchmarks.html

kevinlu1248 · on Aug 1, 2023

Yup it's based on the aider blogs. They're perfect for our use case and are very reliable compared to our old attempts.

_boffin_ · on July 31, 2023

I’ve built out an end to end automated fix pipeline, it’s getting the bug fixes right, but been having trouble with line number errors.

Looking forward to reading through your docs and repo later tonight to see how you’re addressing issues like this.

kevinlu1248 · on Aug 1, 2023

We used to use line numbers but it became problematic so we switched over to search-and-replace pairs, which works significantly better. The only potential problems are with setting up a fuzzy search system since sometimes the search doesn't match exactly with the code (missing comments, etc.). We're going to write about our core algo and diff managing system soon.

reitzensteinm · on July 31, 2023

Ah, interesting. I completely abandoned diff style updates in favour of AST substitution, but that was only possible because my tradeoffs are different to yours.

I'm building a bot that's building itself, so it doesn't have to support large legacy code bases with different languages.

kevinlu1248 · on Aug 1, 2023

This is interesting, I'm wondering what you mean by AST substitution. Is this like an agent that traverses the tree and picks what to edit? Is this language model based? Also, thankfully we don't support too many uncommon languages. The most recent ones we added support for are embedded templates (ERB/EJS for flask and Ruby) and mustache. Fortunately many uncommon languages are subsets of other languages.

reitzensteinm · on Aug 1, 2023

The agent specifies what function, class, method etc to replace, along with its full source. It's more costly, but I believe it leads to fewer hallucinations as it is generating a coherent piece of code.

But it requires parsing AST and language specific instructions. And things like metaprogramming or macros could cause some hairy confusion.

All of these factors don't hurt my use case.

kevinlu1248 · on Aug 1, 2023

We have a similar method under the hood, except it's purely text-based search-and-replace. The model decides what to replace. It seems to be consistent and is easy to implement.

reitzensteinm · on Aug 1, 2023

My gut feeling based on my experience over the last couple of months is that substitution of an entire function is more reliable than some lines of a function. The surrounding context reduces the chance of hallucinations.

Gut feeling doesn't account for much though - I'm working on an evals system to be able to quantify system performance. It won't be cheap to run.

It could easily be that your method is superior.

kevinlu1248 · on Aug 1, 2023

From our experience, single or few line replacements generally fine, since many of the changes are many few-line changes in multiple spots accross multiple files. We also provide surrounding context for the search-and-replace pairs, which helps with the model. Beyond 10 lines and the model also usually add the function headers in there which helps with the code generation.

I'm also curious, how are you guys evaluating the performance of your models?

reitzensteinm · on Aug 1, 2023

There's no systematic evaluation yet which is the next step. It's successfully bootstrapping itself which is a fairly high bar, but quantitative performance measurements are getting more and more important as the project progresses.

kevinlu1248 · on Aug 1, 2023

I feel the same, benchmarking in general is a pain but a good benchmark for us could go a long way.

woadwarrior01 · on Aug 1, 2023

Indeed, it's intuitively more efficient for LLMs to operate on ASTs instead of raw source code. I came across a recent paper[1] that takes this approach.

[1]: https://arxiv.org/abs/2305.00909

kevinlu1248 · on Aug 1, 2023

This is interesting, I'll take a look. My main concern with running this in production is that there is more text data in the world than code. Further, a pure tree-manipulation model has less explainabilities since you can always ask GPT-4 what it's thinking.

woadwarrior01 · on Aug 6, 2023

IIUC, doc-strings and comments in code will still be processed as text.

awwaiid · on Aug 3, 2023

Also git diffs and execution traces.

Zambyte · on July 31, 2023

John McCarthy was right

kevinlu1248 · on July 31, 2023

This is interesting. I'm taking a read on this.

kartoolOz · on Aug 1, 2023

Would see great improvments in retrieval accuracy by finetuning e5-base-v2 or the newer leaders on mteb benchmark.

kevinlu1248 · on Aug 1, 2023

Definitely. I prefer the sentence-transformers ones since they have been fine-tuned on codesearchnet. I'm also really excited about the latest gte models by Alibaba, their smallest model is the size of MiniLM L6 but beats MPNet.

yding · on July 31, 2023

This is really cool and a much needed contribution to helping LLMs run better on large code bases.

kevinlu1248 · on July 31, 2023

Thanks! Would love to see this algorithm in LlamaIndex.

kevinlu1248 · on Aug 1, 2023

For others in this thread, this algo is now publicly accessible in LlamaIndex at https://github.com/jerryjliu/llama_index/blob/e567e6a20cf89b...

thomasfromcdnjs · on July 31, 2023

Also thank you! I had an idea of how this all worked but this definitely concreted my thoughts.

kevinlu1248 · on July 31, 2023

I'm glad to hear! Let me know if you need more questions or are interested in implementing this chunking algo.

sandGorgon · on Aug 1, 2023

oh really ? Thats awfully kind. I'll take that in for EdgeChains as well.

https://github.com/arakoodev/EdgeChains/issues/172

kevinlu1248 · on Aug 1, 2023

Feel free to use the algo! Happy to help and reach out if you need help with implementation!

mellosouls · on Aug 1, 2023

OT: I'm not clear on whether or not Sweep is fully open source - I mean, can you run it fully self-hosted (apart from the GPT4 engine obv), or is the repo essentially a client to a Sweep API/binary?

Cool project btw!

kevinlu1248 · on Aug 1, 2023

Thanks! The repo is just the backend that runs the GitHub webhooks. We used to have a "chat with your code" client but stopped supporting it. Now it's only the GitHub interface with creating tickets and comments.

mellosouls · on Aug 1, 2023

Thanks for the clarification.

sdesol · on July 31, 2023

Congrats! Your project is off to a very good start as shown at https://devboard.gitsense.com/sweepai

What is interesting to me is the sharp increase in forks, which is a good indicator that others will contribute code in the near future.

Full Disclosure: This is my tool

kevinlu1248 · on July 31, 2023

Hey thanks for showing this dashboard. There's some crazy analytics in here, the tool looks awesome!

sdesol · on July 31, 2023

Thanks. For privacy reasons, I'm not showing a lot, but in the future when auth is in place, I can show deeper insights for repo members.

kevinlu1248 · on July 31, 2023

That would be awesome. Let me know when it's out and I'd love to try it out.

sdesol · on Aug 1, 2023

Sure. Update your profile for a way to reach out or send me an email (in my profile).

kevinlu1248 · on Aug 1, 2023

Shot an email. Thanks!

alj032 · on Aug 1, 2023

I just wanted to say the site looks a little odd on mobile

Edit: I guess it is just the app I am using, it looks fine on my mobile browser but odd in the app

kevinlu1248 · on Aug 1, 2023

Which app? We did notice a few visual errors a while back.

wanderingmind · on Aug 1, 2023

Tangent, are there any other similar alternatives to sweep that is not restricted to github but can be installed in other places like gitlab, bitbucket or even self hosted.

kevinlu1248 · on Aug 1, 2023

Not a great answer but we are open-source so forking us is an option.

wanderingmind · on Aug 1, 2023

Yes I had that thought, but your license looked custom so was not sure if that was allowed.

alchemist1e9 · on Aug 1, 2023

Awesome first step. Next is to figure out how to apply syntax trees for diffs and then train the LLM on code and diffs but all in syntax trees somehow.

kevinlu1248 · on Aug 1, 2023

Yup, saw a few papers about this over the past two years, using graph neural networks for code generation. There's also another thread below on this topic.

Edit: Here's some of the papers: https://arxiv.org/abs/1911.09983 and https://aclanthology.org/2021.findings-acl.384.pdf

alchemist1e9 · on Aug 1, 2023

Very interesting! Thank you.

I’ll try to explain something I’m thinking, it comes down to a type of agglutination.

https://en.wikipedia.org/wiki/Agglutination

The ASTs need to become sequences of tokens for LLMs to work well. Also the embedding space is related, this should all be specifically optimized for code, not based on general human language .

Of course the description of the code would be english. But my point is that a subspace of the encoding should be specifically designed for an agglutination based sequential expression of ASTs.

Not sure any of that makes sense to someone with more expertise in this space than me.

kevinlu1248 · on Aug 1, 2023

This is really interesting, will take a look. So is this basically English to code translation?

alchemist1e9 · on Aug 1, 2023

Yeah what I’m thinking about is that and how an LLM could use ASTs expressed in some type of compact encoding and with that code tied to english descriptions, perhaps it will learn really well the semantic cross space.

kevinlu1248 · on Aug 1, 2023

Like stochastically generate into an intermediate representation that can be procedurally compiled to code right?

alchemist1e9 · on Aug 1, 2023