Searching a Codebase in English

tonyoconnell · 2024-08-20T01:54:43.000000Z

Summary "Semantic search on codebases works better if you first translate the code to natural language, before generating embedding vectors. It also works better if you chunk more “tightly” - on a per-function level rather than a per-file level. This is because noise negatively impacts retrieval quality in a huge way."

This makes a lot of sense. You should also embed information about how the code is related to other functions/code and where it is located in the codebase. One approach is to add really wonderful comments to the code so that when humans and machines read it they are brought on a step by step journey of how the code fulfills a goal. I tell the LLM to explain step by step to junior developers and and to inspire seniour engineers with glimpse of the profound beauty of the code and its architecture.

dakshgupta · 2024-08-20T02:47:51.000000Z

We use tree-sitter right now and experimenting with call graphs currently. Will write about the results when we have data on that.

tonyoconnell · 2024-08-20T07:46:20.000000Z

I've been using tree-sitter also. Looking forward to reading about your results with call graphs

byearthithatius · 2024-08-20T18:22:21.000000Z

I think I found a mistake. In the article you write: "We then compare that against our database of vectors and find the one(s) that match the closest, i.e., have the lowest dot product and highest similarity."

We want to maximize the normalized dot product (or cosine similarity) to find semantically similar text chunks.

byearthithatius · 2024-08-20T18:22:59.000000Z

So this would be the highest dot product. Finding the lowest(closest to zero) would mean the two vectors are closer to orthogonal and thus very _not_ similar in direction (semantic meaning). If negative they are semantically opposite.

dakshgupta · 2024-08-21T00:01:15.000000Z

Good catch! This is a mistake, fixing now.

oshams · 2024-08-20T03:58:09.000000Z

Interesting direction. We also have a codebase chat (example here https://wiki.mutable.ai/ollama/ollama) that HN might find appealing. It uses a wiki as a living artifact owned by your team to power the chat, gives us increased context length and reasoning capabilities. We didn't really like the results we got with embeddings. Have been pretty thrilled with the results on Q&A, search, and even codegen (more on that soon).

deisteve · 2024-08-20T01:56:11.000000Z

is there a free version of greptile

dakshgupta · 2024-08-20T02:46:29.000000Z

Free for small open source repos + 300 large open source repos.

Code MISTRAL100 at checkout for a free month of Pro.

elashri · 2024-08-20T04:02:01.000000Z

So does this mean that you can use it for all your small open source repos or do you need to apply for each one? The only mention of free offering is in pricing page where you fill a form with a repo link to apply for the free offering.

Edit: found the information in docs here [2]

[2] https://docs.greptile.com/pricing

dakshgupta · 2024-08-20T04:23:23.000000Z

That form is if you want to install the code review bot or other GitHub integrations for free on your repo .

If you want to chat with your repo for free you can do that on the website, assuming it’s a small public repo.

Zambyte · 2024-08-15T23:59:27.000000Z

The page is unreadable on Firefox Focus

jesse__ · 2024-08-19T22:56:46.000000Z

Also unreadable on Android/Chrome. The whole left side of the page is cropped off

dakshgupta · 2024-08-20T03:10:40.000000Z

Just fixed, sorry

bavent · 2024-08-19T22:58:57.000000Z

Also unreadable on Firefox on iOS.

dakshgupta · 2024-08-20T03:10:51.000000Z

Pushed a fix, thanks

tonyoconnell · 2024-08-20T01:38:43.000000Z

Works well in Arc and Chrome in OSX. Arc is a lovely browser BTW.

tanvibhakta · 2024-08-20T02:20:44.000000Z

Unreadable on Firefox on Android also. I think maybe this site just wasn't tested on a small screen - ie where most people get there reading done.

dakshgupta · 2024-08-20T03:10:24.000000Z

Sorry, fixed

dakshgupta · 2024-08-20T03:10:32.000000Z

Sorry, just fixed

Zambyte · 2024-08-20T14:19:57.000000Z

Cool, works now. Thanks!