Hacker News new | past | comments | ask | show | jobs | submit login
Searching a Codebase in English (greptile.com)
56 points by dakshgupta 31 days ago | hide | past | favorite | 21 comments



Summary "Semantic search on codebases works better if you first translate the code to natural language, before generating embedding vectors. It also works better if you chunk more “tightly” - on a per-function level rather than a per-file level. This is because noise negatively impacts retrieval quality in a huge way."

This makes a lot of sense. You should also embed information about how the code is related to other functions/code and where it is located in the codebase. One approach is to add really wonderful comments to the code so that when humans and machines read it they are brought on a step by step journey of how the code fulfills a goal. I tell the LLM to explain step by step to junior developers and and to inspire seniour engineers with glimpse of the profound beauty of the code and its architecture.


We use tree-sitter right now and experimenting with call graphs currently. Will write about the results when we have data on that.


I've been using tree-sitter also. Looking forward to reading about your results with call graphs


I think I found a mistake. In the article you write: "We then compare that against our database of vectors and find the one(s) that match the closest, i.e., have the lowest dot product and highest similarity."

We want to maximize the normalized dot product (or cosine similarity) to find semantically similar text chunks.


So this would be the highest dot product. Finding the lowest(closest to zero) would mean the two vectors are closer to orthogonal and thus very _not_ similar in direction (semantic meaning). If negative they are semantically opposite.


Good catch! This is a mistake, fixing now.


Interesting direction. We also have a codebase chat (example here https://wiki.mutable.ai/ollama/ollama) that HN might find appealing. It uses a wiki as a living artifact owned by your team to power the chat, gives us increased context length and reasoning capabilities. We didn't really like the results we got with embeddings. Have been pretty thrilled with the results on Q&A, search, and even codegen (more on that soon).


is there a free version of greptile


Free for small open source repos + 300 large open source repos.

Code MISTRAL100 at checkout for a free month of Pro.


So does this mean that you can use it for all your small open source repos or do you need to apply for each one? The only mention of free offering is in pricing page where you fill a form with a repo link to apply for the free offering.

Edit: found the information in docs here [2]

[2] https://docs.greptile.com/pricing


That form is if you want to install the code review bot or other GitHub integrations for free on your repo .

If you want to chat with your repo for free you can do that on the website, assuming it’s a small public repo.


The page is unreadable on Firefox Focus


Also unreadable on Android/Chrome. The whole left side of the page is cropped off


Just fixed, sorry


Also unreadable on Firefox on iOS.


Pushed a fix, thanks


Works well in Arc and Chrome in OSX. Arc is a lovely browser BTW.


Unreadable on Firefox on Android also. I think maybe this site just wasn't tested on a small screen - ie where most people get there reading done.


Sorry, fixed


Sorry, just fixed


Cool, works now. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: