Summary
"Semantic search on codebases works better if you first translate the code to natural language, before generating embedding vectors.
It also works better if you chunk more “tightly” - on a per-function level rather than a per-file level. This is because noise negatively impacts retrieval quality in a huge way."
This makes a lot of sense. You should also embed information about how the code is related to other functions/code and where it is located in the codebase. One approach is to add really wonderful comments to the code so that when humans and machines read it they are brought on a step by step journey of how the code fulfills a goal. I tell the LLM to explain step by step to junior developers and and to inspire seniour engineers with glimpse of the profound beauty of the code and its architecture.
I think I found a mistake. In the article you write: "We then compare that against our database of vectors and find the one(s) that match the closest, i.e., have the lowest dot product and highest similarity."
We want to maximize the normalized dot product (or cosine similarity) to find semantically similar text chunks.
So this would be the highest dot product. Finding the lowest(closest to zero) would mean the two vectors are closer to orthogonal and thus very _not_ similar in direction (semantic meaning). If negative they are semantically opposite.
Interesting direction. We also have a codebase chat (example here https://wiki.mutable.ai/ollama/ollama) that HN might find appealing. It uses a wiki as a living artifact owned by your team to power the chat, gives us increased context length and reasoning capabilities. We didn't really like the results we got with embeddings. Have been pretty thrilled with the results on Q&A, search, and even codegen (more on that soon).
So does this mean that you can use it for all your small open source repos or do you need to apply for each one? The only mention of free offering is in pricing page where you fill a form with a repo link to apply for the free offering.
This makes a lot of sense. You should also embed information about how the code is related to other functions/code and where it is located in the codebase. One approach is to add really wonderful comments to the code so that when humans and machines read it they are brought on a step by step journey of how the code fulfills a goal. I tell the LLM to explain step by step to junior developers and and to inspire seniour engineers with glimpse of the profound beauty of the code and its architecture.