Hacker News new | past | comments | ask | show | jobs | submit login

Even 100k still seems very limiting for many applications naturally suited to LLMs.

What if I want an AI assistant that is specifically trained on a large codebase? Or all my product’s docs (which might easily exceed 100k characters for a big project). Or one that knows the exact details of the entire tax code? Or one that knows every line of Dostoyevsky’s novels so I can have angsty existential conversations with it? Or that can fully remember conversations with me that stretch on for years?

It seems like you’d need fine tuning for these kinds of use cases? Or am I missing something?




You use a (probably vector, ideally) database and a framework of behind-the-scenes prompting that uses it to do a “research loop” to gather info to answer questions. You might also fine tune it on the source you want it to incorporate, but training (including fine tuning) is most useful (as I understand) to provide “understanding”, not exact recall, whereas a searchable auxiliary memory and a framework which accesses it as part of the process of generating user responses is a way to provide exact recall over a larger dataset.


Ok, I see what you're saying, but at least for the big codebase example, and I'd imagine for many other applications as well, are you really going to get good output by just loading in bits and pieces, even if they are intelligently selected? It seems like holistic "understanding" is exactly what you want in that case, so that the model can take into account the full architecture and know how every component and sub-component fits together.

I wouldn't look forward to a PR for a significant feature from a human developer that just skimmed through all of a project's files and only read a handful of them in depth, so I guess I'm also skeptical this would lead to good results from an LLM.


I think the suggested process is to ingest all of the code base into a vector database and provide prompts and APIs for the model to search and focus on relevant areas while developing a feature, which is analogous (identical, even) to how a human would approach the process - with long term and short term memory and applying general and specific experience.


You can also create hierarchical summaries that help the model select the right starting place in the code base.

Additionally fine tuning isn't a great fit for mutable knowledge like a code base.


But finetuning wouldn't give the LLM a holistic understanding. LLMs are finetuned to predict the next token. At best you give them a full file at a time, but more likely, you create batches of similar length sequences. So while finetuning will lead to the LLM seeing all the codebase, it won't necessarily know how all the files fit together (for example I'm pretty sure it won't see folder structure, unless you somehow include it via preprocessing)


While I grant that it doesn't have the ability to go into single-file or line-level granularity, GPT-4 can have a highly informed discussion on the architectures of big projects in its training set like PostgreSQL or the Linux kernel. That's what I mean by "holistic understanding". If you're a contributor for one of those projects (or want to be), it could definitely help to evaluate the tradeoffs of various design approaches for new features or modifications.

Getting into the nitty gritty of files and functions would be a step further, but it seems to me that just having this general knowledge of how a big project is structured would vastly improve any output related to that project compared to just loading in comparatively tiny pieces of it. Even if those tiny pieces are the most relevant to the prompt, it will be missing all the background context that it has for projects in its training set.


I expect that for Linux or PostgreSQL it was trained not just on the codebase, but also on mailing lists and tons of other documents that discuss various things about it and that's what give it these reasoning powers.


You basically can't. We haven't developed that kind of AI.

Some of us hope that just-in-time retrieval of data from an external source (usually a vector db) is going to work. Like, say, "retrieve chapter 3 of Dostoyevsky's CP and make an interesting comment about it". For the record, you can have plenty of interesting angsty existential conversations with both GPTs about D right now if you so desire. Just preface your dialogue with something slightly more sophisticated than "let us have an angsty existential conversation about Dostoyevsky".

This JIT retrieval might work for some cases, but my guess is it won't for large holistic pieces of work where you have to integrate and "understand" the entire edifice at once. I'm not completely sure if codebases merit such a distinction though, but you can imagine a domain like "law" might. We can push the context limit and this might bring some temporary relief, but I'm not convinced. The recent pushes to 100K seem to rely on "dropping" attention in an intelligent way. It sounds like cheating and while it might work OK for some cases, I think it'll drop the ball eventually.

The integrated understanding that results from, somehow, internalizing the relation between all the hundreds of thousands of seemingly unrelated datapoints is what makes us interesting. That's also what makes an LLM interesting. It's just that an LLM only has access to that level of development when it's training.

If I were to guess I think we somehow need to enable "always-training" and I'm not sure anyone has the faintest idea how. You can only go so far if you step out of school and never learn anything ever again, no matter how brilliant you are. GPT4 is quite the scholar, but we're stretching it.


> I think we somehow need to enable "always-training" and I'm not sure anyone has the faintest idea how.

To ask the naive question, why can't we just keep continually training it with new stuff? If it takes, say, a year to ingest approximately the whole internet, tacking on a single large codebase should probably take something like an additional few seconds if we consider what fraction that codebase is of the full training set. I know it's not that simple for a lot of reasons, but it doesn't seem obvious that fundamentally new methods are needed to do this.


This new piece of information can impact all previous information in unknown ways.

To give a dramatic and superficial example, let's say you just found out you have been living inside the Truman Show. This impacts, if not everything, then a whole lot of what you know about your world. Instantly and completely. This cannot be "tacked on". This has to be integrated somehow.

Current AI doesn't work like that. Instead of building a tower of knowledge it sort of creates abstract cognitive landscapes filled with predetermined averaged out paths that, when followed, give good answers. Useful for a lot of tasks, but it doesn't strike me as the type of structure that can be manipulated with any kind of precision. Maybe that will change and/or I am wrong. Certainly the last one is very probable.


You use a vector database with a query similarity search to find portions of documents that might be relevant to your prompt, rank them, take the top n, put them into the context and run through your LLM to generate an answer to your question.


This paradigm applies for much more than just answering questions - I've found the in-context learning capabilities to be very relevant as well. For example, you can query a vector database and input that information into an LLM and ask it to make a prediction across the input data.

https://zilliz.com/blog/ChatGPT-VectorDB-Prompt-as-code


Can you elaborate? I don’t understand how the article you linked is different from normal question answering with context.


That's pretty cool, thanks for sharing!


Finetuning is not useful for teaching new facts, you need RAG for that: https://zzbbyy.substack.com/p/why-you-need-rag-not-finetunin...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: