Do you simple paste the entire content of a webpage inside a prompt? You prepend the question from an enduser with that content?:i.e. dear chatgpt api, could you answer the following question [user question] based on the following text [website contents]
I guess not. Probably an offline process where they scrape the websites into chunks and build embeddings. At query time first search for the relevant chunks and then put those chunks into the prompt?
Yes, you are right. It's not possible to give the entire content in a prompt. A users' site can have a lot of pages and each page can potentially be super long.
You need to split content into chunk but also need to retain semantic meaning. I have been building same thing: https://chatfast.io. This also support scanned pdf and image. (I built my own algorithm, backend in c# not using llama_index or langchain)
Not exactly. Given one of my plans has 5000 pages, it is not possible to just paste the entire content in a prompt. Open AI API has a max tokens limit.
I will first do some pre-processing on the content and fetch the relevant pieces of content before giving it as a prompt to the API.
It's pretty straightforward forward with LangChain and GPT-Index. There are lot of tutorials on the Internet for the same like this one https://youtu.be/9TxEQQyv9cE
I don't think chunking + embedding based retrieval is good enough. It's a good first draft for a solution, but the chunks are out of context, so the LLM could combine them in an unintended way.
Better to question each document separately and then combine the answers into one last LLM round. Even so, there might be inter-document context that is lost - for example looking at one document that depends on details in another one. Large collections of documents should be loaded up in multiple passes, as the interpretation of a document can change when encountering information in another document. Adding one single piece of information to a collection of documents could slightly change the interpretation of everything, that's how humans incorporate new information.
One interesting application of document-collection based chat bots is finding inconsistencies and contradictions in the source text. If they can do that, they can correctly incorporate new information.