Do you simple paste the entire content of a webpage inside a prompt? You prepend...

tracyhenry · on April 1, 2023

I guess not. Probably an offline process where they scrape the websites into chunks and build embeddings. At query time first search for the relevant chunks and then put those chunks into the prompt?

Would love more details though from the author!

pbteja1998 · on April 1, 2023

Yes, you are right. It's not possible to give the entire content in a prompt. A users' site can have a lot of pages and each page can potentially be super long.

robertd7 · on April 2, 2023

You need to split content into chunk but also need to retain semantic meaning. I have been building same thing: https://chatfast.io. This also support scanned pdf and image. (I built my own algorithm, backend in c# not using llama_index or langchain)

battybro0034 · on April 16, 2023

Why did you need to create your own algorithm? And how much time did it take?

pbteja1998 · on April 1, 2023

Not exactly. Given one of my plans has 5000 pages, it is not possible to just paste the entire content in a prompt. Open AI API has a max tokens limit.

I will first do some pre-processing on the content and fetch the relevant pieces of content before giving it as a prompt to the API.

tracyhenry · on April 1, 2023

how do you decide what content on the page to index, and how to split them to fit the context window?

Amazing concept btw - would love to see more examples (like a chatbot for a more well-known site).

amrrs · on April 2, 2023

It's pretty straightforward forward with LangChain and GPT-Index. There are lot of tutorials on the Internet for the same like this one https://youtu.be/9TxEQQyv9cE

visarga · on April 2, 2023

I don't think chunking + embedding based retrieval is good enough. It's a good first draft for a solution, but the chunks are out of context, so the LLM could combine them in an unintended way.

Better to question each document separately and then combine the answers into one last LLM round. Even so, there might be inter-document context that is lost - for example looking at one document that depends on details in another one. Large collections of documents should be loaded up in multiple passes, as the interpretation of a document can change when encountering information in another document. Adding one single piece of information to a collection of documents could slightly change the interpretation of everything, that's how humans incorporate new information.

One interesting application of document-collection based chat bots is finding inconsistencies and contradictions in the source text. If they can do that, they can correctly incorporate new information.

pbteja1998 · on April 1, 2023

I index everything. I don't pick and choose. Like I said, I do pre-processing to scrape the entire website content.

When the user asks, I try to get the relevant bits and answer the question based on that.