Could anyone point me towards a relatively beginner-friendly guide to do somethi...

diego · on March 30, 2023

That's exactly what I did here. https://github.com/dbasch/semantic-search-tweets

mustacheemperor · on March 30, 2023

Thank you! Comparing this and the link the other commenter posted, what handles the actual search querying? Does instructor-xl include an LLM in addition to the embeddings? The other commenter's repo uses Pinecone for the embeddings and OpenAI for the LLM.

My apologies if I am completely mangling the vocabulary here - I have an, at best, rudimentary understanding of this stuff that I am trying to hack my education on.

Edit: If you're at the SF meetup tomorrow, I'd happily buy you a beverage in return for this explanation :)

eternalban · on March 30, 2023

It's in the repo:

You first create embeddings. What is this? It's an n-dimensional vector space with your tweets 'embedded' in that space. Each word is an n-dimensional vector in this space. The vectorization is supposed to maintain 'semantic distance'. Basically, if two words are very close in meaning or related (by say frequently appearing next to each other in corpus) they should be 'close' in some of those n-dimensions as well. The result at the end is the '.bin' file, the 'semantic model' of your corpus.

https://github.com/dbasch/semantic-search-tweets/blob/main/e...

For semantic search, you run the same embedding algorithm against the query and take the resultant vectors and do similarity search via matrix ops, resulting in a set of results, with probabilities. These point back to the original source, here the tweets, and you just print the tweet(s) that you select from that result set (here the top 10).

https://github.com/dbasch/semantic-search-tweets/blob/main/s...

Experts can chime in here but there are knobs such as 'batch size' and the functions you use to index. (cosine was used here.)

So the various performance dimensions of the process should also be clear. There is a fixed cost of making the embeddings of your data. There is a per-op embedding of your query, and then running the similarity algorithm to find the result set.

mustacheemperor · on March 30, 2023

Thank you for this walkthrough, and for citing the code alongside!

eternalban · on March 30, 2023

celestialcheese · on March 30, 2023

langchain and llama-index are two big opensource projects which are great for buildign this type of thing.

https://github.com/mayooear/gpt4-pdf-chatbot-langchain for example

mustacheemperor · on March 30, 2023

Cheers, thank you!

sroussey · on March 30, 2023

I have a system that download all my data from Google, Facebook, Twitter, and others. Geo data is fun to look at, but now the text and images have some more meaning to gleam. I’m thinking about going back to it. Not sure how to package a bunch of python stuff in an app though.

devxpy · on March 30, 2023

https://gooey.ai/doc-search/?example_id=8ls7dpf6

No code needed :)

jeadie · on March 30, 2023

You could try https://github.com/marqo-ai/marqo