Hacker News new | past | comments | ask | show | jobs | submit login
Retro on Viberary (vickiboykis.com)
170 points by alexmolas on Jan 6, 2024 | hide | past | favorite | 32 comments



Very interesting! Coincidentally, I just released my own semantic search engine personal project for Steam games in December.

It’s interesting to compare notes and implementation details. I have many of the same notes and faced very similar challenges in my own work haha.

The most interesting challenge I found was touched on in this post. Semantic searches presented with the typical text box input generally mislead users into treating the search like a Google search.

Searches like “beautiful video game” and “sci-fi thriller” technically work, but they don’t perform as well as just describing a beautiful video game or describing a plot overview of a sci-fi thriller.

Semantic search kind of requires a different kind of search query that many users don’t quite understand yet.

Maybe there’s an opportunity for an intermediary model to “translate” queries, like how recent image generation models take your image prompt and generate a text prompt that’s used for the image generation instead.


This is a problem we're hitting as well. Hyde (Hypothetical Document Embeddings) is a zero-shot approach to it, where from the query you generate documents that would have a very small distance to the actual documents you're looking for. For question answering this would mean generating hypothetical answers. In some cases though, it can yield better results to generate hypothetical queries for each documents, question answering actually being one of them. That requires a lot of pre-processing though, which is not always possible.


Sentence Transformers documentation describes this problem as “asymmetric search”. Some of their models are fine tuned to better handle short queries matched to longer documents. (https://www.sbert.net/examples/applications/semantic-search/...)

Alternatively, you could mean pool (average) the document embeddings for the most popular click results for a given query and use that as the query embedding.


By definition, semantic search works best by similarity. Thus, the interface you are looking for, is one that facilitates selecting one or multiple objects (i.e. video games).


Yes, I agree. This is something I implemented in my project UI (a “more like this game” button, and the ability to specify a Steam appid for using as the query).

It works if there actually is a game out there you want similar games to, but the text input is useful for finding something when you don’t have an example offhand.


A semantic search engine for Steam games sounds like my jam. You should put a link to it in your HN profile!


I don't like to put current projects on my profile because I jump to new ones fairly regularly, it would get out of date quickly. I'll go and put a link to my personal site on there, though.

You can see the project here: https://azstatic.danieltperry.me/steamvibes/build

I’m not 100% satisfied with the performance of the search, but it is what it is. I wanted to get what I had out instead of perfecting it all, since there's a ton of different things I can do that all have ambiguous amounts of impact to the results.

I might switch to using a different embedding model in the future since the current one I'm using seems to be fairly dated by this point.


We also built a similar semantic search engine for retrieving open-source projects. We experimented with HyDE and guessing user queries based on a corpus, both of which were very successful.

For embeddings, we chose BGE, but found they seemed to overfit on Beir. CohereV3 and Voyage appeared to perform better in practice.

For retrieval, we used OpenSearch + ZillizCloud Serverless + Cohere ranking, providing us with maximum flexibility and search effectiveness.

To be mentioned, build a evaluation dataset is important. This helps me to improve the quality.


I'd be curious to hear what people think is state of the art of this kind of problem? I think the Cohere embeddings model v3 is very good, as it handles queries and documents differently to embed them in the same vector space. Otherwise for a specific use case I guess dense retrieval (which is basically a problem specific fine tuned version of this approach) is the best way to go about it?


> it handles queries and documents differently to embed them in the same vector space

I assume this might be behind the state of the art, then. As of a year ago OpenAI managed to get an embeddings model that can do both without any special flag or dual model to treat queries and documents differently.


handling queries and documents differently leads to an increase in overall retrieval.


If you're talking about embeddings model metrics, the MTEB Leaderboard is a good resource: https://huggingface.co/spaces/mteb/leaderboard


Yes I had a look at it, I haven't tried e5-mistral-7b-instruct yet but I'll definitely give it a go. Is there such a leaderboard only focused on retrieval by any chance? I haven't found one so far


The BEIR project might be what you're looking for: https://github.com/beir-cellar/beir/wiki/Leaderboard



It's interesting how the massive model of e5-mistral has only marginal performance gains over the bge-base and similar ones. It could still be useful for the longer sentence length though.


e5-mistral is essentially a distillation from gpt-4 to a smaller model. You can see here https://github.com/microsoft/unilm/blob/16da2f193b9c1dab0a69...

they actually have custom prompts for each dataset being tested.

Question would be, if you haven't seen the task before, what is a good prompt to prepend for your task?

IMO e5-mistral is overfit to MTEB


>I'd be curious to hear what people think is state of the art of this kind of problem?

I would assume Amazon's product suggestion would be SotA for reccomending a book based off of another. It is a recommondation system and while it uses semantic search there are many more ranking signals it uses.


The problem with systems like that is they're going to be optimized for making Amazon the most money, not necessarily what's best for the user (unless the two happen to coincide.)


Money is a proxy for user value and it is in Amazon's best interest for the customer to continually act upon these recommendations. If Amazon fails to deliver user value customers will be less willing to continue paying.


> Money is a proxy for user value and it is in Amazon's best interest for the customer to continually act upon these recommendations. If Amazon fails to deliver user value customers will be less willing to continue paying.

You're talking about the same Amazon that will blindly give me recommendations for other products that fill exactly the same niche as the one I just bought.

"Oh, you just bought a generator? You probably need a second one too, right?"


lol. lmao.


Very nice and comprehensive article by the author. Lots of great lessons shared in building a search system, devops and hosting an application publicly.


One issue I always run into when implementing these approaches is the embedding model's context window being too small to represent what I need.

For example, on this project, looking at the generation of training data [1], it seems like what's actually being generated are embeddings on a string concatenated from each review, title, description, etc. [2]. With the max_seq_length set to 200, wouldn't lengthy book reviews result in the book description text never being encoded? Wouldn't this result in queries not matching against potentially similar descriptions if the reviews are topically dissimilar (e.g., discussing author's style, book's flow, etc. instead of plot).

[1] https://github.com/veekaybee/viberary/blob/main/src/model/ge... [2] https://github.com/veekaybee/viberary/blob/main/src/model/ge...


I have the same problem with a project I'm working on. In my case I'm chunking the documents and encoding the chunks. Then I do semantic search over the embeddings of the chunked documents. It has some drawbacks but it's the best approach I could think


How would you build a semantic search engine that can also take rules, e.g. "AI papers with the word 'transformer' in them."


you would probably first tag the papers (AI papers then being tagged with "AI"),then generate for instance an Elasticsearch query with an LLM for matching papers with the ai tag and the word transformer


you can do first a keyword based search (inverse index, etc.) and then do semantic search with the results of the first step


Is there a similar search model the can find books based not on a vibe, but a on few descriptions of its content, from specific scenes to a more general description of its plot?


Well, after leaving the Web 3.0 failed semantic hype [1] I find stuff such as ChatGPT and LLMs already working. For example, today I started looking to buying a TV stand of good quality. Started an all classic series of search on Google with the typical ads and compiled lists on top. After half an hour feeling running in circles (the results were about the same companies) I fired ChatGPT, asked about a list of places, asked to expand the list three times and voila! all concepts that I remembered about Web 3.0, RDF, triples, FreeBase [2], etc where unnecesary now for reaching Web 3.0.

[1] https://en.wikipedia.org/wiki/Web_3.0

[2] https://en.wikipedia.org/wiki/Freebase_(database)


Semantic search as described in the article is very dissimilar to querying ChatGPT.


You should really try Deft:

https://shopdeft.com/




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: