Hacker News new | past | comments | ask | show | jobs | submit login

Not sure if I can answer those questions.

I'm quite new to the field myself. Many more experienced people, including the OP, I believe, would suggest that you use sentence transformers for this task. I personally don't understand how those are trained or fine-tuned, and I have never done it myself so far. What I do know is that sentence transformer results are horrible if you use them out of the box on anything other than the domain in which they were trained.

There's also the question of compute and RAM necessary to generate the embeddings. In one of my own projects, vectorizing 40M text snippets with sentence transformers would have taken 30 days on my desktop. So all I have worked with so far is fastText. It's several hundred times faster than any BERT model, and it can be trained quite easily, from scratch, on your own domain corpus. Again, this may be an outdated technique. But it's the one thing where I have some practical experience. And it does work quite well for creating a semantic search engine that is both fast and does not require a GPU.

The problem with fastText is that it only creates embeddings for words. You can use it to generate embeddings for a whole sentence or paragraph - but what it will do internally in this case is to just generate embeddings for all the words contained in the text, and then to average them. That doesn't give you really good representations of what the snippet is about, because common words like prepositions are given just as much weight as the actual keywords in the sentence.

Some smart people, however, figured out that you can pair BM25 with fastText: With BM25, you essentially create a word count dictionary on your corpus, that tells you which words are rare, and therefore especially meaningful, and which are commonplace. Then, you let fastText generate embeddings for all the words in your snippet. But, instead of averaging them with equal weights, you use the BM25 dictionary as your weights. Words that are rare and special thus get greater influence on the vector than commonplace words.

FastText understands no context, and it does not understand confirmation or negation. All it will do is find you the snippets that are using either the same words, or words that are often used in place of your query words within your corpus. But I find that is already a big improvement on mere keywords search, or even BM25, because it will find snippets that use different words, but talk about the same concept.

Since fastText is computationally "cheap", you can afford to split your documents into several overlapping sets of snippets: Whole paragraphs, 3 sentence, 2 sentence, 1 sentence, for instance. At query time, if two results overlap, you just display the one that ranks the highest.

Personally, I would imagine that doing document-level search wouldn't be very satisfying for the user. We're so used to Google finding us not only the page we're looking for, but also the position within the page where we can find the content that we're looking for. With a scientific article, it would be painful to having to scroll though the entire thing and skim it in order to know whether it actually is the answer to our query or not.

And with sentence-level, you'd be missing out on all context in which the sentence appears.

As the low hanging fruit, why not go with the units chosen by the authors of the articles, i.e. the paragraphs. Even if those vary wildly in length, it would be a starting point. If you then find that the results are too long, or too short, you can make adjustments. For snippets that are too short, you could "pull in" the next one. And for those that are too long, you could split them, perhaps even with overlap. I think for both the human end user and most NLP models, anything the length of a tweet, or around 200 characters, is about the sweet spot. If you can find a way to split your documents into such units, you'd probably do well, regardless of which technology you end up using.

You can also check out Weaviate. If you use Weaviate, you don't have to worry about creating embeddings at all. You just focus on how you split and structure your material. You could have, for instance, an index called "documents" and an index called "paragraphs". The former would contain things like publication date, name of the authors, etc.. And the latter would contain the text for each paragraph, along with the position in the document. Then, you can ask Weaviate to find you the paragraphs that are semantically closest to query XYZ. And to also tell you which articles they belong to.

You can also add negative "weights", i.e. search for paragraphs talking about "apple", but not in the context of "fruit" (i.e. if you're searching for Apple, the company).

These things work out of the box in Weaviate. Also, you can set up Weaviate using GloVe (basically the same as fastText), or you can set it up using Transformers. So you can try both approaches, without actually having to train or fine-tune the models yourself. With the transformers module, they also have a Q&A module, where it will actually search within your snippets and find you the substring which it thinks is the answer to your question.

I have successfully run Weaviate on a $50 DigitalOcean droplet for doing sub-second semantic queries on a corpus of 20+M text snippets. Only for getting the data into the system I had to use a more powerful server. But you can run the ingestion on a bigger server, and when it's done, you can just move things over to the smaller machine.




Thank you for the write up, since as far as my research has gone that is the best description of how to go about planning for vectorization I have seen. Undoubtedly experimentation on our corpus is required, but it helps to have an overview so we don't wildly run down the wrong paths early on.


Exactly. Training models, generating embeddings and building databases all can take days to run, and hundreds of Dollars in server costs. It’s painful to have done all that only to realize that one has gone down the wrong path. It pays to test your pipeline with a smaller batch first. Most likely you will discover some issues in your process. This iterative cycle is much faster if you work with a smaller test set first and do not switch over to your main corpus prematurely.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: