Show HN: Semantic Grep – A Word2Vec-powered search tool

danieldk · 2024-07-28T07:05:17.000000Z

Some small tips from superficially reading the code:

https://github.com/arunsupe/semantic-grep/blob/b7dcc82a7cbab...

You can read the vector all at once. See e.g.:

https://github.com/danieldk/go2vec/blob/ee0e8720a8f518315f35...

---

https://github.com/arunsupe/semantic-grep/blob/b7dcc82a7cbab...

You can compute the similarity much faster by using BLAS. Good BLAS libraries have SIMD-optimized implementations. Or if you do multiple tokens as once, you can do a matrix-vector multiplication (sgemv), which will be even faster in many implementations. Alternatively, there is probably also a SIMD implementation in Go using assembly (it has been 7 years since I looked at anything in the Go ecosystem).

You could also normalize the vectors while loading. Then during runtime the cosine similarity is just the dot product of the vectors (whether it pays off depends on the size of your embedding matrix and the size of the haystack that you are going to search).

neonsunset · 2024-07-28T12:28:01.000000Z

SIMD situation in Go is still rather abysmal, it’s likely easier to just FFI (though FFI is very slow so I guess you're stuck with ugly go asm if you are using short vectors). As usual, there's a particular high-level language that does it very well, and has standard vector similarity function nowadays...

bitexploder · 2024-07-28T13:47:24.000000Z

Go ahead and share the language, it is good etiquette for HN :)

neonsunset · 2024-07-28T14:50:37.000000Z

Actually, there's three:

C#: https://github.com/dotnet/runtime/blob/main/docs/coding-guid... (and full family of other types: Vector2/3/4, Matrix3x2/4x4 and upcoming Tensor<T>), vector similarity I was talking about is this: https://learn.microsoft.com/en-us/dotnet/api/system.numerics..., it uses a highly optimized SIMD kernel, for DotProduct just use an adjacent method

Swift: https://developer.apple.com/documentation/swift/simd-vector-... it is also a competent language at portable SIMD by virtue of using LLVM and offering almost the same operators-based API (e.g. masked = vec1 & ~vec2) like C#

Mojo: https://docs.modular.com/mojo/stdlib/builtin/simd which follows the above two, it too targets LLVM so expect good SIMD codegen as long the lowering strategy does it in an LLVM-friendly way, which I have not looked at yet.

e12e · 2024-07-28T16:52:41.000000Z

Not Julia?

neonsunset · 2024-07-28T17:19:51.000000Z

Julia is less "general-purpose" and I know little about the quality of its codegen (it does target LLVM but I haven't seen numbers that place it exactly next to C or C++ which is the case with C#).

Mojo team's blog posts do indicate they care about optimal compiler output, and it seems to have ambitions for a wider domain of application which is why it is mentioned.

onli · 2024-07-27T20:51:35.000000Z

That's totally clever and sound really useful. And it's one of those ideas where you go "Why didn't I think of that" when stumbling over the materials, word2vec in this case.

ignoramous · 2024-07-27T23:26:42.000000Z

> Why didn't I think of that

Well, Apple did experiment with embeddings for next word prediction on iOS in 2018 (apparently, they also used it for many other features)! https://machinelearning.apple.com/research/can-global-semant... / https://archive.is/1oCwr

samatman · 2024-07-27T21:40:24.000000Z

This is a good idea. I'm going to offer some unsolicited feedback here:

The configuration thing is unclear to me. I think that "current directory" means "same directory as the binary", but it could mean pwd.

Neither of those is good: configuration doesn't belong where the binaries go, and it's obviously wrong to look for configs in the working directory.

I suggest checking $XDG_CONFIG_HOME, and defaulting to `~/.config/sgrep/config.toml`.

That extension is not a typo, btw. JSON is unpleasant to edit for configuration purposes, TOML is not.

Or you could use an ENV variable directly, if the only thing that needs configuring is the model's location, that would be fine as well.

If that were the on ramp, I'd be giving feedback on the program instead. I do think it's a clever idea and I'd like to try it out.

SloopJon · 2024-07-28T02:28:16.000000Z

I take a Show HN to be a solicitation for feedback.

It took me a few tries to get this to run. I tried passing -model_path, but it complained that it couldn't find config.json. I then made a config.json in the pwd (which contained the executable), but it couldn't expand "~". I then tried searching *.txt, but it couldn't find anything unless I specified just one file.

In terms of offering an experience similar to grep, it's a lot slower. It's not quite as slow with the SLIM file, but I wonder if there's some low-hanging fruit for optimization. Perhaps caching similarity for input tokens?

throw156754228 · 2024-07-28T11:23:45.000000Z

The model of a word to a vector breaks down really quickly one you introduce the context and complexity of human language. That's why we went to contextual embeddings, but even they have issues.

Curious would it handle negation of trained keywords, e.g "not urgent"?

VHRanger · 2024-07-28T14:53:04.000000Z

For a lot of cases, word2vec/glove still work plenty well. It also runs much faster and lighter when doing development -- the FSE library [1] does 0.5M sentences/sec on CPU, whereas the fastest sentence_transformers [2] do something like 20k sentences/sec on a V100 (!!).

For the drawbacks:

Word embeddings are only good at similarity search style queries - stuff like paraphrasing.

Negation they'll necessarily struggle with. Since word embeddings are generally summed or averaged into a sentence embedding, a negation won't shift the sentence vector space around the way it would in a LM embedding.

Also things like homonyms are issues, but this is massively overblown as a reason to use LM embeddings (at least for latin/germanic languages).

Most people use LM embeddings because they've been told it's the best thing by other people rather than benchmarking accuracy and performance for their usecase.

1. https://github.com/oborchers/Fast_Sentence_Embeddings

2. https://www.sbert.net/docs/sentence_transformer/pretrained_m...

arunsupe · 2024-07-28T11:39:47.000000Z

This implementation does not because the query has to be a word. One way to extend it to phrases is to average the vectors of each word in the phrase. Another way is to have a word2vec model that embeds phrases. (The large Google News model I think does have phrases, with spaces converted to underscore). But going from words to phrases opens up a whole new can of worms - how large a phrase to consider from the input stream etc. Plus, I don't think averaging words in the phrase is the same as learning the embedding for the phrase. Sentence embedding models are necessary for that, but they are far too slow for this use case as pointed out by others.

To summarize, this is a simple implementation that works for the simplest use case - semantic matching of words.

vunderba · 2024-07-28T19:37:27.000000Z

Yep, it would almost be easier to:

- check to see if the search query contains more than one word

- spin up a more modestly sized LLM, such as mistral 7b

- ask the LLM to try to condense / find single word synonym for the user query

- send to sgrep

djoldman · 2024-07-28T12:36:21.000000Z

Definitely a limitation of word2vec as applied here.

Something like SBERT addresses this.

_flux · 2024-07-28T06:13:37.000000Z

I wonder if it would be possible to easily add support for multiple CPUs? It seems to be taking at most 150% CPU, so on my workstation it could be (assuming high parallellism) 10 times as fast.

Alas the word2vec repository has reached its quota:

    fetch: Fetching reference refs/heads/master
    batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
    error: failed to fetch some objects from 'https://github.com/mmihaltz/word2vec-GoogleNews-vectors.git/info/lfs'

So here are another sources I found for it: https://stackoverflow.com/a/43423646

I also found https://huggingface.co/fse/word2vec-google-news-300/tree/mai... but I'm unsure if that's the correct format for this tool. The first source from Google Drive seems to work and there's little chance of being malicious..

arunsupe · 2024-07-28T07:10:27.000000Z

The model in Google drive is the official model from Google and will work.

Haven't tried the huggingface model, but, looks very different. Unlikely to work.

randcraw · 2024-07-27T21:38:36.000000Z

This would be really useful if it could take a descriptive phrase or a compound phrase (like SQL 'select X and Y and Z') and match against the semantic cluster(s) that the query forms. IMO that's the greatest failing of today's search engines -- they're all one hit wonders.

fzeindl · 2024-07-28T04:18:00.000000Z

Fyi, there is already a widely used tool along with a company called semgrep, which stems from semantic grep: https://semgrep.dev/.

icholy · 2024-07-28T04:50:05.000000Z

The similarities between semgrep and the linked tool start and end with the name.

fzeindl · 2024-07-28T05:13:06.000000Z

Yes, you are right.

tracerbulletx · 2024-07-28T05:14:57.000000Z

God, the amount of enterprise speak in there is overwhelming.

throwaway290 · 2024-07-28T08:45:50.000000Z

Because of how it is marketed. But it is real and very much works.

bitexploder · 2024-07-28T13:50:58.000000Z

And you can just go to its github to avoid the product side.

molli · 2024-07-28T06:08:41.000000Z

Similar: https://github.com/moritztng/fltr

Like grep but for natural language questions. Based on Mistral LLMs.

drdeca · 2024-07-27T21:35:19.000000Z

Very cool!

Do I understand correctly that this works by splitting each line into words, and using the embedding for each word?

I wonder whether it might be feasible to search by semantics of longer sequences of text, using some language model (like, one of the smaller ones, like GPT2-small or something?). Like, so that if you were searching for “die”, then “kick the bucket” and “buy the farm”, could also match somehow? Though, I’m not sure what vector you would use to do the dot product with, when there is a sequence of tokens, each with associated key vectors for each head at each layer, rather than a single vector associated with a word.. Maybe one of the encoder-decoder models rather than the decoder only models?

Though, for things like grep, one probably wants things to be very fast and as lightweight as feasible, which I imagine is much more the case with word vectors (as you have here) than it would be using a whole transformer model to produce the vectors.

Maybe if one wanted to catch words that aren’t separated correctly, one could detect if the line isn’t comprised of well-separated words, and if so, find all words that appear as a substring of that line? Though maybe that would be too slow?

throwawaydummy · 2024-07-27T22:44:32.000000Z

I wanna meet the person who greps die, kick the bucket and buy the farm lol

Are models like mistral there yet in terms of token per second generation to run a grep over millions of files?

ignoramous · 2024-07-27T23:24:35.000000Z

Mistral has published large language models, not embedding models? sgrep uses Google's Word2Vec to generate embeddings of the corpus and perform similarity searches on it, given a user query.

throwawaydummy · 2024-07-27T23:36:56.000000Z

No I got that I asked because wouldn’t embedding generated by fine tuned transformer based LLMs be more context aware? Idk much about the internals so apologies if this was a dumb thing to say

ignoramous · 2024-07-28T15:14:25.000000Z

embeddings come in handy to augment LLMs [0], but as you suspect, some try LLMs themselves as an outright embedding model with varying degrees of success: https://www.reddit.com/r/LocalLLaMA/comments/12y3stx/embeddi... / https://huggingface.co/spaces/mteb/leaderboard

[0] https://simonwillison.net/2023/Oct/23/embeddings/

rldjbpin · 2024-07-28T08:57:58.000000Z

document search, an llm use-case almost all companies want for their white-collar workers, currently follows creation of a RAG.

vector search is the first step of the process, and i argue that getting top n results and letting the user process them is probably the best of both worlds without involving "AI".

this can be enhanced with multilingual support based on the encoder used. but building the index is still an expensive process and i wonder how that could be done fast for a local user.

ttpphd · 2024-07-28T13:35:43.000000Z

You might be interested in Semantra: https://github.com/freedmand/semantra

sam_perez · 2024-07-28T02:19:47.000000Z

This tool seems really cool, want to play with it for sure.

Just in general, semantic search across text seems like a much better ux for many applications. Wish it was more prevalent.

gunalx · 2024-07-27T20:32:30.000000Z

Really cool. Often just want to fuzzy search for a word, and this would be useful. Can it do filenames as well ? Or do I need to pipe something like LS first.

fbdab103 · 2024-07-27T20:35:19.000000Z

I might have a work use case for which this would be perfect.

Having no experience with word2vec, some reference performance numbers would be great. If I have one million PDF pages, how long is that going to take to encode? How long will it take to search? Is it CPU only or will I get a huge performance benefit if I have a GPU?

9dev · 2024-07-27T20:51:18.000000Z

As someone working extensively with word2vec: I would recommend to set up Elasticsearch. It has support for vector embeddings, so you can process your PDF documents once, write the word2vec embeddings and PDF metadata into an index, and search that in milliseconds later on. Doing live vectorisation is neat for exploring data, but using Elasticsearch will be much more convenient in actual products!

michaelmior · 2024-07-27T22:10:01.000000Z

I would personally vote for Postgres and one of the many vector indexing extensions over Elasticsearch. I think Elasticsearch can be more challenging to maintain. Certainly a matter of opinion though. Elasticsearch is a very reasonable choice.

9dev · 2024-07-28T07:29:19.000000Z

Elasticsearch is a pain to maintain, the docs are all over the place, and the API is what you end up with if developers run free and implement everything that jumps to their mind.

But there just isn’t anything comparable when it comes to building your own search engine. Postgres with a vector extension is good if all you want to do is a vector search and some SQL (not dismissing it here, I love PG); but if you want more complex search cases, Elasticsearch is the way to go.

SkyPuncher · 2024-07-27T21:05:09.000000Z

Vector stuff doesn’t take much. It’s essentially the same computational time as regular old text search.

kristopolous · 2024-07-27T21:38:07.000000Z

No. Look at the implementation. Gpu isn't going to give you huge gains in this one

rasengan0 · 2024-07-27T22:10:22.000000Z

very cool, led me to find https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html and semgrep is taken so another symlink it is, w2vgrep?

infruset · 2024-07-28T10:23:18.000000Z

Very nice ! How might one go about adapting this to other languages ? Does a version of the model downloaded exist somewhere ?

arunsupe · 2024-07-28T10:51:38.000000Z

I would think translation to other languages would be trivial. The model is just a map of word to vector. Every word is converted to it's vector representation; then the query word is compared to the input words using cosine similarity.

arunsupe · 2024-07-28T22:34:41.000000Z

I assumed you meant computer languages :-) If you mean human languages, yes Google publishes word2vec embeddings in many different human languages. Not sure though how easy it is to download.

arunsupe · 2024-08-01T05:44:29.000000Z

Added support for multiple languages, using fasttext's embeddings

pgroves · 2024-07-27T21:09:12.000000Z

How fast is it?

piyushtechsavy · 2024-07-29T07:32:21.000000Z

Hello Arun. Does it use cosine similarity for the matching?

arunsupe · 2024-07-31T06:07:40.000000Z

Correct.

synergy20 · 2024-07-29T02:07:57.000000Z

just played around with it, not very smart per se, to get full power of semantic grep, LLM still might be needed, how to do it? is RAG the only way?

sitkack · 2024-07-27T19:31:41.000000Z

Your post might have been flagged because of your example?

arunsupe · 2024-07-27T20:04:39.000000Z

Oh.Please explain. The example was entirely arbitrary. Should I change it?

threatofrain · 2024-07-27T20:11:32.000000Z

IMO no. There are enough people who will vouch.

tgw43279w · 2024-07-28T11:24:19.000000Z

I really like how simple the implementation is!

shubham13596 · 2024-07-28T18:50:54.000000Z

Why not use GPT 4 embeddings instead of word2vec ? Won't that be more effective

mmcwilliams · 2024-07-28T22:14:07.000000Z

I would assume the argument could be made that "can be run locally", "w2v can be trained faster than other models" and "doesn't cost per-call to use" to be some arguments for why someone would try this.

low_tech_punk · 2024-07-27T22:46:15.000000Z

not to be confused with https://github.com/semgrep/semgrep

danielmarkbruce · 2024-07-28T03:04:53.000000Z

damn, great thinking.