Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Llama2 Embeddings FastAPI Server (github.com/dicklesworthstone)
178 points by eigenvalue on Aug 15, 2023 | hide | past | favorite | 31 comments
Author here. I just wanted a quick and easy way to easily submit strings to a REST API and get back the embedding vectors in JSON using Llama2 and other similar LLMs, so I put this together over the past couple days. It's very quick and easy to set up and totally self-contained and self-hosted. You can easily add new models to it by simply adding the HuggingFace URL to the GGML format model weights. Two models are included by default, and these are automatically downloaded the first time it's run.

It lets you not only submit text strings and get back the embeddings, but also to compare two strings and get back their similarity score (i.e., the cosine similarity of their embedding vectors). You can also upload a plaintext file or PDF and get back all the embeddings for every sentence in the file as a zipped JSON file (and you can specify the layout of this JSON file).

Each time an embedding is computed for a given string with a given LLM, that vector is stored in the SQlite database and can be returned immediately. You can also search across all stored vectors easily using a query string; this uses FAISS which is integrated.

There are lots of nice performance enhancements, including parallel inference, db write queue, fully async everything, and even a RAM Disk feature to speed up model loading.

I’m working now on adding additional API endpoints for easily generating sentiment scores using presets for different focus areas, but that’s still work-in-progress (the code for this so far is in the repo though).




It seems unlikely that raw llama2 will perform better than purpose made encoder models like bge [1], gte [2], e5 [3] or instructor, despite it's much much larger size (for the tasks people usually need embeddings for).

You can probably get it to behave well with a fine tuning like this: https://arxiv.org/pdf/2202.08904.pdf

[1] https://huggingface.co/BAAI/bge-large-en

[2] https://huggingface.co/thenlper/gte-large

[3] https://huggingface.co/intfloat/e5-large-v2


This is my sense as well. Text generation LLMs haven't been the best source of embeddings for other downstream use cases. If you're optimizing for token embeddings (e.g., for NER, span detection, or token classification tasks), then a token training objective is important. If you need text-level embeddings (e.g., for semantic search or text classification), then that training objective is required (e.g., what Sentence BERT did to optimize BERT embeddings for semantic search).

That's a great list of existing embeddings models (in addition the SentenceBERT models https://www.sbert.net/docs/pretrained_models.html).


The SGPT model is a very high performing text embeddings model adapted from a decoder. Using the same techniques with Llama-2 might perform better than you expect. I think someone will need to try these things before we know for certain. I believe there is still room for significant improvement with embedding models.


Thanks for pointing out those models. I see from a quick Huggingface search that the bge model is available in GGML format. You can trivially add new GGML format models to the code by simply adding the direct download link to this line:

https://github.com/Dicklesworthstone/llama_embeddings_fastap...

So to add the base bge model, you could just add this URL to the list:

https://huggingface.co/maikaarda/bge-base-en-ggml/resolve/ma...

I will add that as an additional default.


I think it's still overkill though for semantic embedding, SBERT is on order of ~250mm parameters while smallest llama at 7b parameters.


If all you want to do is make some basic semantic search, that’s probably true. But I strongly suspect we are only just now starting to scratch the surface of what’s possible with embeddings that come from much more powerful LLMs like Llama2 that can clearly manifest much greater demonstrated “understanding” of sentences they are shown (whatever that means, but intuitively, it seems obvious to me). That’s partly why I made this tool—- to aid in my investigations of LLM embeddings in a convenient and performant way.


I'm really curious to see where that investigation leads - have you done any comparisons between Llama 2 and the embedding focused models? I wonder if it'll be better able to provide more 'intuitively correct' similarities?


Yeah, I'd be surprised if embeddings derived from decoder-only models are competitive in common embedding tasks without some extra training work. There's a good benchmark page on huggingface for the MTEB tasks (Massive Text Embedding Benchmark) that's kept up to date here: https://huggingface.co/spaces/mteb/leaderboard


I did not know about BGE, learned about it from your comment. Seems to be the new SoTA for semantic embeddings, even for small models. Very cool!


It's exciting/terrifying to see how fast all this moves, it feels like every day there are new discoveries, techniques and models.

When I talk with people about ChatGPT-esque things (ChatGPT is pretty much what most people know now), I say that it's crazy that you can run this on your own consumer-level hardware if you just want to mess with it. You don't need "prosumer"/enthusiast hardware (unless you want to train models, but then I see people are using Google Colab etc).

It's a crazy world we live in.


This looks really cool. One thing I've wondered about with, e.g., the OpenAI API is if json is really a good format for passing embeddings back and forth. I'd think that passing floats as text over the wire wastes a ton of space that could add up, and might even sacrifice some precision in. Would it be better to encode at least the vectors as binary blobs, or else use something like protobuf to more efficiently handle sending tons of floats around?


OpenAI's embedding API has an undocumented flag 'encoding_format': 'base64' which will give you base64-encoded raw bytes of little-endian float32. As it is used by the official python client, it is unlikely to go away.


I totally agree when you're talking about a bunch of embeddings at once-- that's why the document level endpoint (and the token-level embedding endpoint) can optionally return a link to a zip file containing the JSON. For a single embedding, not sure it matters that much, and the extra convenience is nice.

Edit: One other thing is that you can store the JSON in SQLite using the JSON data type and then use the nice querying constructs directly at the database level, which is nice for the token-level embeddings and document embeddings. This is built in to my project.


Here's an example of Instructor Embeddings w/FeatureBase: https://gist.github.com/kordless/aae99946e7e2a5afccc83f3c4ee...

Instructor Embedding rank high on various leaderboards for embeddings and can be run locally, irregardless of how they are stored. It takes about half a second to embed 20 strings. 2.2 seconds to embed 80 strings. I haven't tested this with different batch sizes or GPU acceleration (don't know if that's possible). It is possible to quantize the vectors to 8 bit floats.

I'm using FeatureBase to store these because a) I work there and b) it will store and search the vectors by euclidian_distance and cosine_distance. Right now this is a cloud-only feature, but we'll work on getting it into the community release at some point.

Combined with our current support of sets, set intersection operations like tanimoto() and filtering and aggregation (all done using roaring bitmaps for in memory operations) this presents an interesting offering for storing training data and reporting. Being able to filter the vectors compared in distance makes nearest neighbor search algos almost unnecessary, except for extreme use cases. In that case, it might be better to consider switching to using knowledge graphs (to filter the vector space) instead of storing 10s of millions of dense vectors and doing approximate search on them.


Cool. I was wondering what Tanimoto meant, since I've tried to make myself familiar with all the useful similarity measures, and apparently it's just another name for Jaccard Index.

I do think there is a lot of potential in exploring more sensitive measures of similarity or statistical dependence. It seems like the ML community has basically decided that all the heavy lifting should be done at the model embedding level, and then you can just use cosine similarity for speed and the answers just "fall out". Which is definitely nice because then you can search across millions of records per second.

But there are some lesser-known measures of similarity/dependence that can pick up on more subtle relationships-- the big drawback is that they are slow. I included a couple of these exotic ones in my project, Hoeffding's D and HSIC, mostly out of curiosity.


> apparently it's just another name for Jaccard Index

Yes, it produces the same values for set compare: https://www.featurebase.com/blog/tanimoto-similarity-in-feat...


It's much better to use something like bge-large-en - https://github.com/arguflow/openembeddings


I read the doubts on your approach compared to bge, gte, e5; however I am wondering if the advantage of your approach could be: - 2048 token context. - multilingual support. - in-context learning or other discrete prompt tuning might enable to beat bge/gte/e5 on some tasks. - optimization with quantized models, finetuned models...

I am just wondering about speed vs bge/gte/e5


What's wrong with just using Torchserve[1]? We've been using it to serve embedding models in production.

[1] https://pytorch.org/serve/


I wanted something that works natively with llama-cpp and langchain that is also small and easy to hack on. And I also wanted everything to be seamlessly cached in SQlite. And also something that had built in semantic search with FAISS. And string similarity using multiple different similarity measures beyond just cosine similarity. And token-level embedding support. There is much more flexibility when you do it yourself to add whatever functionality you want quickly.


Hi,

Looks quite clean, congrats.

Two questions: 1. Starting from this, what would be the proper way to create embeddings for a complete document (i.e. a long paragraph)? My goal is to directly compare two PDFs according to their contents. It seems that `compute_similarity_between_strings` could be used, but then why is `get_all_embedding_vectors_for_document` useful for? 2. Using your API, does the inference run directly on the VPS? Does it need special kind of hardware (GPU, TPU or whatever)?

Sorry if my questions are dumb, but I really appreciate your project simplicity, and I want to know if it could suit my needs.

Thanks for sharing this piece of work.

;-)


Sure, the difference is that the first endpoint would give you back only one single embedding vector for the entire paragraph, while the second endpoint would give you a separate embedding vector for each sentence in the paragraph.

And yes, everything in this code is designed to run on the CPU well on a modest machine and is 100% self-hosted, no API keys needed at all. But if you do have a GPU installed and configured it will automatically use that since it’s powered by llama-cpp which now supports CUDA.


Thank you very much!

And when is it useful to compute an embedding for each sentence of a document/paragraph then?


Honestly, I'm not exactly sure about that myself. Someone asked for that feature on reddit and I realized that it wouldn't be too hard to add it, so I did. Another poster here mentioned that it might be useful for doing word-level highlighting of what was most relevant to a semantic query string within a sentence. I do think it is able to capture a lot more nuance simply because it's giving you so much more data. The question is how best to leverage that incremental data and take advantage of it, and I'm still trying to figure that out.


Nice work and starred! I am curious how you get the embeddings computed, the # of dimensions for the embeddings and if you have run any benchmarks against OpenAI's offering?

Cheers!


The embeddings are computed using llama-cpp, but langchain makes a nice convenience wrapper to directly get them, so I use that. The embeddings are 4096 dimensional vectors.

And no, I haven’t benchmarked them against OpenAI’s embeddings. I should point out that this code will work for any model in GGML format, so if there are fine-tuned Llama2 versions that are optimized for embedding, you could use those instead very easily (or any other model). This project is more about making it easy to go from model to embeddings on demand via an API and then letting you do useful things with those embeddings easily.


I’d love to see some examples of how the token-in-context embeddings stack up against sentence-level. What new use-cases are unlocked?

Perhaps semantic search with word-highlighting?

Any advantage to using the full context window to maximize context around the embedded token?


Great idea, I’ll see if I can make a new endpoint that adds word level highlighting annotations that could be parsed and used to control the brightness or color of each word based on semantic relevance to a query term.

To be honest, I hadn’t even thought about token-level embeddings until someone on Reddit asked about it and I realized it was possible to do with llama-cpp, so I just quickly added the functionality without closely examining the best use cases.

It’s a LOT more data and compute than using the normal sentence-level embeddings, so it would really have to unlock some useful new functionality to be worth it. But I do think the “combined feature vector” concept that at least makes them fixed length is helpful.


I'd rather stick with InstructEmbedding instead of pandering to flavour of the month LLM. That way I keep my key components insulated from drastic changes


LLM inputs are the worst candidates for caching. The only place where caching might make sense if you have a public facing service & have coupled it with a vector cache instead of a typical word for word caching of the prompt


It’s useful if you might submit the same document or edited versions of the same document with a lot of overlap.


[deleted]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: