Author here. I just wanted a quick and easy way to easily submit strings to a REST API and get back the embedding vectors in JSON using Llama2 and other similar LLMs, so I put this together over the past couple days. It's very quick and easy to set up and totally self-contained and self-hosted. You can easily add new models to it by simply adding the HuggingFace URL to the GGML format model weights. Two models are included by default, and these are automatically downloaded the first time it's run.
It lets you not only submit text strings and get back the embeddings, but also to compare two strings and get back their similarity score (i.e., the cosine similarity of their embedding vectors). You can also upload a plaintext file or PDF and get back all the embeddings for every sentence in the file as a zipped JSON file (and you can specify the layout of this JSON file).
Each time an embedding is computed for a given string with a given LLM, that vector is stored in the SQlite database and can be returned immediately. You can also search across all stored vectors easily using a query string; this uses FAISS which is integrated.
There are lots of nice performance enhancements, including parallel inference, db write queue, fully async everything, and even a RAM Disk feature to speed up model loading.
I’m working now on adding additional API endpoints for easily generating sentiment scores using presets for different focus areas, but that’s still work-in-progress (the code for this so far is in the repo though).
You can probably get it to behave well with a fine tuning like this: https://arxiv.org/pdf/2202.08904.pdf
[1] https://huggingface.co/BAAI/bge-large-en
[2] https://huggingface.co/thenlper/gte-large
[3] https://huggingface.co/intfloat/e5-large-v2