At Metal http://getmetal.io/ we're currently building a fine tuning platform. We host, index and version embeddings. Provide an easy way to manage the fine tuning jobs as well.
Tokens, I think? There's an implication that there's a per-token cost, but no idea how they were architected before to even remotely understand how it fits. If they were running their own model on GPUs I guess I'd expect to see things in terms of token/sec, so I assume instead they were using some cloud learning model where they have to pay per token?
The blog post really suffers from "Written by someone who knows what they're talking about", could have done with a review before publication by someone who doesn't know the space. I go through that exercise any time I'm writing stuff for consumption outside of my service teams.
Great post -- one of our core offerings at https://getmetal.io is to manage the process of utilizing Custom Embeddings very similar to your approach here! We'd love to connect and talk through some of the pain points that you came across -- i'll send you a message in the w23 slack!
This doesn’t make any sense to me: how does fine-tuning the embeddings save money? It seemed like the problem was having to make too many API calls to generate the embeddings in the first place.
Embeddings are often used as features for these LLMs so before they were paying to generate embeddings and doing inference with these large models. Now they pay to generate embeddings, fine-tune them and do semantic search (probably approximate k-nearest neighbors). The hardware requirements for most LLMs make them much more expensive than approximate KNN with a vector database.
Maybe this helps people understand what they are doing at index time.
* Version 1. Ask the LLM to describe the code snippet. Create an embedding of the description. LLM generation + embeddings required.
* Version 2. run the code snippet directly through the embedding API. Skip the LLM text generations step. Now run the code snippet through the bias matrix and finally index the resulting embedding.
I assume this only works b/c they fine tuned a bias matrix on code snippet and text pairs. Feels more like a light version of transfer learning to me.
The article was a little unclear in the actual approach for V2 so if I have anything wrong please correct me.
> We considered trying to use a self-hosted LLM as an alternative, but the costs would also have been extremely high for the amount of traffic we were processing.
Is it realistic to self-host an LLM that outperforms OpenAIs offerings cost wise? When I looked at the alternatives (self-hosted, alternate hosted LLM providers, or cloud compute options) you generally ended up with a subjectively worse model AND a lower inference speed - which resulted in me canning my idea as it was simply too expensive.
Flan-T5 is much smaller than GPT-3, but was trained on significantly more data resulting in competitive accuracy. It is also Apache licensed. I wonder if that model is fast enough for enough use cases to make it cost effective?
In my Colab Pro it's running this on a A100 (which is a very beefy GPU) and inference is very fast and definitely suitable for interactive use. On a T5 GPU (which is much cheaper) inference is still alright and probably ok for interactive use.
I think Flan-T5 is fast enough, but I don't think it generates text or abstract reasoning at nearly the same level as current GPT-3 models. This indicates a deficiency in the benchmarks and metrics that we use to evaluate LLMs. For generating embeddings it might work well enough though.
It's certainly not quite as good out of the box, at least the open sourced checkpoints. However so far I found it can achieve similar accuracy with enough examples and/or fine-tuning for my use cases. Like everything, it depends on what are doing too.
For embeddings, it may be overkill. Smaller BERT-type models can provide good embeddings when fine tuned with a contrastive learning objective. Eg: https://sbert.net.
I think you should have a put a bit more thought into planning for scale. There’s a difference between not overengineering and not actually doing the basic maths to figure out the unit economics of your business model ahead of setting it live to the whole world.
But the advice to essentially fine tune your embeddings with a custom matrix is good
There are also other embeddings platforms (other than OpenAI’s) that have built in fine tuning functionality