On the flip side, this is extremely fast and works pretty well on my M1 Pro Macbook running Chrome. The only downside is that it's a little blurry (as it's rendering at 1x pixel scale).
Congrats on launching – we just spent quite a bit of time replicating/transforming our primary database into clickhouse for OLAP use cases, and it would have been way easier if there were a postgres-native solution. Hoping the managed hosting providers catch on
I’m actually in the same boat right now (primary DB -> Clickhouse). We’re currently trialing Airbyte but it appears they’re removing normalization support for Clickhouse this year. Did you land on a custom solution, or some other ETL/ELT tool?
Vector databases are oriented around search and retrieval. The usual approach of generating vectors is to fine tune a large pretrained model and extract the inner representations. Because the dataset contains successful queries and retrieval results, all you need to do is optimize a loss function (usually contrastively or approximated versions of common ranking functions) using raw inputs on the similarity objective supported by the vector database. For common modalities like tabular/text/image/audio data, there is basically no human judgement involved in feature selection - just apply attention.
Note: current state of the art text to vector models like E5-Mistral (note: very different from original E5) don’t even require human curation in the dataset
This is also just eliding the work that humans still have to do.
I give you an audio file (let's say just regular old PCM wav format). You cannot do anything with this without making some decisions about what happens next to the data. For audio, you're very minimally faced with the question of whether to do a transform into the frequency domain. If you don't do that, there's a ton of feature classification that can never be done. No audio to vector model can make that sort of decision for itself - humans have to make the possible.
Raw inputs are suitable for some things, but essentially E5 is just one that has already had a large number of assumptions built into it that happen to give pretty good results. Nevertheless, were you interested for some weird reason in a very strange metric of text similarity, nothing prebuilt, even E5, is going to give that to you. Let's look at what E5 does:
> The primary purpose of embedding models is to convert discrete symbols, such as words, into continuous-valued vectors. These vectors are designed in such a way that similar words or entities have vectors that are close to each other in the vector space, reflecting their semantic similarity.
This is great, but useful for only a particular type of textual similarity consideration.
And oh, what's this:
> However, the innovation doesn’t stop there. To further elevate the model’s performance, supervised fine-tuning was introduced. This involved training the E5 embeddings with labeled data, effectively incorporating human knowledge into the learning process. The outcome was a consistent improvement in performance, making E5 a promising approach for advancing the field of embeddings and natural language understanding.
Hmmm ....
Anyway, my point still stands: choosing how to transform raw data into "features" is a human activity, even if the actual transformation itself is automated.
I agree with your point at the highest (pretrained model architect) level, but tokenization/encoding things into the frequency domain are decisions that typically aren’t made (or thought of) by the model consumer. They’re also not strictly theoretically necessary and are artifacts of current compute limitations.
Btw E5 != E5 Mistral, the latter achieves SOTA performance without any labeled data - all you need is a prompt to generate synthetic data for your particular similarity metric.
> Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets… We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages.
It’s true that ultimately there’s a judgement call (what does “distance” mean?), but I think the original post far overcomplicates what’s standard practice today.
Sorry, I just not believe this generalizes in any meaningful sense for arbitrary data.
You cannot determine frequencies from audio PCM data. If you want to build a vector database of audio, with frequency/frequencies as one of the features, at the very least you will have to arrange for a transform to the frequency domain. Unless you claim that a system is somehow capable of discovering fourier's theorem and implementing the transform for itself, this is a necessary precursor to any system being able to embed using a vector that includes frequency considerations.
But ... that's a human decision because humans think that frequencies are important to their experience of music. A person who totally deaf, however, and thus has extremely limited frequency perception, can (often) still detect rythmic structure due to bone conduction. Such a person who was interested in similarity analysis of audio would have no reason to perform a domain transform, and would be more interested in timing correlations that probably could be fully automated into various models as long as someone remembers to ensure that the system is time-aware which is, again, just another particular human judgement regarding what aspects of the audio have significance.
I just read the E5 Mistral paper. I don't see anything that contradicts my point, which wasn't about the need for human labelling, but about the need for human identification of significant features. In the case of text, because of the way languages evolve, we know that a semantic-free character-based analysis will likely bump into lots of interesting syntactic and semantic features. Doing that for arbitrary data (images, sound, air pressure, temperature) lacks any such pre-existing reason to treat the data in any particular way.
Consider, for example, if the "true meaning" of text was encoded in a somewhat Kaballah-esque type scheme, in which far distance words and even phonemes create tangled loops of reference and meaning. Even a system like E5 Mistral isn't going to find that, because that's absolutely not how we consider language to work, and thus that's not part of the fundamentals of how even E5 Mistral operates.
Understanding audio with inputs in the frequency domain isn’t required for understanding frequencies in audio.
A large enough system with sufficient training data would definitely be able to come up with a Fourier transform (or something resembling one), if encoding it helped the loss go down.
> In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
Today’s diffusion models learn representations from raw pixels, without even the concept of convolutions.
Ditto for language - as long as the architecture is 1) capable of modeling long range dependencies and 2) can be scaled reasonably, whether you pass in tokens, individual characters, or raw ASCII bytes is irrelevant. Character based models perform just as well (or better than) token/word level models at a given parameter count/training corpus size - the main reason they aren’t common (yet) is due to memory limitations, not anything fundamental.
I got SMILE instead of LASIK for that exact reason (avoiding dry eye), and have perfect vision with moist eyes. Highly recommend avoiding LASIK if you qualify for newer procedures.
I think “take responsibility for fucking up the lives of a whole bunch of people, because he didn’t agree with the hiring strategy of the CEO he dumped” is the angle here.
That fine specimen of humanity goes on to say, in the same breath “we are fine, we have money” so clearly they were not pushing bankruptcy, and would have been in a position to help out those they screwed over so brutally.
I’m making it my mission to figure out the combined business interests of their board and c-suite, and make sure that I’ll not use those products if at all possible.
agree, imo this is a big somewhat underrated advantage - gpts summarising capabilities seem quite a bit more impressive to me than its generative ability
Can you elaborate on the problem? I've successfully used LLMs to format unstructured data to json based on a predefined schema in a number of different scenarios yet somehow missed whatever issue you're describing in your comment.
Can you share your techniques or resources you used to get this working? We've got something that works maybe 90% of the time and occasionally get malformed json back from OpenAI.
Just last week I went over 8k lines of data, doing a forst applicability analysis, meaning which lines to be considered for further analysis. The information I needed to do so was hidden in manually created comments, because of course it is, I have never ever seen pre defined classifications used consistently by people. And those pre defined classes never cover whatever need one has years later anyway.
Thing is, when I started I didn't even know what to look for. I knew once I was done, so almost impossible to explain that to LLM before. Added benefit, I found a lot of other stuff in tue dataset that will be very useful in future. Had I used a LLM for that, I wouldn't know hald of what I know about that data I do now.
That's the risk I see with LLMs, already now my pet peeve are data scientist with no domain knowledge or understanding of the data they analyze, but at least they now the maths. If part ofbthat is outsourced to a blackbox AI that halluzonates half the time, I am afraid most of those analysises will be utterly useless, or worse, misleading in a very confident way...
TLDR: In my opinion LLMs take away the curious discovery when go over data or text or whatever. Which is lazy and prevents us from casually learning new things. And we cannot even be sure we can trust results. Oh, and we are moving to think more about the tool, LLMs and prompts, than we of doing the job. Again, lazy and superficial, and a dead sure way to get mediocre, at best, results.
What's the plan/timeline for offering cosine similarity support, given that most OSS embedding models are fine tuned on a contrastive cosine distance objective?
You simply sample tokens starting with the allowed characters and truncate if needed. It’s pretty efficient, there’s an implementation here: https://github.com/1rgs/jsonformer