I've used it for hybrid search and it works quite well.
Overall I'm really happy to see Typesense mentioned here.
A lot of the smaller scale RAG projects, etc you see around would be well served by Typesense but it seems to be relatively unknown for whatever reasons. It's probably one of the easiest solutions to deploy, has reasonable defaults, good docs, easy clustering, etc while still be very capable, performant, and powerful if you need to dig in further.
> Our engineers are fucking morons. And this guy was the dumbest of the bunch.
Very indicative of a toxic culture you seem to have been pulled in to and likely have contributed to by this point given your language and broad generalizations.
Describing a wide group of people you're also responsible for as "fucking morons" says more about you than them.
- R2 users are at least paying for something (storage).
- Their network is massively overbuilt to be able to absorb DDoS attacks.
- They offer free bandwidth with their CDN - including completely free users. These resources need to get fetched from origin and/or cached. R2 doesn't have to fetch from origin which eliminates the bandwidth required for the fetch.
Large providers typically pay very little/nothing for bandwidth other than equipment costs (ports, etc). As a large provider they have free peering to most of the "last mile" ISP/eyeball networks in the world. This benefits all parties because these ISPs don't have to pay transit providers and neither does Cloudflare and it's faster. Same goes for all of the big clouds.
People who think AWS/GCP/Azure/etc bandwidth pricing of $0.12/GB (or whatever) is fair/reasonable have no idea what bandwidth actually costs these operators. As noted it's effectively nothing and the big clouds capitalize on this ignorance by charging insane markups for bandwidth.
> The US voting machines are just waiting to be hacked, just a matter of when, not if.
The US election system is very distributed and fragmented - there is virtually no standardization.
Even in the tightest margins for something like President you'd need to have seriously good data to figure out which random municipality voting system(s) you'd need to target to actually affect the outcome.
> to figure out which random municipality voting system(s) you'd need to target to actually affect the outcome.
As you said, no standardization, which means all precincts reports on wildly different time intervals, if you can interfere with just tallying during or after the fact, and you can get the information on other precincts before any other outlets, you could easily take advantage of this.
It's essentially the Superman II version of interfering with an election. Just put your thumb on the scale a little bit everywhere on late precincts all at once.
The fact that so many states let a simple majority of their state take _all_ electors actually makes this possible. If more states removed the Unit Rule and went like Nebraska and Maine this would be far less effective.
> As you said, no standardization, which means all precincts reports on wildly different time intervals
There is standardization within all precincts of a county. And from my past experience as a poll worker, I can tell you why precinct reporting times can vary wildly within a county.
(Note things I say here are specific to the county where I worked.)
Anyone in line to vote by 8PM is allowed to vote. We (the other poll workers and I) could not start closing the polls until every voter had voted. If the local community did not trust vote-by-mail, then that polling place will likely see delays in closing due to lines.
One polling place often covered multiple precincts, so you'll see multiple precincts delayed simultaneously.
After that, boxes go from one queue to another, with multiple queues consolidating into one or two. So, a one-minute delay in dropping off your box to a collection point, may mean a two-hour delay in that box being processed.
> if you can interfere with just tallying
First off, that would require a remarkable amount of fraud. Second, that's why there are observers. It doesn't matter if it's 2AM on the Wednesday after election day: If tabulating is happening, you are allowed to observe.
The question is how to know which county you need to do that in. The more you try, the greater the odds of being caught but with margins that small you’d need to attempt multiple states and predict rather accurately how many votes you need to win but not to attract too much scrutiny.
Decentralised could be each counter putting 5000 or so ballots into piles with people wandering around witnessing for various parties all working a rigid process accross the nation. Each count publically announced in the room before witnesses.
Totally standardised, coordinated, and decentralised. But fragmented (structuraly) or incoherent.
But agree would be a million times worse with a single electronic system
> word Wood dominated the embedding values, but these were supposed to go into 2 different categories
When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.
Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I have been working on embeddings for a while.
For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?
> Do you mind sharing why you chose SPLADE-esque sparse embeddings?
I can provide what I can provide publicly. The first thing we ever do is develop benchmarks given the uniqueness of the nuclear energy space and our application. In this case it's FermiBench[0].
When working with operating nuclear power plants there are some fairly unique challenges:
1. Document collections tend to be in the billions of pages. When you have regulatory requirements to extensively document EVERYTHING and plants that have been operating for several decades you end up with a lot of data...
2. There are very strict security requirements - generally speaking everything is on-prem and hard air-gapped. We don't have the luxury of cloud elasticity. Sparse embeddings are very efficient especially in terms of RAM and storage. Especially important when factoring in budgetary requirements. We're already dropping in eight H100s (minimum) so it starts to creep up fast...
3. Existing document/record management systems in the nuclear space are keyword search based if they have search at all. This has led to substantial user conditioning - they're not exactly used to what we'd call "semantic search". Sparse embeddings in combination with other techniques bridge that well.
4. Interpretability. It's nice to be able to peek at the embedding and be able to get something out of it at a glance.
So it's basically a combination of efficiency, performance, and meeting users where they are. Our Fermi model series is still v1 but we've found performance (in every sense of the word) to be very good based on benchmarking and initial user testing.
I should also add that some aspects of this (like pretrained BERT) are fairly compute-intense to train. Fortunately we work with the Department of Energy Oak Ridge National Laboratory and developed all of this on Frontier[1] (for free).
My startup (Atomic Canyon) developed embedding models for the nuclear energy space[0].
Let's just say that if you think off-the-shelf embedding models are going to work well with this kind of highly specialized content you're going to have a rough time.
Overall I'm really happy to see Typesense mentioned here.
A lot of the smaller scale RAG projects, etc you see around would be well served by Typesense but it seems to be relatively unknown for whatever reasons. It's probably one of the easiest solutions to deploy, has reasonable defaults, good docs, easy clustering, etc while still be very capable, performant, and powerful if you need to dig in further.