Hacker News new | past | comments | ask | show | jobs | submit | kkielhofner's comments login

I've used it for hybrid search and it works quite well.

Overall I'm really happy to see Typesense mentioned here.

A lot of the smaller scale RAG projects, etc you see around would be well served by Typesense but it seems to be relatively unknown for whatever reasons. It's probably one of the easiest solutions to deploy, has reasonable defaults, good docs, easy clustering, etc while still be very capable, performant, and powerful if you need to dig in further.


> I'm an engineering manager

How are you involved in the hiring process?

> Our engineers are fucking morons. And this guy was the dumbest of the bunch.

Very indicative of a toxic culture you seem to have been pulled in to and likely have contributed to by this point given your language and broad generalizations.

Describing a wide group of people you're also responsible for as "fucking morons" says more about you than them.


They will.

- It's a cornerstone of their brand.

- R2 users are at least paying for something (storage).

- Their network is massively overbuilt to be able to absorb DDoS attacks.

- They offer free bandwidth with their CDN - including completely free users. These resources need to get fetched from origin and/or cached. R2 doesn't have to fetch from origin which eliminates the bandwidth required for the fetch.

Large providers typically pay very little/nothing for bandwidth other than equipment costs (ports, etc). As a large provider they have free peering to most of the "last mile" ISP/eyeball networks in the world. This benefits all parties because these ISPs don't have to pay transit providers and neither does Cloudflare and it's faster. Same goes for all of the big clouds.

People who think AWS/GCP/Azure/etc bandwidth pricing of $0.12/GB (or whatever) is fair/reasonable have no idea what bandwidth actually costs these operators. As noted it's effectively nothing and the big clouds capitalize on this ignorance by charging insane markups for bandwidth.


ICC, IPP, QAT, etc are definitely an edge.

In AI world they have OpenVINO, Intel Neural Compressor, and a slew of other implementations that typically offer dramatic performance improvements.

Like we see with AMD trying to compete with Nvidia software matters - a lot.


> The US voting machines are just waiting to be hacked, just a matter of when, not if.

The US election system is very distributed and fragmented - there is virtually no standardization.

Even in the tightest margins for something like President you'd need to have seriously good data to figure out which random municipality voting system(s) you'd need to target to actually affect the outcome.


> to figure out which random municipality voting system(s) you'd need to target to actually affect the outcome.

As you said, no standardization, which means all precincts reports on wildly different time intervals, if you can interfere with just tallying during or after the fact, and you can get the information on other precincts before any other outlets, you could easily take advantage of this.

It's essentially the Superman II version of interfering with an election. Just put your thumb on the scale a little bit everywhere on late precincts all at once.

The fact that so many states let a simple majority of their state take _all_ electors actually makes this possible. If more states removed the Unit Rule and went like Nebraska and Maine this would be far less effective.


> As you said, no standardization, which means all precincts reports on wildly different time intervals

There is standardization within all precincts of a county. And from my past experience as a poll worker, I can tell you why precinct reporting times can vary wildly within a county.

(Note things I say here are specific to the county where I worked.)

Anyone in line to vote by 8PM is allowed to vote. We (the other poll workers and I) could not start closing the polls until every voter had voted. If the local community did not trust vote-by-mail, then that polling place will likely see delays in closing due to lines.

One polling place often covered multiple precincts, so you'll see multiple precincts delayed simultaneously.

After that, boxes go from one queue to another, with multiple queues consolidating into one or two. So, a one-minute delay in dropping off your box to a collection point, may mean a two-hour delay in that box being processed.

> if you can interfere with just tallying

First off, that would require a remarkable amount of fraud. Second, that's why there are observers. It doesn't matter if it's 2AM on the Wednesday after election day: If tabulating is happening, you are allowed to observe.


Ironically America's fragmentary and incoherent electoral system makes it extremely hard to steal an election there.


The 2000 election was decided by 500 votes. You think it would be unfeasible to flip 500 votes in a critical swing state with such a system?


The question is how to know which county you need to do that in. The more you try, the greater the odds of being caught but with margins that small you’d need to attempt multiple states and predict rather accurately how many votes you need to win but not to attract too much scrutiny.


Moreover the problem there isn't the distributed/local control of voting, but the College.


You say "fragmentary and incoherent", I say "decentralized".


Decentralised could be each counter putting 5000 or so ballots into piles with people wandering around witnessing for various parties all working a rigid process accross the nation. Each count publically announced in the room before witnesses.

Totally standardised, coordinated, and decentralised. But fragmented (structuraly) or incoherent.

But agree would be a million times worse with a single electronic system


Wouldn't you only need to target a handful of battleground districts/states? No point in trying to turn Vermont red or Wyoming blue.


> word Wood dominated the embedding values, but these were supposed to go into 2 different categories

When faced with a similar challenge we developed a custom tokenizer, pretrained BERT base model[0], and finally a SPLADE-esque sparse embedding model[1] on top of that.

[0] - https://huggingface.co/atomic-canyon/fermi-bert-1024

[1] - https://huggingface.co/atomic-canyon/fermi-1024


Do you mind sharing why you chose SPLADE-esque sparse embeddings?

I have been working on embeddings for a while.

For different reasons I have recently become very interested in learned sparse embeddings. So I am curious what led you to choose them for your application, and why?


> Do you mind sharing why you chose SPLADE-esque sparse embeddings?

I can provide what I can provide publicly. The first thing we ever do is develop benchmarks given the uniqueness of the nuclear energy space and our application. In this case it's FermiBench[0].

When working with operating nuclear power plants there are some fairly unique challenges:

1. Document collections tend to be in the billions of pages. When you have regulatory requirements to extensively document EVERYTHING and plants that have been operating for several decades you end up with a lot of data...

2. There are very strict security requirements - generally speaking everything is on-prem and hard air-gapped. We don't have the luxury of cloud elasticity. Sparse embeddings are very efficient especially in terms of RAM and storage. Especially important when factoring in budgetary requirements. We're already dropping in eight H100s (minimum) so it starts to creep up fast...

3. Existing document/record management systems in the nuclear space are keyword search based if they have search at all. This has led to substantial user conditioning - they're not exactly used to what we'd call "semantic search". Sparse embeddings in combination with other techniques bridge that well.

4. Interpretability. It's nice to be able to peek at the embedding and be able to get something out of it at a glance.

So it's basically a combination of efficiency, performance, and meeting users where they are. Our Fermi model series is still v1 but we've found performance (in every sense of the word) to be very good based on benchmarking and initial user testing.

I should also add that some aspects of this (like pretrained BERT) are fairly compute-intense to train. Fortunately we work with the Department of Energy Oak Ridge National Laboratory and developed all of this on Frontier[1] (for free).

[0] - https://huggingface.co/datasets/atomic-canyon/FermiBench

[1] - https://en.wikipedia.org/wiki/Frontier_(supercomputer)


These is excellent comment, can someone put it inside the highlights.


> we don't want to hurt performance on other real-world tasks just to do well on MTEB

Nice!

Fortunately MTEB lets you sort by model parameter size because using 7B parameter LLMs for embeddings is just... Yuck.


LLMs have nearly completely sucked the oxygen out of the room when it comes to machine learning or "AI".

I'm shocked at the number of startups, etc you see trying to do RAG, etc that basically have no idea what they are, how they actually work, etc.

The "R" in RAG stands for retrieval - as in the entire field of information retrieval. But let's ignore that and skip right to the "G" (generative)...

Garbage in, garbage out people!


My startup (Atomic Canyon) developed embedding models for the nuclear energy space[0].

Let's just say that if you think off-the-shelf embedding models are going to work well with this kind of highly specialized content you're going to have a rough time.

[0] - https://huggingface.co/atomic-canyon/fermi-1024


> they're not a complete replacement for simpler methods like BM25

There are embedding approaches that balance "semantic understanding" with BM25-ish.

They're still pretty obscure outside of the information retrieval space but sparse embeddings[0] are the "most" widely used.

[0] - https://zilliz.com/learn/sparse-and-dense-embeddings


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: