More

zxt_tzx · 2025-03-09T10:08:08 1741514888

Thank you for the comment, compared to you I have only touched the bare surface of this quite complex domain, would love to get more of your input!

> building HNSW indices in Postgres is still extremely slow (even with parallel index builds), so it is difficult to experiment with index hyperparameters at scale.

Yes, I experienced this too. I from 1536 to 256 and did not try more values than I'd have liked because spinning up a new database and recreating the embeddings simply took too long. I’m glad it worked well enough for me, but without a quick way to experiment with these hyperparameters, who knows whether I’ve struck the tradeoff at the right place.

Someone on Twitter reached out and pointed out one could quantizing the embeddings to bit vectors and search with hamming distance — supposedly the performance hit is actually very negligible, especially if you add a quick rescore step: https://huggingface.co/blog/embedding-quantization

> But (as mentioned in other comments) keeping your data in sync is a huge issue.

Curious if you have any good solutions in this respect.

> The other challenge I’ve found is that filtering is often the “special sauce” that vector store providers bring to the table, so it’s pretty difficult to reason about the performance and recall of various types of filters.

I realize they market heavily on this, but for open source databases, wouldn't the fact that you can see the source code make it easier to reason about this? or is your point that their implementation here are all custom and require much more specialized knowledge to evaluate effectively?

zxt_tzx · 2025-03-09T03:01:57 1741489317

> And overall, the fewer people use CF (or another provider of their size) the better.

I understand your sentiment, but I vehemently disagree.

The cloud provider space has rapidly become an oligopoly and CloudFlare is one of the few new entrants that (1) has sufficient scale to compete with the incumbents; (2) has new ideas that the incumbents cannot easily match (region earth, durable objects etc.).

For most production workloads, I would not even consider the newer cloud providers, but I sincerely meant it when I said I hope Cloudflare will succeed. They've also been very responsive to the feedback raised in the blogpost when I DM-ed them.

(On a side note re: difficulty for newcomers in this market, I used to be part of a team that would run e.g. staging and testing environments on a new serverless db provider, but would run prod on AWS Aurora. In retrospect, this did not make much sense either as you want your environments to be as similar as possible, which means new cloud providers have an even tougher time getting started.)

zxt_tzx · 2025-03-09T02:42:20 1741488140

After this failed experience with SemHub, I am actually thinking of building something like this, for open source maintainers like you are definitely the ICP! (nuqs seems really cool btw, storing state in the URL param is definitely the way to go)

To elaborate, I was thinking of:

- running a cron that checks repos every X minutes

- for every new issue someone has opened, I will run an agent that (1) checks e.g. SemHub to look for similar issues; (2) checks the project's Discord server or Slack channel to see if anyone has raised something similar; (3) run a general search

- use LLMs to compose a helpful reply pointing the OP to that other issue/Discord discussion etc.

From other OSS maintainers, I've heard that being able to reliably identify duplicates would be a huge plus. Does this sound like something you'd be interested to try? Let me know how I can reach you if/when I have built something like this!

I am personally quite annoyed by all the AI slop being created on social media and even GitHub PRs and would love to use the same technology to do something pro-social.

franky47 · 2025-03-09T11:52:05 1741521125

While having a bot that auto-replies with "similar issues" pointers might make sense at a large scale (to relieve maintainers), I usually prefer to do this manually at my current scale, knowing that there's one particular instance where I pointed someone in a given direction, and want to either reuse/modify a code example block, or stitch together semantically unrelated but relevant comments & discussions together.

You might want to talk to Jovi [1] about that, he's doing something very similar.

[1] https://bsky.app/profile/jovidecroock.com/post/3lh6hkcxnqc2v

zxt_tzx · 2025-03-09T02:32:56 1741487576

Ah I was doing semantic search of GitHub _issues_, not the actual code on GitHub.

For code search, I have used grep.app, which works reasonably well

zxt_tzx · 2025-03-09T02:32:01 1741487521

> There’s so much complexity that comes with keeping your vector db in sync with you main db (especially once you start filtering with metadata)

Ohh do you speak from experience? I know I will likely never do this, but curious how did you do it? When I looked into this, I found that Airbyte has something to connect the vector db with the main db, but I never bit that bullet (thankfully)

zxt_tzx · 2025-03-09T02:27:07 1741487227

oh wow that's super cool, I tried it and it's very fast indeed. thanks for sharing! will spend more time to understand how it's implemented

nthingtohide · 2025-03-09T07:46:23 1741506383

Have you experimented with late interaction?

https://jina.ai/news/what-is-colbert-and-late-interaction-an...

http://musingsaboutlibrarianship.blogspot.com/2024/06/can-se...

https://colbert.aiserv.cloud/

snikolaev · 2025-03-09T10:15:34 1741515334

This article may be more relevant https://manticoresearch.com/blog/github-semantic-search/

zxt_tzx · 2025-03-09T02:16:21 1741486581

Thanks for the feedback, to be honest, my own experience is actually very similar to yours.

The original pain point probably only exists for small minority of open source maintainers who manage multiple repos and actually search across them regularly. Most devs are probably like you and I, and the mediocre GitHub search experience is more than compensated by using Google.

In its current iteration, it's quite hard to get regular devs to change their searching behaviour, and, even for those who experience this pain point, it probably isn't large enough for them to change their behavior.

If I continue to work on this, I would want to (1) solve a bigger + more frequent pain point; (2) build something that requires a smaller change in user behavior.

johnfn · 2025-03-09T16:02:53 1741536173

Any chance you live in SF? If so, we should meet up - I'm working on something similar. You can reach out at my username @gmail.com

zxt_tzx · 2025-03-09T01:37:42 1741484262

Totally fair point. Thanks for taking the time to read through it! I guess I didn't want to use a VPS and then have to switch to something else if the product really worked, but I guess that rhymes with premature optimization.

Some other clarifications:

- I was also surprised with how expensive Supabase turned out to be and only got there because I was trying to sync very big repos ahead of time. I could see an alternative product where the cost here would be minimal too

- I did see this project as an opportunity to try out Cloudflare. as mentioned in the post, as a full stack TypeScript developer, I thought Cloudflare could be a good fit and I still really want it to succeed as a cloud platform

- deploying two separate API server and auth server is actually simpler than it sounds, since each is a Cloudflare Worker! will try to open source this project so this is clearer

- the durable objects rate limiter was wholly experimental and didn't make it into production

> All that being said though, maybe all it would've done is prolong the inevitable death due to the product gap the author concludes with.

Very true :(

android521 · 2025-03-09T03:23:34 1741490614

I am using a vps and it is dead simple and cheap. If my projects actually gained traction , switching from vps to a more scalable infra is not a big challenge. The biggest challenge is to find PMV as fast and as efficient as possible.

zxt_tzx · 2025-03-08T14:03:12 1741442592

Author here. Over the last few months, I have built and launched a free semantic search tool for GitHub called SemHub (https://semhub.dev/). In this blog post, I share what I’ve learned and why I’ve failed, so that other builders can learn from my experience. This blog post runs long and I have sign-posted each section. I have marked the sections that I consider the particularly insightful with an asterisk (*).

I have also summarized my key lessons here:

1. Default to pgvector, avoid premature optimization.

2. You probably can get away with shorter embeddings if you’re using Matryoshka embedding models.

3. Filtering with vector search may be harder than you expect.

4. If you love full stack TypeScript and use AWS, you’ll love SST. One day, I wish I can recommend Cloudflare in equally strong terms too.

5. Building is only half the battle. You have to solve a big enough problem and meet your users where they’re at.

gfody · 2025-03-08T23:44:04 1741477444

it's weird you consider this a failure. you spent a few months and learned how to work with embedding models to build an efficient search. the fact that your search works well is a successful outcome. if your goal was to turn a few month effort into a thriving business that's never going to happen period - it only seems possible because when it does happen for people we completely discount the luck factor.

if you want to turn your search into a business now that's a new and different effort, mostly marketing and stuff that most self respecting engineers gives zero shits about, but if that's your real goal don't call it a failure yet because you haven't even tried.

zxt_tzx · 2025-03-09T01:54:15 1741485255

> it's weird you consider this a failure. you spent a few months and learned how to work with embedding models to build an efficient search. the fact that your search works well is a successful outcome.

Thank you for your encouragement! I take your point that it was not a technical failure, but I think it's still a product failure in the sense that SemHub was not solving a big enough pain point for sufficiently many people.

> if you want to turn your search into a business now that's a new and different effort, mostly marketing and stuff that most self respecting engineers gives zero shits about, but if that's your real goal don't call it a failure yet because you haven't even tried.

Haha to be honest, my goal was even more modest, SemHub is intended to be a free tool for people to use, we don't intend to monetize it. I also did try to market it (DMing people, Show HN), but the initial users who tried it did not stick around.

Sure, I could've marketed SemHub more, but I think the best ideas carry within themselves a certain virality and I don't think this is it.

smarx007 · 2025-03-08T18:42:25 1741459345

Hi, thanks for building a great tool and a great write-up! I was trying to add a number of repos under oslc/, oslc-op/, and eclipse-lyo/* orgs but no joy - internal server error. Hopefully, you will reconsider shutting down the project (just heard about it and am quite excited)!

I think a project like yours is going to be helpful to OSS library maintainers to see which features are used in downstream projects and which have issues. Especially, as in my case, when the project attemps to advance an open standard and just checking issues in the main repo will not give you the full picture. For this use case, I deployed my own instance to index all OSS repos implementing OSLC REST or using our Lyo SDK - https://oslc-sourcebot.berezovskyi.me/ . I think your tool is great in complementing the code search.

zxt_tzx · 2025-03-09T01:45:03 1741484703

Ohh apologies, I think there was a bug that led to the Internal Server Error, please try again, I _think_ it should be working now!

> I think a project like yours is going to be helpful to OSS library maintainers to see which features are used in downstream projects and which have issues.

That was indeed the original motivation! Will see if I can convince Ammar to reconsider shutting down the project, but no promises

> For this use case, I deployed my own instance to index all OSS repos implementing OSLC REST or using our Lyo SDK

Ohh, in case it's not clear from the UI, you could create an account and index your own "collection" of repos and search from within that interface. I had originally wanted to build out this "collection" concept a lot more (e.g. mixing private and public repos), but I thought it was more important to see if there's traction for the public search idea at all

fulafel · 2025-03-08T15:45:49 1741448749

SST: https://github.com/sst/sst - vaguely similar to CDK but can also manage some non-AWS resources and seems TypeScript-only

e12e · 2025-03-08T23:40:46 1741477246

Apparently they started on top of cdk - then migrated to pulumni adding support for terraform providers.

Looks like one of the more interesting deploy toolkits I've seen in a while.

romanhn · 2025-03-08T23:54:52 1741478092

Thanks for posting this, very timely as I'm also playing around with pgvector for semantic search. I saw that you ended up trimming inputs longer than 8K tokens. Have you looked into chunking (breaking input into smaller chunks and doing vector search on the chunks)? Embedding models I'm playing with have a max of 512 tokens, so chunking is pretty much a must. Choosing a chunking strategy seems to be a deep rabbit hole of its own.

zxt_tzx · 2025-03-09T02:01:56 1741485716

> Have you looked into chunking (breaking input into smaller chunks and doing vector search on the chunks)?

Ohh I had not seriously considered this until reading this. I could have multiple embeddings per issue and search across those embeddings and if the same issue is matched multiple times, I would probably take the strongest match and dedupe it.

I could create embeddings for comments too and search across those.

Thanks for the suggestion, would be a good think to try!

> Choosing a chunking strategy seems to be a deep rabbit hole of its own.

Yes this is true. In my case, I think the metadata fields like Title and Labels are probably doing a lot of the work (which would be duplicated across chunks?) and, within an issue body, off the top of my head, I can't see any intuitive ways to chunk it.

I have heard that for standard RAG, chunking goes a surprisingly long way!

vaidhy · 2025-03-08T15:30:27 1741447827

Having built a failed semantic search engine for life sciences (bioask when it existed), I think the last point should be the first. Not getting a product market fit very quickly killed mine.

niel · 2025-03-08T17:03:09 1741453389

Thanks for writing this up!

> Filtering with vector search may be harder than you expect.

I've only ever used it for a small proof of concept, but Qdrant is great at categorical filtering with HNSW.

https://qdrant.tech/articles/filtrable-hnsw/

zxt_tzx · 2025-03-09T02:34:53 1741487693

Thanks for sharing! Do you have more details to share, e.g. did you just have a vector db, or did you have a main db as well?

In my research, Qdrant was also the top contender and I even created an account with them, but the need to sync two dbs put me off

wrs · 2025-03-08T17:11:23 1741453883

Fantastic writeup — thank you for taking the time to do this!

zxt_tzx · 2025-03-09T03:02:20 1741489340

I'm glad you found it helpful :)

cynicalsecurity · 2025-03-08T15:29:49 1741447789

With 5 you mean promoting the app? It is by far the biggest problem, yes. In many cases even bigger than building the app itself.

zxt_tzx · 2024-07-08T02:21:55 1720405315

It's always a little dubious when modern people pretend to have high confidence about the behaviors of long-dead people to serve their modern purposes. (Another example: oh you're an INFJ, just like Moses from the Bible!)

0xdeadbeefbabe · 2024-07-08T15:21:18 1720452078

Our ancient ancestors never had to deal with such deep thoughts, so we can't help but make dubious comparisons because our brains haven't evolved enough.

Yes, it's so lazy. Modern people ought to pretend to greater laziness, and then things would go better.