adithyadrdo's comments

adithyadrdo · 2026-02-26T04:18:28 1772079508

Thanks for the breakdown — this is exactly the kind of feedback I needed.

You’re right about my query flow: I’m still doing separate LLM calls for the router, analyzer, and rewriter. Merging that into one should cut latency a lot, especially since Qwen2.5-7B-AWQ on an RTX 4000 Ada only gives me ~15–25 tok/s.

The BM25 point is spot-on too. I’ve been running pure vector search (BGE-base-en-v1.5 + FAISS, reranked with bge-reranker-v2-m3). Adding BM25 with dynamic weighting — especially for exact-match queries like titles/authors — is something I really shouldn’t keep putting off.

Using the cross-encoder as the evaluator is probably the easiest fix. My current GOOD/UNSURE/BAD scoring uses the same Qwen model, which is the circular issue I mentioned. Since I’m already running the cross-encoder, letting it handle the thresholding would let me drop the LLM evaluator entirely.

No caching yet, but I’ll start with exact-match hashing and layer semantic caching later.

Model-wise: Qwen2.5-7B-AWQ on GPU, with Qwen2.5-14B on CPU as a slow fallback. AWQ is what makes the 8GB VRAM setup workable.

Really appreciate you taking the time — I’ll open issues for hybrid search + caching this week.