Hacker Newsnew | past | comments | ask | show | jobs | submit | yugoru's commentslogin

All my services have lost their databases. All their AI agent is offering me is to delete all the data (thousands of data) and create a new database. It insists there's no other way. But I have clients, I have hundreds of courses on the platform!

its really interesting. Running agents close to the user’s data solves a lot of latency and privacy issues. The hard part seems to be balancing autonomy with predictability — once agents start chaining actions locally, debugging behavior becomes tricky. Curious how you're approaching that.


its harder than it first appears. Even with good embeddings, semantic similarity across languages often breaks when articles include local context or idioms. Curious whether you found a threshold strategy that works reliably across languages, or if it still needs manual tuning.


Good question. The short answer: a single global threshold (cosine similarity ≥ 0.7) works surprisingly well for news, but it's not because embeddings handle idioms perfectly — it's because news articles are structurally constrained.

News articles about the same event tend to share named entities (people, places, organizations), numbers, and factual structure even across languages. "EU approves AI regulation" is a factual statement that embeds similarly regardless of language. This is very different from, say, opinion pieces or cultural commentary where idioms and local framing would diverge more.

That said, similarity alone isn't enough. The real reliability comes from non-semantic constraints layered on top:

- Time gap ≤ 18 hours between article and story — prevents "same topic, different month" false merges

- Story age ≤ 36 hours — old stories stop absorbing new articles

- Two-pass design — matching against refined story embeddings (average of recent articles) is more stable than raw article-to-article comparison

Where it does break: regional stories with heavy local context. A Japanese domestic politics article and an English wire service summary of the same event sometimes land just below threshold because the framing is so different. I accept some missed merges there rather than lowering the threshold and getting false positives.

No per-language thresholds so far — the embedding model (Qwen3) seems to normalize well across the languages I cover. But I wouldn't be surprised if that changes when adding languages with less training data representation.


Interesting perspective, one thing that keeps surprising me is how many modern systems still end up re-discovering OOP ideas in different forms, especially when you start modeling complex real-world systems (hardware pipelines, robotics control layers, etc). The terminology changes, but the need for encapsulating behavior around state never really disappears.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: