Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I built Haystack – your own google for scattered workplace knowledge (haystack.it)
96 points by _vxw6 on Jan 2, 2023 | hide | past | favorite | 67 comments
Hi all!

A few weeks ago I was scrolling through confluence pages trying to find ssh connection details to our integration machine for 40 minutes straight, later I discovered my co-worker slack'ed me the ssh connection string two months ago.

So the same weekend I started working on haystack - a search engine for workplace apps. that enables you to search slack, confluence, jira, teams, sharepoint, github, and email in one place.

I wanted it to support natural language queries so a query like: "how to connect to integ2 machine?" yields:

  ssh -i private.pem ubuntu@ec2-integration2.eu-est-1.compute.amazonaws.com
I decided that user data should be stored locally, so all logic is completely client-sided (including the NLP model) - I don’t want access to your internal docs, thanks.

I rolled it out to my co-workers a week ago and they thought it's a hit, so I'm planning on releasing it publicly on March 2023. But if you want to try it out before then it's available here: https://haystack.it. Thanks!




Interesting idea and approach!

Just a thought on the design: I really feel like the gold gradients make the whole site feel 'cheap', not trustworthy and not very 'professional'. Actually makes me not want to use it. Replacing them with a simple warm yellow improves this a lot. That might be just me, but maybe it's something for you to consider.

Good luck!


Not a designer, I appreciate this advice immensely, I’ll try!


Yep, specifically it looks to me like the kind of styling for a gambling or raffle website.


Or one of those pamphlets that come with packages from random companies on Amazon that tries to show how it's high quality.


Note taken lol wow


Honestly it's not that bad. For someone who's not a designer it looks rather polished.


I’m adding some technical details!

Haystack runs entirely client sided in the browser, so it has a unique tech-stack:

Storage

using IndexDB, haystack stores user indexes locally, + a compressed 90mb NLP model (t5-small) is stored.

Indexing

Locally in the browser, using a t5-small bi-encoder, and some parsing of documents happens in wasm.

Search

Query converted to embedding, then searched over index, atlast results are reranked with a t5-small based cross encoder, and top results go through a seq-to-seq transformer to produce a nice consise textual answer.


Very cool tech stack. How do you run t5-small in the browser?


I load the 90mb model into memory from IndexDB, and then some rust code compiled down to WASM does calculations with the model.


Unless there is some way to try it this post may be against Show HN guidelines specifically:

> Show HN is for something you've made that other people can play with. HN users can try it out, give you feedback, and ask questions in the thread.

> Off topic: Those can't be tried out, so can't be Show HNs. Make a regular submission instead.

I'd suggest you change the title asap.

(1) https://news.ycombinator.com/showhn.html


I seriously hope they don't submit this thing every 5 days until March https://news.ycombinator.com/item?id=34161085


Well, at least it got new text in the submission!


Hi, didn’t intend to repost this, it’s just that HN is hard to figure out.

And most first posts don’t get the intended traction for various reasons (too long, bad wording, unclear).

reposting and changing the post is totally allowed :)


Oh, I meant that in the spirit of encouraging a reworked text!


Sorry didn’t assume otherwise, replied accidentally to the wrong thread


Can see how this has potential. I've worked on a similar thing at work and there are some nuances as to the level the embedding is performed at (e.g. sentence level) and the kinds of queries that your search engine will be good at (i.e. good ranking of results). Also, depending on how heterogenous the data is, other factors (dated-ness, colleague whose writing/instructions/tutorials you prefer) can also be incorporated into the ranking algorithm.


You hit the nail on it’s head, some of these is something I’m dealing with right now (i.e dates)


Here's a slightly different use case. I want something that can index all open tabs in the browser so that I do not have to leave hundreds of tabs open.


There are semantic embeddings libraries that are fast enough to run on entire webpages in ~hundreds of millis or low single digit seconds. I've been thinking a lot recently about making a browser plugin that simply does semantic embedding on a paragraph level of every web page I ever visit, and store it in a vector database.

This would enable querying my little private search engine like "the HN story a few weeks ago that talked about ancient greek mining techniques" or "the reddit comment that had an analysis comparing Orwell's 1984 to the bible".

For those not familiar, semantic embeddings take a chunk of text and embed it in a high dimensional vector space (~hundreds of dimensions) where semantically similar texts are closer together.


Very good explaination!

hundred of millis is what I experienced.


That’s extremely interesting, I would argue that the reason for keeping tabs open varies, but is something along the lines of: re-reaching the page in the tab is too slow


More specifically, I do a lot of context switching and still try to maintain a reasonable amount of open tabs.

The problem that I have is that I can't remember days/months later if something I read is in an open tab or closed a while ago resulting in some frantic searches.


Tab grouping is one way to tame the madness. I’ve tried it with the default tab groups feature in Chrome but keep loosing the groupings whenever I restart chrome. Anyone know a better grouping extension?



This is awesome! Is any of it open-source? I’d love to learn more about how the LLM works in the browser.


It will be very soon https://github.com/haystackoss/haystack

some rust code that compiles to WASM loads LLM from memory, and uses custom transformer.py like rust alternative we wrote.


Knowledge base for dev teams -- a problem we all have.

I almost built a potential solution to this problem years ago but backed out. I'd love to see a solution that sticks, and to be wrong about this, but it feels very much like a Tarpit problem to me:

https://www.youtube.com/watch?v=GMIawSAygO4


How do you compare to Glean? https://www.glean.com/

Not affiliated but just a happy user of their product it searches slack, confluence, jira, gmail, gdrive, github and source code all at once. With extras like Go links, verification, and some knowledge base features.


open & free for self hosted version, current alpha version is client side (runs in the browser).


By open do you mean open source?


How much does glean cost?


Why does your emoji (exploding head) on your landing page lead to an external page describing that emoji? https://emojipedia.org/exploding-head/


Haha I copied it from that page, hilliarious!

I'll keep it haha


Awesome idea and interesting approach!

I am wondering how the search latentcy will be with your approach, especially for cases with more than a few hundreds documents. Do you have any insights about that?


Actually a few hundred documents is really no biggy, my current benchmarks is in the range of <250ms (instant feeling) for hundreds of thousands of paragraphs.

I'm testing this on a large knowledge base.


Does the app need to be open in the browser for the indexer to run?


Yes it needs to stay open. if that’s a problem, I thought of building an extension for continuous indexing.


As someone who uses a multitude of workplace apps myself, this is amazing. What kind of model are you using, and do you have plans to provide a service based off this?

Good stuff!


I’m using a fine-tuned t5-small model, I fune-tuned it for two tasks, question answering from a paragraph, and highlighting relevant text of search results.


I’m planning on releasing an open source version of this.

But also have paid features that managers would like to use.


Sounds like a Searchable Library of All Corporate Knowledge


What determines relevance? It must do something other than page rank. Will it recognize synonyms and more subtle kinds of nearness in word-space?


Actually, the page rank is really based on semantic similarity and relevance of the matched paragraph.

Which under the hood is based of a t5 encoder


Does it work well for question answering from books ?


The tricky part is to understand which parts are you indexing, paragraph vs. sentences vs. pages.

But yeah


Does the index contain only info I have access to? How is authentication for all the knowledge sources handled?


In the setup process you sign in via SSO to all integrations, the token is saved in local storage. That token is used for indexing, and so if you don’t have access to info, the index doesn’t have access.


Makes me wonder, what kind of information sources do you use at work?

Slack, teams, confluence or notion? airtable? jira?


Interesting product, you should list all the integrations on the landing page.


Really looking forward to trying out your app, it looks amazing.


Thanks


> Browser based LLM

Very cool - any more info on that?


Same bundle of weights but is being run by some rust code that is compiled down to WebAssembly.


How long does the initial indexing take?


Seems quite skinny on info about indexing. It nabs your login and stores tokens to the sites you want to index then... does it just spend CPU days running in your browser downloading and indexing all that data? How much storage do you need locally to index what can potentially be massive amounts of data in most corporate information sources?


gb's of storage, potentially 10+ for very large datasets.

Minutes, not days. Very big data sets might take 30+ minutes (or even a couple of hours), but usefulness starts in the first few minutes (because of the priority algorithm)


where does this exactly store the data? and is it manageable?


It stores it in browser local storage using IndexDB, if you have access to years of documents, it might take up to 10+gb of persistant storage.



you forgot the most important one: haystack.it!

I would argue that every known noun are the first domains to get registered. My goal is to associate workplace search engines with haystack.



Forgive me if I don't understand, but I don't think there's a problem with multiple companies using the same common noun as the base for their domain.

Let the best product be remembered for the name.


Certainly not, presuming it doesn't escalate to the level of trademark/service mark infringement (to be fair, IANAL). Just a risk consideration...your product, your call.

But I think there's value in at least recognizing that the namespace is quite crowded given the collisions that two interweb randoms were able to identify in short order.



Not really, gethaystack searches your tabs


Their beta seems so


Hey.com was haystack.com in development I think




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: