Hey HN! Chris and Yuhong here from Danswer (
https://github.com/danswer-ai/danswer). We’re building an open source and self-hostable ChatGPT-style system that can access your team’s unique knowledge by connecting to 25 of the most common workplace tools (Slack, Google Drive, Jira, etc.). You ask questions in natural language and get back answers based on your team’s documents. Where relevant, answers are backed by citations and links to the exact documents used to generate them.
Quick Demo: https://youtu.be/hqSouur2FXw
Originally Danswer was a side project motivated by a challenge we experienced at work. We noticed that as teams scale, finding the right information becomes more and more challenging. I recall being on call and helping a customer recover from a mission critical failure but the error was related to some obscure legacy feature I had never used. For most projects, a simple question to ChatGPT would have solved it; but in this moment, ChatGPT was completely clueless without additional context (which I also couldn’t find).
We believe that within a few years, every org will be using team-specific knowledge assistants. We also understand that teams don’t want to tell us their secrets and not every team has the budget for yet another SaaS solution, so we open-sourced the project. It is just a set of containers that can be deployed on any cloud or on-premise. All of the data is processed and persisted on that same instance. Some teams have even opted to self-host open-source LLMs to truly airgap the system.
I also want to share a bit about the actual design of the system (https://docs.danswer.dev/system_overview). If you have questions about any parts of the flow such as the model choice, hyperparameters, prompting, etc. we’re happy to go into more depth in the comments.
The system revolves around a custom Retrieval Augmented Generation (RAG) pipeline we’ve built. During indexing time (we pull documents from connected sources every 10 minutes), documents are chunked and indexed into hybrid keyword+vector indices (https://github.com/danswer-ai/danswer/blob/main/backend/dans...).
For the vector index (which gives the system the flexibility to understand natural language queries), we use state of the art prefix-aware embedding models trained with contrastive loss. Optionally the system can be configured to go over each doc with multiple passes of different granularity to capture wide context vs fine details. We also supplement the vector search with a keyword based BM25 index + N-Grams so that the system performs well even in low data domains. Additionally we’ve added in learning from feedback and time based decay—see our custom ranking function (https://github.com/danswer-ai/danswer/blob/main/backend/dans... – this flexibility is why we love Vespa as a Vector DB).
At query time, we preprocess the query with query-augmentation, contextual-rephrasing, as well as standard techniques like removing stopwords and lemmatization. Once the top documents are retrieved, we ask a smaller LLM to decide which of the chunks are “useful for answering the query” (this is something we haven’t seen much of elsewhere, but our tests have shown to be one of the biggest drivers for both precision and recall). Finally the most relevant passages are passed to the LLM along with the user query and chat history to produce the final answer. We post-process by checking guardrails and extracting citations to link the user to relevant documents. (https://github.com/danswer-ai/danswer/blob/main/backend/dans...)
The Vector and Keyword indices are both stored locally and the NLP models run on the same instance (we’ve chosen ones that can run without GPU). The only exception is that the default Generative model is OpenAI’s GPT, however this can also be swapped out (https://docs.danswer.dev/gen_ai_configs/overview).
We’ve seen teams use Danswer on problems like: Improving turnaround times for support by reducing time taken to find relevant documentation; Helping sales teams get customer context instantly by combing through calls and notes; Reducing lost engineering time from answering cross-team questions, building duplicate features due to inability to surface old tickets or code merges, and helping on-calls resolve critical issues faster by providing the complete history on an error in one place; Self-serving onboarding for new members who don’t know where to find information.
If you’d like to play around with things locally, check out the quickstart guide here: https://docs.danswer.dev/quickstart. If you already have Docker, you should be able to get things up and running in <15 minutes. And for folks who want a zero-effort way of trying it out or don’t want to self-host, please visit our Cloud: https://www.danswer.ai/
I guess the main problem is the "private" aspect, if I've understood your goals correctly. Since most SaaS products lock down the private data unless you pay enterprise fees for compliance tooling.
For instance, if you want to ingest data from private Slack channels or Notion groups, you have to get the users in those groups to add your bot to them, otherwise there's no way of your service getting access to the data. It's possible, just a bad UX for users.
That said, built-in search for most SaaS products built after 2015 is generally quite good (e.g. Slack has an internal Learning to Rank service for a while now, which makes their search excellent: https://slack.engineering/search-at-slack/), so you'd be solving for companies like Webex and Confluence where their internal search is not great. At companies like Google they have internal search across products, which is the ideal end state, but have the benefit that they own the source code for most of their internal products.