Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Danswer (YC W24) – Open-source AI search and chat over private data
231 points by yuhongsun on Feb 22, 2024 | hide | past | favorite | 129 comments
Hey HN! Chris and Yuhong here from Danswer (https://github.com/danswer-ai/danswer). We’re building an open source and self-hostable ChatGPT-style system that can access your team’s unique knowledge by connecting to 25 of the most common workplace tools (Slack, Google Drive, Jira, etc.). You ask questions in natural language and get back answers based on your team’s documents. Where relevant, answers are backed by citations and links to the exact documents used to generate them.

Quick Demo: https://youtu.be/hqSouur2FXw

Originally Danswer was a side project motivated by a challenge we experienced at work. We noticed that as teams scale, finding the right information becomes more and more challenging. I recall being on call and helping a customer recover from a mission critical failure but the error was related to some obscure legacy feature I had never used. For most projects, a simple question to ChatGPT would have solved it; but in this moment, ChatGPT was completely clueless without additional context (which I also couldn’t find).

We believe that within a few years, every org will be using team-specific knowledge assistants. We also understand that teams don’t want to tell us their secrets and not every team has the budget for yet another SaaS solution, so we open-sourced the project. It is just a set of containers that can be deployed on any cloud or on-premise. All of the data is processed and persisted on that same instance. Some teams have even opted to self-host open-source LLMs to truly airgap the system.

I also want to share a bit about the actual design of the system (https://docs.danswer.dev/system_overview). If you have questions about any parts of the flow such as the model choice, hyperparameters, prompting, etc. we’re happy to go into more depth in the comments.

The system revolves around a custom Retrieval Augmented Generation (RAG) pipeline we’ve built. During indexing time (we pull documents from connected sources every 10 minutes), documents are chunked and indexed into hybrid keyword+vector indices (https://github.com/danswer-ai/danswer/blob/main/backend/dans...).

For the vector index (which gives the system the flexibility to understand natural language queries), we use state of the art prefix-aware embedding models trained with contrastive loss. Optionally the system can be configured to go over each doc with multiple passes of different granularity to capture wide context vs fine details. We also supplement the vector search with a keyword based BM25 index + N-Grams so that the system performs well even in low data domains. Additionally we’ve added in learning from feedback and time based decay—see our custom ranking function (https://github.com/danswer-ai/danswer/blob/main/backend/dans... – this flexibility is why we love Vespa as a Vector DB).

At query time, we preprocess the query with query-augmentation, contextual-rephrasing, as well as standard techniques like removing stopwords and lemmatization. Once the top documents are retrieved, we ask a smaller LLM to decide which of the chunks are “useful for answering the query” (this is something we haven’t seen much of elsewhere, but our tests have shown to be one of the biggest drivers for both precision and recall). Finally the most relevant passages are passed to the LLM along with the user query and chat history to produce the final answer. We post-process by checking guardrails and extracting citations to link the user to relevant documents. (https://github.com/danswer-ai/danswer/blob/main/backend/dans...)

The Vector and Keyword indices are both stored locally and the NLP models run on the same instance (we’ve chosen ones that can run without GPU). The only exception is that the default Generative model is OpenAI’s GPT, however this can also be swapped out (https://docs.danswer.dev/gen_ai_configs/overview).

We’ve seen teams use Danswer on problems like: Improving turnaround times for support by reducing time taken to find relevant documentation; Helping sales teams get customer context instantly by combing through calls and notes; Reducing lost engineering time from answering cross-team questions, building duplicate features due to inability to surface old tickets or code merges, and helping on-calls resolve critical issues faster by providing the complete history on an error in one place; Self-serving onboarding for new members who don’t know where to find information.

If you’d like to play around with things locally, check out the quickstart guide here: https://docs.danswer.dev/quickstart. If you already have Docker, you should be able to get things up and running in <15 minutes. And for folks who want a zero-effort way of trying it out or don’t want to self-host, please visit our Cloud: https://www.danswer.ai/




Good luck folks! I'm glad there are projects trying to solve enterprise search.

I guess the main problem is the "private" aspect, if I've understood your goals correctly. Since most SaaS products lock down the private data unless you pay enterprise fees for compliance tooling.

For instance, if you want to ingest data from private Slack channels or Notion groups, you have to get the users in those groups to add your bot to them, otherwise there's no way of your service getting access to the data. It's possible, just a bad UX for users.

That said, built-in search for most SaaS products built after 2015 is generally quite good (e.g. Slack has an internal Learning to Rank service for a while now, which makes their search excellent: https://slack.engineering/search-at-slack/), so you'd be solving for companies like Webex and Confluence where their internal search is not great. At companies like Google they have internal search across products, which is the ideal end state, but have the benefit that they own the source code for most of their internal products.


I think search is the wrong lens to look at it. Yes, finding relevant information quickly is important, but the key to enterprise search tools would be to get a holistic view around any topic. A typical enterprise has a lot of silos (12 out of top 15 enterprise apps on G2 are addressing this problem) and the flow of information doesn't exist. Any enterprise search tool helps in aggregation and triangulation of conversations/knowledge from various sources, helping any user get the sense of 1/ what is the current state of this topic? 2/ how did they get to this state? (you would have guessed it, you need a knowledge graph for the product to work well)

(Disclaimer: we started with enterprise search too, and now we think a custom model is a better way to get to those goals) Also, the output needs to be integrated into their own workflows. Eg: Seeing a conversation on slack and using a bot fetch all the supplemental information to understand the context from say mail, docs etc. is mighty useful. Search using RAG is the easy part, the hard part is contextualizing it in a way it is immediately useful. That depends on understanding the company/domain lingo, understanding users, etc.

Private aspect is a bad UX but not a bottleneck. Remember, this is targeted towards power users looking to use it on a daily basis. Connecting the bot once to get started is fine as long as it adds value. If anyone offers data governance along with it (updating access on a daily basis and answers only from access), it could be a huge hit.


Exactly, there is a huge amount of value in being able to quickly get a holistic view of topics. Most topics don't exist in an isolated tool - most often there are the official discussions/designs which exist in place, there are customer interactions with the topic which uses a separate channel, and then there are one off small conversations about the topic in chats like Slack. So isolated, tool specific searches are great for finding specific documents, but less useful for getting actionable insights.

Regarding contextualizing: we're currently working on organizational understanding and we're very excited about this one! We're embedding users based on the documents they authored or interacted with, the questions they have asked, description of projects they worked on, and the org chart. The thinking is that, there will always be questions that can't fully be answered via documentation alone. But in those cases we'll be able to recommend someone who might know. It also has the benefit of contextualizing the user asking so that we can surface more relevant results for them.


>> the hard part is contextualizing it in a way it is immediately useful. That depends on understanding the company/domain lingo, understanding users, etc.

How do you control this deterministically? It sounds like the "hard part" is variation in prompting & selectively choosing the right data to include, both of which I could see being good enough right now but hard to deliver definitively.


Being able to filter down the data deterministically is a big value add, especially as the number of documents scale into the range of multiple millions. We have filters by document-set, tags, time range, source type (ie. only include Slack + Google Drive, or Confluence + Jira + Gong, etc.)

The challenge is with the non-deterministic portions of the flow as you pointed out. Ensuring retrieval quality in out-of-domain datasets, guardrailing the LLM generation, working with conflicting or deprecated information are some of the interesting areas we're addressing. Happy to dive deeper on any aspect you're curious about, and I'm sure we can learn from the discussion as well.


I think I understand your concern but if I miss the point, please follow up!

So regarding getting access to read knowledge from the different tools, it depends tool by tool but a lot of them have API keys or options for app integrations available in the free tier (GitHub, Google Drive, Confluence come to mind). Other tools don't have a free tier and you just get access to the API keys as a part of paying for the service. I think there are probably tools that require a premium fee to get integration access but I'm not aware of any personally.

For the SlackBot, it can add itself to public channels but for private channels someone needs to add it. It is what it is sadly.

About search being available for most SaaS products: SaaS tools are definitely improving their own searches. But I still think a single place to search and aggregate data has significant value. For example, as an engineer by training, often getting the full picture for some customer escalation includes reading Slack threads, Confluence Design docs, old Pull Requests on GitHub. Would be nice to get it all in one place.


> It is what it is sadly.

This is what I mean -- previously I built a similar search engine on top of slack, notion, etc., but didn't launch the product because I thought that requiring users to constantly add bots to private channels would be a subpar experience. I thought this would be a blocker for good UX, so didn't go further, but maybe you'll find a nice solution!

Searching over public internal data is addressed by a few existing tools, but it's the private aspect which is pretty difficult to handle and disastrous to get wrong when managed ad-hoc - e.g. someone accidentally adds the bot to a private slack group called #layoffs :) so you'd want this handled properly and centrally.

I guess you'll also need to handle privacy well, ~maybe it's OK when run as a SaaS for db admins to have access to ingested data, but if it's OSS then the people that run it probably shouldn't be able to read the private data that's ingested, so now you need to handle search over encrypted data, which is a fun problem :D


Access controls is a non-glamorous but critical piece of what we're building. Currently implementing automatic access sync-ing for a few sources like Google Drive, Confluence, Jira, and Notion to start. By matching document-access in the source to users and groups, and then to emails, we can finally map Danswer users to document level access. So someone searching in Danswer will only get results based on the set of documents they have access to in the source tool.

For Slack it would look something like: get the users in the Slack channel, map those Slack users to users in Danswer. Then only those users in Danswer will be able to get results from that channel.


> ~maybe it's OK when run as a SaaS for db admins to have access to ingested data, but if it's OSS then the people that run it probably shouldn't be able to read the private data that's ingested

I don't understand the distinction here. If Danswer runs a SaaS version then yes I agree they can have a license agreement that lets their DB Admins see data in some cases which is fine. That seems an orthogonal issue to if a company is running the OSS version internally, in which case presumably their administrator can see all docs (but software administrators usually can do this anyway).


Yep, this is exactly correct! For our SaaS version, we do have an agreement which allows us to look at data if needed to debug issues and/or improve search performance.

For self-hosted deployments, usually a select few admins who have setup the plumbing on AWS do have access (but as nl has mentioned, these people usually have access to superuser access on the tools we connect to anyways so this is a noop).


I like open source software. I like what you are doing and keeping development open.

I have been closely watching AI development. There are 10k+ apps now using AI. Every major company FAANG, Tier-2,3,4,5 company now have AI as top priority. However, there got to be something coming out of wrapper software. I have not read docs entirely yet. I have a few questions for you that might give us idea whether this fits our use case.

1. Which models are you using for this? Can I switch models to open source?

2. When you say connect to Apps, how often are you pulling data from these apps? For example, you connect to confluence where tens of wikis get updated. How much of that ends up in your vector DB?

3. Most important, what separates you from tens of other providers out there? Glean, as someone commented, is very similar to what you are doing.

4. How do you plan to convince SMBs and mid-size companies to use you over say in-house development?

5. OpenAI, Mistral, Claude and other LLM model developers can build this functionality natively into their offering. Are you concerned about becoming obselete or losing competitive ground? If not, why?

Either way, this is a good direction. I will try it out tonight. Feel free to respond when you get a chance.


Hello, thanks for the kind words! With regards to your questions:

1. Are you referring to the local NLP models or the LLM? The local models are already open source models or ones we've trained ourselves. If you're talking about the LLM, the default is OpenAI but it's easy to configure other ones without any code changes.

2. Most sources are pulled from every 10 minutes. They have incremental updates so if you have Confluence with a million pages, probably in the last 10 minutes, only a dozen or so have been updated. The only exception is websites (which are crawled recursively so we don't know which pages are updated before we try), which is updated once a day.

3. Glean is indeed similar. Without going into the features in detail, we are an open source Glean with more of an emphasis on LLMs and Chat.

4. There's generally not a great reason to build from scratch if an open source alternative with +75% alignment exists. They can always build on top of us if they want. A lot of teams reach out to us because they were looking to switch from their in house solution to Danswer. Generally though these are larger teams, we haven't seen many SMBs building RAG for their own usage, usually these smaller teams building RAG are looking to productize.

5. Currently there is no cheap and fast way to fine-tune LLMs every time a document is updated. If you want an LLM to remember the document that was just updated you'd have to augment it to at least dozens of similar (but all correct) examples. RAG is still the only viable option. Then there is the problem of security etc. since you can't enforce user roles at the LLM level. So companies that focus on building LLMs don't really compete in this specific space and they don't want to either as they're trying to build AGI. There is more of a threat from teams like Microsoft and Google who are indeed trying to build knowledge assistants for their product lines, but we think there is a world where open source ends up winning against the giants!


how is it "chat over private data" if you are exposing my data to more parties like openai? I thought you were using a stack of self hosted open weight LLMs etc. If I can send it elsewhere, it is not private data.


So private refers to two things here, sorry for any confusion.

When we say "chat over private data" we mean that this data isn't publicly available and no LLMs have this knowledge in their training. Meaning that with our system you can now ask questions about team specific knowledge. For example, you can ask questions like "What features did customer X ask about in our last call". Obviously if you ask ChatGPT this, it will have no idea.

The other part is data privacy when using the system. The software can be plugged into most LLM providers or locally running LLMs. So if your team doesn't trust OpenAI but instead has a relationship with say Azure, or GCP, you can just plug into one of those instead. Alternatively, a lot of users recently have been setting up Danswer with locally running LLMs with tools like Ollama. In that case, you now have a truely airgapped system where no data is ever going outwards.


This is really nice. Congratulations for launching.

Just the last 2-3 weeks have I had talks with enterprise companies regarding this topic. It seems to be on every CEO's agenda. I have talked to a couple of startups who wanted to do a similar thing to you. But they all feared Microsoft Copilot is not beatable. So they don't even try.


>But they all feared Microsoft Copilot is not beatable. So they don't even try.

The thing is, you don't need to beat copilot. Copilot may well be the worse system and Microsoft can still win because they offer the most enterprise-y solution. I wouldn't be worried about competing with them on functionality. But I also would never even try to outdo them on a business level.


I definitely see what you mean. They also have advantages in bundling co-pilot with their other offerings. This will in no means be an easy battle for us, but we have hope that we'll be able to build something people love and end up using!


This is also another reason why we think OSS is the way to go here. Taking on the tech giants alone is definitely a daunting task (maybe even impossible for a small isolated team).

The hope is that by working with the community, we'll be able to incorporate the best ideas and contributions from a large pool of like-minded people to build something everyone can benefit from!

The OSS space has absolutely taken off in the NLP space and the excitement has bled over to developer platforms resulting in some outstanding projects, hopefully the same will happen with LLM applications.


What are the most exciting projects you have seen in this space so far?


I think the explosion in interest and all the new software around GenAI owe their success to NLP advancements (coming from an NLP background myself, I may be biased though). I think the best projects are those pushing the frontier of NLP and sharing the learnings with the world:

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 - Mixture of Experts model like GPT-4 but this one is open source doesn't cost an arm and leg to run.

https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1 - This is a very recent reranker, we will probably try it out. It's really nice that it's open source and small enough to run on CPU. I think an overall trend of the space in the next years will be to maintain/improve performance while shrinking the cost and hardware requirements.


Does the GitHub connector imply ability to ask questions of an entire code base, such as: “help me write an endpoint to style of our codebase.”

I believe there are a variety of projects focused on that problem but would be good to latch onto one that handles and integrates externalities.

Speaking of which I did not see sentry in the list of connections or mentioned in an issue. Any plans there?


Code search is something we have in our sights in the next couple months. Currently the GitHub connector pulls in PRs and Issues but not the whole code base. We wanted to have the best RAG pipeline so we deep dived on that. Code search uses a combination of graph based traversals and a different type of embedding so it's a separate effort, but we will definitely build it out since it's immensely useful to engineering teams!


I would hope you just partner with the open source Bloop AI for that :)


Ah thanks for pointing it out, it look like an awesome project! We love open source so it's nice to see others going this direction as well


So the intent of the GitHub/GitLab connector in the current iteration is to help people easily find implementations that have been done before. For example, if a new bug comes up with some feature, I can easily search for the PRs relating to that feature in natural language and filter down the code changes that may have caused it.


How do you prevent "how much do my colleagues make?" questions from being answered to the wrong people? I know you mention citations and the ability to backtrack a "fact" to the source document. How robust is this?


So we're leaning on access control to do this! Right now we support manually configured group-based access at the connector level (e.g. Users X, Y, and Z make up the `Engineering` group and that group should have access to Folders A, B, C in Google Drive + Github).

We're also in the process of adding the ability to sync permissions from sources. For example, with this in place you would only be able to chat with / search over documents in Google Drive that you have access to. Since everything is RAG based rather than any fine-tuning, this will guarantee that someone asking "how much do my colleagues make?" will not get an answer UNLESS they already have access to the document that has this info (in which case, it shouldn't be a problem :D)


Just wanted to drop in and share some love for Danswer. We've been using it as our go-to doc repository for a bit now, and it's been a game-changer for us. Not naming names, but let's just say it's powering some pretty key projects.

What really blows my mind is the LLM capabilities. We're pulling out some seriously amazing answers from our knowledge base, making info retrieval a breeze. And yeah, I've been hands-on with it for the last 3 months, even tinkered around developing some custom connectors. It's been a fun ride.

The team behind it? Absolute champs. Super responsive, and the product itself is rock solid. It's not every day you come across a tool that genuinely makes your workflow smoother and smarter. I've discovered thanks to them also Vespa, and it's our Embeddings de facto DB now.

Big shoutout to the Danswer team. Keep up the awesome work!


Thanks for the kind words! There's nothing more motivating in the world than hearing that our users are loving what we've built!


Your methodology is nice.

Is there any work or planned work around enterprise authentication and access? For instance, indexing Sharepoint in such a way where a user of Danswer isn't exposed to sharepoint information they wouldn't otherwise have access to?


Thank you!

Yes, there are several options for user authentication (Basic Auth, Google OAuth, OIDC, SAML).

Currently the RBAC is managed via the Danswer and this controls who has access to which documents (it's done at the connector level as it would be untenable to assign access to documents individually).

We're also working on automatically sync-ing permissions from the sources. Basically seeing which emails have access to each doc and mapping it to the Danswer users.


I noticed the Google Drive connector includes sheets. It looks like for the time being, these just get indexed as CSV file. That seems like it would miss a lot of context since a good number of spreadsheets aren't structured as a simple table. I'm wondering if you have any plans to make spreadsheet indexing more useful going forward.


Ya, handling spreadsheets is a beast of its own. We have this simple implementation to cover easy cases for folks but likely it will need its own more involved pipeline for indexing, retrieval, and interactions with the LLM.

Currently with large tables, it's not handled very well either. The more complete approach would be to pass the headers to the LLM and ask it to generate a formula to parse the data rather than feeding the whole table(s) to the LLM directly.

Some of the bigger items we want to target that require special flows are code search, SQL tables, and Excel/CSVs/TSVs


most of the files I have, I'm most interested in finding graphs and then updating relevant data.

it looks like the best way to do that is understand the ooxml format in xlsx. it's all fairly easy to understand.


Ya, parsing the file is generally not bad at all. The problem comes with the fact that LLMs are notoriously bad with numbers and formatted data. So the current approach of passing relevant information to the LLM and asking it to generate answers will produce misleading information when larger tables are passed in.

By asking the LLM to generate a formula though, it doesn't actually need to do any number crunching of its own which makes solving the challenge a bit more reliable when it comes to LLMs.


I've limited my expectations for LLM support to file management. locating relevant datasets or filing away things. the interactive QA just don't seem salient beyond some high level.


A lot of people are very bullish on AI, it's very interesting to hear the opposite side. My opinion is that LLMs are very powerful at digesting and distilling knowledge which is why we built this project. I also think that LLMs are terrible reasoning engines and so agent-flows are not quite ready for primetime.

Would love to hear your perspective on the space!


I certainly see the value of large document retrieval and various forms of search.

However, what seems to be the business proposition is giving managers shallow access to documents but won't lead to rigorous information.

There's a few middle grounds where it can yield insights. like regulatory scenarios where you want to understand how public orgs satisfy permits with written plans.

however, what I don't believe will yield is the context size. when I want to explore my knowledge base, I need far more than 128k and there's sever orders of structure that language itself is not going to bridge.


Ya, a lot of knowledge is locked away in metadata, how documents are structured, in non-textual representations, or sometimes even just in the head of experts. We definitely want to be more than a shallow access to documents and we're building with that in mind. We're currently working to include more metadata and organizational understanding, with plans to tackle OCR, NL-to-SQL, code search and knowledge graphs in the future. While we don't have concrete dates for all of these, hopefully this gives some visibility into our vision at least!

Regarding the context size, the research community is doing some stunning work. If you're interested, you should check out the new Mamba (Transformers replacement) architecture: https://arxiv.org/abs/2312.00752


I think the biggest problem I've run into with company documentation is that the relevant docs either don't exist at all or are woefully out of date. Sure, there might be a procedure doc that spelled out how to handle a particular type of issue, but it has probably been updated three times in Slack DMs and twice on Zoom calls since the doc was actually written. And maybe at least once, the company has "declared bankruptcy" on, say, Confluence, and half-converted a few things over to Notion.

If Danswer could surface the "official" answer, such as it is, but then also note how old the authoritative doc is, and which teams have been messaging about it and/or scheduling calls with related names more recently, and then tell me who's responsible on that team for knowing the actual procedure, I'd never need another tool!


The truth will be in the code!


We definitely recommend you trying it out!

As far as how outdated information is handled - we pass the most relevant documents along with metadata to the LLM. So in the case you mentioned, the LLM will be provided the procedure doc and the time it was updated, the relevant Slack messages and the times they were sent, and the call transcripts along with when the call happened. The LLM tends to handle this pretty well.

Additionally, during the retrieval phase, there's a time based decay applied based on the last time the document was updated. Also there is learning from feedback so users can upvote documents that are useful and downvote documents that are deprecated.


Hello, congratulations. Danswer looks really interesting and the name is simply great. We are building something similar (internal enterprise search using llm) and I am thinking whether we should jump to Danswer codebase. I would have a couple of questions, if you could answer: - how would you compare Danswer and privateGPT? Do you see it as a direct competitor? - you posted below that you have not use llama hub connectors because they do not allow incremental updates? Can you maybe elaborate on that, examples? - instead of pulling data from different sources, did you consider knowledge graph approach (push) where data and vector index would live together in the single graph database? What would be the advantages of your approach? - you posted below that you had to implement a custom search. Could that possibly be avoided (with the different architecture?)

Thanks and best


Thanks for the kind words! Sorry for the delayed response, this post drew a lot more interest than anticipated and we've been swamped working with new folks coming in.

Regarding:

- PrivateGPT: they're for individual use where you ingest your own data. We're for teams to use with access controls, connectors to typical business SaaS tools, works at scale (incremental updates and scalable container architecture), different user roles, etc. So basically I see very little overlap between the two projects at least in terms of "competition", we're just different.

- LlamaHub: Our connectors pull all of the documents in the first run, then every following run, it only pulls in documents that have changed since the last run. For large teams, the first run may take many hours but following that, it will only take seconds each time. Without this, it becomes untenable to keep all information up to date. Also we pull in additional metadata and permissions, which not all LlamaHub connectors support.

- Push flow for indexing: Yes, we also have APIs that you can push to for indexing documents. For event based "push", we didn't go that route because most tools we connect to don't support this.

- Knowledge Graph: We will certainly be building this, it is only a matter of prioritization and timelines.

- Replacing the custom search: We think our search is much better than a basic RAG pipeline out of the box that someone can get from Langchain/LlamaIndex. What's the motivation for wanting to remove/replace it?


Great stuff!

I've spent the last 6 months doing fullstack development for a very similar app at my work. Me and the ML engineer on my team are always joking that something like Danswer is going to come around sooner or later to replace what we're building. sad lols

The concept of team-specific knowledge assistants is very hot in our org (which is a gov org). We have HEAPS of legacy and current data that employees and consultants need to comb through to write up documents.


"We were building something similar in house, but then we found Danswer". That's something we keep hearing all the time :D

(1) We handle chunking, embedding, building the keyword search index, etc. all on our end! We use Vespa as both our Vector DB / Search Engine (it allows a custom hybrid search, which we've found to perform really well). So no, you would not need to bring your own Vector DB - everything needed to run the system is managed dockerized and managed by Docker compose / kubernetes (whichever you choose).

(2) It would be straightforward! We both offer an API to ingest documents into the system as well as a pretty simple connector interface that you can implement to add your own custom connector (https://github.com/danswer-ai/danswer/blob/main/backend/dans...)


My only other concern is we want to control role based access, so our users can login with org Azure AD accounts. And we want have project / document context lens for the AI chatbot which are available only to specified users.


We do have support for IdPs like Azure AD + role based access control (these aren't in the MIT version of Danswer though :sweat:).

If you're interested in learning more, would love to chat through the details on Slack / Discord.


Wow! you jumped onto those questions before I read more and deleted them. Thanks :)


Congrats on launching guys.

When I was in consulting, being able to search our internal documents was absurdly painful.

We had all this data from client teams spanning decades and continents, but I could never find what I needed. It was all locked up in these silly PowerPoint files -- not even necessarily PDFs. I'd literally spend all night sometimes clicking through pages by hand.

Corporate knowledge management is an absolutely immense problem, and I'm thrilled to see you tackling it.


Ya, there's definitely a common thread there! A lot of consulting firms have given us very positive feedback from using Danswer. I think the nature of the time-framed projects and frequent scope changes mean that people have to always take in new information and a lot of documented knowledge is lost or at least difficult to find.

A great use case for sure!


You’re amazing for doing such a details, OSS-first announcement — you do an honor to ycombinator and HN by living up to the old school ideals IMO. My only complaint: more whimsy and tell us what danswer means! Literally this whole time I was thinking “huh maybe it’s Docker… or just some dude named Dan…”. For the slow like me, it’s dance + answer

Will definitely be checking it out, Dan or no :)

  Optionally the system can be configured to go over each doc with multiple passes of different granularity to capture wide context vs fine details.
Just curious, is the granularity able to itself be controlled by LLMs, or does that refer to traditional content-blind sliding windows? I’m big on Minsky’s idea of Frames[1] which emphasize understanding the same input in multiple instrumental contexts/frameworks/backgrounds. Something you’ve thought of?

[1] https://web.media.mit.edu/~minsky/papers/Frames/frames.html


The name was intended to be (1) Danswer -> Deep (learning) Answer and (2) Danswer -> Dancer (thus the logo). Although we've heard quite a ton of different interpretations (the "dude named Dan" is quite popular, as is "Danswer -> The Answer".

The granularity is currently hard-coded (e.g. 512 tokens default chunk size, augmented with passes of 128 tokens). I have not heard of the idea of Frames, but it's interesting. Thanks for sharing - I'll probably do a deep dive sometime this weekend.


lol great answer, thanks. I think it’s a powerful, simple idea that is too little used - for example, temporarily transforming a markdown document into an outline, an amateur summary, and a detailed outline, or transforming a source file into comments, explanations, code, specific classes/functions, etc.

I say this with a hint of irony, knowing that I’m just some kid and you’re launching an awesome AI product right now, but I highly recommend looking back at mainstream ai from before it all started actually working, namely the classic book Artificial Intelligence: A Modern Approach. Just because they lacked the tools doesn’t mean there wasn’t serious, detailed thought put into what we should do in the technological situation we now find ourselves in. For that reason, this is (AFAIK) the main textbook taught in US grad schools for AI survey courses. Frames would be part of chapter 10, I’m guessing.

https://aima.cs.berkeley.edu/


Thanks so much for sharing! Excited to look into it!


Really like the idea! The company I work for has recently contracted with (glean)[glean.com] which seems to serve the same purpose but imo the killer feature which they lack is being able to work collaboratively with the AI to produce an answer by enabling the human operator to explicitly scope down the context of a chat to specific documents and then converse with the document in question.

Sometimes you know roughly where the data you're looking for exists but the artifact containing the information is extremely dense to interpret. For example, a runbook for a system could span 10s to 100s of pages and to actually accomplish what you want means interpreting and joining information from different sections of the same document. It seems like there's potential here to allow an expert to define explicit scope of what it needs to search and then include information in the context as wide or as narrow as the question requires.


Sounds like Danswer might just fit your needs. Also ya, we came up with this idea of chatting with documents that you can select on the fly, I think we're the only ones who do this still. People have really been liking that one!

If you happen to want to talk to us about Danswer, we'd love to welcome you to our Slack: https://join.slack.com/t/danswer/shared_invite/zt-2afut44lv-...


Can I connect it to any OpenAI Rest compatible LLM? So having my own LLM on premises based eg on ollama OpenAI Rest endpoint?


Yes absolutely! We actually have a doc specifically for Ollama: https://docs.danswer.dev/gen_ai_configs/ollama


The integration part (connectors) is the key here. I can see how beneficial this would be for companies as they can plug and play.

Adding the vectorisation locally is superb, I've played around with sbert models before and ability to run without GPU is going to simplify the process a lot.


Ah yes, this reminds me! I forgot to mention it but for the local NLP models that we run, they're in the range of 100 million parameters so they're able to be run on CPU (no GPU required!) with pretty low latency.

Also a fun tidbit on the connectors, more than half of them now are built by open source contributors! We just have an interface that needs to be implemented and people have been able to figure it out generally.


Congrats on the launch! Finally got some time to try this out. Tried this on a couple of personal documents and compared to asking on chatGPT (both 4 and 3.5), so far the results weren't great. For context, the questions needed a little inference and danswer said the info isn't present in the document versus chatGPT which inferred the answer from a related statement.

I do intend to perform a much bigger test around documents but curious to hear thoughts on why this might be the case.


Hi, it may be an indexing issue. There's a precanned "Information not found" message that we show if document retrieval failed. A couple common causes for this are:

- Not provisioning enough resources and processes are dying (we run NLP models locally so the system isn't totally lightweight) - Access is not correctly configured, either while pulling in from the source or at the user level during query time.

Assuming nothing is wrong with the setup, it is most likely because of how we prompt the LLM. Questions that require reasoning or are more open ended, generally aren't "safe" for the LLM to answer. So the system is constrained on purpose. For example if someone were to ask: "How can I increase revenue by 30% next quarter". It's not safe for the system to just propose some actions and it's likely better to just search the documents and say there wasn't any answer in the docs (unless of course some doc explicitly states plans for increasing revenue).


Do we have options for developing our own custom connectors? For ex, not much known apps like Erpnext etc.. In other words, any app not in danswer, we should be able to create custom connector and use..


Yes! And we'd love it if you contribute them back to the project! More than half of the connectors are community contributed at this point and it's by far the most common area of contribution.

There's a simple Document interface that needs to be implemented to provide stuff like title, content, link, etc. From there the rest of the Danswer code takes it and handles the indexing and making it available for querying.

There's a contributing guide here: https://github.com/danswer-ai/danswer/blob/main/CONTRIBUTING...

And a connector contributing guide here: https://github.com/danswer-ai/danswer/blob/main/backend/dans...


Congrats on the launch!

In what way is "prefix-aware embedding models trained with contrastive loss" better than the standard embedding model provided by OpenAI?

"added in learning from feedback and time based decay" => Sounds interesting! Have you seen significant gains in precision and recall here?

It looks like you are using NextJS app dir + external backend. Why did you decide against NextJs for frontend and backend? Are you happy with your choice?


OpenAI's models may fit that description as well under the hood. Specifically, for `prefix-aware`, this is useful when you have short passages (e.g. Slack messages) that you are trying to match against short queries (e.g. user questions). Without being prefix-aware, the model can get confused, think both are queries, and cause any short passages to match very strongly with short queries.

For learning from feedback for sure! No exact benchmarks, but we've heard from quite a few users about how useful this is to push high quality docs up and reduce the prevalence of poor docs. This is all very hard to evaluate since there aren't readily available, real-world "corporate tool / knowledge base" datasets out there. We're actually building our own in house right now, so we should have more concrete numbers around these things soon.

For the backend, we do a lot of stuff with local embedding models / cross encoders / tokenization / stemming / stop word removal etc. Python has the most mature ecosystem for this kinda stuff (and the retrieval pipeline is the core of our product), so we don't regret it at all!


I've been planning on building some of this for an internal tool, but now it looks like I don't have to. I'm impressed by the demo, it looks really polished.

I'm particularly surprised by the speed considering all of the pre and post processing. I am doing some similar things and that is one bottlenecks. I'll dig in, but I'm curious what models you are using for each of these steps.


A lot of teams we talk to switched from an in-house solution to either directly using Danswer or building on top of Danswer. Glad you liked the demo!

We're using E5 base by default but there's an embedding model admin page to choose alternatives. There's also an API for it if you know what you're doing, you can even set one of the billion+ parameter LLM bi-encoders if you want (but you'd need a GPU for sure).


I’m actually giving a presentation tomorrow to my team about a tool I was building to leverage our runbooks (came out of a hackathon) and this just blows my app out of the water. I’m really stoked to give this a try and possibly contribute back by creating a connector for our messaging app. Thank you so much for making this available and for explaining so much about the architecture.


This is one of those posts that makes me feel like I'm in the right place at the right time. Thank you, this is a fantastic piece of tech.


Thanks so much! It's always rewarding to hear people sharing our excitement!


What an amazing tool, I cant wait to implement this in our workflow. How well does this work with documents in other languages besides English?


Glad you asked! We are actually the only project (open source or closed), as far as I know, that handles multilingual quite well. We have options to do multilingual query-expansion and also multilingual embedding models.

Without dropping any names, a big french company with 5000 people is actually doing this with Danswer and have found great success.

There is some info here: https://docs.danswer.dev/configuration_guide You'll also want to change the embedding model in the admin UI


Are you using anything like Llamaindex internally or did you write it from scratch without the assistance form a helping wrapper like this?


We use LlamaIndex very sparingly. Specifically the context aware document chunking functionality is via LlamaIndex.

We couldn't use the more involved pipelines because we needed significant custom logic to enforce permissions, filters (like time filter, source filter, document-set filters), and other complexities. At that point, it's easier to write from scratch rather than conform to expectations of these third party libraries.


I was considering starting a project just like this using Llamaindex but I think I'll give yours a try first before going that route. Looks good. Thank you.


I think these developer platforms like LlamaIndex and Langchain are super great for prototyping and understanding the crowdsourced best approaches in solving these new LLM related challenges.

Depending on how custom the pipelines need to be, you'll either find that you've saved a huge amount of time using these libraries, or you'll find that you have no option but to switch off and build from scratch.


That makes sense. It was my fear that I'd start with a wrapper that helps but end with a wrapper that hurts. Good to know that it's still worth experimenting with these for prototyping and discovery. I'm guessing that there will still be components of them that you can use in a product if you don't want to implement something quite specific yourself.


Hi guys! I recently deployed your open source product for a client - great code.

One thing I personally see expanding is document generation. This involves a tailored Q and A generation pipeline, specific RAG and the use of knowledge graphs and ontologies down the line.

I’m curious as to how you are going to towards that very clear future and therefore stay ahead of copilot as an example.


I would suggest putting real time ingestion on the roadmap; that would unlock a lot of new use cases.

Did you use a library to implement the integrations?


Do you mean directly uploading files in the UI and chatting against those files? This one will be done within the next few weeks, it's a very high value item for sure.

Alternatively, if you're talking about real time indexing to make it available to everyone immediately, there's an Ingestion API, where users can send documents in the expected format directly to the system, is this what you're thinking?

The integrations are built in house, some using client libraries of the particular tool (like the Atlassian python client library for example). We considered using Airbyte, LlamaHub etc. but we found that they don't support the full flexibility that we need, including pulling incremental updates and access permissions.


I mean streaming insertions and indexing using pub/sub or something.


Got it, we considered it but most connectors that we pull from don't support this. Might bring it back for the ones that do, thanks for the suggestion!

It's actually still in the code, just none of the connectors implement it atm: https://github.com/danswer-ai/danswer/blob/main/backend/dans...


Feels like the big missing piece is being able to index an s3 bucket. Is s3 API compatibility on the roadmap?


We have a connector interface and build guide for contributors: https://github.com/danswer-ai/danswer/blob/main/backend/dans...

Should be not too bad to build one out! Fun fact, more than half the connectors were built entirely by community members who needed them for their own teams and we're super grateful when they contribute it back to the repo.


Quick question: do you offer API? I'm hoping to integrate this with an existing chat UI that I have.


Yes, we have an API and a way of accessing it with a generated API key which you can find in the admin panel.

Two things to note though.

The APIs are intended for serving the Danswer frontend. The functionality is generally complete for similar use cases but it's not documented so you have to look at the code.

If you're overusing the API on the cloud without providing your own OpenAI key, we will likely have to shut down the instance to prevent losing too much on inference fees.


This will kill a lot of Low Effort, So Called Ai Startups that just prompt engineered chatgpt.

Good work guys!


I have just set this up, pointed it at a internal Google Sites-based knowledge base, and made it available to my team through a Slack bot. It's really nice and easy to get started with and I love the self-hosting support.


Interesting!

I’m curious about the busines model. I see more and more YC companies that are FOSS, which is nice.

Why did you choose a FOSS license instead of proprietary with a license fee?

What are your plans for securing funding for further development?

Would be very interesting to hear your thoughts on this.

Congrats on the launch!


We chose FOSS because we think this type of tool will be universal in the near future. We likely won't be able to directly serve millions of teams ourselves by the time it happens but by open sourcing, if teams want it, they can just set it up.

Especially with small teams that have no budget, they can get value and never need to talk to us. We hope to grow with them.

In terms of funding and the economics of it, we have a cloud currently which is paid and we're developing additional features that are not free. For example, identifying experts on the team when the system is unable to find the information to answer the question directly. This is an example of a feature that large teams are likely willing to pay for but small teams won't need (as they know each other intimately).


Ok, cool. Best of luck with Danswer!


Another question. If I host this publicly for people that I work with on their data, how can I make sure that it's only them that can access the service? Do you have any form of auth?


Yes! There is Basic Auth (email + password with email verification) and Google OAuth available in the free version.

We also do OIDC and SAML to integration with Identity Providers (IdPs) like Okta but that's part of the paid features. Ahhh please don't hate us!


With the free version, can I constrain the emails to be from one domain? i.e. the company domain


Chris here (the other founder) - yes you can! We have an `VALID_EMAIL_DOMAINS ` env variable which controls this.

For example, for us we have `VALID_EMAIL_DOMAINS=danswer.ai`.


Awesome, thank you!


Nice to see yet another open source approach to LLM/RAG. For those who do not want to meddle with the complexity of do-it-youself, Vectara (https://vectara.com) provides a RAG-as-a-service approach - pretty helpful if you want to stay away from having to worry about all the details, scalability, security, etc - and just focus on building your RAG application.


What's your moat? FAANG+ are all working on similar products.


We're leaning heavily into the open source aspect. We think that a solution like this will be useful to even smaller teams (10-50) that other companies won't want to target. There is some non-trivial amount of setup required, specifically chasing down the API keys etc. So for the SaaS alternatives, they rely heavily on their sales orgs being very hands on so it makes no sense for them to target small teams. In fact for many of them, if you try to sign up for a demo and you say you're less than 50 people, they straight up ignore you.

So hopefully teams will self-adopt Danswer and as they grow, they will keep using us!

For larger teams, the transparency and peace of mind of self-hosting an open source solution is also a major benefit. We've often heard from large teams that have adopted Danswer, that the customizability of it has been a driving factor in their adoption. They want to own the solution and they want to customize it specifically for their needs. At the very basic level, a lot of teams have swapped in domain specific embedding models and prompts but we've seem some significantly more involved customizations as well.


> Once the top documents are retrieved, we ask a smaller LLM to decide which of the chunks are “useful for answering the query”

This sounds like normal re-ranking. How is it different?


Well the most standard approach is to use cross-encoders (e.g. something like Cohere Rerank) to give similarity scores between the query and the chunk, and then use these scores to update the ranking.

Our approach is to use an LLM (gpt-3.5-turbo for example), and to ask it explicitly "Is this chunk <CHUNK> useful for answering this query <QUERY>". We've found, while certainly a bit more expensive, the larger model size and greater understanding of the world allows this approach to yield significantly better results that the SOTA cross-encoders. It also allows us to ask the model to explain why it's useful, which can be really helpful for the user when determining if they should look deeper into a document (as opposed to the standard keyword-based highlighting which often isn't very useful when determining if a document actually has useful information for your query).


Interesting! Thanks for the explanation.


Two other tidbits on this:

1. There's a difference between relevance and usefulness that cross-encoders cannot capture. Imagine a thread with a bunch of people complaining about an exception and each comment is another mention of the exception. Now imagine another thread with one mention of the exception at the top, and a bunch of people offering solutions. If you query for the exception, LLMs will find the second thread more useful, but cross encoders will find the first one more relevant.

2. LLMs/GenAI models don't output a single value. They can use the tokens they output to "reason" about the usefulness of a doc. Eg. Rerankers are like tiny LLMs that are only allowed to output "yes" or "no", but instead you can use an LLM to do chain-of-thought and finally decide at the end.


We’re building something similar. How do I contact you?


Email: founders@danswer.ai Slack: https://join.slack.com/t/danswer/shared_invite/zt-2afut44lv-... Discord: https://discord.gg/TDJ59cGV2X

We hear this a lot - we take it to mean we've found something that people like and need!


How does this compare to something like OpenWebUI?


We have a strong emphasis on the retrieval half of RAG. A big part of the value of Danswer is in connecting to sources like Notion, Linear, GitLabs, etc. We do incremental updates to keep data fresh, pull in metadata, etc. We also have features to manage access to documents in Danswer like RBAC.

Basically, (as I understand it) OpenWebUI is more like a ChatGPT frontend and we're more like a unified search with an emphasis on LLMs.


This is incredible. Congrats on the launch!


Thank you for the kind words!


Congrats on the launch! Don't think anyone has mentioned this yet but that is a fire name! :) Love the pun.


Thank you for the kind words!

Yea, the name has many meanings. It's a fun little puzzle to find them all :)


How does this compare to simply rolling your own OpenAI Assistant (apart from direct integration of Slack etc.)?


So one of the main things we do is automatically sync-ing documents from different sources of knowledge from your team. So all of the data connectors as well as the user authentication and access systems would have to be built from scratch if you did your own. Also if you have more than a few documents you would have to recreate the RAG pipeline (and ours is fairly involved so it would be quite some work). Finally there's the UI and other features like learning from feedback, usage analytics, chat history, etc etc.

If you're just looking to upload a few personal docs into a chat assistant for your own use, probably Danswer is overkill and more complex than the effort is worth. If you're thinking of a team wide use case, then using Danswer makes sense.


That makes a lot of sense. Thank you!


Won't it be better to transform documents into Q&A style instead of "chunking"?


Are you referring to the approach of creating hypothetical questions for embedding along with the document. Or more along the lines of creating summaries of the documents during indexing and embedding those as well?

Either case, the reason we don't do LLM based transformations during indexing time is that we hope our tool can be used by teams with a lot of documents. It becomes prohibitively expensive to run these transformations at indexing time when the scale can be multiple millions of documents.


congrats on the launch! curious, what is your business model? you only charge for cloud option right?


My question also. I looked through a lot of the posts here and can’t seem to find it.


There's a question by cpach further up on this page which is essentially asking this but also with some additional questions. Hopefully you find that thread useful!

TLDR: There's the cloud version and also there are a set of nice to have features more relevant for larger teams. Those features are not MIT.


I think dust.tt is open source as well. Is there a difference here?


Ya, I haven't dug too deep into their project, to be honest. But we do think it's great more teams are going for open source.

I tried to look up their RAG pipeline to see if they've invested effort into building a strong retrieval. It looks like they have a more basic vector only search, not sure if it's running local models or using a cloud service. On the flip side, it looks like they have assistants that can take actions (which we don't do), so maybe that's their focus.

I can only speculate as I've never spoken with their team but I imagine they're working closely with a few specific teams judging based on their number of connectors. Most likely it was built for specific customers. We are leaning more into the open source and working with our community and letting people self-host easily to find value with no friction.


Congrats on the launch! Does it search within pdf files?


Thanks! Yes, it does do PDFs. We don't do anything fancy with it though like Optical Character Recognition (OCR). So pictures of text, as well as images and graphs will be lost. This is something we will work on though.

Is this something that you would find a lot of value in or is simple text processing of PDFs sufficient?


Not OP, but I would definitely find a lot of value from processing PDFs in such a way that it could eg understand tables and images. I work in mining and having it digest a 43-101 technical report with images and tables would be supremely valuable.

I know that might be a niche case tho.

Absolutely incredible work you’re doing tho wow, I’m very impressed by what you’re doing and the way you’re doing it. Even if you stopped now this is a masterpiece, so while yes I would definitely find a lot of value from being able to process images and graphs/tables, simply being able to process the text and cite it is already a superpower. Thank you for your amazing work!!!


I'd benefit from OCR too. Not just PDFs, but OCR on images could be super useful to.

For a personal use case, I'm thinking things like receipts. For work, I'm thinking OCR on architecture diagrams/etc.


:)


Another RAG chat bot ... YC is pretty unoriginal huh




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: