Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: RAGstack – private ChatGPT for enterprise VPCs, built with Llama 2 (github.com/psychic-api)
84 points by ayanb9440 on July 20, 2023 | hide | past | favorite | 30 comments
Hey hacker news,

We’re the cofounders at Psychic.dev (http://psychic.dev) where we help companies connect LLMs to private data. With the launch of Llama 2, we think it’s finally viable to self-host an internal application that’s on-par with ChatGPT, so we did exactly that and made it an open source project.

We also included a vector DB and API server so you can upload files and connect Llama 2 to your own data.

The RAG in RAGstack stands for Retrieval Augmented Generation, a technique where the capabilities of a large language model (LLM) are augmented by retrieving information from other systems and inserting them into the LLM’s context window via a prompt. This gives LLMs information beyond what was provided in their training data, which is necessary for almost every enterprise application. Examples include data from current web pages, data from SaaS apps like Confluence or Salesforce, and data from documents like sales contracts and PDFs.

RAG works better than fine-tuning the model because it’s cheaper, it’s faster, and it’s more reliable since the provenance of information is attached to each response.

While there are quite quite a few “chat with your data” apps at this point, most have external dependencies to APIs like OpenAI or Pinecone. RAGstack, on the other hand, only has open-source dependencies and lets you run the entire stack locally or on your cloud provider. This includes:

- Containerizing LLMs like Falcon, Llama2, and GPT4all with Truss - Vector search with Qdrant. - File parsing and ingestion with Langchain, PyMuPDF, and Unstructured.io - Cloud deployment with Terraform

If you want to dive into it yourself, we also published a couple of tutorials on how to deploy open source LLMs for your organization, and optionally give it access to internal documents without any data ever leaving your VPC.

- How to deploy Llama 2 to Google Cloud (GCP): https://www.psychic.dev/post/how-to-deploy-llama-2-to-google... - How to connect Llama 2 to your own data using RAGstack: https://www.psychic.dev/post/how-to-self-host-llama-2-and-co...

Let a thousand private corporate oracles bloom!




- Do you have plans to support other connectors, specifically OneDrive? - Do you have a demo somewhere? From the website and screenshots, it's not clear the functionalities you offer. A few min long screencast would help. - How do you differ youself from Quivr? Seems like another open source alternative and has some nice feature. Thanks for this. I will try to use this and see how well it works for my use case.


We have about 10 other connectors in a separate project at https://github.com/psychic-api/psychic

Thanks for the feedback! We’ll include a demo soon.


Approximately, what would the hourly cost of running this be on Google Cloud?

>In the default-pool > Nodes tab, set:

>Machine Configuration from General Purpose to GPU

>GPU type: Nvidia TF

>Number of GPUs: 1

>Enable GPU time sharing

>Max shared clients per GPU: 8

>Machine type: n1-standard-4

>Boot disk size: 50 GB

>Enable nodes on spot VMs

Not familiar with GCP, but I see n1-standard-4's are in an instance type that is $.19/hr. Are there any other significant costs to take into account?


Just ran our deployed cluster through GCP's pricing calculator and it's about $300 USD per month with Llama 2


I see a few nice connectors, but doesn't seem to support network shares to unstructured data. Not all enterprises host on cloud. Not all use Google Cloud, or will be willing to. Some want it local to their network. This also becomes a lot more interesting with M365 integration (OneDrive, SharePoint, Teams, Loop, etc) and boring old network share paths or local filesystem sources? I'd love to try it at least now on the two latter things without connectors to SaaS things. I have dozens of terabytes of data to test with this. Any plans for these more common sources?


This looks like a great project. Given the costs, I imagine many might want to run on dedicated hardware with GPU - yet:

> GPT4All: When you run locally, RAGstack will download and deploy Nomic AI's gpt4all model, which runs on consumer CPUs.

> Falcon-7b: On the cloud, RAGstack deploys Technology Innovation Institute's falcon-7b model onto a GPU-enabled GKE cluster.

> LLama 2: On the cloud, RAGstack can also deploy the 7B paramter version of Meta's Llama 2 model onto a GPU-enabled GKE cluster.

Why not llama2 on dedicated/local hardware? Memory and download size requirements?

Ed: After reading the linked tutorial - it looks like the built docker container will run fine on local/dedicated hardware?

https://www.psychic.dev/post/how-to-deploy-llama-2-to-google...


Yep the docker containers should run fine on local hardware, but the terraform config only supports GCP right now.

In terms of cost - just ran our deployed cluster through GCP's pricing calculator and it's about $300 USD per month. Definitely not cheap for individual use, but pretty affordable for enterprise use. Running the 40B parameter version will be significantly more.


Out of curiosity how does that gcp instance compare to my modest gaming rig (Nvidia 3080 24(?)gb ram/Ryzen 7/64gb ram)? (Since I'm paying 0/month for it ...).


What is the capacity for that price?


> only has open-source dependencies and lets you run the entire stack locally or

Open source and on-prem are two different things. Llama 2 doesn't seem to be open source.


I don't think we've collectively figured out how to describe what "weights openly available" means, so open-source is probably a reasonable descriptor.


I disagree. Open source involves the "source" being available, not just the "compiled".


The concept of “source” is nebulous for ML models. If you have the weights you can recreate a model without access to the source code originally used to train it, and similarly just having the source code without the training data won’t allow you to recreate the model.

While it would be nice to have the data set Meta used I think open sourcing the weights is good enough.


I think some marketers are trying to use this term "open source" to try to ride on the goodwill and perceived benefits of open source, without actually doing it.

Also, people who just want to be able to run something on their computer without paying money for it shouldn't call it "open source", unless it actually is.

These distinctions have been going on for decades, for very good reasons. No need to throw away that progress now.


No. The weights encode recorded parameters they don’t encode essential components like hyperparameters or modules without recorded parameters.


You're right. Either way it's impossible to recreate Llama 2 without the data set so perhaps "free to use model" is a better description than "open source model"


Maybe we could call it... Open Weights™.


More like Weights Available in the case of llama2 (and Bring Your Own Pirate Treasure in case of llama1?).


This is wonderful, I have been struggling to find a viable way to deploy private llms, this seems like the perfect option. Thanks for sharing!


Is there a version of this set up to be cpu-only, as in something that can use ggml tech? I'd love to deploy this on some servers with lots of ram and cpu horsepower, but no gpus.


Not yet, but we can definitely add it. Created an issue: https://github.com/psychic-api/rag-stack/issues/2

In the meantime it uses GPT4all when running locally so you can technically deploy it as well, but it's not very good.


So this dumps the documents returned from the vector store into a prompt to the LLM. How does it work when there are many documents returned? What's the upper limit there?


Yep. We use LangChain's basic text splitter to chunk the documents and the QA chain to stuff it into the prompt. But AFAIK it doesn't check for context length so that's a piece that's still missing.

Upper limit depends on the model, Llama 2 is 4k including the prompt.


Does it use openAI embeddings or other free ones ?


It uses all-MiniLM-L6-v2 from huggingface by default

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

You can also specify a specific embeddings model from SentenceTransformers to use in /server/.env


Trying to run this locally and once I get past a few gotchas (local.env, needing to be renamed to .env) and needing to `pip3 install poetry`. I start getting back responses like

"D<D,8H8,H<,,DH8DHH,,<<,DH<,<DHD<<,<<D,D,HD88<<H8<<D8D88,,8D,DH<,8,D<D,D,D8,D8<D8H,DHH8,D8H<,8D,,H8DHD88DD8H8<,8,HD<8D<,8D,<<888D<H,8<HD<HHD<8<<D8DD<DD<HHHH,,DDD<<DHDH,88HDH8,8DHD<<,D8,<8<H8<8H<,,<,,,D,88,<,<<8D,8<8,,H8,,D888D8<HD8<D,D8,<8<<H8D,,D<D,8<DD,<8"

I'm sure I'm doing something wrong :)


Thanks for the callout! We'll add the local.env instructions to the readme.

Are you using it with input docs or without? Locally it uses GPT4all which isn't nearly as good as Llama or Falcon. I saw a project that is docker for Llama 2 so we might use that instead!


I tried both. I'll certainly try it remotely as well. Was just pottering through HN before bed.


what's the best repo to create your own vector db but then query openai with the context?


One of the open source vector DBs is probably your best bet. Chroma, weaviate, Qdrant and a few others




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: