Show HN: RAGstack – private ChatGPT for enterprise VPCs, built with Llama 2

shekhar101 · on July 20, 2023

- Do you have plans to support other connectors, specifically OneDrive? - Do you have a demo somewhere? From the website and screenshots, it's not clear the functionalities you offer. A few min long screencast would help. - How do you differ youself from Quivr? Seems like another open source alternative and has some nice feature. Thanks for this. I will try to use this and see how well it works for my use case.

jasonwcfan · on July 20, 2023

We have about 10 other connectors in a separate project at https://github.com/psychic-api/psychic

Thanks for the feedback! We’ll include a demo soon.

WillPostForFood · on July 20, 2023

Approximately, what would the hourly cost of running this be on Google Cloud?

>In the default-pool > Nodes tab, set:

>Machine Configuration from General Purpose to GPU

>GPU type: Nvidia TF

>Number of GPUs: 1

>Enable GPU time sharing

>Max shared clients per GPU: 8

>Machine type: n1-standard-4

>Boot disk size: 50 GB

>Enable nodes on spot VMs

Not familiar with GCP, but I see n1-standard-4's are in an instance type that is $.19/hr. Are there any other significant costs to take into account?

jasonwcfan · on July 20, 2023

Just ran our deployed cluster through GCP's pricing calculator and it's about $300 USD per month with Llama 2

unstatusthequo · on July 24, 2023

I see a few nice connectors, but doesn't seem to support network shares to unstructured data. Not all enterprises host on cloud. Not all use Google Cloud, or will be willing to. Some want it local to their network. This also becomes a lot more interesting with M365 integration (OneDrive, SharePoint, Teams, Loop, etc) and boring old network share paths or local filesystem sources? I'd love to try it at least now on the two latter things without connectors to SaaS things. I have dozens of terabytes of data to test with this. Any plans for these more common sources?

e12e · on July 20, 2023

This looks like a great project. Given the costs, I imagine many might want to run on dedicated hardware with GPU - yet:

> GPT4All: When you run locally, RAGstack will download and deploy Nomic AI's gpt4all model, which runs on consumer CPUs.

> Falcon-7b: On the cloud, RAGstack deploys Technology Innovation Institute's falcon-7b model onto a GPU-enabled GKE cluster.

> LLama 2: On the cloud, RAGstack can also deploy the 7B paramter version of Meta's Llama 2 model onto a GPU-enabled GKE cluster.

Why not llama2 on dedicated/local hardware? Memory and download size requirements?

Ed: After reading the linked tutorial - it looks like the built docker container will run fine on local/dedicated hardware?

https://www.psychic.dev/post/how-to-deploy-llama-2-to-google...

jasonwcfan · on July 20, 2023

Yep the docker containers should run fine on local hardware, but the terraform config only supports GCP right now.

In terms of cost - just ran our deployed cluster through GCP's pricing calculator and it's about $300 USD per month. Definitely not cheap for individual use, but pretty affordable for enterprise use. Running the 40B parameter version will be significantly more.

e12e · on July 20, 2023

Out of curiosity how does that gcp instance compare to my modest gaming rig (Nvidia 3080 24(?)gb ram/Ryzen 7/64gb ram)? (Since I'm paying 0/month for it ...).

kwerk · on July 20, 2023

What is the capacity for that price?

neilv · on July 20, 2023

> only has open-source dependencies and lets you run the entire stack locally or

Open source and on-prem are two different things. Llama 2 doesn't seem to be open source.

mmastrac · on July 20, 2023

I don't think we've collectively figured out how to describe what "weights openly available" means, so open-source is probably a reasonable descriptor.

neilv · on July 20, 2023

I disagree. Open source involves the "source" being available, not just the "compiled".

jasonwcfan · on July 20, 2023

The concept of “source” is nebulous for ML models. If you have the weights you can recreate a model without access to the source code originally used to train it, and similarly just having the source code without the training data won’t allow you to recreate the model.

While it would be nice to have the data set Meta used I think open sourcing the weights is good enough.

neilv · on July 20, 2023

I think some marketers are trying to use this term "open source" to try to ride on the goodwill and perceived benefits of open source, without actually doing it.

Also, people who just want to be able to run something on their computer without paying money for it shouldn't call it "open source", unless it actually is.

These distinctions have been going on for decades, for very good reasons. No need to throw away that progress now.

Q6T46nT668w6i3m · on July 20, 2023

No. The weights encode recorded parameters they don’t encode essential components like hyperparameters or modules without recorded parameters.

jasonwcfan · on July 20, 2023

You're right. Either way it's impossible to recreate Llama 2 without the data set so perhaps "free to use model" is a better description than "open source model"

wmf · on July 20, 2023

Maybe we could call it... Open Weights™.

e12e · on July 20, 2023

More like Weights Available in the case of llama2 (and Bring Your Own Pirate Treasure in case of llama1?).

royriver23 · on July 27, 2023

This is wonderful, I have been struggling to find a viable way to deploy private llms, this seems like the perfect option. Thanks for sharing!

generalizations · on July 20, 2023

Is there a version of this set up to be cpu-only, as in something that can use ggml tech? I'd love to deploy this on some servers with lots of ram and cpu horsepower, but no gpus.

jasonwcfan · on July 20, 2023

Not yet, but we can definitely add it. Created an issue: https://github.com/psychic-api/rag-stack/issues/2

In the meantime it uses GPT4all when running locally so you can technically deploy it as well, but it's not very good.

rozap · on July 20, 2023

So this dumps the documents returned from the vector store into a prompt to the LLM. How does it work when there are many documents returned? What's the upper limit there?

jasonwcfan · on July 20, 2023

Yep. We use LangChain's basic text splitter to chunk the documents and the QA chain to stuff it into the prompt. But AFAIK it doesn't check for context length so that's a piece that's still missing.

Upper limit depends on the model, Llama 2 is 4k including the prompt.

Jayakumark · on July 20, 2023

Does it use openAI embeddings or other free ones ?

jasonwcfan · on July 20, 2023

It uses all-MiniLM-L6-v2 from huggingface by default

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

You can also specify a specific embeddings model from SentenceTransformers to use in /server/.env

vertis · on July 20, 2023

Trying to run this locally and once I get past a few gotchas (local.env, needing to be renamed to .env) and needing to `pip3 install poetry`. I start getting back responses like

"D<D,8H8,H<,,DH8DHH,,<<,DH<,<DHD<<,<<D,D,HD88<<H8<<D8D88,,8D,DH<,8,D<D,D,D8,D8<D8H,DHH8,D8H<,8D,,H8DHD88DD8H8<,8,HD<8D<,8D,<<888D<H,8<HD<HHD<8<<D8DD<DD<HHHH,,DDD<<DHDH,88HDH8,8DHD<<,D8,<8<H8<8H<,,<,,,D,88,<,<<8D,8<8,,H8,,D888D8<HD8<D,D8,<8<<H8D,,D<D,8<DD,<8"

I'm sure I'm doing something wrong :)

jasonwcfan · on July 20, 2023

Thanks for the callout! We'll add the local.env instructions to the readme.

Are you using it with input docs or without? Locally it uses GPT4all which isn't nearly as good as Llama or Falcon. I saw a project that is docker for Llama 2 so we might use that instead!

vertis · on July 20, 2023

I tried both. I'll certainly try it remotely as well. Was just pottering through HN before bed.

tomr75 · on July 21, 2023

what's the best repo to create your own vector db but then query openai with the context?

jasonwcfan · on July 21, 2023

One of the open source vector DBs is probably your best bet. Chroma, weaviate, Qdrant and a few others