Hacker News new | past | comments | ask | show | jobs | submit login
Running Dolly 2.0 on Paperspace (simonwillison.net)
73 points by l2dy on April 13, 2023 | hide | past | favorite | 13 comments



Simon, I opened an issue on your TIL repo with the pip incantation that I think will get the GPU working.

https://github.com/simonw/til/issues/69

I ran into that previously


I read "Paperspace" as "paper space" so it reminded of this great article: http://www.righto.com/2014/09/mining-bitcoin-with-pencil-and...

Could someone do the same with some LLM to demonstrate a very simple example?


We'd love to help you all deploy this!

1. We just released a couple models that are much smaller (https://huggingface.co/databricks/dolly-v2-6-9b), and these should be much easier to run on commodity hardware in a reasonable amount of time.

2. Regarding this particular issue, I suspect something is wrong with the setup. The example is generating a little over 100 words, which probably is something like 250 tokens. 12 minutes makes no sense for that if you're running on a modern GPU. I'd love to see details on which GPU was selected - I'm unfamiliar with which modern GPU has 30GB of memory (A10 is 24GB, T4 is 16GB, and A100 is 40/80GB). Are you sure you're using a version of PyTorch that installs CUDA correctly?

3. We have seen single GPU inference work in 8-bit on the A10, so I'd suggest that as a followup


I've also been struggling to run anything but the smallest model you have shared on paper space:

import torch from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

import torch from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-6-9b", torch_dtype=torch.bfloat16, trust_remote_code=True, device=0) generate_text("Explain to me the difference between nuclear fission and fusion.")

Causes the kernel to crash, GPU should be plenty

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro P6000 Off | 00000000:00:05.0 Off | Off | | 26% 45C P8 10W / 250W | 6589MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

I'm extremely excited to try these models but they are by far the most difficult experience I've ever had trying to do basic inference.


I’ve never used Paperspace, so I’ll try to give it a try this weekend. How much RAM do you have attached to the compute. We don’t think it should be any harder to run this via HF pipelines than other similarly sized models, but I’ll look into it.


I wrote a small POC of getting this model working on my box (I felt inspired after reading this). If anybody else is wanting to try this out, give it a shot here!

https://github.com/lunabrain-ai/dolly-v2-12b-8bit-example

(It's garbage code and this should really just be used as a starting POC. I hope it helps!)


Could someone give a breakdown on why Dolly 2 is so much more difficult to run than llama.cpp?


Could just be timing. Dolly was announced yesterday - Llama was announced by FB and then it took perhaps a week or two weeks for Llama.cpp to appear.


Will Llama.cpp give the same results as Llama? And how is it so much easier to run?


Llama.cpp is just a redistributable C++ program. The original Llama depended on a Python machine learning toolchain IIRC, so a lot more dependencies to install.


Hey there! I'm one of the folks working on Dolly - Dolly-V2 is based on the GPT-NeoX architecture. llama.cpp is a really cool library that was built to optimize the execution of the Llama architecture from Facebook on CPUs, and as such, it doesn't really support this other architecture at this time from what I understand. Llama also features most of the tricks used in GPT-NeoX (and probably more), so I can't imagine it's a super heavy lift to add support for NeoX and GPT-J in the library.

We couldn't use Llama because we wanted to use a model that was able to be used for commercial use, and the Llama weights aren't available for that kind of use.


llama.cpp is just a frontend for the GGML tensor library, which required models be converted (optionally quantized) to GGML format.

Of course people are working on that for Dolly as well: https://huggingface.co/snphs/dolly-v2-12b-q4


Besides converting Dolly's weights to ggml it's also necessary to implement the model using ggml operators, right? Or does the ggml format also carry with it the architecture of the model?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: