Hacker News new | past | comments | ask | show | jobs | submit login

Prior to LLaMA 2, I would have agreed with you but LLaMA 2 is a game changer. The 70B performance is probably between 3.5 and 4. But running it personally isn't cheap. The cheapest I found is about $4/hr to run the whole thing. I only spend around $3 on average a month on GPT-3.5 API for my personal stuff.



For what tasks do you consider 70B beyond GPT-3.5 performance? There are some I’m aware of, but they are very much the exception and not the rule, even with the best 70B fine-tunes currently available.


I mainly use 70B for “text QA” on files I find sensitive like personal documents. The answers have been very close to what I get if I use GPT-3 (langchain makes it easy to switch). Do you use the quantized version? If so, try running the full one on a A100.


I run 70B very cheaply using serverless GPUs. I've had the best experience with Runpod, but there are a few other options out there for it as well.


Out of curiosity and if you are happy to share, what is your 'personal stuff'?


I use it a lot for personal coding projects, grammar correction/sentence rewording, and translation (it works better than google translate for longer text). I explicitly call out personal stuff since my job provides an in-house front end that uses the GPT API (I'm actually not sure which version it is, but guessing from the response quality, its probably GPT-4). My work one has made me noticeably more productive. It helps me with a lot of the "boring" work that I procrastinate a lot on. It starts my momentum and allows me focus a lot on the "complex stuff". I'm not sure how much money I use since there is no limit at work but if I had to guess, its probably north of $100 a month on credit.


Can you talk about how you integrate gpt API at work and why not just use chatgpt 4?


The server is provided by my employer so I can’t go into the implementation detail. But overall, most companies provide access to the API endpoint instead of using chatGPT itself since OpenAI uses your results to train (hence why it’s free for 3.5). The API endpoint supposedly doesn’t use your data for training which is why I use the API endpoint for personal stuff as well.


As a counter reference, for my work I use it to code (for-4) and it has been between $70 and $200 per month depending on how heavily I use it


GPT-4 is significantly more expensive so I can definitely see you spending that amount. For really complex stuff, I switch over the GPT-4 and it will cost me almost $3 a "question" (as in going from the beginning to solving it). Honestly worth it since it solves my problem but it adds up quick so I try to stick with 3.5 when I can.


Can’t you get by with ChatGPT-4 for these personal assistant type questions? That’s what I do and my 20 a month goes a long way. I’d be interested to see if I am missing out on anything using GPT to is way in contrast to the API.


I actually used to use ChatGPT but switched to the API once I had GPT-4 access. Mainly it’s because I simply didn’t use the $20 worth of the GPT-4 at the time. It was extremely slow and the question per hour limitation was annoying and stressful. I would always worry I would need it for something unexpected so I never used more than 15 questions at a time (but this has probably changed these couple months). In addition, the privacy implications are better for the API since the terms are better for how they handle your data. I also like how I can tie in GPT anywhere. I use the matrix bridge so you can give access to people like my parents who are not as tech literate to sign up and get used to chatgpt interface. I allow them to talk to it as a bot through WhatsApp bridge.


I use it with a tool that is wired into my terminal that changes my files for me [1]. That alone makes me several times more productive compared to copy pasting back and forth between the chat window. If the chat window makes me twice as productive the command line tool probably makes me 5x as productive. At that kind of output on a developer salary the $70-200 a month is absolute peanuts compared to what you get in return

1: https://github.com/paul-gauthier/aider


This tool looks splendid. Personally, it evokes in me the memories of MUDding back in the early 90s. What a concept that would be to MUD to build apps via LLM -- or even MUD to build the MUD in real-time outside of the OLC and scripting. That sounds like a passion project for me when I can find the time.


Is your code subject to code review? If so have you done anything to improve that bottleneck, or was it never an issue at previous productivity?


It is subject to code review, but I typically spent much more time writing code than having it reviewed (I am very methodical and slow writing code)


How are you currently hosting your LLaMA 2? Any tips, tricks or advice?


It depends on your needs. For instance, do you want to host an API or do you want to have a front end like chatGPT? Chances are, text-generation-webui [1] should get you pretty close to hosting it yourself. You simply clone the repo, download the model from huggingface using the included helper (download-model.py) and fire up the server with server.py. You can connect to it by SSH port tunneling on port 7860 (there's other way like Ngrok but SSH tunneling is the easiest and secure).

As for hosting, I found that runpod [2] has been the cheapest (not affiliated, just a user). All the other services tend to add up more than them when you include bandwidth and storage. There's some tutorials online [3] but a lot of them use the quantized version. You should be able to fit the original 70B with "load_in_8bit" on one A100 80GB.

[1] https://github.com/oobabooga/text-generation-webui [2] https://www.runpod.io/ [3] https://gpus.llm-utils.org/running-llama-2-on-runpod-with-oo...


If you want to query the Llama-2 models, you can use Anyscale Endpoints [1]. Note: I work on this :)

Llama-2-70B is $1 / million tokens, which is the most cost-efficient on the market that I'm aware of.

[1] https://app.endpoints.anyscale.com/


Can we supply our own fine-tuned models?

Edit. I'm sure it's answered on your site but sometimes it's better to include it right here! :)


Tried to plug it in to my favorite chat frontend (TypingMind), bounced off CORS. Is this something you can do something about?


How do you keep the cost down?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: