Hacker News new | past | comments | ask | show | jobs | submit | manca's comments login

I love projects like this. It shows the true potential of what LLMs and RAG can unlock. Imagine applying the same method on the actual content within the threads and extract the sentiment, as well as summarize the key points of a particular thread -- the options are limitless.

My only piece of advice, though: try to do the reranking using some other rerankers instead of an LLM -- you'll save both on the latency AND the cost.

Other than that, good job.


Thanks! I tried a few other approaches and found the LLM results were overall better (latency and cost aside). Maybe that should be an option made available to users though...


Cohere has a very cheap, fast and effective reranking API!

https://cohere.com/rerank


i think not, better results >>> better latency + cost


Maybe a combined approach beats either? Let some non-LLM reranker quickly spit out two results, and fill in the rest with the LLM.


This is exactly what https://www.perplexity.ai/ is trying to do. Maybe not "RAGing" the entire internet, but sure using the mapping between natural language query to their own (probably) vector database which contains "source of truth" from the internet.

The way how they build that database and what models they use for text tokenization, embeddings generation and ranking at "internet" scale is the secret sauce that enabled them to raise more than $165M to date.

For sure this is where the internet search will be in a couple of years and that's why Google got really concerned when original ChatGPT was released. That said, don't assume Google is not already working on something similar. In fact, the main theme of their Google Next conference was about LLMs and RAG.


A lot of the answers to your question focus solely on the infra piece of the deployment process, which is just one, albeit, important piece of the puzzle.

Each model is built using some predefined model architecture and the majority of the LLMs of today are the implementation of Transformer architecture, based on the "Attention is All You Need" paper from 2017. That said, when you fine-tune a model, you usually start from a checkpoint and then using techniques like LORA or QLORA you compute new weights. You do this in your training/fine-tuning script using PyTorch, or some other framework.

Once the training is done you get the final weights -- a binary blob of floats. Now you need to use those weights back into the inference architecture of the model. You do that by using the framework which is used for training (PyTorch) to construct the inferencing pipeline. You can build your own framework/inferencing engine too if you want and try to beat PyTorch :) The pipeline will consist of things like:

- loading the model weights

- doing pre-processing on your input

- building the inference graph

- running your input (embeddings/vectors) through the graph

- generating predictions/results

Now, the execution of this pipeline can be done on GPU(s) so all the computations (matrix multiplications) are super fast and the results are generated quickly, or it can still run on good old CPUs, but much slower. Tricks like quantization of model weights can be used here to reduce the model size and speed up the execution by trading-off precision/recall.

Services like ollama, or vllm abstract away all the above steps and that's why they are very popular -- they might even allow you to bring your own (fine-tuned) model.

On top of the pure model execution, you can create a web service that will serve your model via a HTTP or gRPC endpoint. It could accept user query/input and return a JSON with the results. Then it can be incorporated in any application, or become part of another service, etc.

So, the answer is much more than "get the GPU and run with it" and I think it's important to be aware of all the steps required if you want to really understand what goes into deploying custom ML models and putting them to a good use.


Thanks for the insightful response. This is exactly the type of answer I was looking for. What's the best way to educate myself on the end-to-end process of deploying a production grade model smartly in a cost efficient manner?


This might be asking for too much but is there a guide that explains each part of this process? Your comment made the higher level way clearer for me and I'd like to go into the weeds a bit on each of these


download llama.cpp

convert the fine tuned model into gguf format. choose a number of quantization bits such that the final gguf will fit in your free ram + vram

run the llama.cpp server binary. choose the -ngl number of graphics layers which is the max number that will not overflow your vram (i just determine it experimentally, i start with the full number of layers, divide by two if it runs out of vram, multiply by 1.5 if there is enough vram, etc)

make sure to set the temperature to 0 if you are doing facts based language conversion and not creative tasks

if it's too slow, get more vram

ollama, kobold.cpp, and just running the model yourself with a python script as described by the original commenter are also options, but the above is what i have been enjoying lately.

everyone else in this thread is saying you need gpus but this really isn't true. what you need is ram. if you are trying to get a model that can reason you really want the biggest model possible. the more ram you have the less quantized you have to make your production model. if you can batch your requests and get the result a day later, you just need as much ram as you can get and it doesn't matter how many tokens per second you get. if you are doing creative generation then this doesn't matter nearly as much. if you need realtime then it gets extremely expensive fast to get enough vram to host your whole model (assuming you want as large a model as possible for better reasoning capability)


Interesting. Thanks for the response. Do you have any resources where I can educate myself about this? How did you learn what you know about LLMs?


Well, when Llama 1 came out I signed up and downloaded it, and that led me to llama.cpp. I followed the instructions to quantize the model to fit in my graphics card. Then later when more models like llama2 and mixtral came out I would download and evaluate them.

I kept up on hacker news posts and any comments about things I didn't understand. I've also found the localllama subreddit to be a great way to learn.

Any time I saw a comment on anything I would try it, like ollama, kobold.cpp, sillytavern, textgen-webui, and more.

I also have a friend who has been into ai for many years and we always exchange links to new things. I developed a retrieval augmented generation (rag) app with him and a "transformation engine" pipeline.

So following ai stories on hn and reddit, learning through doing, and applying what I learned to real projects.


Thanks. Very cool. Have you ever tried to implement a transformer from scratch? Like in the Attention is all you need paper? Can a first/second year college student do it


Andrej Karpathy's course is a good resource: https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs...


I haven't tried it yet, but I do intend to. I think the code for llm inference is quite straightforward. The complexity lies in collecting the training corpus and doing good rlhf. That's just my intuition.


Hi, I work at a startup where we train / fine tune / inference models on a gcp kubernetes cluster on some a100s.

There isn't really that much information about how to do this properly because everyone is working it out and it changes month by month. It requires a bunch of DevOps an infrastructure knowledge above and beyond the raw ml knowledge.

Your best bet is probably just to tool around and see what you can do.



Thanks!! This is really cool


If you don't care about the details of how those model servers work, then something that abstracts out the whole process like LM Studio or Ollama is all you need.

However, if you want to get into the weeds of how this actually works, I recommend you look up model quantization and some libraries like ggml[1] that actually do that for you.

[1] https://github.com/ggerganov/ggml


I've tried code-llama with Ollama, along with Continue.dev and found it to be pretty good. The only downside is that I couldn't "productively" run the 70B version, even on my MBP with M3 Max with 36GB of RAM (which interestingly should be enough to hold quantized model weights). It was simply painfully slow. 34B one works good enough for most of my use-cases, so I am happy.


I tried to use codellama 34B and I think it is pretty bad. For Example I asked it to convert a comment into a docstring and it would hallucinate a whole function around it.


What quantization were you using? I've been getting some weird results with 34b quantized to 4 bits -- glitching, dropped tokens, generating Java rather than Python as requested. But 7b, even at 4 bits, works OK. Posted about it earlier on this evening: https://www.gilesthomas.com/2024/02/llm-quantisation-weirdne...


Same, CodeLlama 70B is known to suck. Deepseek is the best for coding so far in my experience, Mixtral 8x7B is another great contender (to be frank, for most tasks). Miqu is making a buzz, but so far I haven't tested it personally yet.


Galaksija was truly a "masterpiece" at that time, made by a single person by stitching together various smuggled parts from the West. I have a huge admiration and respect for Voja, especially after he decided to give up everything in Serbia and move to the US and start from scratch on his own in his late sixties!

He's a very humble man despite his remarkable impact and influence on the early tech industry in Yugoslavia. He and Dejan Ristanovic [1] started one of the first PC magazines in the 80's which was the bastion of progress filled with ingenious articles and insights collected from all over the world mostly by the word of mouth (remember there was no internet back then). They and a few others actually founded the first ISP and BBC in Yugoslavia in the late eighties.

Anyway, I am glad to see this article on HN and would suggest you all to watch Voja's interview [2] given to Computer History Museum in Mountain View where Galaksija rightfully got its own piece of the history.

[1] https://en.wikipedia.org/wiki/Dejan_Ristanovi%C4%87

[2] https://www.youtube.com/watch?v=nPLyzOEobw8&ab_channel=Compu...


It's hard to overstate how important how these magazines "Računari u kući", "Svet kompjutera" and "Moj Mikro" were important during 80s. There was very limited amount of computer literature to be found, so we all learnt 90% of what we knew from these magazines. Hats off to Dejan Ristanović, Voja Antonić, and all others who wrote for these magazines. They were the light that guided many of us to our future careers.


I will always remember when the "Takedown" movie came out. I loved the original "Hackers" and couldn't wait for "Hackers 2" which was Takedown.

I had learned about Mitnick few years prior to the movie and was fascinated by his life story and what he had done up to that point (including his "takedown" by the FBI). It's an understatement to say that his work, character and some sort of positive social manipulation put a great influence on my upbringing and later my professional career. Back then I enjoyed playing pranks with my friends and "hacking" them with all sorts of trojans and ejecting their CD roms :)

I am very sad to hear that he's gone. RIP Legend.


I've been using Threads since late yesterday and it's been very slow... sometimes it wouldn't even load content when you browse specific profiles (especially the active ones with tens of thousands of followers). So I am not surprised the backend is in Python :)


I remember when Evernote launched and everyone was super hyped about it -- even some of the biggest VC names promoted them. Not to mention the funding they raised ($290M). It was literally iOS Notes on steroids. I even used it for awhile, but somehow it didn't stick.

They hired bunch of great people and had some good backend tech -- sad to see this happen to them.


Didn't Evernote precede Notes? My understanding is that Evernote was the first "notes" app, but I'm old; I could be making shit up in my head.


Evernote preceeded the Iphone, the product was originally a non cloud small business that I think was sold/ spun off into a venture capital funded company.

Evernote is older than the Iphone.

I just checked and I found a review of Evernote 1.0 from 2007, but I think the product predated that review. I found a web site saying the first Evernote Beta was released in 2004. I remember using Evernote in grad school in 2008.


Older than...Lotus Notes?


Short story about Lotus Notes: Back in high school, a friend of mine started writing "code" for Lotus Notes; can't remember what the "code" was called, though - templates, plugins, something like that. Anyway, he started selling his stuff; first to accountants and small businesses (his dad was an accountant), then he started selling to local banks (in Modesto). His dad used to do all the sales, and my friend would go along to meetings and just sit there. Then, he sold the "code" through magazine ads. Finally, he somehow got connected to corporations in San Francisco and started making a ton of money (for a high school kid). Last I heard, just after we all graduated high school, he moved to San Francisco to start a full-time company. I have no idea what happened after that. Steve, I hope you did well. ... His situation taught me two things: (1) you could use computers for more than games. And (2) a "little guy" could start a company and actually make money with software.


Yes, it preceded iOS Notes. Evernote started in 2004.


Great job. I must say that the speech synthesis sounds pretty realistic. I talked with Jobs, Musk and Obama and liked how they sounded and more importantly how they handled the questions. Do you mind sharing the entire stack you used to build this? Very well done!


Thanks much appreciated! It was a mixture of some the latest TTS models. Azure speech to text. Gpt ofc. And some other tools for handling conversational stuff (like interruptions).


Nicely done. Does Azure Speech to Text also handle speech synthesis and provide out of the box voices for different characters or you had to build your own model to do this? It's impressive if their service can do it all: speech recognition, speech to text and text to speech and in near real-time. I should take a closer look at the Azure ML stack :)


I've been using the Azure Cognitive Services speech recognition and text-to-speech for my own locally run 'speech-to-speech' GPT assistant application.

I found the Azure speech recognition to be fantastic, almost never making mistakes. The latency is also at a level that only the big cloud providers can reach. A locally run alternative I use is Vosk [0] but this is nowhere near as polished as Azure speech recognition and limits conversation to simple topics. (Running whisper.cpp locally is not an option for me, too heavy and slow on my machine for a proper conversation)

The default Azure models available for text-to-speech are great too. There are around 500 models in a wide variety of languages. Using SSML [1] can also really improve the quality of interactions. A subset of these voices have certain capabilities (like responding with emotions, see 'Speaking styles and roles').

Though in my opinion the default Azure voice models have nothing on what OP is providing. The Scarlett Johansson voice is really really good, especially combined with the personality they have given it. I would love to be able to run this model locally on my machine if OP is willing to share some information about it!

Maybe OP could improve the latency of Banterai by dynamically setting the Azure region for speech recognition based on the incoming IP. I see that 'eastus' is used even though I'm in West Europe.

But other than that I think this is the best 'speech-to-speech AI' demo I've seen so far. Fantastic job!

[0] https://github.com/alphacep/vosk-api/

[1] https://learn.microsoft.com/en-us/azure/cognitive-services/s...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: