Hacker News new | past | comments | ask | show | jobs | submit login
OpenLLaMA: An Open Reproduction of LLaMA (github.com/openlm-research)
484 points by sadiq 10 months ago | hide | past | favorite | 180 comments

To use with llama.cpp on CPU and 8GB RAM

  git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && cmake -B build && cmake --build build
  python3 -m pip install -r requirements.txt

  cd models && git clone https://huggingface.co/openlm-research/open_llama_7b_preview_200bt/ && cd -
  python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
  ./build/bin/quantize models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/ggml-model-f16.bin models/open_llama_7b_preview_200bt_q5_0.ggml q5_0
  ./build/bin/main -m models/open_llama_7b_preview_200bt_q5_0.ggml --ignore-eos -n 1280 -p "Building a website can be done in 10 simple steps:" --mlock

You the real MVP!

Though I'm getting this error on an Intel macbook (Monterey); it works fine on a Windows11 box:

   python3 convert-pth-to-ggml.py models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights 1
   Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin
   Traceback (most recent call last):
    File "/l/llama.cpp/convert-pth-to-ggml.py", line 11, in <module>
      convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
    File "/l/llama.cpp/convert.py", line 1129, in main
       model_plus = load_some_model(args.model)
     File "/l/llama.cpp/convert.py", line 1055, in load_some_model
     File "/l/llama.cpp/convert.py", line 857, in lazy_load_file
       raise ValueError(f"unknown format: {path}")
   ValueError: unknown format: models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin

I had the same issue and then noticed that I need git lfs - otherwise just cloning the repo will not download the weights.

After getting the model with git lfs I get:

Loading model file models/open_llama_7b_preview_200bt/open_llama_7b_preview_200bt_transformers_weights/pytorch_model-00001-of-00002.bin

Traceback (most recent call last):

  File "convert-pth-to-ggml.py", line 11, in <module>
    convert.main(['--outtype', 'f16' if args.ftype == 1 else 'f32', '--', args.dir_model])
  File "/Volumes/mac/Dev/llama.cpp/convert.py", line 1145, in main
    model_plus = load_some_model(args.model)
  File "/Volumes/mac/Dev/llama.cpp/convert.py", line 1071, in load_some_model
  File "/Volumes/mac/Dev/llama.cpp/convert.py", line 865, in lazy_load_file
    return lazy_load_torch_file(fp, path)
  File "/Volumes/mac/Dev/llama.cpp/convert.py", line 737, in lazy_load_torch_file
    model = unpickler.load()
TypeError: 'staticmethod' object is not callable

Thanks for the tip! After running `brew install git-lfs && git lfs install` on my Macbook, I was able to run the model.

I get the same error on an M series MacBook (Ventura). However from the repo README.md it looks like make should work instead of cmake, I’ll give that a try.

It's not clear from the GitHub; are there any plans to eventually train the 30 or 65 billion weight LLaMA models? The 65B model seems comparable to GPT3.5 for many things, and can run fine on a beefy desktop just on CPU (CPU ram is much cheaper than GPU ram). It'd be amazing to have an open source version.

There’s a lot of controversy about “7B is good enough and small enough for consumer hardware so it’s good enough fullstop”

…but, although it is true that for a fixed compute budget that these small models can have impressive results with good training data, it is also true that smaller models (7B) appear to have an upper performance bound that is beaten easily by larger well trained models.

It’s just way more expensive to train larger models.

They specifically note they are training a smaller 3B model In the future.

So… it seems reasonable to assume that this is a proof of concept, and that no, the Berkeley AI lab will not be fielding the cost for training a larger model.

This is probably more about exploring the “can we make a cheap good-enough model?” than “here is your GPT4 replacement”.

Agreed. With some work, 13B runs on consumer hardware at this point. That redefines consumer to a 3090 (but hey, some depressed crypto guys are selling them. I recently got another GPU for my homelab this way).

30B is within reach, with compression techniques that seem to lose very little information of the overall network. Many argue that machine learning IS fundamentally a compression technique, but the topology of the trained network turns out to be more important. Assuming an appropriate activation function after this transformation.

No… definitely not your GPT4 replacement. However this is the kind of PoC I keep following… every… 18 hours or so? Amazing.

> That redefines consumer to a 3090

Or a beefy MacBook Pro. I recently bought one with 64gb of memory and Llama 65B infers very promptly as long as I'm using quantized weights (and the Mac's GPU).

This is very impressive. I think everyone should pay very close attention to what M1/M2 have given us.

But I’m waiting until my friends can afford it. Right now (which in this pace might mean I change my mind tonight)

…I am earnestly studying how to make this a thing anyone can install as a part of a product they can use without a subscription.

And beam size 1?

Do you know of any research that tries to take large pre-trained model and make it smaller by cutting out least activated neurons and training it a bit not to loose performance?

The entire field of ML distillation.

> They specifically note they are training a smaller 3B model In the future.

They're kidding right, there's no way that thing will be more useful than one of those flan models.

Given inference costs and ability to run on devices, there's an argument to be made for training models that are smaller than Chinchilla-optimal though, especially if you can still eek out improved performance with longer training times.

I ran the 30b and 65b Q4 on a laptop with 64 gb of RAM (8/16 CPU). It worked but token/s was very low for it to be practically useful.

That's unfortunate. Running the 65B Q4 on an AMD Epyc with 32 1.5ghz cores and 256 GB of ram I get around 3 tokens/sec, which is useable if not ideal. I wonder if the difference is related to the RAM or the number of CPUs?

Although there are multiple bottlenecks, my understanding (and why at a certain point, throwing more threads doesn't work) is that inference for dense LLMs are largely limited by memory bandwidth. Most desktop computers will have dual channel DDR4/DDR5 memory which will be hard pressed to get >60GB/s. A last-gen Epyc/Threadripper Pro should have 8 channel memory DDR4-3200 support, which should get you a theoretical max of 204.8 GB/s (benchmarking ends up more around 150GB/s in AIDA64).

The latest Genoa has 12 channel DDR5-4800 support (and boosted AVX-512) and I'd imagine should perform quite well, but if you primarily want to run inference on a quantized 65B model, I think you're best bang/buck (for local hardware) would be 2 x RTX 3090s (each of those has 24GB of GDDR6X w/ just shy of 1TB/s of memory bandwidth).

Yeah, it's really so bad on desktops.

With my LLaMA AVX implementation on 32bit floats [0] there no performance gain after 2 threads, so remaining 14 threads available are of no use, there no memory bandwidth to load them with work :)

[0] https://github.com/gotzmann/llama.go

To the extent that you're memory bandwidth limited you should be able to do multiple inferences at once --- latency stays high but getting multiple samplings can be extremely useful for many uses and can cover up somewhat for high latency.

To an extent, but memory bandwidth soon becomes a bottleneck there too. The hidden state and the KV cache are large so it becomes a matter of how fast you can move data in and out of your L2 cache. If you don’t have a unified memory pool it gets even worse.

Thank you, that makes sense. I had no idea that there was such a dramatic difference in memory bandwidth between desktop and server CPUs.

The two-channel DDR5 in desktops can't even do two channels very well -- if you try to put 64GB RAM in (two dual-rank 32GB DIMMs) then you lose around 50% of the bandwidth compared to a single rank DIMM (e.g. from 8GHz to 4GHz speeds, and increased latency).

I'm following the discussions on GitHub as well as their PRs closely.

The primary bottleneck for now is compute.

They've recently made a big improvement to performance by introducing partial gpu acceleration if you compile with a gpu accelerated variant of BLAS. Either cublas (Nvidia) or CLBlast (slightly slower but supports almost everything: Nvidia, Apple, AMD, mobile, raspberry pi etc)

3 tokens/sec is a lot faster than what I experienced. Even though your CPU has a lot more cores, I think llama.cpp was not being able to make good use of more than 8 threads.

When did you test this? Maybe llama.cpp had some improvements since I used it (which was at the start of the project).

It's not about threads number, it about memory bottleneck. Sweet spot for my M1 Pro laptop is around 6 threads and 4bit model - I've managed to get 20 tokens per sec, really impressive

I tested this on the latest master. Llama.cpp has had some performance improvements, although I don't know if that'd be enough to make it 3x faster.

That's just a bit faster than my MacBook Pro, for what it's worth. Which was quite expensive but I don't think AMD Epyc expensive ...

Is it Zen1 architecture? It should be much better on Zen2 and newer Epycs

slow could be useful if you do not want to chat with it, and instead you could code it to do a long running job, like code review your entire project like a code analysis tool. Or summarize a lot of content.

How low? I think everybody has different requirements there.

I ran it on a modern desktop and was getting sub 1 token/s

could it parallelize across multiple PCs ?

No since it’s stateful in the sense that inferencing is dependent on the past generated tokens.

That's why it's not parallelized along the time axis but rather along the dimension of the embedding axis.

You split the big matrices into smaller matrices to dispatch the workload. But this means you have to add some communication overhead (roughly nblayers sequential synchronisation point per token). In official LLama implementation this is done transparently using RowParallelLinear, ColumnParallelLinear, ParallelEmbedding see https://github.com/facebookresearch/llama/blob/main/llama/mo...

Transformer have multiple attention heads, that can be computed independently and then summed together to produce the output of the layer. This allow to split the parameter space among machines without having to transfer them at each iteration.

I'm really curious how Meta, DeepMind and OpenAI make the big models work. The biggest A100 you can buy is just 80GB. And I assume the big companies use single precision floating point during training. Are they actually partitioning the big model across multiple GPU instances? If one had the hardware, how many GPUs does the biggest LLAMA take? These are systems issues and I have not read papers or blog posts on how this works. To me, this infra is very non-trivial.

The "standard" machine for these things has 8x80GB = 640GB memory (p4de instances here: https://aws.amazon.com/ec2/instance-types/p4/), with _very_ fast connections between GPUs. This fits even a large model comfortably. Nowadays probably most training use half precision ("bf16", not exactly float16, but still 2 bytes per parameter). However during training you easily get a 10-20x factor between the number of parameters and the bytes of memory needed, due to additional things you have to store in memory (activations, gradients, etc.). So in practice the largest models (70-175B parameters) can't be trained even on one of these beefy machines. And even if you could, it would be awfully slow.

In practice, they typically use servers with clusters of these machines, up to about 1000 GPUs in total (so around 80TB of memory, give or take a few?). This allows even the biggest models to be trained on large batches of several hundreds, or even thousands, of elements (the total memory usage is _not_ proportional to the product of number of parameters and the batch size, but it does increase as a function of both of them, a term of which being indeed the product of the two). It makes for some very tricky engineering choices to make just the right data travel across connections, trying to avoid as much as possible that you have to sync large amount of data between different machines (so "chunking" things to stay on the 640GB range) with strategies such as ZeRO being published every now and then. Plus of course the practical effort to make physical connections as fast as possible...

To get an idea of how hard these things are, take a look at how long the list of names in the published paper about BLOOM language model is :-)

In case anyone's interested, on this page[1], a P4DE 24xlarge is listed as costing $25 per hour for a reserved instance.

[1] - https://instances.vantage.sh/

($25)*(24hours)*(30 days)*(12 months) = $216,000

That's absolutely nuts. That's basically the entire capital cost of an 8x A100 hyperplane from LambdaLabs [1] plus power for a year plus administration! What's the point of cloud hardware if you're paying for everything reserve anyway?

Roughly the same setup costs $12/hour at Lambda if you're lucky enough to snag one so it looks like demand for 8x A100 is so high that you basically have to pay AWS for an entire pod to get access to one, unless you want to pay $40 per hour (!!!)

[1] https://shop.lambdalabs.com/deep-learning/servers/hyperplane...

Very insightful!! A 175B parameter model with 2 bytes per weight, and say 2 bytes per gradient (not sure if single precision gradients makes sense?) comes in at 700GB, which is beyond a single 8x80GB beefy machine!! I recall reading with tech such as RDMA, you can communicate really fast between machines .. I assume if you add a switch in there, you are toast (from a latency perspective). Perhaps using 2 such beefy machines in a pair would do the trick .. after all .. model weights aren't the only thing that needs to be on the GPU.

I saw a reference that said GPT-3, with 96 decoder layers, was trained on a 400 GPU cluster, so that seems like the ballpark for a 175B parameter model. That's 50 of the hypothetical machines we talked about (well .. really 100 for GPT-3 since back in those days, max was 40 or 48 GB per GPU).

I also wonder why NVIDIA (or Cerebras) isn't beefing up GPU memory. If someone sold a 1TB GPU, they could charge a 100grand easy. As I understood it, NVIDIA's GPU memory is just HBM-6 .. so they'd make a profit?

Looking here: https://huggingface.co/docs/transformers/perf_train_gpu_one#... It looks like the most standard optimizer (AdamW) uses a whopping 18 bits per parameter during training. Using bf16 should reduce that somehow, but it wasn't really considered in that section, I'm not sure if that part of the guide is a bit outdated (before A10 / A100 this wasn't an option) or if it still has some instability issues ("normal" float16 can't be used for training because multiplying gradients through the hundreds of layers you'd get 0 or infinity values that would kill your learning). You can switch to different optimizers (Adafactor) and modify a few other things, but that typically comes at the cost of either lower accuracy or slower training, or both.

For multiple GPUs there are quite a few ways to improve memory footprint and speed: https://huggingface.co/docs/transformers/perf_train_gpu_many Although I'm not sure if the implementations in HuggingFace are really on par with the SOTA methods (they shouldn't be far away in any case). I guess they should be at least on par, if not better, with whatever OpenAI used for GPT-3 back then, things evolving so quickly in this realm...

On the last point, I can only assume there are some hard thresholds which are difficult to overcome in order to add more memory, otherwise they would. Just an 80GB memory GPU was something unthinkable a dozen years ago, before the deep learning explosion around 2GB was the norm. A couple of years ago, when 16GB or 32GB was the best you'd get from Nvidia, AMD did come out with consumer grade GPUs having significant larger memory (maybe 48GB back then? I can't remember), which could have stirred the market a bit I guess, but it didn't pick up for deep learning (I suspect mostly due to a lack of the equivalent to cudnn / cuda, that makes it possible to "easily" build deep learning frameworks on top of the GPUs).

My take on this is, if there's a competitor who fights hard to regain market share, and bets big on offering more memory, and still the best it comes up with is just a couple of times more than what the others have, it must be not as easy as "let's stick another bank of memory here and sell it", or they would have...?

GPU memory is also useful to load large detailed scenes for rendering (.usd). It is a bit surprising that 80GB is the limit. It was obvious for years that GPU compute is ahead of GPU memory size by 10x-100x. And loading larger models and scenes into memory was always a struggle. This must be a hardware or yields issue.


depends on your application, if getting many completions is useful to you then its embarrassingly parallel.

I didn't measure, but IIRC it was lower than 1 token/sec

If I rent an A100 what kind of speed could I expect?

While I do not have any A100 handy right now I have an instance running on Genesis Cloud with 4x RTX 3090.

A quick, very unscientific, test using the oobabooba/text-generation-webui with some models I tried earlier gives me:

* oasst-sft-7-llama-30b (spread over 4x GPU): Output generated in 28.26 seconds (5.77 tokens/s, 163 tokens, context 55, seed 1589698825)

* llama-30b-4bit-128g (only using 1 GPU as it is so small): Output generated in 12.88 seconds (6.29 tokens/s, 81 tokens, context 308, seed 1374806153)

* llama-65b-4bit-128g (only using 2 GPU): Output generated in 33.36 seconds (3.81 tokens/s, 127 tokens, context 94, seed 512503086)

* llama (vanilla, using 4x GPU): Output generated in 5.75 seconds (4.69 tokens/s, 27 tokens, context 160, seed 1561420693)

They all feel fast enough for interactive use. If you do not have an interface that streams the output (so you can see it progressing) it might feel a bit weird if you often have to wait ~30s to get the whole output chunk.

At least for now they are focused on 7B and then 3B[1].


I'm not sure whether the number of parameters serves as a reliable measure of quality. I believe that these models have a lot of redundant computation and could be a lot smaller without losing quality.

The Chinchilla scaling law describes, apart from the training data size, the optimal number of parameters for a given amount of computing power for training. See


For training, yes, but these models are optimized for inference, since inference will be run many more times than training. The original Llama models were run way past chinchilla-optimal amounts of data.

Does anyone have any resources they recommend for just understanding the base terminology of models like this? I always see the terms "weights", "tokens", "model", etc. I feel like I understand what these mean, but I have no idea what I need to care about them for in open models like this? If I were to download an open model to run on my machine, would I download the weights? I'm just ignorant in the ML space I guess but not sure where to start.

Psst ... why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this? Get those personalised explanations and enjoy its unlimited patience.

I have felt the same in the past, related to a completely different topic. I know how it feels, it's like people are not saying things what they are, just using weird words.

"weights" - synapses in the AI brain

"tokens" - word fragments

"model" - of course, the model is the AI brain

"context" - the model can only handle a piece of text, can't put whole books in, so this limited window is the context

"GPT" - predicts the next word, trained on everything; if you feed its last predicted word back in, it can write long texts

"LoRA" - a lightweight plug-in model for tweaking the big model

"loss" - a score telling how bad is the output

"training" - change the model until it fits the data

"quantisation" - making a low precision version of the model because it still works, but now is much faster and needs less compute

"embedding" - just a vector, it stands for the meaning of a word token or a piece of image; these embeddings are learned

But, this isn't a bad ideia when you don't know even the basics? Because you wouldn't be able to separate genuine information to subtle or not so subtle hallucinations.

It's like generating code in a language that you know nothing about. You should check for bugs, but you can't.

The first thing to learn is you can’t trust the internet. From that you’ll know not to trust gpt. If you are prone to trusting things blindly, without doing your own research or verification, you have far bigger problems than gpt “hallucinations” (frankly a terrible terminology).

I find "hallucinations" to be pretty apt. What works better in your opinion?

The neurological term for it is "Confabulation", which is a lot better than "Hallucination" as used in AI.

Confabulation is the unintended generation of false memories.

Hallucination is false perception.

Clearly, the phenomenon we are seeing with LLM researchers call Hallucination better fits Confabulation.

Sometimes it helps when the audience gets the meaning of a word. Confabulation is not really popular among non-native english speakers, I am sure.

It's also not popular among native English speakers, I can assure you.

I don't actually think either term is more precise than the other when we're talking about LLMs, which aren't human brains. It doesn't have either memory or perception in a way that we do.

I think the horse had left the barn on this one.

“Confidently presented bullshit” is probably much more accurate. Added benefit no new vocabulary terms :-)

Lies. Bullshit. Con artistry.

It's not perceiving reality incorrectly, it's presenting wholesale fiction as fact both coherently and with absolute confidence. It even forges supporting documentation ad-hoc.

GPT is not a poor schizophrenic suffering from delusions or innocuous "hallucinations." It is the world's most advanced liar.

> Lies. Bullshit. Con artistry.

These are worse as they imply the thing generating the words knows the truth and purposely says something else.

An LLM is just doing next token prediction. It's a mathematical process. It's not trying to "hide" the truth from you.

For me, hallucination is better.

Lies, BS, and Con artistry all require conscious motive and intent. Thats a bridge to far, for me, in ascribing ‘intelligence’ to these models.

Hallucination, to me, conveys ‘seeing things (facts) that are not there’. To the extent the models are ‘perceiving’, they ARE perceiving reality incorrectly. Granted, I expect many times it’s because the source of the model training data are, at best, just wrong or are lying.

Those are very inaccurate descriptors. A lie is an intentional deception, which is impossible for GPT. It "believes" that it "knows" something about the world, which happens to have been made up wholesale by its "subconscious" (obviously I know it's not a human brain). That is pretty much a hallucination by definition, applied to a non-human "intelligence".


> it's presenting wholesale fiction as fact both coherently and with absolute confidence

That is not in any way distinct from perceiving reality incorrectly. It is a symptom common to both skilled lying and hallucination.

In my opinion people are way more afraid of hallucinations than they should be. You are not asking it to solve world hunger, this is basically like asking it to summarize Wikipedia articles. At least with GPT4 it doesn't hallucinate on basic things. I am learning typescript with it, and it hasn't given me wrong answers to direct questions yet. If you are too worried about hallucinations use something like phind.com which will give some sources.

Anyone can evaluate whether it's giving you a self-consistent set of statements, and the additional words it spits out are helpful for a traditional search for alternative sources.

IMO, so long as you're aware the information is often subtly wrong, it's not that different from, e.g., physics classes progressively lying to you less to allow your brain to build a framework to house the incoming ideas.

I think of the good things to get a sense of with ChatGPT is the types of areas where it is most and least likely to confabulate. If I asked it for an ELI5 about key concepts relating to how LLMs work, I would be highly confident it would be accurate. When you start asking about truly esoteric topics, that's when it often starts completely making things up.

I like the term "confabulation". A hallucination is an artifact of an intoxicated or malfunctioning brain. In my experience, confabulation is a common occurrence in normal brains, and can occur without intention. It's why humans make such poor witnesses. It's how the brain fills in the blanks in its senses and experience.

> Psst ... why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this?

I do not use ChatGPT as a search engine. Its ability to confidently hallucinate consistently places it much below a human expert on any topic that I care to understand correctly.

That attitude is going to cost you. You'll have no choice but to abandon it at some point, as the LLM implementations get better. The improvements in GPT4 over 3.5 alone are enough to dispel a lot of my own initial skepticism.

> That attitude is going to cost you.

I don’t think it will cost me much to not use the explicitly-not-a-search-engine thing as a search engine.

Which LLM will you use to verify that ChatGPT is more knowledgeable than human experts on a given topic?

The thing is, your mistake isn't just distrusting the language model, it's trusting the search engine. No matter what tool you use, the responsibility for ensuring accuracy is ultimately yours. Similar degrees of caution and skepticism must be applied to results from both ML and traditional search engines.

They are both insanely powerful tools, and like most insanely powerful tools, the hazards are considerable.

Without a search engine, how am I supposed to weigh the accuracy of an LLM? How am I supposed to take responsibility for ensuring accuracy?

I also think people who say that search engines lie are seriously overestimating the amount of lies on returned by a search result. Social media is one thing but the broader internet is filled with articles from relatively reputable sources. When I Google "what is a large language model" my top results (there aren't even ads on this particular query to really muddle things) are:

1. Wikipedia

Sure this is the most obvious place for lies but we already understand that. Moreover, the people writing the text have some notion of what is true and false unlike an LLM. I can always also use the links it provides.

2. Nvidia

Sure they have a financial motive to promote LLMs but I don't see a reason they have to outright mislead me. They also happen to publish a significant amount of ML research so probably a good source.

3. TechTarget

I don't know this source well but their description seems to agree deeply with the other two so I can be relatively sure on both this and the others' accuracy. It's a really similar story with Bing. I can also look for sources that cite specific people like a sourced Forbes article that interviews people from an LLM company.

With multiple sources, I can also build a consensus on what an LLM is and reach out further. If I really want to be sure I can type a site:edu to just double check. When I have the source and the text I can test both agreement with consensus and weigh the strength of a source. I can't do that with an LLM since it's the same model when you reprompt. I get that LLMs can give a good place to begin by giving you keywords and phrases to search but it's a really, really poor replacement for search or for learning stuff you don't have experience in.

> The thing is, your mistake isn't just distrusting the language model, it's trusting the search engine.

There is a rather substantial difference between a search engine, which suggests sources which the reader can evaluate based on their merits, and a language model, whose output may or may not be based on any sources at all, and which cannot (accurately) cite sources for statements it makes.

> Similar degrees of caution and skepticism must be applied to results from both ML and traditional search engines.

This is a fairly ridiculous statement.

This is a fairly ridiculous statement.

Really? Have you used Google lately -- say, in the past 6-12 months?

I personally use search engines on a daily basis. They link me to external websites that I can trust or distrust to varying degrees depending on my prior experience with them and the amount of further research I put in.

If a person is in the habit of using a search engine like a chat bot by typing in questions AskJeeves-style and then believing what text pops up in the info cards above the ads (which are themselves above the search results), I could see how the distinction between chat bots and search engines could seem trivial.

The similarity between chat bots and search engines breaks down significantly if the user scrolls down past the info cards and ads and then clicks on a link to an external website. At that point in the user experience it is no longer like chatting with a confident NPC.

> The thing is, your mistake…

This is a weird thing to write to a stranger. I suppose there will be no need to caution people about rudeness or making strange assumptions in the utopian future where humans only talk to chatbots, though.

We're starting to be able to tell the humans from the bots because the bots can consistently demonstrate better social skills.

Of course, it will be trivial for such bots to emulate humans if they find that useful.

Fun times.

It will be a wondrous day that we can finally see a computer capture the distinctly-human Urge to Post. The je ne sais quoi that makes us all donate our takes to the needy is an organic phenomenon so far.

The je ne sais quoi that makes us all donate our takes to the needy is an organic phenomenon so far.

"I do not use ChatGPT as a search engine. Its ability to confidently hallucinate consistently places it much below a human expert on any topic that I care to understand correctly."

The je ne sais quoi that makes us all donate our takes to the needy is an organic phenomenon so far.


Exactly. Just pointing out that it's not "weird" to answer an opinion disguised as an axiom with another just like it. You shared your position in no uncertain terms and I did the same. It's all good, welcome to HN.

Yes. It would have been a very strange joke about posters if I somehow tried to say that I am not myself a poster, in a post. That would have been a weird thing to imply.

Thank goodness that I didn’t do that, I’d certainly have egg on my face if I hadn’t included myself in the joke and somebody called me out on it!


These are explanations that make sense to people who already know how deep learning works but don't really explain much to beginners beyond giving them a grossly oversimplified misrepresentation of what is being discussed (while not actually explaining anything).

My advice to folks is, if you actually want to know how this stuff works at some basic level, put in some time learning how basic linear and logistic regression work, including how to train it using back propagation. From there you'll have a solid foundation that gives enough context to understand most deep learning concepts at a high level.

It was intended as a demystification, not a total explanation. There are millions of places explaining with technical details.

> why don't you spend 30 minutes of quality time with chatGPT and get to the bottom of this?

when it can hallucinate content, why do that instead of reading a blog post from an expert?

Oh no, it will hallucinate an obscure fact, but not basics. It's pretty good at reciting theory, it would pass many ML engineering theoretical interviews.

If you don't trust its memory, copy a piece of high quality text in the topic of interest inside the context, as reference.

it's repeatedly made up entire quotes and research papers?

Not the OP, I'm still hesitant because it infuriates me I have to give them my identity which they will then log every prompt against. You think they aren't building profiles on people? AI moties(more in gods eye reference )is what they are.

I think this is the right answer, ChatGPT is an excellent 1-1 tutor.

Andrej Karpathy's Zero to Hero video series [1] is a good middle ground. It isn't super low-level but it also isn't super high-level. I think seeing how the pieces actually fit together in a working project is valuable to get a real understanding.

After going through this series I can say I basically understand weights, tokens, back-propagation, layers, embeddings, etc.

1. https://karpathy.ai/zero-to-hero.html

I'm working my way through that series now. He really is a good teacher -- I keep waiting for the inevitable "Next, draw the rest of the fucking owl" moment, but so far he does seem to be sticking to his commitment to a from-scratch approach.

When was this published? Is this an older tutorial by Karpathy?

Just curious, didn't see any date...

The first class is 8 months old and the latest one is 3 months old. If you click on the links, they'll direct you to YouTube videos.

On youtube you can. First video 8 months ago.

Weights are basically number/float variables. In neural networks, vectors of values are multiplied (or math'd in some way) by weights to get new vectors of values. A 500 billion weight model has 500 billion variables, all carefully chosen via training.

A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.

Tokens are sort of "words" in a sentence, but the ML may be translating the word itself into a more abstract concept in 'word space': eg, a bunch of floating point values.

At least some of what I just said is probably wrong, but now someone will correct me and we'll both me more right!

At a first approximation this is pretty good. I wouldn't say this exactly:

> A model is some architecture of how data will flow through these weight matrices, along with the values of each weight.

Because data doesn't really flow through weight matrices, though perhaps this is true if you squint at very simple models. Deep learning architectures are generally more complicated than multiplying values by weights and pushing the results to the next layer, though which architecture to use depends heavily on context.

> Tokens are sort of "words" in a sentence

Tokens are funny. What a token is depends on the context of the model you're using, but generally a token is a portion of a word. (Why? Efficiency is one reason; handling unknown words is another.)

> What a token is depends on the context of the model you're using, but generally a token is a portion of a word.

When doing quick estimates, I just assume every syllable is a token. It tends to overestimate, which is fine for my OOM mitigation purposes.

Probably not the answer you would like but I think your approach to download them and figure out how to run them on your machine is a good one. You don't need to understand everything to get something working. It can be overwhelming and unproductive to know everything before getting started.

To learn more deeply though, get started with getting it to work and when you are curious or something doesn't work, try to understand why and recursively go back to fill in the foundational details.

Example, download the code try to get it to work. Why is it not working? Oh it's trying to look for the model. Search for how to get the model and set it up. Then key step, recursively look up every single thing in the guide or set up. Don't try to set something up or fix some thing without truly understanding what it is you are doing (e.g. copy and paste). This gives you a structured why to fill in the foundations of what it is you are trying to get to work in a more focused and productive manner. At the end you might realize that their approach or yours is not optimal "oh it was telling me to download the 65k model when I can only run 7k on my machine bc ..."

For a good general non-technical introduction I recommend the YouTube computerphile series related to language models, transformers and other general concepts. If you are interested in actually doing stuff there’s an over abundance of material out there already, if you try looking.

I haven't watched it yet, but the Practical Deep Learning for Coders course that's available on YouTube is often recommended


A book about AI. (Norvig and Russell comes to mind)

I'm always curious about the cost of these training runs. Some back of the envelope calculations:

> Overall we reach a throughput of over 1900 tokens / second / TPU-v4 chip in our training run

1 trillion / 1900 = 526315789 chip seconds ~= 150000 chip hours.

Assuming "on-demand" pricing [1] that's about $500,000 training cost.

[1] https://cloud.google.com/tpu/pricing

At these levels of spending the actual cost is heavily negotiated and is usually far below the advertised on-demand pricing.

Considering I could negotiate A100 for under a dollar/hr - 8 months ago, when they were in high demand, I wouldn't be surprised if the cost was close to 100k for this training run.

Nobody in their right mind is using GCE for training. Take a look at real prices: https://vast.ai/

I got the impression that kind of thing (buying time on GPUs hosted in people's homes) isn't useful for training large models, because model training requires extremely high bandwidth connections between the GPUs such that you effectively need them in the same rack.

I suspect most A100s on vast.ai are actually in a datacenter, and might even be on other public clouds, such as AWS. I don't see why either vast.ai or AWS care if this was the case.

Is there at good resource that describes the impact of bandwidth and latency between GPUs?

I assume that it's completely impractical to train on distributed systems?

Anyone training this size of model is almost certainly using AWS/GCE.

The GPU marketplaces are nice for people who need smaller/single GPU setups, don't have huge reliability or SLA concerns, and where data privacy risks aren't an issue.

Well, or Azure.

Ha yes of course. But actually has anyone been able to get instances on Azure? Thought OpenAI had them all reserved.

Aren't they explicitly using TPUs in their training? Vast AI are only offering GPUs.

These nodes typically have slow downstream, and thus are hard to use when training requires pulling a huge dataset.

Only 19 GPUs with 30+G of VRAM in the entire North America.

I might be misreading it. It might be just 12 GPUs.

They haven't trained a 1 trillion token model yet. They have only done 200bn so far

Google is generous for giving TPU for free for research, so likely it is using this. The more representative number is one from meta which required 87k A100 hours, which is close to $100-200k for 7B model training.

I am quite new to this, I would like to get it running. Would the process roughly be:

1. Get a machine with decent GPU, probably rent cloud GPU.

2. On that machine download the weights/model/vocab files from https://huggingface.co/openlm-research/open_llama_7b_preview...

3. Install Anaconda. Clone https://github.com/young-geng/EasyLM/.

4. Install EasyLM:

    conda env create -f scripts/gpu_environment.yml
    conda activate EasyLM
5. Run this command, as per https://github.com/young-geng/EasyLM/blob/main/docs/llama.md:

    python -m EasyLM.models.llama.llama_serve \
         --mesh_dim='1,1,-1' \
         --load_llama_config='13B' \
         --load_checkpoint='params::path/to/easylm/llama/checkpoint' \
Am I even close?

I think llama.cpp might be easier to set up and get running.


I second this recommendation to start with llama.cpp. It can run on a regular laptop and it gives a sense of what's possible.

If you want access to a serious GPU or TPU, then the sensible solution is to rent one in the cloud. If you just want to run smaller versions of these models, you can achieve impressive results at home on consumer grade gaming hardware.

The FastChat framework supports the Vicuna LLM, along with several others: https://github.com/lm-sys/FastChat

The Oobabooga web interface aims to become the standard interface for chat models: https://github.com/oobabooga/text-generation-webui

I don't see any indication that OpenLLaMa will run on either of those without modification. But one of those, or some other framework may emerge as a de-facto standard for running these models.

Yes, I can clone this and get into a prompt in less than 5 minutes on an M2 MBA.

might try it first. seems to be only CPU?

It has partial gpu acceleration if you compile it with LLAMA_CUBLAS or LLAMA_CLBLAST

They really have come a long way since... A few weeks ago.

Using cublas with my 1080ti results in a 52% speedup compared to cpu-only. Vram usage is very minimal.

I'd see that as a benefit of llama.cpp - it's specifically designed to be usable on consumer hardware such as laptops, without professional GPUs.

You can get it running with one Python script on Modal.com :)


Ok you lot! Will try out modal.

Yeah it is pretty nice. Not sure how long it took, but less that the time to make a sandwich (2 minutes). It cost 2-3c a pop so sadly more expensive than GPT3.5. However maybe it can be optimised. Or maybe there is some init cost that could be store in state.

    (modal) fme:/mnt/c/temp/modal$ modal run openllama.py
    ? Initialized. View app at https://modal.com/apps/ap-9...
    ? Created objects.
    +-- ?? Created download_models.
    +-- ?? Created mount /mnt/c/temp/modal/openllama.py
    +-- ?? Created OpenLlamaModel.generate.
    +-- ?? Created mount /mnt/c/temp/modal/openllama.py
    Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]Downloading shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:00<00:00, 1733.54it/s]
    Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00,  5.70s/it]Loading checkpoint shards: 100%|¦¦¦¦¦¦¦¦¦¦| 2/2 [00:12<00:00,  6.23s/it]
    Building a website can be done in 10 simple steps:
    1. Choose a domain name. 2. Choose a web hosting service. 3. Choose a web hosting package. 4. Choose a web hosting plan. 5. Choose a web hosting package. 6. Choose a web hosting plan. 7. Choose a web hosting package. 8. Choose a web hosting plan. 9. Choose a web hosting package. 10. Choose a web hosting plan. 11. Choose a web hosting package. 12. Choose a web hosting package. 13. Choose a web hosting package. 14. Choose a web hosting
    ? App completed.

Thanks for trying it out!

2-3c per run seems very high. That's probably just the cost if you have to spin up a new container. You can shorten the idle timeout on a container if its going to just serve one request typically. If it's going to serve more requests, then the startup and idle shutdown cost is amortized over more requests :)

I found this was the cost per call to a web function. I used deploy to deploy it. The function just does what the main did in the example repo (earlier in this theead)

How is this model performing better than LLaMa in a lot of tasks[1] even though its trained on a fifth of the data (1 trillion vs 200 billion).


They are likely doing some interpolation for 200B or benchmarking it in wrong way. e.g. Hellaswag accuracy for llama 7b is 0.76[1], but it is written 0.56 in the repo. Even at 200B tokens, it is higher than 0.56 for llama looking at the charts.

[1]: https://arxiv.org/pdf/2302.13971.pdf

They ran lm-evaluation-harness on both this model and the original llama weights, which is the correct way to do it.

Many people have been struggling to reproduce the benchmark numbers included in the original llama paper.

Nobody knows :^)

Maybe it uses a higher quality dataset

Would be very interesting to see https://github.com/BlinkDL/RWKV-LM trained on the same data

Interesting. Have you done anything with RWKV?

I evaluated RWKV recently, and it's interesting for sure. It's undertrained, and has a quirky architect, so some parts of it are different than playing with the llama ecosystem. The huge context length is super appealing, and in my tests, long prompts do seem to work and get coherent results.

Where it's slow is in tokenization -- it can be very, very slow to make an initial tokenization of a prompt. I think this has to do with how the network actually functions, like there's a forward loop that feeds each token in to the network sequentially.

I would guess if it had the same level of attention and work that the Llama stack is getting it would be pretty fantastic, but that's just a guess, I'm a hobbyist only.

Nope, not yet, the current 14B version is much worse than LLaMA 65B. But there are apparently plans to train a RWKV-65B by the end of the year, and if including the LLaMA training dataset results in something like LLaMA-65B but with infinite context then that'd be really amazing.

How is this different from what RedPajamas is doing?

Also, most people don't mind running LLaMA 7B at home so much because of enforceability, but a lot of commercial businesses would love to run a 65b parameter model if possible and can't because the license is more meaningfully prohibitive in a business context. Open versions of the larger models are a lot more meaningful to society at this point.

RedPajama is creating a dataset. This is a permissively licensed model trained on that dataset.

RedPajama is also training both foundation and instruct-tuned models

Source: https://twitter.com/togethercompute/status/16527350961501757...

I agree with this. For a lot of companies hundreds of thousands of dollars or single digit millions on fine tuning, inference, and so on is entirely feasible but using model weights with clouded legal status isn’t.

Really exciting how fast fully pre-trained new models are appearing.

Here's another repo (with the same "open-llama" name) that has been available on hugging face as well for a few weeks. (different training dataset)

https://github.com/s-JoL/Open-Llama https://huggingface.co/s-JoL/Open-Llama-V1

Is anyone familiar with the BOINC-style grid computing scene for ML and, specifically, LLM? Is there something interesting going on, or is it infeasible? Will things like OpenLLaMA help it?

They seem to scale up, not out, so grids don't really work.

What everyone is using are HPC grade low latency interconnects to make the cluster look as close as possible to a single big TPU.

"They seem to scale up, not out, so grids don't really work."

Can someone explain what this means? I don't understand.


In a typical fully connected hidden layer, the neurons each need to compute the values of the all others in the previous layer, so you need all the data in one place. Obviously you can distribute the actual calculations which is what a GPU does, but distributing that over networked CPUs will be incredibly slow and require the whole thing to be loaded into memory on all instances.

My bet is on some kind of light based or analog electric accelerator PCIE card to be the next best thing for this sort of inference, since it should be able to calculate multiple layers at once. FPGAs also work but only for fixed weights.

Further than that, with big models and training rounds that want to update potentially all the values, you can't even split the work by saying "report the fitness of this model against this cost function and report back in however much time your CPU needs" because shipping around the model and data is impractical.

I mean yeah, even just doing regular inference is borderline impossible on a normal machine given that we're even having this discussion. Training is just completely unfeasible.

Up=bigger machine

Out=lots of machines through network

The more you split it up outwards (across more nodes), the more communication among nodes that is required, which doesn’t lend itself well to regular Internet connections, which means it would prefer to scale upwards with more GPU/CPU/memory capacity per node.

I haven't looked into it or tried it yet, but there is https://petals.ml/

Can someone explain how to tell if a model doesn't require a GPU and can run on a CPU?

After setting up dalai, OpenAssistant, gpt4all and a bunch of other (albeit nonworking) LLM thingies, my current hunch is:

if the model somewhere has "GGML" in its name, it doesn't require a GPU.

Technically anything that's based on pytorch can run on CPU, you just need to tell it to do so. For example, in textgen add '--cpu' and you're done. It will be super slow though.

GGML format is meant to be executed through llama.cpp, which doesn't use GPU by default. You can often find these models in a quantized form as well, which helps performance (at a cost of accuracy). Look for q4_0 for the fastest performance and lowest RAM requirements, look for 5_1 for the best quality right now (well, among quantized models).

Oh yeah, textgen supports llama.cpp, and also provides API, so it looks like a clear winner. You might want to manually pull newer dependencies for torch and llama.cpp though:

pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.h... pip install -U llama-cpp-python

Has anyone successfully used embeddings with anything other than OpenAI's APIs? I've seen lots of debates on using embeddings vs fine-tuning for things like chatbots on private data, but is there a reason why you can't use both? IE, fine-tune LLaMA on your data, then run the same embeddings approach on top of your own fine-tuned model?

> We are currently focused on completing the training process on the entire RedPajama dataset.

So that's 1.2 trillion tokens. Nice.

Forgive me for the ignorance, but can a refined training model be a specific codebase, after say training on all standard docs for the language, and 3rd party libs, and so on.

I have no formal idea how this is done, but my assumption is that "something like that" should work.

Please disabuse me of any silly ideas.

Hi Jason! I have a few thoughts on this!

Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's very expensive and can disrupt the usefulness of having all the generalizations baked in from training data [1].

While LLMs can generate text based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. But I do think they hold a lot of promise in reasoning.

Small corollary: LLMs do not know a head of time what they are generating. Secondly, they use the input from you and itself to drive the next message.

This sets us up for a strategy called in-context learning [1]. We take advantage of the above corollary and prime the model with context to drive the next message. In your case, a query about some specific code base with knowledge about standard docs etc.

Only there is a big problem, context sizes. Damn. 4k tokens?

We can be clever about this but there is still a lot of work and research needed. We can take all that code and standard docs and create embeddings of them [2]. Embeddings are mathematical representations of words or phrases that capture some of their semantic meaning. Basically the state of a trained neural network given inputs.

This will allow us to group similar words and concepts together closer in what is called a vector space. We can then do the same for our query and iterate over each pair finding the top-k or whatever most similar pairs. Many ways to find the most similar pairs but what's nice is cosine similarity search. Basically a fancy dot product of the pairs with a higher score indicating greater similarity. This will allow us to prime our model with the most "relevant" information to deal with the context limit. We can hope that the LLM would reason about the information just right and voila.

So yeah basically create a fancy information retrieval system that picks the most relevant information to give your model to reason about (basically this [3]). That and while also skirting around the context limitations and not overfitting and narrowing the training information that allow them to reason (controversial).

1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf

2: Embeddings https://arxiv.org/pdf/2201.10005.pdf

3: https://twitter.com/marktenenholtz/status/165156810719298355...

Much appreciated, Sun fearing dude, much appreciated.

You can train the model on more training data after it has been released.

So is this free as in “do what you f’ing like with it”?

I made a YouTube video on how to run OpenLLaMa on Google Colab with Hugging Face Transformers (using a T4 GPU): https://www.youtube.com/watch?v=1NOPciKuQb8

Hope that helps!

Has anyone actually used this? I poked around and it's so poorly documented that I don't see how one can readily, short of trying to go through the code, understand how to do a minimal run.

I've used it with llama.cpp; results are not great, but not entirely terrible (I'd say somewhere between GPT-2 and GPT-3). Still, totally free and open source is great and I'm looking forward to more development from them (and others building on top like an RLHF / alpaca / chat kind of thing).

Thanks for answering! In my skim of the thread I only saw people mention trying it with llama.cpp. I tried to get his EasyML framework going but could not figure out the parameters I needed. Definitely agree it's great to see real open source models being built.


Happily, licensing.

why the hell will you be happy about duplicate work?

Actually, replication is very important. If no one can make new llamas, that would mean that facebook used some secret sauce in their training. Understanding publicly how to train these 'enhanced' models that shows performance of much greater models is a very strong motive.

And getting hid of the NC clause of the original llamas too, of course.

As of right now, there's trouble replicating the eval results of the paper, for example.

Yeah but that wasn't the reason, was it? They didn't do it because they wanted to replicate work, they did it because they didn't want the Meta lawyers to be big mad at them.

Good luck convincing Meta to release their models with a proper licence.

That is why its sadly, licensing.

Sadly, licensing.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact