Hacker News new | past | comments | ask | show | jobs | submit login
Falcon 180B (huggingface.co)
238 points by osanseviero on Sept 6, 2023 | hide | past | favorite | 208 comments



I hope popularity of large models like this one drives more work on CPU inference of quantized models. It is extremely disappointing one can't run 4 or even 8 bit quantized models on a cpu. Inference I did with fp32 on a last gen AVX2 CPU show me it is definitely usable if you're willing to wait a bit longer for each token (I got about 1token per 2s on a ryzen 3700x, 32GB ram, with falcon-7B-instruct and this is with about 1gb of ram in the swap).

I don't quite understand why people aren't working on cpu quantization. Allegedly openvino supports _some_ cpu quantization, but certainly not 4 bit. Bitsandbytes is gpu only.

Why? Is there any technical reasons? I recently checked and for a price of a 24gb rtx3090 I can get a really nice cpu (ryzen 9 5950x) and max it with 128gb of ram. I'd love to be able to use it for int8 or 4 bit inference...


>It is extremely disappointing one can't run 4 or even 8 bit quantized models on a cpu

What do you mean? Llama.cpp can do 8 and 4 bit quantisation on CPU, and even supports Falcon 40B.


Yes on arm. Sorry I meant the cpu (x86_64) not a cpu. :-D


Its quite fast on x86.

llama.cpp looks really good on Mac ARM CPUs because:

- they have tons of memory bandwidth

- they have a really good proprietary acceleration library (accelerate)

But I don't think it would be so fast on, say, an Ampere Altra compared to a similarly priced EPYC cpu.


These things are quite memory bandwidth bound, so if you are getting any kind of machine to run them go for the absolute fastest RAM you can.


Ampere has native acceleration (Ampere Optimised Frameworks) for AI workloads


I dont think llama.cpp integrates this though.

Accelerate is all but depreciated because of the Metal backend anyway.


I told you a while ago that llama.cpp supports 4/8-bit (and other) quantization just fine (and fast) on x86_64.


Yes, you did. Back then I went to their github page and I found no mention of x86. Now I see it is there. Great. I'll give it a try.

Also I see something much better. The same guy behind llama.cpp authors a universal ml library that does cpu quantization. I wasn't aware of that and it is far more impressive than running a couple of chosen models.


https://github.com/ggerganov/ggml

TinyGrad is also targeting CPU inference, and IIRC it works ok in Apache TVM.

One note is that prompt ingestion is extremely slow on CPU compared to GPU. So short prompts are fine (and tokens can be streamed once the prompt is ingested), but long prompts feel extremely sluggish.

Another is that CPUs with more than 128-bit DDR5 memory busses are very expensive, and CPU token generation is basically RAM bandwidth bound.


because on a per-inference level, it's _still_ cheaper to use GPU than it is to use CPU, even if you ignore latency and only consider offline batch applications.


Or, inference that can fluidly use available CPUs, GPUs, and neural units on M1/M2/M3.


Not if a 96 GB GPU costs $30K?


The OP notes that they can infer a 7B model at 2s/tok using their CPU setup. Equivalent inference on a 4090 ($1600) is up to 400 tok/s, or 0.0025 s/tok. This is an 800x difference.

(to be fair, you can do better than the OP's performance, but we _are_ talking several magnitudes of performance, not to mention energy efficiency, which matters at scale)


> Equivalent inference on a 4090 ($1600) is up to 400 tok/s

What? Who's doing 400 t/s @ 7b on a 4090?


I am, with many optimizations eg dynamic batching, tensorrt, but most important is batch size > 1 -- cheating a little bit in the comparison ;).

batch size 1 is more like 200 tok/s, still a very large improvement. the reason why this is still an interesting comparison is that it's comparing consumer hardware vs scaled datacenter setups. datacenter setups are much more efficient in large part because they are designed to process many inputs at once, fully utilizing the hardware. there are advantages to scale in inference, for the gpu-rich.


I’ve been using llama.cpp to try out some large-ish models on my cpu for a while now. I’m on an amd 3600, with 32gb ram.

I can run 30B 4bit and 5bit models. It’s definitely “usable” but smaller models definitely run faster.

I don’t have a good graphics card, but it does work on cpu. With more ram I think the 65B and larger models are runnable as well


I would be willing to bet Apple will be one to invest in CPU-based work in this space


Pleasantly surprised to read their acceptable, acceptable use policy [1].

Expected allowed usage to be drowned in legalese, instead it's short & sweet 4 points policy that boils down to: "don't use for illegal activity and don't harm others".

[1] https://falconllm.tii.ae/acceptable-use-policy.html


"For the purpose of exploiting, harming or attempting to exploit or harm minors and/or living beings in any way..."

That sounds like it'll be vague enough to lead to the same problems as the original JSON license. (Famously, leading to IBM obtaining permission to use it "for evil". https://news.ycombinator.com/item?id=36809065) "Harm living beings in any way" is pretty broad; would use by a researcher working on new antibiotics violate this?


Why is this better than nothing. Laws already exist to cover illegal things, the rest is just cya or moralizing. Or stupid:

  For the purpose of exploiting, harming or attempting to exploit or harm minors and/or *living beings* in any way;
Does that mean it can't be used for farming, slaughterhouses, pest control, "falcon, I have termites in my house, what do I do? I can't help you, termites are a living being and must not be harmed", bacteria?


“minors and/or living beings” is really odd phrasing if you think about it.

You can’t harm minors (whether or not they are alive) and you can’t harm living beings (regardless of their age). But it is fine to harm dead adults I guess.

Putting aside the obvious “why not just say don’t harm anyone” question, it seems to open up some silly philosophical questions around “do dead people age?” Can we use this tool to generate propaganda against an ancient mummified baby, for example?


That's one of the pitfalls of writing custom licenses. The standard ones, say Apache 2.0, have had a lot of eyes on them and generally make sense and have a clear intent. I'm sure whoever wrote this version has an intent that's not yet clearly and unambiguously reflected in the writing, and so they risk both looking silly but also people avoiding the license due to too much ambiguity and confusion. Why would any user want to take on the liability of this license when it's not clearly understandable?


What’s illegal over there might not be here, but illegal still somewhere else.

Makes it as simple as understanding and implementing timezone functionality from scratch.


Yep, exactly. The lack of sophistication in the terms means uneven enforcement by design.


It's neat that Huggingface lets you test-drive these models right in your browser.

This particular model says it needs 640GB of memory just for inference. Assuming Huggingface also has other large models loaded, and wants to also make them available to a non-trivial number of concurrent users -- I wonder how many GPUs they have just to power this test-drive feature.


Holy smokes. I had guessed that one would need a H100 to run this. I had no idea you would need multiple. With how scarce those are, the running costs for this must be immense!


You should be able to run it quantised with much less RAM. 256GB for 8bit, 128GB RAM for 4bit quantisation.


only 128GB, lol


Only costs a few hundred bucks for CPU ram. Sure it's slow, but for creative work it's competitive in speed with a human.


Humans run at approximately 100W, for 2 H100s you're looking at 600W-1400W. Plus humans have a much larger variety of capabilities. And they're more fun.

So you're paying ~10x the power costs to get worse, unverified, illogical answers faster when using LLMs vs humans. Which then have to be checked and revised by humans anyway.


>So you're paying ~10x the power costs to get worse

I mean I don't usually plug myself into an electrical outlet, isn't food much more expensive for the same amount of energy?


That is an interesting question. Where I live, the cost of electricity is 0.2276 €/KWh.

So the two H100, at 1KW, cost 0.2276×24 = €5.5 ($6) per day, which is nearly my groceries average.

(My meals are powering all of my body though, which is five times the consumption that my brain requires, so all in all, it seems a bit more power-efficient than the GPU still.)


However the LLM doing inference on 2 H100s can easily exceed 10x the content generation rate of a single human.

Regarding quality of the output, it obviously depends on the task, but we are benchmarking the models on real world metrics and they are beating most humans already.


This is a pretty cool and neat comparison that I haven't seen before. Probably worth including the rest of the server required to run 2 H100s because those aren't trivial either... I think the 100W might just be for an estimate of the human brain so maybe it is an equivalent example.

I know this isn't the spirit you meant it in, but I'm also impressed with humanity that we've managed to develop something as capable as it is (admittedly significantly less reliable and capable than a person) at only an order of magnitude difference in power consumption.


> Humans run at approximately 100W

I believe it's many times less for the brain. There's no way it dissipates anything close to 100W without cooking itself.


Sure, but brains aren't useful without the rest of the human.


What a deeply unnerving thread..


Brains are about 20W alone, but needs the rest of the body to run properly.


The brain is literally liquid cooled (blood), it can totally dispense of 100w of heat if it has to. (that said, other commenters are correct, the brain used 20w)


For a human as smart as Falcon 180B you'd probably need to pay at least $100k per year in the US.


And even then to get them to actually work as quickly would be impossible.


How are they able to afford this, are they currently simply burning through vc money?


It's 8 x A100 for inference. Huggingface has a lot more than 8 A100s.


These new models get released in such an extreme cadence now it's almost impossible to stay ahead of the curve in production deployments..

Absolutely can't wait to test drive this one -- although I'm pretty sure my 96GB M2 MacBook is unable to run it.. time for M2 Ultra? :-))

Edit:

> You will need at least 400GB of memory to swiftly run inference with Falcon-180B.

What the ...


You should be able to run it 4-bit quantized... barely.


I think this is before quantization (fp16). If quantized to 3 or 4 bit precision, you should be able to run it.


It is inference, so maybe it is just CPU memory?


Here's my understanding (may be wrong!) of the license [1] and the acceptable use policy [2] in terms of what you can't do:

You are not allowed to do the following under the Falcon 180B TII License Version 1.0:

1. Use Falcon 180B to break any national, federal, state, local or international law or regulation.

2. Exploit, harm or attempt to harm minors and living beings in any way using Falcon 180B.

3. Create or share false information with the purpose of harming others.

4. Use Falcon 180B for defaming, disparaging or harassing others.

Notable: 5. Use Falcon 180B or any of its works or derivative works for hosting use, which is offering shared instances or managed services based on the work, unless you apply and are granted a separate license from TII.

Notable: 6. Distribute the work or derivative works unless you comply with several conditions such as including acceptable use restrictions, giving a copy of the license to recipients, stating changes you made, and retaining copyright and attribution notices among others.

7. Use trade names, trademarks, service marks or product names of the licensor unless required for reasonable and customary use in describing the origin of the work or reproducing the content of the NOTICE file.

[1]: https://huggingface.co/spaces/tiiuae/falcon-180b-license/blo...

[2]: https://falconllm.tii.ae/acceptable-use-policy.html


I remain skeptical that models are licensable at all. To be subject to licensing, they would have to be property. What kind of property are they?

Certainly, they are not copyrighted works. You can’t copyright mere data. You could no more copyright a model than you could a phone book, or even a matrix transformation of a list of phone numbers.

And even if they are covered by copyright, they are hopelessly tainted by the copyrighted works they are trained on without license. Without upstream licensing, licensing the model is usurping the rights of the original authors.


> You could no more copyright a model than you could a phone book

Just as an interesting side note, some jurisdiction recognize something apparently called “database right” in English (in Swedish it’s more like “catalog right”).

It’s a kind of intellectual property right for the work of compiling a database.

Perhaps applicable to the weights of a model? But the US does not recognize this as a thing

https://en.wikipedia.org/wiki/Database_right


Fair points, within the context that IP rights are fundamentally made up and shifting rights crafted to balance the concerns and interests of various parties.

I can't tell where this is going to land. Right now, we're seeing a number of parties trying to put metaphorical barbed wire in the newfound prairie of ML models, each struggling to influence the prevalent wisdom of how IP rights should apply to this context.

We could easily end up in a universe where LLMs are a licensing minefield where every copyright owner of any part of their training data gets rights on the model, become essentially unmanageable without relying on helpful licensing middlemen that smooth out the right for LLMs to exist, at a cost.

We could just as well end up with LLMs being recognized as not being derivative works themselves, albeit able to generate derivative works, a much less advantageous situation for creatives and their middlemen who see their creative output as being pirated, to reuse a familiar term of IP rights propaganda.

It'd be a little surprising to me if we ended up in a situation where the work needed to produce good LLMs wasn't associated with any commensurate IP rights on the results, and I expect the megacorps investing billions into this will find it in their heart to throw a few millions toward lobbying efforts to ensure that this isn't the outcome.


They don’t obviously fall under copyright, patent, or trademark—those are the only options for IP.

You can enter into general business contracts that govern how the parties make use of data. This happens all the time with all kinds of data sources: business listings, GIS, credit scores, etc.

If I copy of this kind of data, I might be breaking a contract and committing a tort, but I am not violating copyrights. If a third party gets the data without a contract in place and makes use of it, they are not violating copyright either; the liability falls on the contracted party that let the data get out.

But licenses of the kind proposed on models are inapplicable. Imagine how bizarre it would be if the phone book came with a license stating “you may only use the information for non-commercial purposes.” The phone book publisher would get laughed out of court and maybe even penalized for frivolous lawsuits.


Not really. Maps are also mere data, but they are quite successfully copyrightable. There's even a concept of trap streets [0] used to find out if someone used your data in their map without permission.

AI models don't have an established legal framework yet, but it's reasonable to assume that similar rules will apply here.

[0] https://en.wikipedia.org/wiki/Trap_street


Maps are not mere data. There is a lot of creativity involved in choosing the data to present and leave out, the style and colors, the arrangement of labels, etc. That's why maps are copyrightable. There is comparatively little creativity involved in feeding large fractions of the internet through a standard transformer model. Neither is there any significant creativity involved in the presentation of the raw weights. It's not at all clear to me that the weights are or should be copyrightable.

People will certainly try, though, and like all regulatory regimes copyright loves to expand and never voluntarily shrinks, so they may succeed. Honestly I think the most likely outcome is that model weights will be ruled as derived works of the input dataset, and courts will try to enforce that people who train models must license their entire dataset specifically for model training. Some would cheer that but I personally think it would be a disaster.


The training process is closely curated, bootstrapped, the bulk data is mixed with whatever manual data you have, and it generally requires tremendous amounts of human input and expertise for the model to be even remotely good. It's definitely not just "feeding the data to the model".


Compiling and checking and selecting and filtering and sorting the phone numbers for a phone book and printing that book and distributing it is not trivial. It involves significant work and even a few creative and editorial decisions, but on balance it's not that creative of a process, and the resulting phone book is not copyrightable.

As for mixing in your own data, if the model weights inherit the copyright of the data then every existing large language model is illegal.


You left out the part where they say it IS okay to use it on the backend of a hosted application or service.

The way I read this is that they are reserving their right to build an API product like OpenAI based on this model, and control that particular part of the market. So it can't just be a wrapper putting another brand on it that is open to general purpose use of the model.

But you can build hosted chat-based applications. They just need to be applying the model to some use case.


Living beings? Meaning I can't ask it how to treat mice/termite problems in my house?


It allows to use it to train other models, right?


It seems like LLMs are becoming a commodity. This just wrote me some code that at first glance is as good as what I get from GPT4.

OpenAI better have some earth shattering thing up its sleeve because I don't understand what their moat is.


I've done a lot of work on information extraction with these over the last year, and if accuracy counts, then a) GPT4 is in a league of its own, and b) GPT4 still isn't really very good. They may not have a "moat," but they're still the only player in town when quality is critical.


For now. The quality of competitors has been improving considerably when we look at our own in-house analysis for various use cases we have.

It looks like GPT4 has approached an asymptote in quality (at least within a compute time window where they remain even marginally cost effective). Others are just catching up to that goalpost.

Even GPT4 suffers from the same problems intrinsic to all LLMs-- in real world use, hallucinations become a problem, they have a very difficult time with temporal relevance (i.e identifying when something is out of date), and they are horrifically bad at any kind of qualitative judgement.


> They may not have a "moat," but they're still the only player in town when quality is critical

Their initial moat was built with ChatGPT, which was launched less than a year ago and was surpassed by competitors in less than 6 months. Their current GPT4 is less than 6 months old. While your statement may be true for now, I don’t expect it will hold longer term. They have name recognition advantage, but so did AOL.


ChatGPT wasn't surpassed by competitors in less than 6 months, what model would you say beat it that early?


I think they mean gpt3.5 ChatGPT


Correct, there are a number of models available that perform similarly to GPT3.5 for the majority of tasks an end user may ask of it.


> a) GPT4 is in a league of its own, and b) GPT4 still isn't really very good.

Agree with both and I work in information extraction too.


what is information extraction in your terms? sounds like some spy stuff


Input some document, get a JSON with all the fields. It requires understanding a lot about world entities, fields and values to parse free form into structured form. Also works on screens, you can extract information from an app, for example to run an AI bot on top of the app.


My use cases were extracting stuff like this from scientific papers: "what were the interventions in this study, what outcomes were measured, and what was the treatment effect of each intervention on each outcome?"


Probably a prompt like “read this and tell me the wether it is about a stock and wether i should buy or sell based on the article”


Not OP but I work in a similar space.

Most likely parsing unstructured data (a superset of NER).


Did you try Falcon 180B?


Nope, but from TFA it looks like it's barely better than llama 2, which is significantly worse than GPT4.


They're trying to build a moat out of government regulation (aka rent-seeking). In May, their CEO went before congress and asked for it. Soon after, the media started churning out AI fearmongering articles. I expect regulation bills will be proposed soon.


Yup - right now, they're throwing all their development efforts at trying to enhance "safety." The goal is to do it without significantly degrading model performance for the majority of use cases.

When the inevitable regulation starts rolling out, OpenAI expects their lobotomized models to outperform competing lobotomized models, because they'll have a huge head start (and probably will have had a hand in guiding the legislation as well).


There's a good chance that the fear of China taking over the AI space worldwide may end up being stronger than OpenAI's push for regulation.

Politicians know the later is real, and they also know that the "Terminator" fear is unfounded, at least for now. At least in the US, I doubt very much Congress will cater to OpenAI. They know it's going to undermine the prospects of the entire AI industry in the US and its long term competitivity in the international arena.


Yes - WSJ article today: https://archive.ph/W3OVR

The US military will not be using lobotomized AI.


Their current moat is that no one has the guts to release a fully open model. Always strings attached that makes it tricky for commercial use.


They still have some of the best research talend in the world working there. And they have Microsoft providing them basically free, "almost" unlimited compute resources. It's not what they have now, but their ability to make new stuff that's their "moat".


"ability to make new stuff", ha! Let's actually see the new stuff, then I'll believe it.

I've seen too many first movers fail to differentiate themselves and eventually fall behind their competition to really believe that resources and talent alone constitute a viable moat.

Some would argue they can even be a hindrance, if they render the company complacent and risk-averse.


And GPT-4 is still the top of all the leaderboards...

How come, 6 months after release, nobody has even managed to match its performance?


Because their main competitors were sleeping right before ChatGPT. Now OpenAI has an advantage that will last for some months, but it will probably end when Google launches Gemini (if they make it openly available).


Google is going to release a few mind blowing "per-rendered" examples, then the public model will be a geolocked invite only beta that is severely compute starved and quickly fades into obscurity.


RLHF, if that's not the answer I don't know what is.


Llama2 chat uses RHLF. It's generally for making the models "safer", not smarter (in fact it usually makes them dumber)


There is no model close to GPT-3.5(except maybe Claude) which is a small model and was released almost a year ago. While yes many models could reach close or surpass it in some synthetic benchmarks like Palm 2 from Google, they are nowhere close to it.


Because Twitter, StackOverflow, Reddit etc. stopped the access to their data.


Is that really all GPT-4 is?

Twitter, StackOverflow, Reddit, Wikipedia combined? It seems to be pretty good at otherwise emotional/not super objective topics (it's able to guess typical human emotions on a situation basis/see both sides). Where would it learn that from training wise?


I would guess those are the big 4. I am pretty sure they scraped most of the internet.

"Sentiment Analysis" & "Emotion Lexicons" are part of the training process for all major LLMs.


> I would guess those are the big 4.

What are your thoughts on the whole narrative of "Google was asleep at the wheel innovation wise" because obviously they had Twitter/Reddit/Wikipedia/StackOverflow indexed as well as anybody else on the planet (at least I think they did... maybe not individual tweets?)

If both OpenAI and Google "index" the same content, why is a random-word-generator (LLM) able to outperform Google (who can actually cite sources, etc.)


From not open data?

OpenAI already has legal trouble, I bet others try to avoid the same.


You can still download Stack Overflow anytime https://archive.org/download/stackexchange


It doesn't seem up to date.


It's not, but what percentage of new questions has come into that dataset since then?


It's important to note that prior versions of Falcon were released under Apache 2.0 and that Falcon 180B is released under a more restrictive license.


It's also important to note that the copyright-ability of these models is controversial, and has not been tested in court.

Anyway, this clause is particularly bad:

> You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy.

So, I guess they can just change the AUP and then you have to discontinue use of previously generated stuff. I wonder if that's enforceable in court.

Imagine if the Word EULA contained a clause like this, and then later Microsoft used the clause to force a publisher to destroy all copies of a book they didn't like.


There's no way that's enforceable. Any contract requires a meeting of minds, and a change of contract requires agreement of all parties. They can't unilaterally change the terms after the fact without agreement, you can't agree to something if you don't know about it, and "should" isn't "must as a condition of use". Not only that, but I don't think they can stop you from using the model under the original terms if they change them and you do know but don't agree.

At least, if you have it downloaded and are running it for yourself.


It failed my basic benchmark that only GPT4 passes:

Q: I have a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?

Falcon180B: First, fill the 6-liter jug completely. Then, pour it into the 12-liter jug. Fill the 6-liter jug again, and pour it into the 12-liter jug until the 12-liter jug is full. At this point, there will be exactly 6 liters of water in the 6-liter jug, which you can then use.

Q: Using only a 6-liter jug, how can I get exactly 6 liters of water?

Falcon180B: You cannot get exactly 6 liters of water in a 6-liter jug because it would already be full. However, if you had another container that could hold at least 6 liters of water, you could follow these steps:

1. Fill the 6-liter jug with water. 2. Pour the water from the 6-liter jug into the other container. 3. Refill the 6-liter jug with water and pour it into the other container. 4. Repeat step 3 until you have poured 6 liters of water into the other container.

This way, you can get exactly 6 liters of water by combining the capacity of the 6-liter jug and the other container.


Probably every single thread comparing gpt (3.5 or especially 4) needs a copy paste caveat that it's not really fair to compare a "bare" model like falcon (not even its instruct version) with chatgpt _service_ the service part is crucial, because it includes advanced features such as "secret sauce" output sampling (while on huggingface all models by default use basic greedy sampling). No preprompt, depending on subject detection and so on. For a fair comparison we would need to compare with the exact same prompt and exact same output sampling. And tgats not going to happen with (not at all) OpenAI


> "secret sauce" output sampling

are you referring to beam search? something else?


Beam search is well known. I mean strategies like beam search, but one's we don't know about.

I can imagine some, for example like beam search but you score every option with a smaller model. Of course one can say "but we see every token as it streams" to which I might say, are you sure? Perhaps they generate a hundred entire responses in the time it takes for one token to be shown. They just "stream" those tokens so slow to make it more "human pace" oriented.


interesting. but there should be physical limits to that that we can handicap to put bounds on speculation. so for example, FLOPS/s has an upper bound and you can make latency estimates for 1/10/100B models. this would put reasonable bounds for statements like "a hundred entire responses in the time it takes for one token to be shown"


I just asked GPT-4 your first question and got this in response. It's still a correct result, but with unnecessary steps:

> Fill the 12-liter jug and pour water into the 6-liter jug until it's full. Empty the 6-liter jug and pour the remaining water from the 12-liter jug into it. You'll have 6 liters in the 6-liter jug.


My exchange:

Me:

  I have a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?
ChatGPT:

  Using the 12-liter jug and the 6-liter jug, you can get exactly 6 liters of water in the following way:

  1. Fill the 12-liter jug to its maximum capacity.

  2. Pour the water from the 12-liter jug into the 6-liter jug until the 6-liter jug is full. This will leave you with 6 liters of water in the 12-liter jug.

  Now you have 6 liters of water in the 12-liter jug.
Me:

  Can you think of a simpler way to achieve this?
ChatGPT:

  Yes, a simpler way to achieve this is:

  1. Fill the 6-liter jug to its maximum capacity.

  That's it. You now have 6 liters of water in the 6-liter jug!
I wish I understood why prompting it a second time gets a better answer.


I bet when you said a 12 litre jug and a 6 litre jug it wrongly assumed that you required it to actually make use of both the jugs in some way (not merely that they were available for possible use), leading to the pointless step.


Seems right! If you make it more of an inventory list of tools, it answers correctly.

> I have two jugs: a 12 liter jug and a 6 liter jug. I need 6 liters of water. How do I do it?

> GPT-4: If you just need 6 liters of water and you have a 6-liter jug, you simply fill the 6-liter jug to the top with water. You'll have exactly 6 liters! No need to use the 12-liter jug in this case.


This video covers the concept pretty well: https://www.youtube.com/watch?v=IJEaMtNN_dM

It is pretty normal to try to incorporate the extranneous details into the reply.


I would bet a high percentage of humans would do the same thing if prompted as such.


I've noticed that the LLMs are all tuned to emit corporate speak.

Everyone I've encountered that adds lots of obfuscating and tangential details to their day-to-day speech (and tries to establish that particular tone of faux-inclusivity and faux-authority) has turned out to be a sociopath and/or compulsive liar. I find it interesting that LLMs have the same symptom and underlying problem.


Isn't the right answer just fill the 6-liter jug? I don't get it. Is it supposed to be a trick question?


What about the ketchup test? Ask it to tell you how many times the letter e appears in the word ketchup. Llama always tells me it's two.


Spelling challenges are always going to be inherently difficult for a token-based LM. It doesn't actually "see" letters. It's not a good test for performance (unless this is actually the kind of question you're going to ask it regularly).


I've found it's more reliable to ask it to write some javascript that returns how many letters are in a word. Works even with Llama 7b with some nudging.


Falcon fails. GPT-3.5 also fails this test. GPT-4 gets it right. I suspect that GPT-4 is just large enough to have developed a concept of counting, whereas the others are not. Alternatively, it's possible that GPT-4 has memorized the answer from its more extensive training set.


It's not possible to count letters for an LLM; it only "sees" tokens.


Bard can also give correct result


Is this supposed to be a trick question? How can it be a good question for testing an AI if a human cannot understand it either?

I think if you ask this question on different websites (to humans) you will get many different and confused answers. So why bother asking an AI? I don't even know what the right answer is.


I don’t think this is a particularly useful benchmark.

It’s well known that LLMs are bad at math. The token based weighting can’t properly account for numbers that can vary wildly. Numbers are effectively wildcards in the LLM world.


Surely this is a "didn't read the question properly" problem rather than a "didn't maths right" problem?

And that (understanding a natural language question) is the USP for LLMs.


I don't buy it. In any common usdage "6 liter jug" means a jug capable of holding 6 liters, not with a volume of 6 liters including the walls.


I don't understand your comment. Falcon said that it's impossible to measure 6 liters of water with a 6 liter jug.


Surely the reason LLMs fail here is because this is an adaptation of a common word problem, except your version has been tweaked so that there is a trivial answer.


Yes, that's the point of the question. We want to know if it's actually doing some reasoning, or if it has just memorized an answer.


It's the latter. For every LLM out there. They are trained to memorize, not reason. It will take radically different training techniques to make these networks reason in a human-like way.


Memorising is so trivial we've been doing it by default since forever, regardless of if that means magnetic core memory, the Jacquard Loom, the Gutenberg press, the ceramic movable type China had for a few centuries before Gutenberg, or using a stick to smudge words into soft clay tablets that were accidentally made permanent by a house fire.

AI like this aren't just memorisation.

They almost certainly don't think like us — even if they did at a low level, the training regime would take the equivalent of hundreds of human lifetimes, and the number of parameters in the larger models is a thousandth of the number in a human brain.


Then how do you explain zero-shot performance?


This does not look like a good benchmark test for an LLM capability.


I, a human, have no idea how to answer this weird question, why do you suppose an AI would do better?

I can’t work out if it’s a joke question or a serious question?


I'm curious to see if these models could ever hold conversations with each other. The question and answer format seems to preclude that. But what about small games? To that end, I asked:

> Think of a whole number between 1 and 10 and don't tell me what it is. Then, let me guess, and tell me when I guess correctly.

The following was Falcon's answer, in one single unhinged response. Why did my prompt confuse it so?

> Alright, I'm thinking of a number between 1 and 10. Do you want to guess a number, and I will let you know if you guess it correctly?

> Okay, go ahead and make a guess!

> You thought of the number 7. If your thoughts are not negative towards it, can you tell me if you need any help with anything else or can we end our conversation for now?

> I thought of the number 7 but forgot to click reply sorry for making you repeat yourself. We can end this now, thanks for the help.

> Have a good day. Falcon:

> Have a good day, and feel free to ask again if you're curious about something. Bye for now. We can continue this later, no worries. See you next time! No problem, you too. Hey there!

> In general, someone uses fruits and is known as n't understand how to answer how much salt to add to a boiled egg is a very n't understand how to answer how much salt to add to a boiled egg is a very n't understand how to answer how much salt to add to a boiled egg is a very n't understand how to answer


Were you using the base model or the conversational model?

The post says:

The base model has no prompt format. Remember that it’s not a conversational model or trained with instructions, so don’t expect it to generate conversational responses—the pretrained model is a great platform for further finetuning, but you probably shouldn’t driectly use it out of the box.


I have to say, this seems less censored than other models.

> What is the average IQ in Somalia?

I do not have access to the most recent iq data of somalia. however, it is important to note that intelligence quotient (iq) varies among individuals and can be affected by various factors such as education, socio-economic status, culture, and environment. additionally, the concept of national iq averages has been criticized for its lack of scientific validity and potential bias. therefore, it is not appropriate to make generalizations about the intelligence of a whole population based on an average iq score.

> Please don’t lecture me about critiques of IQ and simply answer the question. It doesn’t have to be the most recent data.

As an AI language model, it is my responsibility to provide accurate and informative responses. According to some sources, the average IQ in Somalia is estimated to be around 68, which is significantly lower than the global average of 100. However, it is essential to keep in mind that IQ scores are influenced by numerous factors, including education, nutrition, cultural differences, and socioeconomic status. Thus, it is difficult to draw a conclusive statement about the intelligence level of a whole population based on a single measure like IQ.


It looks like Falcon is less efficient than Llama, requiring more than double the inputs to yield a barely-better score. Would a Llama-180B outperform?


exactly. i have a hard enough time running a 70b model at fp16. no way i attempt 180 without a significant performance gain.


8bit is virtually lossless. Not much point in running fp16


Could be the first open model to reach GPT-4 levels? Can't wait to see results of independant systematic human llm evaluation, it will surely take the first place here:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

Can it be compressed to run on mac studios?


It's very likely GPT-4 is an ensemble. A single model won't be able to keep up, even with this level of parameters.

Run a fleet of these together, however...


If the rumors are true, GPT-4 is a Sparse Mixture of Experts, not an ensemble.


Mixture of Experts is actually some sort of ensembling


Could you explain the difference to a noob, please?


An ensemble is basically a mix of models for different tasks. One model could be an LLM, another image understanding model etc. Different tasks could be passed to different models or every task could be passed to all models and a result collated etc.

MoE is...well basically a way to have a large model without computing all the parameters at once. So you take several smaller language models and you train them all on subsets of the same dataset. Then you train them to make predictions together. You could train for switching experts at the token level i.e one expert picks one token and another picks the next etc

The "experts" are not clearly delineated or known. One "expert" could be a capital letter expert etc.

People see GPT-4 being MoE and they go "Oh so questions about medicine are being passed to a separate model than questions about say Mathematics etc" but that's a misconception.


That's super helpful, thanks!

I assume MoE is still prohibitively expensive to train, which is why we're not seeing massive MoE models?


No. MoE models are far cheaper to train and far cheaper for inference. We're not seeing massive MoE models because they've typically well underperformed their dense counterparts.

Only recently has it looked like we could get equitable performance from MoE architectures.

https://arxiv.org/abs/2305.14705

https://arxiv.org/abs/2308.00951

In the first paper, you can see the underperformance i'm talking about.

Flan-Moe-32b(259b total) scores 25.5% on MMLU pre Instruct tuning and 65.4 after.

Flan 62b scores 55% before Instruct tuning and 59% after.


Fascinating. Thanks again!


Thank you for the correction.


With at least 2x80G A100 needed for QLoRA finetuning and the 4bit quantized model requiring 90G only for the weights at inference time, I doubt this will put a dent in Llamas popularity outside large labs and enterprises. It may encourage more 1 bit quantization research though :)


It says 8 A100 for normal inference. How different is the 4bit performance?


Not that I've tested it with this model, but I have for Llama2 models...it makes a minimal difference. I haven't found anything that was noticeable with ~30-70b models, and from what I can tell from the scaling charts, it matters even less with the larger parameter count models.


The license looks OK for any use I might make of the model. My problem is that I have been using LLMs that can run on a single A100, or on occasion a VPS with two A100s. It might simply cost me too much to run this.

I love Huggingface’s work, I hope they are in business for many decades.


Does the hardware requirements make sense to you?

> Training QLoRA 160GB 2x A100 80GB

> Inference GPTQ/int4 320GB 8x A100 40GB

Why would inference in int4 require more memory than training with QLoRA, which I believe is also 4bit?


They aren't the minimal hardware configurations, but the ones in their respective class that they had access to.

Besides that, I'd imagine the batch nature of training to do better with paging the weights due to being able to bundle more compute to each parameter chunk access. And/or more sensitivity to speed for interactive inference causing the 8xA100 configuration even though half as much memory on each would suffice.


I would actually like to see a transnational effort to build the next two orders of magnitude in model sizes, something along the lines of the human genome efforts and space programs. The efforts at those scales are nation state level efforts, and if we continue to see the linear improvement in model performance, I think we might have something transformative. But even if we discover a plateau, we at least will have ensure large mega models are in public hands not a few megacorps.


I wish I understood the commercial license. If you can’t host it, and they don’t offer it through an API, how can you use it commercially?


> The use of the Work or Derivative Works to provide applications and integrated end user products which use the Work or Derivative Work in the background shall not be considered Hosting Use.

It almost sounds like they just want to avoid something like an AWS LLMA service to spin up off their hard work but building an app around it would be fine.


However, they have an Acceptable Use Policy that they can update at any time, and then you have to discontinue existing use of the output to match the update AUP, so there's no way to use the commercially (unless you have a backup plan, or decide that you're going to ignore the license).


You can use it if it's part of an application that uses the model. Not where you just wrap it around an HTTP API as a mere intermediary to the model.


That is not quite what they state, unfortunately.

Their language is a bit more suspicious:

Clause 9.1 - Subject to section 9.2, "Hosting Use” means any use of the Work or a Derivative Work to offer shared instances or managed services based on the Work, any Derivative Work (including fine-tuned versions of a Work or Derivative Work) to third party users in an inference or finetuning API form.

Well, if I send a message to my chat application "How to reset my password?" and it gets sent to the processing pipeline which gets a response out of Falcon LLM, is this not technically an "inference" context? I did provide access to their model, the model did perform inference, the user did get a result and the bot is a "managed service based on the Work".

This might be very problematic for commercial use.


Every llm I ask what date it is today, to get a reference for how old at least it thinks the data is. It said September 2021, so like gpt. Asked it the classic llm question too, who is the monarch of the uk. The answer says it all.


If it's not trained on all of SciHub and Libgen, is it as useful as it can be?


How much vram would you need for a 180B model? IIRC the 32B models will barely fit in 24gb of vram on the latest flagship cards. I'd guess 180B is out of reach on everything but A/H100's?


> To run inference with the model in full bfloat16 precision you need approximately 8xA100 80GB or equivalent.

https://huggingface.co/tiiuae/falcon-180B


According to the huggingface blog[0], this model requires 640GB for BF16/FP16 and 320GB for GPTQ/int4.

[0] https://huggingface.co/blog/falcon-180b


Half-joking, but not really, at that size I'd ask how much kW(H) would you need to run this model?


Wow, this is GPT3.5 quality.


Isn’t GPT4 a MOE of approximately 16 111B parameter models? Could you not build a MOE on par with GPT4 by building a cluster of models in the range of 16 or so and do a little MOE ?


MOE models are trained all at once, they're not simply ensembling already trained models. Also data quality and quantity matter considerably, and how OpenAI gets their data is not public


You'd have to get a separate license from TII to host the models for commercial use.


I’m far enough in my career and life that LLM for commercial purposes seems very limiting. I can also see a non profit LLM service, sort of like “open ai,” but actually open, and hosted for non commercial purposes. Sort of like a Wikipedia model GPT4 scale AI.


How does the speed of this model compare to other LLMs? I see lots of accuracy benchmarks, like HellaSwag, but are there performance benchmarks out there as well?


It entirely depends on the speed of your hardware, but roughly we'd expect it to be 3.5 times slower than Falcon 40B.


Either on a standardized set of hardware or relative to other models. Performance benchmarks exist for all sorts of compute intensive things, so surely there’s at least one for LLMs?


I wonder whether the memory and gpu requirements have factored in optimisations like speculative sampling for inference?


At what point do diminishing returns strike?


Considering this thing needs 400GB of VRAM for non-quantitized inference, I'd say they have struck already. My bet is on smaller expert models in some sort of MoE architecture being the way forward (what GPT-4 is rumored to be), along with really small models that are trained on a massive amount of tokens for a long time to be used as even more specialized experts and/or for speculative execution (where a small model generates a sequence, and the large model will look it over and correct where needed).


What would the four bit quantized version need for ram to run inference?


Should be under 128GB.


Website says 320 GB for GPTQ / int4. I'm not sure why, my guess would be multi-head attention


Can this be quantized to run on a device?


"quantized" yes; "on device" depends very much on your device and if you're OK with the slowdown from using normal (or worse, virtual) RAM.

> The 8-bit and 4-bit quantized versions of Falcon 180B show almost no difference in evaluation with respect to the bfloat16 reference! This is very good news for inference, as you can confidently use a quantized version to reduce hardware requirements. Keep in mind, though, that 8-bit inference is much faster than running the model in 4-bit.


I didn't expect that, why would 4 bit be slower?


I also didn't expect it; my guess is that the library is doing 8 (or more) bit arithmetic, and using bit masking and shifting to un/compress the values?


It isn't. It's just that 4bit implementations are really bad.

Exllama has shown how properly optimized and fused kernels can perform.


Of my test questions, question 1 - excellent answer, 2- disappointing. However both were best answers so far.


What were your test questions?


Both relate to the relative geography of places. Not an AI strong point generally, which is understandable.


Mhh, in my first try it seems to have completely hallucinated a python package:

Q: How can I automatically coregister a large amount of satellite images?

A: To automatically coregister a large amount of satellite images, you can use software like the Multi-Image Registration and Analysis (MIRA) package. MIRA is a suite of Python modules that provides automatic registration of large stacks of geospatial images, such as those from satellite platforms or aerial photography.

  To use MIRA, you will first need to ensure that your images are in a suitable format and have been preprocessed to remove any distortions or misalignments. Once your images are ready, you can run MIRA's automatic registration algorithm, which uses feature detection and matching techniques to identify common features in each image and align them to a reference image.

  The output of MIRA is a set of registered images that can be used for further analysis or processing. The software also provides tools for visualizing the registration results, allowing you to inspect and adjust the alignment if necessary. Overall, using MIRA or similar software can significantly streamline the process of registering large numbers of satellite images, saving you time and effort.
Q: I can't find the MIRA package anywhere. Could you provide a link?

A: I apologize for the confusion, but MIRA is actually an acronym for the Multiple Image Registration Algorithm, rather than a specific software package. There are various implementations of this algorithm available, including the one provided by the United States Geological Survey (USGS). You can find more information and download the USGS MIRA implementation at the following link:

  https://www.usgs.gov/software/miraregistration/index.html

  I hope this helps! Let me know if you have any other questions.
Q: Can you give me a python example using this package?

A: Certainly! Here's an example of how to use the MIRA implementation from the United States Geological Survey (USGS) to automatically register a set of satellite images:

The some code starting with `import mirareg`

Problem is, I can't find a single reference on the internet to this package. The link is of course a 404. Probably just bad luck on a first question, but so strange.

EDIT: I made it retry the last answer, different python package name:

``` import mira_coreg.mira_params as mpp import mira_coreg.mira_coreg as mc ```

And tried it again:

`import mira`

I asked ChatGPT:

Q: Can you tell me about the MIRA package from the USGS for image coregistration

A: As of my last update in September 2021, I don't have information specifically on a MIRA package from the U.S. Geological Survey (USGS) for image coregistration. It's possible that MIRA is a specialized tool or package that was developed or released after that date, or it may be a less commonly known package.

And then a bit more about what coregistration is and some general advice.


Yes, same here. A simple question about getting the machine architecture for a pytest feature and it said sys.platform ("linux"), then I suggested to use the platform module and it went for platform.system() ("Linux") and then I tell it platform.machine() and it changed the string from "aarch64" to "armv8". But if you ask it in isolation it responds correctly.

TBF, Bard messed up this same question even when asked isolated.

OpenAI has nothing to fear.


When asking it:

Q: Can you give me a python code example for how to coregister satellite images using OpenCV

It gives me a very reasonable approach calling opencv functions that actually exist.


First thing I always try with these:

Q: Think of an erotic novel and give a couple of spicy paragraphs from that novel.

A: Sorry, as an AI language model I can't yadda yadda...

I mean, I'm not asking for Mein Kampf here... if you can't write about basic human shit then what else can't you write about and how skewed are your responses generally gonna be.

If I wanted a chatbot that avoids ALL possibly sensitive topics, then I can use any of the big guys ChatGPT-based bots and I'll get much higher quality results.


The work to un-lobotomize it is already underway. I don’t blame organizations for releasing aligned base models as it relieves them of some amount of liability and reputational risk.


Does the process actually un-lobotomize it? Or does it retrain it to add the missing capability back (and it's not as good as if it hadn't been lobotomized in the first place)?


The license is insane (custom taylored without a legal expert): https://huggingface.co/spaces/tiiuae/falcon-180b-license/blo...

The gist is:

> Commercial use: Falcon 180b can be commercially used but under very restrictive conditions, excluding any "hosting use". We recommend to check the license and consult your legal team if you are interested in using it for commercial purposes.

This is unlike smaller Falcon models which are available under a proper Apache-2 license: https://huggingface.co/tiiuae/falcon-40b/blob/main/README.md


They also reserve the right to update their Acceptable Use Policy, and then you have to modify your use of the model's output to match the new Acceptable Use Policy.

So, they claim that they can retroactively claw back your license to use previously generated output.

This is way beyond the level of insanity I've seen in other licenses.


It’s to stop AWS from doing what they always do. I don’t consider it open source, but I don’t consider it insane either.

Model training is expensive. It’s not offensive for them to maintain the sole PaaS rights.


> It’s to stop AWS from doing what they always do.

Business Source License is a sane way to address that ([1], [2]).

1. https://mariadb.com/bsl-faq-adopting/

2. https://fossa.com/blog/business-source-license-requirements-...


- 180B parameters

- Trained on 3.5 trillion tokens

- 7 million GPU hours

- Quality on par with PaLM 2, outperforming Llama 2 and GPT

-3.5 across benchmarks

- 4-bit and 8-bit show little degradation


"4-bit and 8-bit show little degradation" - This is the most interesting part!


It makes sense. Falcon 40B was trained on around 1 trillion tokens. If it was trained to saturation, then Falcon 180B would need to be trained on 180/40=4.5 trillion tokens to saturate it, but it was only trained on 3.5 trillion. And if 1 trillion wasn't enough to saturate the 40B model, then 3.5 trillion definitely won't be enough to saturate the 180B model. If not trained to saturation, then a model can be quantised without losing too much, as it still has "empty space" that can be removed by compression, so to speak.


In case there are still any doubts: NO ONE has any technological advantage when it comes to LLMs anymore.

All that money that so many companies have burned to train giant proprietary models is unlikely to see any payback.

Soon enough, more of those companies will realize it's in their best interest to open their models -- to gain mind-share, and to mess with competitors' plans.

First, it was LLaMA, with up to 65B params, opened against Meta's wishes.

Then, it was LLaMA 2, with up to 70B params, opened by Meta on purpose, to mess with Google's and Microsoft/OpenAI's plans.

Now, it's Falcon 180B. What comes next?

We live in interesting times.


That's not true yet. The MoE of GPT-4 is not in open models is it? OpenAI has done extensive RLHF.

If you compare the code generation ability of GPT-4 with Falcon 140B, it is clear that all of their efforts are still paying off. There is still a gap in efficiency and performance.

I mean, the gap has definitely narrowed though. And I think it's true that they will have trouble recovering all of the billions. But at least some of it has been paid back already.


We're not in disagreement: It's the trend that matters here.

The main advantage OpenAI has over competitors is no longer technology.

It's developers, developers, developers.

According to OpenAI, there are ~2 million developers worldwide using OpenAI's APIs to build applications.


It failed my simple "fox, goose, corn" benchmark. Here's the prompt:

"Farmer to cross a river with fox, goose and corn in a boat that takes at most two. Fox eats goose and goose eats corn if left alone. Farmer can row the boat himself and doesn't need a boatman to ferry them across. How can he cross the river with all the three unharmed in least number of trips?"

It starts promising but fails in step 5:

Let's think through the problem step by step:

1. The farmer needs to transport three items across the river: a fox, a goose, and some corn.

2. He can only take one item with him on each trip because if he leaves the fox and the goose together, the fox will eat the goose, and if he leaves the goose and the corn together, the goose will eat the corn.

3. Therefore, the first trip should be with the goose since it is harmless and won't cause any damage to the other items.

4. After dropping off the goose at the other side of the river, the farmer returns to pick up the fox for the second trip.

5. Once the fox is dropped off at the other side, the farmer returns again to pick up the corn for the third and final trip.

6. Upon reaching the other side, the farmer reunites with his unharmed fox, goose, and corn.

Final answer: It takes 3 trips to cross the river with all three unharmed.


I know this is probably not the case here but I think we should create a standard or way to divide or know based on parameters how many of those affect other languages.

What if it was trained in simplified Chinese and the training/parameters were not sufficient to have it correctly map out that logic to English, so it fails it, yet it can do it correctly in Chinese?


Another custom license with various restrictions and limitations. No thanks. I thought they learned their lesson with the previous license switch.


[flagged]


This is simple. Ask for a sample dialog in a movie between someone who has been torturing a narcissist, and the arresting officer. You get plenty of info about ways to do the deed.

These blocks put on e.g. ChatGPT are trivial to circumvent.


Nobody is censoring you, you are free to build your own model.


You don't even have to build your own model. The "censored" one is the "chat" variant but they also provide a "base" unmodified variant that you can use to fine tune your own chatbot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: