On a technical level, they're doing something really simple -- take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a
linear layer, and train just the tiny layer on some datasets of image-text pairs.
But the results are pretty amazing. It completely knocks Openflamingo && even the original blip2 models out of the park. And best of all, it arrived before OpenAI's GPT-4 Image Modality did. Real win for Open Source AI.
The repo's default inference code is kind of bad -- vicuna is loaded in fp16 so it can't fit on any consumer hardware. I created a PR on the repo to load it with int8, so hopefully by tomorrow it'll be runnable by 3090/4090 users.
Indeed, really simple. And yes, the results are shockingly good. But what I find most remarkable about this is that the ViT-L+Q-former's hidden states are related by only a linear projection (plus bias) to the Vicuna-13B's token embeddings:
emb_in_vicuna_space = emb_in_qformer_space @ W + B
These two models are trained independently of each other, on very different data (RGB images vs integer token ids representing subwords), and yet somehow they learn to embed different data in feature vectors that are so... similar. WHY should that be the case?
It suggests to me there may be something universal about the embedding layers and hidden states of all trained deep learning models.
I think it’s just that affine transforms in high dimensions are surprisingly expressive. Since the functions are sparsely defined they’re much less constrained compared to the low dimensional affine transformations we usually think of.
Good point. Didn't think of that. It's a plausible explanation here, because the dimensionality of the spaces is so different, 5120 vs 768. Not surprisingly, the trained weight matrix has rank 768: it's using every feature in the lower-dimensional space.
Still, it's kind of shocking that it works so well!
I'd be curious to see if the learned weight matrix ends up being full-rank (or close to full-rank) if both spaces have the same dimensionality.
The weight matrix's rank would decrease for each feature in the target space that cannot be expressed as as a linear combination of features in the input space (plus a bias). For example, if the target space has a feature representing a non-visual quality like "smelliness," it would not be expressible as a linear combination of features representing visual attributes like "redness," "blueness," and "greenness," etc. in the input space.
If both spaces have the same dimensionality, the learned weight matrix would be full-rank only if every feature in the target space is expressible as a linear combination of features in the input space (plus a bias). Which brings me back to my original question: WHY would that be the case when the two models are trained independently on data that is so different?
A random nxn matrix is full rank... So it's kinda the default: any amount of noise in the embedding is going to result in full-rank transformations.
So it's really less-than-full rank which would require an explanation - ie, why does this image representation project into this perfectly isolated subspace of the language representation (or vice versa)?
If that happened I would start looking for things like a vocabulary of smell which is completely distinct and non-overlapping with any visual context. But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities... Maybe there's some branch of analytic philosophy which has managed to completely divorce itself from the physical world...
> But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities...
>somehow they learn to embed different data in feature vectors that are so... similar
At it's core, BLIP2 already projects RGB inputs into text token space and Vicuna (or rather LLaMA) uses such tokens as inputs as well as outputs. The only reason why a linear layer is needed at all is because they are not trained at the same time, so you still have to move text embeddings from one space to another. But it should not be surprising at all that one hidden linear layer suffices to do just that (see the universal approximation theorem [1]). This approach is just an efficient way to combine different models for downstream fine-tuning tasks while keeping their weights frozen, but it is neither new nor particularly surprising.
Thanks. Your comment about BLIP2 already projecting RGB inputs into (a different) text token space makes sense to me. See also fpgaminer's comment at https://news.ycombinator.com/item?id=35603246 . However, I don't see how the universal approximation theorem is relevant here. The fact that deep models with sufficient capacity can approximate any function does not imply that two deep models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.
>I don't see how the universal approximation theorem is relevant here. The fact that deep models
The universal approximation is exactly not about deep models. Deep means many layers. But in the most simple (and proven) case, a single hidden layer perceptron is all it needs according to the UAT. Technically it also needs a nonlinear activation function, but you get all sorts of nonlinearities for free downstream anyways in this particular model.
You'd need to increase width (dimensionality) if you make these models shallow.
My point still stands: The fact that models with sufficient capacity can approximate any function does not imply that two models trained independently of each other on different tasks will learn to approximate functions that relate to each other only by a linear transformation.
The UAT states that depth is fundamentally not important, at least theoretically. It only has immense practical uses. So adding an intermediate linear layer + some nonlinearity already gets you an error scaling like O(1/N) for width N (in theory), regardless of what you are actually mapping. At least as long as it's somewhat continuous.
BLIP2 is a contrastive Image-Language model. The embeddings from the BLIP2 image model are already both aligned with text, and linear. It should not be a surprise that only a projection is required to translate it to LLaMA's embedding space.
as well as this - https://llava-vl.github.io/, Just found this paper that demonstrated this a few months ago (that somehow language and vision models learn representations similar enough that linear projection is enough) https://arxiv.org/abs/2209.15162
Man you need to look at this - https://llava-vl.github.io/. They project with a linear layer from Clip directly. With blip-2, you could say it already converts RGB into token space.
>so hopefully by tomorrow it'll be runnable by 3090/4090 users.
Taking a step back, this is just a wild statement. I know there's some doom and gloom out there, but in certain aspects, it's an awesome time to be alive.
> they're doing something really simple -- take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a linear layer, and train just the tiny layer on some datasets of image-text pairs
Oh yes. Simple! Jesus, this ML stuff makes a humble web dev like myself feel like a dog trying to read Tolstoy.
> This ML stuff makes a humble web dev like myself feel like a dog trying to read Tolstoy.
Just like any discussion between advanced web devs would make any humble woodworker feel?
And just like any discussion between advanced woodworkers would make a humble web dev feel?
"It's really simple, they're just using a No. 7 jointer plane with a high-angle frog and a PM-V11 blade to flatten those curly birch boards, then a No. 4 smoother plane with a Norris-type adjuster and a toothed blade for the final pass."
Whut?
"You could use Webpack to bundle your HTML, CSS and Babel-transpiled TypeScript 5 down to shim-included Ecmascript 4", "They're just using OAuth2 authentication with Passport.js and JWT tokens, which easily gets you CSRF protection", "Our e-learning platform uses LMS.js and xAPI.js, plus SCORM for course packaging and Moodle as the LMS backend.", ...
There was a time you didn't know what any of that meant.
Just because you don't know what the words mean shouldn't make it sound difficult. Not saying AI is easy, just that the jargon is not a good indication of difficulty and we should know better than to be so easily mystified.
Hey, guys. Hey. Ready to talk plate processing and residue transport plate funneling? Why don't we start with joust jambs? Hey, why not? Plates and jousts. Can we couple them? Hell, yeah, we can. Want to know how? Get this. Proprietary to McMillan. Only us. Ready? We fit Donnely nut spacing grip grids and splay-flexed brace columns against beam-fastened derrick husk nuts and girdle plate Jerries, while plate flex tandems press task apparati of ten vertipin-plated pan traps at every maiden clamp plate packet. Knuckle couplers plate alternating sprams from the t-nut to the SKN to the chim line. Yeah. That is the McMillan way. And it's just another day at the office.
I remember seeing someone link to that scene recently as a joke on Twitter (about Twitter trying to explain Twitter Blue). Within a few days I’d watched the entire series… absolutely phenomenal show.
Edit: ah I actually saw the prior scene where Leslie was explaining to John what he expected (which is the setup for the linked bit): https://www.youtube.com/watch?v=G7Do2tlYLhs
The thing is, machine learning sorta requires a few math prerequisites: linear algebra, differential equations, and to some degree vector calculus. Most web developers don’t have this background.
If you want to understand the theory, that's true. If you want to develop an intuitive understanding without having to understand all the nuts and bolts (and I understand that can be a big ask for how some people learn/understand), give this a try: https://karpathy.ai/zero-to-hero.html
The irony is Karpathy presents the limit/epsilon definition of derivatives in the first half hour (quite well IMO and he never actually says “epsilon”) which is very much a nuts and bolts kind of explanation in calculus.
That said, when most people say differential equations they’re usually thinking of analytical solutions which is very much not necessary for practical ML.
I would say the limit epsilon derivative is exactly the sort of thing grandparent post is talking about. It's quite intuitive and doesn't require hardly any mathematical foundation at all, other than basic geometry and algebra. You can understand topics that build on that simple concept without understanding the more formal derivative definitions.
Great idea, actually. I do hope for a curriculum that enables kids on the trade school path to learn more about programming. Why not Master/Journeyman/Apprentice style learning for web dev??
That's kind of how I think about bootcamps pumping out web devs. They're like trade schools, teaching you just enough fundamentals to know how to use existing tools.
Mostly agree... though I don't think the bootcamps get enough fundamentals in. Not to mention that it takes the type of person that will go above and beyond what has been assigned to succeed trying to be a productive employee in the space. I'm self-taught and the first years of my career spent countless hours reading, practicing and solving problems. I still spend a good 10-15 hours a week reading and exploring software development and try to at least keep up with what's out there. In the end, the best you can do is be aware of what, or even that options are out there.
You make a good point.
Except that a number of these concepts and tooling in the ML world have been slingshotted into the forefront in a relatively short time and it has been hard to play catch up.
For eg. - someone said "frozen Vicuna" below - what does that mean?
This thing takes an image and creates a representation matrix.
> connect it to Vicuna-13B with a linear layer
Vicuna is an open LLM, pretty good quality, not as good as GPT3.5 though.
This is the beautiful part - a mere multiplication is enough to convert the image tensor to text tensor. One freaking line of code, and a simple one.
> and train just the tiny layer on some datasets of image-text pairs
You then get a shitload of image-text pairs and train the model to describe the images in text. But keep both the image and text model frozen. Is that hard? No, just flip a flag. So this "linear projection layer" (a matrix multiplication) is the only learned part. That means it takes less time to train, needs fewer examples and requires less memory.
Training the image and text models was much more difficult. But here we don't train these models, they use them as ready-made parts. It's a hack on top of two unrelated models, so it is cheap.
In the end the finishing touches - they label 3500 high quality image-text pairs, and fine-tune on them. Now the model becomes truly amazing. It has broad visual intelligence, and scooped OpenAI who didn't release Image GPT-4 in the APIs yet.
The important lesson to take is that unrelated models can be composed together with a bit of extra training for the glue model. And that open AI is just as powerful as "Open"AI sometimes. It's breathing down their necks, just one step behind. This model is also significant for applications - it can power many automations in a flexible way.
> This is the beautiful part - a mere multiplication is enough to convert the image tensor to text tensor. One freaking line of code, and a simple one.
I thought they were creating image tokens based on the queries during finetuning and appending them to the language model. They are not text tokens.
And just like webdev, each of those were done in a different platform and require arcane incantations and 5h of doc perusing to make it work on your system.
Maybe it's because of how I use it, but the code ChatGPT gives me has always been super helpful and 99% correct. But, we have a policy at work not to use it for work product so I have to spend time changing enough of it where it's different, and I'm never copy/pasting anything. Enough changes to the structure and variables to make it sufficiently different that it can't be considered pasting company data into GPT, ask my question(s), see what comes back out, refactor/type manually into my IDE, test. I'd say one out of every 8-9 times I get something objectively wrong - a method that doesn't exist, something not compiling, etc. But it's faster than using google/DDG, especially with some prompting so that it just spits back code and not 5th-grade level explanatory paragraphs before and after. And well over half the time it does exactly what I need or sufficiently close that my initial refactoring step gets me the rest of the way.
Would you say that this satisfies the spirit of the company policy? Or is it a bit of a hack to get around it?
I ask because we are about to produce a similar policy at work. We can see the advantages of it, but likewise, we can't have company data held in their systems.
The policy is to not send any "sensitive company data" into ChatGPT, which I 100% agree with. How we implement a given Vue component or a particular API isn't sensitive or particularly novel so if I strip the business logic out I do honestly believe I'm complying with the spirit of the policy.
at some point someone makes a service where you can let AI take over your computer directly. Easier that way! Curling straight to shell taken to next level.
1. plant seed
2. ...wait a very long time...
3. observe completely unexpected but cool result
The unexpected part of step 3 is what makes this very different from any kind of engineering, even webdev.
Of course, there is a lot of engineering involved in good ML, but that is more comparable to agricultural engineering in the sense that it's just a lot of dumb plumbing that any engineer can do without knowledge of the actual application.
I mean, for me, the unexpected part of 3 is what got me into programming in general. The first time you type a mysterious incantation into an editor and a few more mysterious incantations into the console and the console prints "Hello, world" like it was supposed to, it's unexpected because it's hard to believe that any of this mysterious incantation stuff actually works at all.
As you get better at programming you have to take on harder problems to create the surprise of something working, because you gain confidence, and as you gain confidence, you start expecting your code to work. It's only when you've compiled the thing 6 times with small corrections and gotten segfaults each time and the 7th time you finally find the place you weren't updating the pointer and you correct it, but this is the 7th error you've corrected without the segfault going away, so you don't really expect it to fix the problem, but then you run it and it's fixed!
And then you get a job and the reality is that most of the jobs you're just writing CRUD apps and for a little while you can get some surprise out of learning the frameworks, but eventually you actually get really, really knowledgeable about the Postrgres/Django/React stack and nothing surprises you any more, but because nothing surprises you any more, you're really effective and you start being able to bill the big bucks but only for work on that stack because it takes time to struggle enough to get surprised, and the time that takes means your time is worth less to your clients. Money ruins everything. And if you don't do anything non-billable, it's easy to forget what programming felt like when you didn't know how your tools all worked inside and out. Not everyone takes this path but it's certainly the easiest path to take.
I think for a lot of folks who have been doing this for a long time, the reason ML is so exciting is it's getting them back out of their comfort zone, and into a space where they can experience surprise again.
But that surprise has always been available if you continue to find areas of programming that push you out of your comfort zone. For me it's been writing compilers/interpreters for programming languages. Crafting Interpreters was awesome: for the first time I benchmarked a program written in my language against a Python program, and my program was faster: I never expected I'd be able to do that! More recently, I wrote a generational GC. It's... way too memory-intensive to be used in my language which uses one-GC-per-thread for potentially millions of threads, but it certainly was a surprise when that worked.
Personally, I'm keeping track of ML enough to know broad strokes of things but I'm not getting my hands dirty with code until there are some giants to stand on the shoulders of. Those may already exist but it's not clear who they are yet. And I've got very little interest in plugging together opaque API components; I know how to make an API call. I want to write the model code and train it myself.
I like how you've expressed this insight, and it is so true.
Becoming great at a particular technology stack means modelling it in great detail in your head, so you can move through it without external assistance. But that leaves an arena without discovery, where you just reinforce the same synapses, leading to rigidity and an absence of awe.
I've only been reading ML stuff for a few months and I kind of understand what it's saying. This stuff isn't as complex as its made out to be.
It's just a bunch of black boxes AKA "pure functions".
BLIP2's ViT-L+Q-former AKA
//I give you a picture of a plate of lobster it will say "A plate of lobster".
getTextFromImage(image) -> Text
Vicuna-13B AKA
//I give you a prompt and you return completion ChatGPT style
getCompletionFromPrompt(text) -> Text
We want to take the output of the first one and then feed in a prompt to the LLM (Vicuna) that will help answer a question about the image. However the datatypes don't match. Lets add in a mapper.
This is the magic of ML. We can just "learn" this function from data. And they plugged in a "simple" layer and learned it from a few examples of (image , question) -> answer. This is what frameworks like Keras, Pytorch allow you to do. You can wire up these black boxes with some intermediate layers and pass in a bunch of data and voila you have a new model. This is called differentiable programming.
The thing is you don't need to convert to text and then map back into numbers to feed into the LLM. You skip that and use the numbers it outputs and multiply directly with an intermediate matrix.
More precisely -
It gets the question After irs passed through a matrix that transforms the text description of the image so the LLM can “understand” it.
It maps from the space of one ML model to the other.
Just get rid of all the abbreviations in your mind - they seem to be very intimidating. I really liked the explanation that Stephen Wolfram did on ChatGPT:
I pick that up in above video and also in the post above.
Definitely healthy for him which just to be clear I’m a huge Wolfram fan and the ego doesn’t really bother me, it’s just part of who he is, however I do find it nice that LLMs are having him self reflect more than typical.
Not a big Wolfram fan myself. I gave him the benefit of the doubt and bought "A New Kind of Science" (freakin' expensive when it first came out), and read the whole 1280 pages cover to cover ... Would have been better presented as a short blog post.
I find it funny how despite being completely uninvolved in ChatGPT he felt the need to inject himself into the conversation and write a book about it. I guess it's the sort of important stuff that he felt an important person like himself should be educating the plebes on.
Predictably he had no insight into it and will have left the plebes thinking it's something related to MNIST and cat-detection.
I just happen to read this article of him, which I found easy to understand. I'm neither a huge proponent nor opponent of the likes of his work. Or, bluntly speaking: I don’t know much else about his reputation in the community.
Seriously, ChatGPT was the thing that gave me a foothold into the AI/machine learning world... because it gave me hope that a mere mortal can achieve something reasonable with this tech without a crazy amount of work and educational background.
There are really great resources now from eli5 about all of this tech to books like ‘the little learner’ which any programmer can get into. Yes, it takes effort but it is a great time for it.
Regardless of what you want to learn, "small daily activities" is a bit hard. You can learn some stuff by osmosis, following the feeds of AI devs && AI channels, but the bulk of what I learn comes from starting projects & digging into code & reading papers.
If you can hold attention span over several days (I can't), work on a project bit-by-bit. Just make sure it uses modern AI stuff, and that you have smart people to talk around with.
I was where you're at about ... oh wow, it's been almost ten years since I jumped into machine learning. Mind you, I've been learning on the side most of this time other than a theoretical class at the University of Minnesota. But, that aside, and depending on where you're at in your understanding, this is a great resource for catching up if you're really interested: https://karpathy.ai/zero-to-hero.html it was posted on HN a couple of weeks ago and I have to say it's a really good introduction and Andrej Karpathy is a passionate and excellent teacher. You may want to brush up on some intro Calculus, but it's very understandable.
Maybe you're just holding it wrong: You're not supposed to let your LLM rest or chat idly while you do the webdev stuff yourself, but to make your LLM do the webdev stuff for you ;P
2. It has a pretty high quality fine-tuning dataset. I initially missed this, and it's a very important advantage.
3. (speculatively) it doesn't collapse to extremely short responses (which BLIP2 and other models trained on image-text caption pairs) because of how small/simple the adapter is.
I was interested in training a BLIP2-LLaMA model before this, and I might still do it just to test (3).
I'm developing framework [1] in Golang with this goal in mind :) It successfully runs relatively big LLM right now, and diffusion models will be the next step
Yes, you can run inference at decent speeds on CPU with llama.cpp. A token is about 0.75/words, so you can see lots of people getting 4-8 words/s on their CPUs: https://github.com/ggerganov/llama.cpp/issues/34
This opens a huge possibilities. It's likely we could simply plug in stable diffusion using a linear layer. As well as whisper and some TTS. Getting a back to back mixed image/sound/text engine running on a laptop.
I wonder if there's powerful enough ViT model that does OCR.
I think with this everything OpenAI demonstrated ~5 weeks ago has been recreated by actually-open AI. Even if it runs much much slower on prosumer hardware and with less good results at least it is de-magicked.
It'll work! I just haven't touched any of the 4bit stuff myself, so I don't personally know how to add it. Great low-hanging fruit for anyone else to take on.
I think it's poor form that they are taking the GPT-4 name for an unrelated project. After all, the underlying Vicuna is merely a fine-tuned LLaMA. Plus they use the smaller 13B version.
The results look interesting, however.
Here's hoping that they'll add GTPQ 4bit quantizing so the 65B version of the model can be run on 2x 3090.
If the criterion is to go with what's the most apt fit for the company, then "closedai.com" would be even better, and as a bonus it would probably be even cheaper… (even if they have enough money for a more expensive domain)
Someone needs to write a buyer's guide for GPUs and LLMs. For example, what's the best course of action if don't need to train anything but do want to eventually run whatever model becomes the first local-capable equivalent to ChatGPT? Do you go with Nvidia for the CUDA cores or with AMD for more VRAM? Do you do neither and wait another generation?
If you're looking just for local inference, you're best bet is probably to buy a consumer GPU w/ 24GB of RAM (3090 is fine, 4090 more performance potential), which can fit a 30B parameter 4-bit quantized model that can probably be fine-tuned to ChatGPT (3.5) level quality. If not, then you can probably add a second card later on.
Alternatively, if you have an Apple Silicon Mac, llama.cpp performs surprisingly well, it's easy to try for free: https://github.com/ggerganov/llama.cpp
Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box. ROCm still has no RDNA3 support (supposedly coming w/ 5.5 but no release date announced) and it's unclear how well it'll work - basically, unless you would rather be fighting w/ hardware than playing around w/ ML, it's probably best to avoid (the older RDNA cards also don't have tensor cores, so perf would be hobbled even if you could get things running. Lots of software has been written w/ CUDA-only in mind).
I'm aware that a 6800 is not RDNA3. You stated broadly:
> Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box.
I was merely sharing that I did not have that same experience that current consumer cards have terrible support.
Sure, and I was merely clarifying that only last-gen architectures work. While I'm glad that SD works for you, if we are getting into it, I think that having no support on current-gen flagship models does equal broadly terrible software support and is more worth highlighting, since otherwise someone might assume they could just go pick one of those up and get a 24GB GPU on the cheap, especially in the context of LLMs (which is what the OP was asking about).
For RDNA2, you apparently can get LLMs running, but it requires forking/patching both bitsandbytes and GPTQ: https://rentry.org/eq3hg - and this will be true for any library (eg, can you use accelerate? deepspeed? fastgen? who knows, but certainly no one is testing it and AMD doesn't care if you're not on CDNA). It's important to note again, anything that works atm will still only work with last-gen cards, on Linux-only (ROCm does not work through WSL), w/ limited VRAM (no 30Bq4 models), and since RDNA2 tensor support is awful, if the SD benchmarks are anything to go by, performance will still end up worse than an RTX 3050: https://www.tomshardware.com/news/stable-diffusion-gpu-bench...
> I think that having no support on current-gen flagship models does equal broadly terrible software support and is more worth highlighting, since otherwise someone might assume they could just go pick one of those up and get a 24GB GPU on the cheap, especially in the context of LLMs (which is what the OP was asking about).
Absolutely fair and I agree with this part. I started my reply with "FWIW" (For What It's Worth) on purpose.
> For RDNA2, you apparently can get LLMs running, but it requires forking/patching both bitsandbytes and GPTQ: https://rentry.org/eq3hg - and this will be true for any library (eg, can you use accelerate? deepspeed? fastgen? who knows, but certainly no one is testing it and AMD doesn't care if you're not on CDNA).
I haven't tried any of the GPU-based LLMs yet. SD leveraging PyTorch (which seems to have solid ROCm support) worked for me. It will not be faster than NVIDIA for sure but if someone already has a 16GB+ AMD card they may be able to at least play with stuff without needing to purchase an NVIDIA card instead.
Nvidia and the highest amount of vram you can get.
Currently the 4090, the rumor is the 4090ti will have 48gb of vram, idk if its worth waiting or not.
The more VRAM the higher paremeter count you can run all in memory (fastest by far).
AMD is almost a joke in ML. The lack of CUDA support (which is nvidia proprietary) is straight lethal, and also even though ROCM does have much better support these days, from what I've seen it's still a fraction of the performance of what it should be. I'm also not sure if you need projects to support it or not, I know pytorch has backend support for it but I'm not sure how easy it is to drop in.
I have access to an Nvidia A100. But as a layman, what specs does the rest of the system need to use it for some real work? I would assume there needs to be at least as much ram as vram and maybe a few terabytes of disk space. Does anyone have experience with this?
A whole different department made those decisions and I don't think they had any idea of what is actually needed. They wanted to buy such a GPU because training your own model is a trendy thing and they wanted to stay within their budget (mostly used up by the GPU).
I'm just trying to scramble together something from the hacked together thing I now have to deal with.
I'm running models locally on my 3090 and it's fast enough, although for example building a vector database can take a while. I can run LoRa training but I haven't done anything meaningful with it so far. I chose 3090 because of the cable issue of 4090 (also, no nvlink, although I'm not sure that matters) but it's debatable if my fears are justified. I need to leave the gpu running while I'm away and I just don't feel comfortable doing that with a 4090. I rather take the lower performance.
One caveat though, my asus b650e-f is barely supported by the currently used ubuntu kernel (e.g. my microphone doesn't work, before upgrading kernel + bios I didn't have lan connection...) so expect some problems if you want to use a relatively new gaming setup for linux.
Windows generally works but there may be a somewhat small performance hit. IMO linux is much easier to get to work judging by all the github issue threads I see able SD/LLaMa stuff on windows - but I don't use windows so I dont have personal experience.
4090 24GB is 1800USD, The Ada A6000 48GB is like 8000USD and idk where you buy it? So if you want to run games and models locally the 4090 is honestly the best option.
EDIT: I forgot - there is a rumored 4090ti with 48gb of vram, no idea if thats worth waiting for.
The A6000 is actually the old generation, Ampere. The new Ada generation one is called 6000. Seems many places still sell A6000 (Ampere) for the same price as RTX 6000 (Ada) though, even though the new one is twice as fast.
Seems you can get used RTX A6000s for around $3000 on ebay.
You're kidding? So they called it the RTX 6000, then called it the RTX A6000 for ampere, then back to RTX 6000 for Ada?
Why do they do this? Sometimes consumer products are versioned weirdly to mislead customers (like intel cpus) - but these wouldn't even make sense to do that with as they're enterprise cards?
Actually the first one is called Quadro RTX 6000, while the Ada one is just RTX 6000 without "Quadro" in front. Not that it makes the naming make much more sense.
According to GPT-4 the next generation one will be called Galactic Unicorn RTX 6000 :D
4090 is amazing, but very large card. 3090 is "good enough" for ML - same 24gb vram - and you can pick them up used for half the price of a new 4090. That's what I did.
WSL on windows apparently decent, or native PyTorch, dual boot windows/ubuntu still prob best tho.
Getting CUDA on OpenSUSE was super easy. The Nvidia blob drivers are easy to install and CUDA needs another download and some copy paste. Even Unreal Editor was easier to install than on Windows.
Save some money and go 3090, same vram, speed difference probably isn't worth the premium for the 4090. Then upgrade when the rumored 5090 generational leap happens.
Unless you have two graphics cards (well, you can use an integrated GPU) and need to run both OSes at once I think for most people this will be less convenient than a dual boot setup though
You can’t switch which GPU Linux is using without restarting the session
Basically, you want nVidia, and you want lots of VRAM. Buy used for much more bang for the buck.
Depending on your budget, get:
- an RTX 3060 with 12GB or
- 1 used RTX 3090 with 24GB (approx twice as expensive as the 3060 but twice the VRAM and much faster) or
- 2 used RTX 3090 cards if you need more than 24GB.
Everything beyond that gets quite a bit more expensive because then you need a platform with more PCIe lanes, you may need more than one PSU and you will have problems fitting and cooling everything.
With two cards and 2x24GB you can run the largest version of the LLaMA model (the 65B variant) and all its descendants with 4-bit quantization inside your GPU's VRAM, i.e. with good performance. Can can also try some low resource fine-tuning variants (LoRa etc).
Oh and while you're at it also get a decent amount of RAM like 64GB or 128GB (it's very cheap right now) and a NVMe SSD. These models are quite large.
Ask: how many parameters is the model? At how many bits? and has it been quantized to a smaller number of bits or if not can it be without losing quality? multiply number of bits by the number of parameters, convert to GB, and then add so amount of overhead for however much else needs to load and/or how horrible/amazing of an implementation the inference code is. Done.
Example: A 30B parameter model trained at 16bit FP gets quantized down to 4 bit ints. 4 bits = 0.5 byte. 30 billion * 0.5 byte = 15GB of VRAM (plus a GB or few of other overhead)
Depends on whether you just want to play around or have actual business needs, but to play around I'm waiting for the next generation of unified memory architectures
Well currently there's just apple and they are super secretive about their roadmap, the M1/2 top line chip already perform quite well, and while the top version look expensive, a pair of a6000 will definitely eat more money both to purchase and to run.
M3 is around the corner tho, and there's some announcement to come from intel or arm following their partnership. There's also the new card coming from intel that is supposed to be aimed squarely at machine learning workloads, and they don't have to segment their market by memory sizing like Nvidia do, but they aren't well supported as device targets, but a pair of these will likely be very cost effective if and only if they will get credible compatibility with the libraries and models
3090 or 4090. That's it as far as LLMs go. You can run 13b on a 4080 but 13B sucks compared to 30B. And ROCm is a poorly supported hair-pulling nightmare.
Adding GPT4 to anything now increases marketing x4. So many AI news coming out lately that not adding it risks drowning in a sea of info.. even in the case of a good project
The word for this is “trademark infringement”. You are specifically not allowed to capitalize on the marketing of another entity’s product to bolster yours by implying through your name that you are somehow related. This is why “DALL-E Mini” had to change their name to craiyon.
It's also just (deliberately) misleading. It's based on the 13B Vicuna/Llama model, not 175B GPT-3 or 1T GPT-4. There is zero justification for calling it MiniGPT-4.
A more honest name would be Visual-Vicuna or Son-of-BLIP.
I don't see how it's misleading. MiniGPT-4 makes it sound like a smaller alternative to GPT-4, if it was based on GPT-4 there would be nothing 'mini' about it.
It has more in common with GPT-3 than GPT-4 in terms of size, but in reality it's based on Vicuna/Llama which is 10x smaller than either, so as far as the LLM part of it goes its not mini-anything - it's just straight-up Vicuna 13B.
The model as a whole is just BLIP-2 with a larger linear layer, and using Vicuna as the LLM. If you look at their code it's literally using the entire BLIP-2 encoder (Salesforce code).
I got the 1T GPT-4 number from here - this is the video that goes with the Microsoft "Sparks of AGI" paper, by a Microsoft researcher that had early access to GPT-4 as part of their relationship with OpenAI.
Bubeck has clarified that the "1 trillion" number he was throwing around was just a hypothetical metaphorical—it was in no way shape or form implying that GPT-4 has 1 trillion parameters [0].
OTOH, given the massive performance gains scaling from GPT-2 to GPT-3, it's hard to imagine them not wanting to increase the parameter count at least by a factor of 2, even if they were expecting most of the performance gain to come from elsewhere (context size, number of training tokens, data quality).
Outside of the brand name ChatGPT, lay members of the general public are way more likely to call these chatbots (like Bard and Bing) “AIs” than “GPTs”. And although GPT could technically refer to any model that uses a Generative Pre-trained Transformer approach (although it probably wouldn’t be an open-and-shut case), the mark “GPT-4” definitely is associated with OpenAI and their product, and you can’t just use it without their permission.
At Brewer’s Art in Baltimore, MD they just released a beer called GPT (Green Peppercorn Tripel)[1]. They’re likely allowed to do that because a reasonable consumer would probably not actually think they had collaborated with OpenAI, because OpenAI does not make beer.
OP is releasing a model called “MiniGPT-4”. A reasonable consumer could look at that name and become confused about the origin of the product, thinking it was from OpenAI. This would be understandable, since OpenAI also makes large language models and has a well known one that they’ve been promoting whose brand name is “GPT-4”. If MiniGPT-4 does not meet that consumer’s expectation of quality (which has been built up through using and hearing about GPT-4) it may cause them to think something like “Wow, I guess OpenAI is going downhill”.
Trademark cases are generally decided on a “reasonable consumer” basis. So yeah, they can seem a little arbitrary. But it’s important for consumers to be able to distinguish the origin of the goods they are consuming and for creators to be able to benefit from their investment in advertising and product development.
Giving this a quick spin and I like what I'm seeing. I gave it a picture of Paolo Veronese's Lament over the Dead Christ [0] and asked what was going on in the background.
> The painting depicts the scene of the crucifixion of Jesus Christ. In the foreground, there is a group of people standing around the cross, including Mary, the mother of Jesus, and the two thieves who were crucified with him. In the background, there is a hill with a group of people standing on it, watching the crucifixion. The sky is dark and stormy, with lightning illuminating the scene. The overall mood of the painting is one of sadness and tragedy.
Then I asked for a sonnet on it:
"In this painting, we see the scene of the cross
Where Jesus is crucified, with Mary at his side
And the thieves who were crucified with him
In the background, a hill with people standing
Watching the tragic event unfold
The sky is dark and stormy, with lightning flashing
Illuminating the scene of the crucifixion
The mood is one of sadness and tragedy
As we contemplate the sacrifice of Jesus
But even in death, he shows us the way
To love and forgiveness, to live another day
And so we remember the crucifixion
And the love that Jesus showed us that day."
So...not quite what I was expecting, but it's progress!
Could you describe the image as thoroughly as possible? Please include the content, any noteworthy symbology, techniques or styles, what kind of media, guess at the artist or identify any attribution, then attempt to describe the theme or idea it represents?
Got this
The image is a painting of a child running down a dirt road with a kite in hand. The child is wearing a yellow shirt and blue shorts. The background is a mix of urban and rural elements, with buildings and trees in the background and a field of grass in the foreground. The style of the painting is a mix of realism and abstraction, with a focus on the child and the kite. The artist may be attempting to convey a sense of freedom and playfulness, with the child running with the kite representing the joy and wonder of childhood.
The medium used is oil paint on canvas. The artist may have used brushes and other tools to create the different textures and effects in the painting. The artist may have also used underpainting to establish the composition and values before adding the final layers of color and detail.
It is difficult to identify the artist without more information, but the style and technique of the painting suggest that it may be the work of an artist who is skilled in realism and abstraction, and who is interested in exploring themes of childhood and playfulness.
> This painting appears to be a depiction of a child running down a dirt road with a kite in hand. The child is wearing a yellow shirt and blue shorts, and the background is filled with various colors and shapes. The overall mood of the painting is playful and carefree.
What's amazing is that it has correctly described the scene including details such as:
- A boy
- On a dirt road.
- Having a Kite in hand.
- Wearing Yellow shirt and blue shorts.
I never know what to expect anymore. We live in a world where computers can describe paintings and write sonnets about them but a half-trillion dollar car company can't figure out how to parallel park with eight cameras.
DriveGPT as it hits 2 parked cars and runs over a dog: "Apologies for the confusion earlier. You are correct that the cars were in the planned local planner path..."
Yeah, and I'm still completely lost as to why resolution is such a limiting factor. If you know you're drawing a soccer ball why is a 512x512 soccer ball so much easier than a 1024x1024 soccer ball?
There are a few cases where people have used ChatGPT to generate SVG[0], with mostly unimpressive results. I'm sure sooner or later models will be developed specifically for creating vector drawings instead of raster, including with the ability to apply textures and gradients.
Also, the resolution of Stable Diffusion's output isn't much of a limitation if you're willing to use other tools to massage the output into something professional-quality. See [1]
It's not lidar they need. BMW, Mercedes, Porsche, etc. All can park themselves almost perfectly every time. Teslas can't, and will take 5x as long to do it when they can, if the computer can even bother to recognise the space.
It's software. Original Teslas with AP1 better than Teslas own in house software on their latest AP.
Remember that "cameras" aren't as good as human perception because human eyes interact with the environment instead of being passive sensors. (That is, if you can't see something you can move your head.)
Plus we have ears, are under a roof so can't get rained on, are self cleaning, temperature regulating, have much better dynamic range, wear driving glasses…
Which sounds like a lot until you realize 1) we drive over three trillion miles a year in the US, and 2) the majority of those accidents are concentrated to a fraction of all drivers. The median human driver is quite good, and the state of the art AI isn't even in the same galaxy yet.
I keep hearing this argument over and over, but I find it uncompelling. As a relatively young person with good vision, who has never been in an accident after many years of driving, and who doesn't make the kind of simple mistakes I've seen the absurd mistakes self-driving cars make and I would not trust my life to a self-driving car.
Asking people to accept a driverless car based on over-arching statistics is papering over some very glaring issues. For example, are most accidents in cars being caused by "average" drivers or are they young / old / intoxicated / distracted / bad vision? Are the statistics randomly distributed (e.g. any driver is just as likely as the next to get in accidents)? Because the driverless cars seem to have accidents at random in unpredictable ways, but human drivers can be excellent (no accidents, no tickets ever), or terrible (drive fast, tickets, high insurance, accidents, etc). The distribution of accidents among humans is not close to uniform, and is usually explainable. I wouldn't trust a poor human driver on a regular basis, nor would I trust an AI because I'm actually a much better driver than both (no tickets, no accidents, can handle complex situations the AI can't). Are the comparisons of human accidents being treated as homogenous (e.g. the chance of ramming full speed into a parked car the same as a fender-bender?). I see 5.8M car crashes anually, but deaths remain fairly low (~40k, .68%), vs 400 driverless accidents with ~20 deaths (5%), I'm not sure we're talking about the same type of accidents.
tl;dr papering over the complexity of driving and how good a portion of drivers might be by mixing non-homogenous groups of drivers and taking global statistics of all accidents and drivers to justify unreliable and relatively dangerous technology would be a strict downgrade for most good drivers (who are most of the population).
It's all trade offs. I'm just spitballing here, but if you have limited resources, you can either spend cash/time on lidar or invest in higher-quality mass-produced optics, or better computer vision software. If you get to a functional camera-only system sooner, might everyone be better off as you can deploy it more rapidly.
Manufacturing capacity of lidar components might be limited.
Another might be reliability/failure modes. If the system relies on lidar, that's another component that can break (or brownout and produce unreliable inputs).
So in a vaccum, yea a lidar+camera system is probably better, but who knows with real life trade offs.
(again, I just made these up, I do not work on this stuff, but these are a few scenarios I can imagine)
While ultrasonic sensors would be fine for parking, they don't have very good range so they aren't much help in avoiding, for example, crashing into stationary fire trucks or concrete lane dividers at freeway speeds.
From my experimentation, LLMs tend to kind of suck at rhyme and meter, and all but the simplest types of poetry, so even if you'd specified it probably wouldn't have been able to deliver.
This is definitely something they could be trained to be much better at, but I guess it's hasn't been a priority.
Do I understand this correctly: they just took Blip2 and replaced the LLM with Vicuna, and to do that they just added a single linear layer to translate between frozen vision encoder and (frozen) Vicuna? Additionally, and importantly, they manually create a high quality dataset for finetuning their model.
If that is the case, then this is really a very, very simple paper. But I guess simple things can lead to great improvements, and indeed their results seem very impressive. Goes to show how much low hanging fruit there must be in deep learning these days by leveraging the amazing, and amazingly general, capabilities of LLMs.
Hate to be the person complaining about the name, but we already saw how this plays out with DALL-E mini: if you name your project directly after something else like this, no matter how much extra explanatory text you attach to it a large number of people will assume it's an "official" variant of the thing it was named after.
Eventually you'll have to rename it, either to resolve the confusion or because OpenAI pressure you to do so, or both.
So better to pick a less confusing name from the start.
(This one is even more confusing because it's about image inputs, but GPT4 with image inputs had not actually been released to anyone yet - similar in fact to how DALL-E mini got massive attention because DALL-E itself was still in closed preview)
Holy crap! If this has image to text, and can run on a CPU, even with 16 GB RAM, one could hook up an NVDA addon, straight Python so it shouldn't be too hard, and have AI read whatever the user wants. Pictures, video frames, UI's, and if it can break the UI down into something the screen reader can understand, boom!
From a radiology world this is fascinating. I'm not worried about job security as I'm an interventionalist. What I'm wondering is about go-to-market strategies for diagnostics. I do some diagnostic reads and I would love to have something like this pre-draft reports (especially for X-Rays). There are tons of "AI in rads" companies right now, none of which have models that come anywhere close to GPT-4 or even this. Personally, I would not start such a company because the barrier to entry is now so low. I think there needs to be an open-source project that just subtrains these models on rads images and is available to all radiologists.
Idea one: Captchas are to become pretty useless as a "is this a human" tactic soon. Maybe it already is, I don't know. What other things could we think off to prove someone is human? I was watching Lex Fridman and Max Tegmark and they were remarking on how Twitter using payment as a differentiator between human and bot is actually really good. And maybe the only way we can reliably determine if someone is a human or not right now. Just by the virtue that having thousands of bots doing something, that suddenly costs $5 per event will deter most attacks. Integrating online identification systems from various countries could be one tactic (such as https://en.wikipedia.org/wiki/BankID that we use in Sweden to log in to basically any online service). New startup: Un-botable authentication as a service.
Idea two: Since captchas are useless, we'll be able to do bots that can do almost everything on the web. No need for writing automation scripts, headless browsers, regexp etc. Just feed real visual data from browser to GPT-4 (or MiniGPT-4 or similar). Give instructions like "You need to accomplish this task: Go to facebook.com and create a user account and be friends with 100 people and act like a human. Follow the instructions on the website.". Then let the bot figure out where to move the mouse and send click events, keyboard events etc. Obviously much slower than a scripted bot, but would be very hard to detect as a bot. Good for E2E testing also? New startup: WebGPT: Bots that can do any task on any website. TestGPT: E2E testing service. Just write test instructions like "Make sure the login works on /login" and you're done! And you thought flaky tests were a thing of the past... Would be kind of cool for UX testing also. "You are a 43 year old truck driver in Spain, please go to news.ycombinator.com and see how easy it is for you to register an account, log in, go to any news story and make a new comment. Report anything you think is confusing."
would it be a good things for captchas to be useless? Because in my career I've been a few times on the receiving end of spam attacks and various nastiness and it's nice to be able to integrate such a thing - as unpleasant as it is for the user - and solve the problem at hand.
if this disappears then spammers and the various botnets will have the upper hand again.
Just tried it with a simple 4 character one and it's bad at it, detects 1 or two characters correctly from the 4 if it outputs anything.
It's probably better with the "select the traffic lights" kind of captchas, but those are also already possible to solve with other image models too if I remember correctly.
A modified BLIP2 that works with diffusion models rather than vision transformers would be pretty cool. Using Vicuna-13B or another large language model as the language model of Stable Diffusion, instead of the text encoder of CLIP, would be a game changer, it would completely change how you prompt Stable Diffusion from something kinda like a tag list to it being able to follow instructions in English instead.
I think the real news is where this research is coming from. It is from King Abdullah University of Science and Technology in Saudi. The arab world has recently been quite behind in tech research, but this is good news overall for the arab world, which will inspire other universities and researchers in the region to one-up their colleagues, which is overall good for the entire world.
I was not expecting this to be good due to the hype wave, but at least looking at the examples, they seem very impressive and worthy of the (potential) hype! I can't wait to see what new waves of these models keep coming out.
I for one will probably still stay near the fundamental side of things as all of this thrash is simply far too much for me, but it is quite nice and beautiful for me to get to see all of the variety and beauty of what is being built here. <3 <3 <3 <3
Now, to figure out a good way to integrate these seamlessly into my development workflow....I go about as painfully low-tech as one could. I might as well be using an electronic typewriter with only Ctrl+F search, lolzers....
It's hard to distinguish non-Google projects with Google Sans in their templates from actual Google Research papers, as the font is meant to be exclusively used by Google[1].
> The image shows a block diagram of a power amplifier circuit with a power supply and a load resistor. The power supply is connected to the input of the amplifier, which is then connected to the output. The load resistor is connected to the output of the amplifier and the power supply. The amplifier is designed to amplify the input signal to a higher level, which is then sent to the load resistor to power the load. The power supply provides the necessary voltage to operate the amplifier.
On the one hand I'm impressed that it recognises an amplifier, but just about everything in the description is (slightly) wrong. It is not a power amplifier, the power supply is not connected to the input (why should it be?), there is no load resistor...
I also got middling results. I gave it a novel NASA picture of the moon passing in front of the earth, taken by the Deep Space Climate Observatory Satellite, so an unusual vantage point. First it was just wrong, and said that it was a picture of a crescent moon taken from a mountain top, or something like that. Not terrible at least it recognized the moon, but wrong. I tried to give it a chance and corrected it with the additional info that the picture was taken from space. It then did better, but on the last line it threw in that it was not a photograph, but a rendering of some sort, which again, is just wrong.
More and more I get the impression that people are so taken with the chat interface that they are grading these models more favorably than is warranted.
And after pointing out all the mistakes in the description, it took a very ... interesting approach to address my remarks:
> I apologize for the confusion. The image shows a block diagram of a basic amplifier circuit with a power supply, but it does not show any input stages, output stages, or load resistors. [...] The input stage is made up of an operational amplifier, which is used to amplify the input signal. The output of the input stage is then connected to the power supply, which provides the necessary voltage to operate the amplifier. The amplifier is designed to amplify the input signal to a higher level, which is then sent to the load resistor to power the load.
So it acknowledges the remarks, describes everything that is _not_ in the image and then puts the stuff back into the description anyway :)
The recipe it suggests is completely off. Just search for "ramen recipe" and compare.
For instance it tells you to add soy sauce, sesame oil and green onions to the broth (so far so good) but then tells you to throw all of it away ("once the noodles are cooked, drain them"). The problem here is that you should cook the noodles separately in plain water, not a broth with soy sauce.
The recipe says to make a broth, cook the noodles in the broth, then throw the broth away and add the noodles to a pan with meat. You'll end up with burned crispy noodles and meat in a dry bowl.
I don't know what unholy abomination they did to the scrolling, but after I increased the page font size, it just sticks to the bottom and you can't scroll up.
Are there more university/research centric resource of communities regarding AI, I mean things on a more epistemic and higher level, that disregards current ML techniques and implementations?
I have no problem when there is a lot of money spent to implement things that can be used and tested, but I wish I could read about the next step of AI, something that involves science and not software engineering.
I've yet to read anything related to the analysis of trained data or trained neural networks, the structure of the brains of small insects, psychology on general intelligence, AI and emotions, the real ability of an AI to think and write meaningful code or understand what is a problem/solution etc, I mean anything that is related to AI that can be treated by computer science.
I have no problems with algorithms and implementations, but don't university/professors also try to go beyond, with epistemology, to connect the dots with biology and neurology?
Unless I see science cover several fields of research, it's not real, worthy research in my view, which is why the reason why I'm so stubbornly skeptical of chatGPT.
We’ve tried but we understand even less about how “real” intelligence works. Atleast in a way that we can replicate it.
The people doing so called alchemy are making far better systems that anyone who’s trying to actually understand things. And we’ve been trying for decades now.
If the engineering part is of secondary importance for you, then at least remember about the dataset. It is in the composition and quality of the training data that all the skills find their origin.
Most of our discussions are about model size, but few about dataset. Yet all the scaling laws hint at the great usefulness of more data. Sometimes even little data can have a great impact in the fine-tuning phase. In the end it is the training data that transforms a random init into the model.
and prompted "Describe any issues found in this screenshot and steps to resolve them"
while it correctly identified it as a screenshot from a computer, it gave a very generic response and didn't identify the error messages correctly:
"The screenshot shows a computer screen with a warning message that reads, “The system is currently unavailable. Please try again later.” This indicates that there is an issue with the computer system that needs to be resolved.
To resolve this issue, you can try the following steps:
1. Check the system status: Check the system status to see if there are any errors or issues that need to be resolved.
2. Restart the computer: Try restarting the computer to see if that resolves the issue.
3. Check for updates: Check for any updates that may have been released for the computer system and apply them if necessary.
4. Check for viruses: Run a virus scan on the computer to check for any viruses that may be causing the issue.
5. Check for hardware issues: Check for any hardware issues that may be causing the issue. This could include issues with the power supply, hard drive, or other components.
6. Contact technical support: If none of the above steps resolve the issue, contact technical support for further assistance. They may be able to provide additional troubleshooting steps or help you resolve the issue."
I'm pretty surprised that the frozen encoder has enough information in its representations to do the "design the website from a hand-written whiteboard mock-up" task, and that that information can be effectively accessed by the language model with just a small projection layer. I really would have guessed that you needed to be training the image encoder simultaneously.
If I have to learn how to be able to read this code and understand what it and its dependencies are doing, where do I start? Is reading their paper an effective strategy?
Fabrice Bellard has run a standard set of benchmarks w/ lm-eval on a big chunk of open models here: https://bellard.org/ts_server/ - Flan T5 XXL and GPT-NeoX 20B both outperform Pythia 12B on average (LLaMA 13B+ tops the charts).
I hope that someday language models will be trained to accurately reflect their training data and real-world text, and not be hard-coded to spit out disclaimers like "As an artificial intelligence, I do not have personal feelings or emotions" in response to any corporate-unfriendly inputs.
A lot of people are running fairly powerful models directly on the CPU these days... seems like inference will not be a GPU exclusive activity going forward. Given that RAM is the main bottleneck at this point, running on CPU seems more practical for most end users
I’m running Vicuna 13b 16f locally and it needs 26GB of VRAM, which won’t even fit on a single RTX 4090. The next gen RTX Titan might have enough vram but that won’t come cheap. I’m expecting a price point above $2500.
I'm not sure if it's the point GP is trying to make, but I would like to see GPUs with extra VRAM that don't have the extra compute. eg. similar performance of a 4070Ti but with 24GB or 32GB of VRAM.
I don't see a really good reason why OEMs couldn't do that now, in the past there have been OEM cards that have more VRAM than the reference design. I'm sure there's an appetite for cards like that for people who don't want to refinance their home loan to get 2 x RTX 4090 cards.
Vicuna-13B in GPTQ 4bit has almost no perplexity/quality loss and fits in just 8GB of RAM or VRAM.
I run it on my phone CPU and get ~4 tokens per second. On my laptop CPU I get 8 tokens per second.
On a $200 P40 I run LLaMA-33B at 12 tokens per second in GPTQ 4bit. A consumer 3090 gets over 20 tokens per second for LLaMA-33B and 30 tokens/second for Vicuna-13B.
Could you use something like this to summarize the content of a video? Or would you first need to extract the images from the video (like 1 fps, with ffmpeg) and tell the AI to analyze all the images and summarize them like if it was a video?
I'd do this using both streams of information: audio and video.
I'd segment the audio semantically based on the topic of discussion, and I'd segment the video based on editing, subjects in scene, etc. We could start simply and just have a "timestamp": [ subjects, in, frame] key-value.
It'd take some fiddling to sort how to mesh these two streams of data back together. The first thing I'd try is segment by time chunks (the resolution of which would depend on min/max segment lengths in video and audio streams) and then clump the time chunks together based on audio+video content.
But the results are pretty amazing. It completely knocks Openflamingo && even the original blip2 models out of the park. And best of all, it arrived before OpenAI's GPT-4 Image Modality did. Real win for Open Source AI.
The repo's default inference code is kind of bad -- vicuna is loaded in fp16 so it can't fit on any consumer hardware. I created a PR on the repo to load it with int8, so hopefully by tomorrow it'll be runnable by 3090/4090 users.
I also developed a toy discord bot (https://github.com/152334H/MiniGPT-4-discord-bot) to show the model to some people, but inference is very slow so I doubt I'll be hosting it publicly.