The emphasis here is Single Image, but can this model generate with multiple images too?
We know that a single image of an object physically can't cover all the sides of it, so it's all guesswork in AI. This is totally fine for certain scenario, but in lots of other cases, it's trivial to have multiple images of the same object, and if that offers higher fidelity, it's totally worth it.
I'm aware there are many algorithms or AI models that already
do that. I'm asking about Stability's one specifically because if they have impressive Single Image result, surely their multi-image results would also be much better than state-of-the-art?
Just tried to run this using their sample script on my 4090 (which has 24GB of VRAM). It ran for a little over 1 minute and crashed with an out-of-memory error. I tried both SV3D_u and SV3D_p models.
[edit]Managed to generate by tweaking the script to generate less frames simultaneously. 19.5GB peak VRAM usage, 1 min 25 secs to generate at 225 watts.[/edit]
I managed to get it working with a 4090. You need to adjust the parameter decoding_t of the sample function in simple_video_sample.py to a lower value (decoding_t = 5 works fine for me).
I also needed to install imageio==2.19.3 and imageio-ffmpeg
I made a Manifold market[0] on the amount of ram a 5090 will have, and while pretty much nobody has participated, I just checked and the market is amusingly at the 32GB you've also quoted. Just like you, I hope it will be more but I fear it will be even less.
Even 32GB would be great for a gaming card, any more and you're never seeing on sale as it will be bought by truckloads for AI, so of course they're not gonna balloon the VRAM. I suspect we'd still be at 16GB but they launched 3090 on Sep 2020 with 24GB, before all this craze really, and lowering is bad optics now.
Meanwhile Apple will sell you a chip with 96GB of unified memory for the price of two 4090s ... and that is with the Apple tax ... it's ridiculous. I know the memory bandwith of M2 Max is like 1/2 of a 4090, but still, the artificial kneecapping Nvidia does is absurd.
You can add multiple, but practically speaking you're better off with used 3090s which you get 2 for the price of one 4090.
I have 3090 Ti and I can run Q4 quant 33b models at 30t/s with 8k context. A 4090 would allow me to do the same but with ~45t/s, both inference speeds are more than fast enough for people so 3090 is the usual choice. In my tests on runpod, H100 with 80GB memory is around the same speed as 3090, so slower than a 4090.
Odd statement. I don't really know what you mean by that. Perhaps 'math _works_, code should too' ?
I would definitely agree that it _should_ work.
I'm of the belief that no one should _have to_ publish (e.g. to graduate, get promotions, etc) in academia, and that publications should only occur if they're believed to be near Novel prize worthy, and fully reproducible by code with packaging that should last and work in 10 years, from data archives that will exist in 10 years.
But it seems I have been outvoted by the administration in academia.
Hence, we get this "ai that doesn't run" phenomenon
It already is effectively just for industry benefit. It's been like that since the start. Work that is too expensive for industry to do (research and discovery) was put into the public sphere such that the role of industry was to take that innovation and optimize it. That's at least how it is intentionally constructed.
My main point was that there is a lot of noise in scientific journals that are caused from pressures in academia that are requirements if publishing. If these are removed, then the quality of work published increases and quantity decreased.
There are other places to post work that is derivative and non-novel like blogs. The field of biology has an immense amount of work that is mostly observational without strong conclusions or predictivity. A tabulation of observation should definitely be put out by a lab, and it should be much sooner with far less pressures than today, such as the typical dance of putting the data in during publication. The SRA is one example of a place to share data. If the typical way to work was put all data immediately onto a public repo, sometimes comment on it in ways that have been seen before on blogs and other classes below scientific journals, and then if something truly substantial comes out of it (a novel model that is analytical and highly predictive of cell behavior in all situations for example) then publish.
It could alleviate the noise from the signal. LLMs is one case where the noise is very strong in that many papers are simply 'we fine tuned an llm'.
So how should knowledge be shared in academia without publishing? Any work worthy of a Nobel Prize (or more likely, a Turing Award) is built on top of significant amounts of other research that itself wasn't so groundbreaking.
That said, I certainly think that researchers can do more to make their code and data more accessible. We have the tools to do so already but the incentives are often misaligned.
Yeah, I’m still debating whether I go with a Mac Studio with the RAM maxed out (approx $7500 for 192 GB) or a PC with a 4090. Is there a better value path with the Nvidia A series or something else? (I’m not sure about tibygrad)
I have an M1 Max with 64GB and 3090 Ti. M1 Max is ~4x slower at inference for the same models than 3090 (i.e. 7t/s vs 30t/s), which depending on the task can be very annoying. As a plus you get to run really large models, albeit very slowly. Think if that will bother you or not. I will not give up my 3090 Ti and am rather waiting for 5090 to see what it can do because when programming, the Mac is too slow to shoot of questions. I use it mostly to better understand book topics now and 3090 Ti to do fast chat sessions.
You can get a previous gen RTX A6000 with 48GB of gddr6 for about $5000 (1). Disclosure: I run that website. Is anyone using the pro cards for inference?
Perhaps NVIDIA or somebody could invent a RAM upgrade via NVLINK? Seems plausible and not every problem would want to add another GPU when the ability to add the extra memory alone is all they need.
We need AMD to compete, but from what I know their software is subpar to NVIDIA's offering and most of the current ML stacks are built around CUDA. Still there's a lot of money to be made in this area now so competition big and small should pop up.
the memory is inherent to the gpu architecture. You cannot just add VRAM and expect no other bottlenecks to pop up. Yes they can reduce the VRAM to create budget models and save a bit here and there. But adding VRAM to a top model is a tricky endeavour
There's a lot of open weights activity around 7B/13B models which the 4090 will run with ease. But you could can run those OK on much cheaper cards like the 4070Ti (which is of course why they're popular).
And there's a lot of open weights activity around 70B and 8x7B models which are state-of-the-art - but too big to fit on a 4090. There's not much activity around 30B models, which are too big to be mainstream and too small to be cutting edge.
If you're specifically looking to QLoRA fine-tune a 7B/13B model a 4090 can do that - but if you want to go bigger than that you'll end up using a cloud multi-gpu machine anyway.
4090 has more VRAM than most computers have system RAM. Surprised this is considered "low RAM" in any way except for relative to datacenter cards and top-spec ASi.
You're comparing RAM amounts to other RAM amounts without considering requirements. 24GB is more than (most) current games would ever require, but is considered a uncomfortably-constrictive minimum for most industrial work.
Traditional CPU-bound physics/simulation models have typically wanted all the RAM they could get; the more RAM the more accurate the model. The same is true for AI models.
I can max out 24GB just using spreadsheets and databases, let alone my 3D work or anything computational.
Good to know. I've only been running LLMs at home and most of the open-source ones have been more than small enough to fit in my measly 12GB. But I guess most workloads want so much that 24GB won't fit them at all.
For AI that's either a very fat SDXL model at it's max native resolution, or a quantized 34B parameter model, so it's on the low size. Compare that with the Blackwell AI "superchip" announced yesterday that appears to the programmer as single GPU with 30TB of RAM.
Maybe give me lots of money to give Nvidia for a card with more memory then?
Nvidia have held back the majority of their cards from going over 24GB for years now. It's 2024 and my laptop has 96GB of RAM available to the GPU but desktop GPUs that cost several thousands just by themselves are stuck at 24GB.
They don’t get their absurd profit margins by cannibalising their data centre chips.
This is like Intel and their refusal to support ECC memory; when AMD does on nearly all Ryzens.
—
Note: your laptop is probably using a 64-bit memory bus for system RAM. For GPUs, the 4090 is 384-bit. That takes up a lot more die area for the bus and memory controller.
But GP's laptop with 96GB of unified memory would be a M2 Max Macbook or better. The M2 Max has a 4 x 128-bit memory bus (410GB/s) and the M2 Ultra is 8 x 128bit (819GB/s), versus a 4090 at 1008GB/s. But see here for caveats about Mac bandwidth: https://news.ycombinator.com/item?id=38811290
Isn't there the risk that if they give the gaming cards enough RAM for such tasks then they'll get bought up for that purpose and the second-hand price will go even higher?
I guess my point is, rather than give the cards more RAM, the gaming cards should just be priced cheaper.
This is unfairly downvoted. They launched 3090 on Sep 2020 with 24GB which was more than AMD's 16GB 6900XT launched on that same month. Maybe before blaming Nvidia, blame AMD for lack of trying to compete with them? Of course they're not gonna release a gaming card with loads more VRAM because a) competition doesn't exist nor has gaming cards with more VRAM b) it would all be bought up for AI workloads c) games don't really need more as parent said.
Dunno why the defaults for this stuff isn't the base performance, feel I always have to tweak the batch size down on all the base scripts even with 24gb cos everything assumes 48gb
Yeah, this is to be expected with early adoption. This stuff comes out of the lab and it's not perfect. The key thing to evaluate is the trajectory and pace of development. Much of what folks challenged ChatGPT with a year ago is long lost in the dust. Go look at stable diffusion this time last year. Dall-E couldn't do words and hands, it nails that 90% of the time in my experience today.
About words, Dall-e is nor even close to nail it 90% of the time. Not even 50%. Maybe they nerf it when you request a logo from it, but that was my experience in the last few days.
With previous attempts at this problem the shaded examples could be quite misleading because details that appeared to be geometric were actually just painted over the surface as part of the texture, so when you took that texture away you just had a melted looking blob with nowhere near as much detail as you thought. I'd reserve judgement until we see some unshaded meshes.
Seems like a tougher nut to crack than image generation was, since there isn't a bajillion high quality 3D models lying around on the internet to use as training data, everyone is trying to do 3D model generation as a second-order system using images as the training data again. The things that make 3D assets good, the tiny geometric details that are hard to infer without many input views of the same object, the quality of the mesh topology and UV mapping, rigging and skinning for animation, reducing materials down to PBR channels that can be fed into a renderer and so on aren't represented in the input training data, so the model is expected to make far more logical leaps than image generators do.
I know where I could get several hundred terabytes (maybe an exabyte? It’s constantly growing) of ultra high quality STL files designed for 3D printing. I just don’t have the storage or the knowledge of how to turn those into a model that outputs new STL files.
I’d imagine it’d require a ton of tagging, although I have a good idea of how I could leverage existing APIs to tag it mostly automatically by generating three still image thumbnails of the content, then feeding that through CLIP, and verifying that all two or three agree on what it’s an STL of, and manually tag the ones that fail that test.
> since there isn't a bajillion high quality 3D models lying around on the internet to use as training data
There aren't a bajillion high-quality 3D models of everything, but there are an unbounded number of high-quality 3D models of some things, due to the existence of procedural mesh systems for things like foliage.
You could, at the very least, train an ML model to translate images of jungles into 3D meshes of the trees composing them right now.
Although I wonder if having a few very-well-understood object types like these, to serve as a base, would be enough to allow such a model to deduce more generalized rules of optics, such that it could then be trained on other object categories with much smaller training sets...
It almost seems easier, in that you have an arbitrary # of real world objects to scan and the hardware is heavily commoditized (IIRC iPhones have this built in at highres now?)
In context, the conversation was beyond a dichotomy - thankfully. Having only 2 choices leaves conversation at people insisting one is better, and becomes an argument about definitions where people take turns alternating being "right" from the viewpoint of a neutral observer.
It's proposing a solution to the author's observation that everyone is doing it in second order fashion and missing a significant amount of necessary data.
The implication is that rather than doing it the hard way via the already-obtained 2nd order dataset, it'll be easier to get a new dataset, and getting that dataset will be significantly easier that it was to get the second-order dataset, as you don't need to worry about aesthetic variety as much as teaching what level of detail is needed in the mesh for it to be "real"
I don't think they have a specific use-case for this model, they're throwing ideas at the wall again in the hopes some of them stick and eventually turn into another product. The paper doesn't discuss any of the problems that would need to be solved in order to easily generate game-ready assets so I think it's safe to assume that it currently doesn't.
For games at the very least you need to consider polygon budget, getting reasonably good UVs, and generating materials which fit into a PBR shader pipeline, at least if it's going to work with rendering pipelines as we know them today (as opposed to rendering neural representations directly, which is a thing people are trying to do but is totally unproven in production).
I'd be willing to bet you could create a diffusion model to map unrefined meshes to UV-fixed and remeshed surfaces. If you had a large enough library of good meshes you just programmatically mess 'em up and use that as the dataset.
That's assuming your generator produces a normal map, the ones I've seen do not, the only texture channel they output is color. That being the one channel that a model trained on images is naturally equipped to produce.
You can generate pretty reliable texture depth maps from just an image. It’s going to be trash if you’re trying to generate the depth for the entire 3D model but I presume it’s going to go a good job with just texture. Then you just use a displacement based on the depth map.
Only if you have multiple images of the same areas so that you can extract actual position. And there is no guarantee that multiple pictures of the same model have the same detail, much less in a manner that can be triangulated with accuracy. A lot of the photogrammetry algorithms discard points that don't match certain error-bars.
So yes, there might be a wooden frame in the middle of that window, but does it match the math on both angles of it? Doubt it.
I don't know much about 3D printing, would be very interested in learning more about this idea if you'd be so kind as to expand on it. Could I have AI spend all day auto scanning what teens are doing on instagram, auto generate toys based on it, auto generate advertisements for the toys, auto 3D print on demand?
I think their suggestion was more "I have a photo of a cool horse, and now I would like a 3D model of that same horse."
Another way of looking at it, 3D artists often begin projects by taking reference images of their subject from multiple angles, then very manually turning that into a 3D model. That step could potentially be greatly sped up with an algorithm like this one. The artist could (hopefully) then focus on cleanup, rigging, etc, and have a quality asset in significantly less time.
The question is whether this actually "creates a 3d model based on the picture",
or if it "finds an existing model that looks similar to the picture and texture map it".
Hypothetically, sure, assuming the parent comment that these meshes are sufficient for modelling is correct and that you can find any teens who want a non-digital toy.
I think a good hobbyist application for this would be something like modelling figurines for games, which is already a pretty popular 3D printing application. This would allow people with limited modelling skills to bring fantastical, unique characters to life “easily”.
OP is suggesting that this (AI model? I honestly am behind on the terminology) could replace one of the common steps of 3D printing - specifically, the step where you create a digital representation of the physical object you would want to end up with.
There are other steps to 3D printing in general, though; a super rough outline:
- Model generation
- "Slicing" - processing the 3D model into instructions that the 3D printer can handle, as well as adding any support structures or other modifications to make it printable
- Printing - the actual printing process
- Post-processing - depending on the 3D printing technology used, the desired resulting product, and the specific model/slicing settings, this can be as simple as "remove from bed and use" to "carefully snip off support structures, let cure in a UV chamber for X minutes, sand and fill, then paint"
As I said before, this AI model specifically would cover 3D model generation. If you were to use a printing technology that doesn't require support structures, and handles color directly in the printing process (I think powder bed fusion is the only real option here?), the entire process should be fairly automatable - a human might be needed to remove the part from the printer, but there might not be much post-processing to do.
The rest of your desired workflow is a bit more nebulous - I don't know how you would handle "scanning what teens are doing on instagram", at least in a way that would let you generate toys from the information; generating and posting the advertisement shouldn't be too hard - have a standardish template that you fill in with a render from the model, and the description; printing on demand again is possible, though you'll likely need a human to remove the part, check it for quality and ship it. You could automate the latter, but that would probably be more trouble than it's worth.
Interesting, to be clear I don't think this is a good idea and it's kinda my nightmare post capitalism hell. I just think it's interesting this could be done now.
On finding out what teens want, that part is somewhat easy-ish, I guess you'd need a couple of agents, one that is scanning teen blogs for stories and then converting them to key words, then another agent that takes the key words (#taylorswift #HaileyBieberChiaPudding #latestkdrama etc) into Instagram, after a while your recommend page will turn into a pretty accurate representation of what teens are into, then just have an agent look at those images and generate difs of them. I doubt it would work for a bunch of reasons, but it's an interesting thought experiment! Thanks!
I'd like to play around with something like this, but from my understanding my machine (Macbook, 2021 M1) isn't nearly powerful enough (right?). Are there remote/cloud environments where I can run models like this?
Im sorry for dumb lazy question. But would the input require more than one image? Is there a demo url to test this? I think it might jsut be time to buy a 3d printer.
EDIT> Does "single image inputs" mean more than one image?
TBH I always look at the worst case scenario. I was worried it meant it need 3 images inputted as a single image at direct steps of the process, so requiring different angles. I wasn't sure, but thought best to check. I feel like it would have been clearer to have said something like " generates a 3d models from a single image". ( not exact wording but you catch my drift ). Sorry I am over analysing but all feedback is good right?
> Stable Video 3D (SV3D) is a generative model based on Stable Video Diffusion that takes in a still image of an object as a conditioning frame, and generates an orbital video of that object.
So can it actually output a 3d model? Or just images of what it thinks the object would look like from other angles?
The reference video (https://youtu.be/Zqw4-1LcfWg) says they use a NeRF / structure from motion and then create a mesh with marching cubes from the generated radiance field. This is how most soa text-to-object generators work now as well
It crashes with an out-of-memory error on my 24GB 4090, so at least when it comes to their sample script the answer is "a lot". Maybe it's just an inefficient implementation though.
Pretty much every initial Stability release has been inefficient and has resources drop a lot when optimized for real consumer hardware community engines appeared for running the model.
OTOH, with their shift to a less open licensing structure, community tooling probably won’t emerge with the same level of energy.
In the repo the model weights file is 9.37GB, whereas sdxl turbo is 13.9GB, and I don't see any mention of huge context windows, so probably it just needs a decent graphics card.
Extracting object from an image, transforming the object (rotating, for example), re-blended it into the original image.
Or making 3d game assets from objects you have around. Imagine: take your phone, go around town, into shops, into churches, come back, press a button, get huge library 3d assets to populate your game.
Or, something like this, for IKEA: couple of photos of a room --> extract objects --> let user re-arrange furniture. The room could be either the user's room or an IKEA showroom.
You can do it with existing tools, but this kind of technology reduces it to pressing a couple of buttons.
There are many direct consequences of people being able to directly transform text into textured 3d models, and even vaster indirect consequences if one pauses to reflect. A tight feedback system with good cohesion could revolutionize, art, design, mechanical engineering, video games, etc, etc.
Avante-garde/experimental film making is the main benefactor of all this.
Basically, cool looking video no one watches. I say that as a huge fanboy/artist myself. It is like Christmas every other day right now. All this VC money being set on fire to make better Avante-garde film tools is just wonderful. A dream come true.
It's hard to get camera position tracking for random objects, so it looks like they used simulations. There's probably a lot more plastic children's toy models in Blender than people, fabrics, buildings, &c.
They compare against Zero123-XL, but they should compare against MVDream instead. MVDream is quite good. If you fiddle with the loss you can get even better results.
We know that a single image of an object physically can't cover all the sides of it, so it's all guesswork in AI. This is totally fine for certain scenario, but in lots of other cases, it's trivial to have multiple images of the same object, and if that offers higher fidelity, it's totally worth it.
I'm aware there are many algorithms or AI models that already do that. I'm asking about Stability's one specifically because if they have impressive Single Image result, surely their multi-image results would also be much better than state-of-the-art?