Hacker News new | past | comments | ask | show | jobs | submit login
Running Stable Diffusion XL 1.0 in 298MB of RAM (github.com/vitoplantamura)
510 points by Robin89 on Oct 3, 2023 | hide | past | favorite | 154 comments



Fascinating. The money quote:

"OnnxStream can consume even 55x less memory than OnnxRuntime while being only 0.5-2x slower"

The trade-off between (V)RAM use and inference time sounds like it could be advantageous in some scenarios, and not just when RAM is constrained like in the RPi case.

I actually wonder if this weight unloading approach can be used to handle larger batch sizes in the same amount of RAM, in effect increasing throughput massively at the cost of latency.


I want this for LLMs. Having that much less of a memory footprint would allow us to put more models on a GPU at a time, and assuming the clock could keep up it could more than make up for the loss in inference speed per individual model


"0.5-2x slower" must be a typo on their part right? If something is 0.5x slower, then it is 2x faster.

I assume they meant to say "1.5-2x slower".


Maybe they meant 50%-200% slower, in which case the x-factor range would really be 1.5x to 3x?


I think it still wrong, although one may understand what the other person means.

When we say "faster" or "slower" what we usually mean is that we add/remove the percentage to the original amount, which is often cause of misunderstanding.

"Y is 10% faster than X" means that Y goes at 110% the speed of X

"Y is 10% slower than X" means that Y goes at 90% the speed of X

In particular, "Y is N% slower than X" doesn't mean that "X is %N faster than Y" ! (110% of 90% is not 100%)

For example "Y is 100% slower than X" doesn't mean that "X is double as fast as Y", but that Y is not moving at all.

and "Y is 200% slower than X" means... that Y goes in the other direction ? (Maybe back in time, in this case ?)


200% slower is 3x as long, yes.


I love making fun of people that don’t understand percentages… wait, wat?


117. 472% of grade school students are unable to readily convert between fractions and percentages.

38.157% of informally provided statistics are made up on the spot under the assumption nobody will actually check.


You have a typo - it’s actually 83.157%


9% of people consists entirely of shoulders, elbows and knees


Who is the intended butt of your joke here?

And explain to me why it isn't you?


I am pretty sure they intended the butt of the joke to be intentionally themselves


He’s being self deprecating, yes


But 50 % slower is either 1.5 or twice as long, depending on whether the 50 % refer to an increase of the runtime or a decrease of throughput. 200 % slower is only unambiguous because the throughput can not decrease by more than 100 %.


Usually that’s “half as fast”. English has always had the terrible twos. Is biweekly twice a week or every two weeks.

I’ve been doing capacity planning lately and the whole deal with how we are bad at fractions came up again. If you have to spool up 10% more servers and then cut costs by 10%, you’re still slightly ahead. If you cut 10% and then later another 10% you have cut 19% of the original, not 20%.


Communication is hard. If they said "Takes 50% to 200% more time" it would have been clearer


What? No, that's confusing enough it's almost hostile. The fact thst your math is wrong is proof enough. 50% to 200% more time is 1.5x to 3x slower.

I don't like how it was worded by the author. But all you've done is essentially invert the wording while making the math MORE difficult in the process.


Umm what? I hear “50% more time” all the time.

50% to 200% is 0.5x slower to 2x slower.

People seem to be confusing “% slower/more time” vs “% of current time”


Yes people saying “50% more” is very common, and usually easily understood to mean 1.5x.

I think something like “0.5x slower” is just conventionally not used, even if it’s understandable. It’s a difference in common usage between percentages and x-factors. One reason might be that x-factor implies multiplication, not addition; that’s what the “x” stands for. X-factors are typically used to say something like the ‘the new run time was 1.7x the old one’. Whether it was slower or faster is implied by whether the x-factor is below or above 1.0. Because x-factors are commonly used as multipliers, and not commonly used to say “1.5x more than” (which actually means 2.5x), it’s pretty easy for people to misunderstand when someone says “0.5x slower” because it looks like an x-factor.

Now, the same argument could apply to percentages. A percentage is also a factor. But in actual usage, “50% more” is common and “0.5x more” is not; and “100x faster” is common (usually to mean 100x not 101x) while “10000% faster” is not common at all. So language is inconsistent. ;)

All that said, using an x-factor as a pure factor, and not a multiply-add, is less confusing and more clear. Saying “The new runtime is 1.5 times the old one” leaves no room for error, where “The new runtime is 90% slower than the old one” is actually pretty easy to miscalculate, easy to mistake, and easy to misinterpret. The percentage-add is also asymmetric: 90% slower means 0.1x, while 90% faster means 1.9x. Stating a metric as percentage-add makes sense for small percentages, and makes more sense for add than subtract once the numbers are double-digit and larger.


1x slower would be 2x as slow, I think, is how people are interpreting it.

I think they are correct but it is easy enough to misinterpret that it is not a good way to phrase things.


It depends. What do you think if I say it's 1x slower?

Is it as fast as the original or does it take twice as long?


Twice as long. As fast as the original would be "0x slower" or "1x as fast".

People should just use duration instead of speed as you did at the end: "takes twice as long", "takes 1/3 of the time..."


> As fast as the original would be "0x slower" or "1x as fast".

I have played a fair number of incremental games and this quickly became a pet peeve of mine. So many will say things like "2x more" and it will actually be "2x as much". Fortunately, I don't recall any which actually switch between the meanings but it's so commonly a guessing game until I figure it out.


Off by one is baked into the language design.


hi,

I'm the author.

I have never questioned the clarity of that sentence, at least until today :-)

By "0.5x" I mean "0.5 times or 0.5 multiplied by the reference time" where "reference time" is the inference time of OnnxRuntime. So I'm actually meaning "50%".

I think the expression "0.5x slower", taken by itself, could be misleading, but in the context of the original sentence it becomes clearer ("while being only 0.5-2x slower", the "only" is important here!!!).

But I think the general context of that sentence defines its meaning. I am referring to the fact that under no circumstances could my project, with its premises, be in any scenario even a single millisecond faster than OnnxRuntime. Then in the second paragraph of the README the goal of the project is stated, which is precisely to trade off inference time for RAM usage! Obviously combined with the fact that the performance data is clearly reported and that I repeat several times that the generation of a single image takes hours or even dozens of hours.

However, given the possible misunderstanding, I will correct the sentence in the next few days.


I still don't understand what you mean. 0.5 multiplied by the reference runtime is faster.

Do you mean it increases the runtime by 50%-200%?


By "0.5x slower" I mean "50% slower" :-)



This is an issue that irritates the hell out of me when I see it and I'm so glad this person wrote it out so plainly.


They don't talk about 50% cookie though. They only go up, not down in cookie.


It is actually correct, it is slower by 0.5-2x in terms of runtime resulting in a runtime of 1.5-3x. Admittedly it would probably have been less confusing to say that the runtime increased by 0.5-2x to 1.5-3x because slower is more intuitively associated with less while the runtime of course increases.

In terms of throughput it is a decrease by 0.33-0.67x to 0.67-0.33x. That this are the same numbers in reverse order is of course just a coincidence, would the runtime have increased by only 0.2-0.5x to 1.2-1.5x, then the throughput would have decreased by 0.17-0.33x to 0.83-0.67x.


From my (albeit naive) reading, it doesn't appear that that they've reduced the amount of memory bandwidth required, simply the size of the working set required.

Since inference is generally memory bandwidth bound once you reach the level of 'does this model even fit in the given system', I'd imagine that this technique wouldn't help much for greater throughpit via larger batch sizes. Just one instance is probably already saturating the memory controller.

Maybe it'd help on the training side though?


That's true. But assuming the required memory bandwidth is not already maxed out by this, there might still be a narrow but workable "Goldilocks zone" for this technique to be useful.


11 hours remind me of doing raytracing on my Amiga 500 back in the day. It was definitely an overnight job for the "final" render.


Heh sometimes I am still doing that. modern bidirectional raytracers can do some interesting tricks. and I wanted to see caustics(the bright lines in pools). but caustics despite being bright are actually statistically rare. to get good caustics you have to unbound the render engine and just let it cook overnight.

And the end result, a single image of a mediocre scene by a poor artist with amazing caustics. I won't be quitting my day job.


Doing that low quality render first because you'd rather waste an hour being right than all night being wrong.

That was about when I decided I needed other hobbies. Right before that happened some brilliant soul put out a tool that would render your scene in OpenGL so you could look at it first. I don't think that would run on your Amiga but it (barely) ran on my machine.


Ha. Same on my 286. Set up povray, go to bed, see image before school in the morning.


Same, (albeit a little later) with dodgy copy of 3DSMAX on a 386.


It reminds me of doing Mandelbrot fractals on my C64. Debugging my code was really hard.


I am still amazed by seeing Fractals rendered in real time. My Core 2 Duo can do the initial renders at about 1080p resolution in about a second or two. Something that would have taken hours to do on an Amiga in the 80's IF you even had the memory for that kind of storage.


I did 640x480 fractals at crazy speeds with Xaos on an AMD Athon, which was and is pretty well optimized.


I've been using Stable Diffusion on a MBP via invoke.ai. Are there recommendations for better parameterization of SD? I can never match the quality of the images I find on the internet even when using the same prompt and (seemingly) the same knobs (e.g., same Model like Euler A, etc). [edited for clarification]


This is the best I've tried so far, but no mac support I don't think. Its a feature packed fork of Fooocus, which was developed by the orginal ControlNet dev. The quality you can get from small prompts is mind boggling:

https://github.com/MoonRide303/Fooocus-MRE

For base SD 1.5, I use Volta, because its fast: https://github.com/VoltaML/voltaML-fast-stable-diffusion/com...

Really good SD 1.5 image quality comes from gratuitous use of finetunes, LORAs, controlnet and other augmentations. So you can, say, trace a base image for structure, specify prompting in certain areas of the image and so on. InvokeAI is actually quite feature packed, and has lots of these augmentations hidden in the nodes UI, but Volta and other UIs also expose them more directly.


Fooocus does quite a bit of prompt massaging for you - there are models that take a few words and turn them into “prompt engineer” level prompts. Makes a huge difference.


Yeah, and InvokeAI has a similar "IP-adapter" model.

Still, even with it turned off, the quality is quite remarkable.


Ip adapter is a bit different from what fooocus and midjourney do.

Ip adapter uses an image to guide denoising.

Fooocus and MJ take a prompt and expand it in a variety of ways (eg a language model or more simplistic text manipulation). The actual prompt that creates the conditioning is not what you typed in. That’s what I mean by prompt massaging


Are you using custom weights? I'm assuming you are but there is a major difference between using the default RunwayML 1.5 weights and using a model finetuned for a specific purpose.

Generally the trade off is that any of the impressive finetuned models are far less generalizable then the default weights, but in practice this is not a big deal and the results can be a substantial improvement.


I have the same experience with Invoke.ai or MochiDiffusion in the MBP M1. I can only match the quality of other images with Automatic1111 (https://github.com/AUTOMATIC1111/stable-diffusion-webui).

You’ll need more time and memory compared to Invoke or an Nvidia graphics card, but it’s not that bad: 1-2 s/it for an image in standard 512x768px quality, 14-20 s/it for an image in high 1024x1536px quality (Hires Fix).


Do they specify it's straight from the generator? The process videos I've seen start with "a girl standing in a green field" and then an hour plus of inpainting to fix hands, pose, etc.


Draw Things added CUDA compatibility seed mode allows you to match NVDIA card generated images on Mac.


This would be really cool to have running embedded in a digital photo frame or wall painting.


I've built this a while back, using a previous version that runs Stable Diffusion on a raspberry pi zero 2 w:

https://hackaday.com/2023/09/19/e-paper-news-feed-illustrate...

https://github.com/rvdveen/epaper-slow-generative-art/


I'm building exactly that with an eink display atm. Sadly, i can't seem to be able to build the XNNPACK stuff on my pi zero 2W in the repo...


that's an awesome idea, do you have a link to more information?


Ill write it up once it's done and post here, if it gains traction you might see it haha.

In all seriousness I can give a brief overview:

- I'll probably offload the image generation to the 5 year old intel nuc I already have as a home automation server, comfyUI in CPU mode takes 20-30 mins for a generation. Ideally it's all self contained on the Pi but that might be beyond me, skill wise.

- prompts are composed by taking time of day, season, special occasions (birthdays, xmas etc); adding random subjects from a long manually curated list; then asking gpt4 to creatively remix the prompt for variety

- i have an inky impression 7.3 inch 7 color eink display and a raspberry zero stuck onto it. Right now it'll simply download new images from the NUC every once in a while

- i like wood and i dislike the jagged 3d printer aesthetic so I'll create a frame from laser cut plywood by designing some stackable svg shapes in inkscape and sending those to a laser cutter

It works right now, functionally.

Considering that I'm painstakingly writing this on a phone with a sleeping 3 week old baby on my chest it'll be while before i have the energy to make it look like something you'd hang on your wall


Sounds awesome what’s your Twitter ?


Great idea, where every 10 hours or so it would refresh with a new image it created itself (perhaps based on a theme supplied by the user).


Not very environment-friendly, though.


Dude, we are literally using single use plastic bottles to store water for a few weeks in them.


If we're doing one bad thing already, we may as well do a hundred!


We are literally using LED lighting because it saves energy over conventional light bulbs.

And now we're going to put a screen on the wall that we don't even look at 99% of the time?


Nothing is stopping you to buy a reusable bottle.


Compared to what though?

I think it might be friendlier in some aspects than fetching an image from a server running the big the models.

And you don't have to worry about service disruptions or api keys

An e-ink display doing it should only use energy when refreshes. And you could minimize refreshes to once a day, week, etc

Less friendly than a photo of course.


Watch 5 seconds of a tv show with your big tv and you've spent that environmental cost


Why do you say that? The energy usage of inference? I would guess that the embodied energy of the digital photo frame is probably higher.


That's where $1999 color E Ink display comes into play.


There are cheaper.. but it might require dithering..


2.5A 5V is not much power, and it would use considerably less when idling.


125 watt hours for a raspberry pi to generate an image in 10 hours compared to 7 watt hours for a 440W PC to run for 1 minute.


Is this for a regular Pi . The OP post is using a Pi Zero 2.

That is big though


Amazing feat, but of course takes forever to generate an image (in the Readme states 11 hours)


Yep. I'll never need or use this implementation, but the tricks used will make it to other tools, which will be great.


It'd be interesting to see what the cost and power equivilence would be compared to a higher end method. I.e the time, cost (including all hardware required) and power taken to generate 100 images using 100 individual Pi Zero 2's (doesnt even need to be a W) vs something like an average mid-tier PC.

I'd assume the pc would still likely win.

Something like a Pi 4 or 5 may be a better benchmark than the Zero 2 as I get the impression its been used more for the challenge than practicality.


A GPU can produce an image in about 1 second.


That depends on the GPU, I'm not talking mid-high end (e.g RTX level), just your average 'basic' GPU.


On a raspberry Pi. Zero 2.


Impressive!

Verily, the era is nigh wherein even lamps and toasters shall brim with surpassing sagacity.

After exposure to this field for many years, the last decade was stunning.

I say “was”, because the speedup in the last 6-18 months has been another thing altogether.

I am not concerned with what we will be able to two years hence, but with how much faster progress will be. And then again, and again.


Ooh! A toaster that takes a prompt and generates that image on your toast! The GPU heat could be harnessed to actually toast the toast.

Let's make a startup!


We're extremely proud to announce that ToasterDream has raised $323M in Series A funds, and we look forward to many years of exciting developments ahead. As the CEO I'd like to personally assure our loyal customers that taking this funding will not compromise the quality of our goods and services — rather, quite the opposite! In fact, in just the first month since we secured this funding with our investors, we have already managed to create an entirely new product! It's called ToasterDream Ultra, and allows users to toast up to 8 images simultaneously for just $5.99/month...


$5.99! But please make those toast dream catridges bigger so we don't have to replace them each week, my trashcan has been complaining about it!


So this should be it for trying to regulate stable diffusion type tech, right? If these models and their inference infra can be shrunk down to be runnable on a PS2, it doesn't seem like it's possible to stop this tech without a totalitarian surveillance state (and barely even then!).


The war on general computing has been ongoing but not made enough inroads to stop people from owning general computing devices (yet)


Indeed, the death knell could be tolling not for regulation of ai but for general purposes computers. In AI we have four horsemen: copyright infringement, illegal pornography, fake news generation, and democratization of capabilities that large companies would rather monetize.


Given the proliferation of illegal downloads (I can get a bad cam rip of the Barbie movie on release weekend just fine, plus a VPN would protect me from DCMA takedowns), and illegal pornography (just ask a torrent tracker for the fappening), and the proliferation of fake news (esp on eg, Facebook) despite a lack of it needing to be ML model generated, and companies and OSS in the space doing the democratizing and releasing complete model weights, and not just lone individuals trying to do the work in isolation, (aka stability.ai), are they really four horsemen, or four kids on miniature ponys?


I try to bring up as often as possible in conversation that nearly all the progress we're seeing in terms of usability and performance is precisely because of the open source support for these models.

Especially because these tools are so popular outside of the developer community, I think it's worth really beating into peoples minds that without open source AI would be in a much worse place overall.


This is more than a little melodramatic.

https://frame.work/ and the https://mntre.com/ MNT Reform: Exist


If my country decides to ban the ownership of general purpose computers for individual persons, they would order the customs service to stop import of any computer hardware that enabled general purpose computing. Now I would not be able to have any computer shipped to me from outside my country, so I could no longer buy from either of those vendors you linked.

Furthermore, it also would mean that I would not be able to bring any personal computers with me when I travel to other countries. I like to travel, and I like to bring my computers when I do.

Next, it would also be dangerous to try to buy computers locally within the borders of the country. The seller might be an informant of the police, or even a LEO doing a sting operation.

And then next you have to worry about the computers you already have. If you decide to keep the computers that you had since before, after it is made illegal to own them, you will have problems even if you keep them hidden and only use them at home. Other people know about your computers. Some of those people will definitely tip off the authorities about the fact that you are known to have computers.

Let’s hope it never goes as far like this :(


People would take the CPUs out of other devices and use them. A consumer grade router has most of the hardware you need to make a general purpose computer.


This is a slippery slope to the extreme.

What country outside of North Korea has banned the ownership of general purpose computers, or even considered/tried to?


Banning the import of personal computers would be absolutely disastrous for any possible economy anywhere.


That is virtually impossible because Turing-complete systems are everywhere


Just like how making weed illegal is virtually impossible because anybody can grow marijuana in their backyard.

How many regular people would risk owning turning-complete devices that can run unauthorized software if it would net you jail time if caught? Lots of countries are already itching towards banning VPN, corpo needs be damned.

Especially now that the iPhone has shown having a device that can only run approved legal software covers a lot of people's everyday needs.


I'm more referring to the fact that stuff like PowerPoint and Minecraft and who knows what are Turing-complete, albeit with awful performance.

Theoretically, you can have a totally owned device managed by Big Brother, yet generate AI smut with a general purpose CPU built in PowerPoint.

How do you possibly regulate that?


> How do you possibly regulate that?

The government could send an order to the software developer to patch out that turning completeness, and ban the software if it's not complied.

I get what you mean, it's never possible to 100% limit things. But if you limit things 98% so that the general public does not have access that's more than enough for authoritarian purposes.


I wonder if there's an analogy to be made here to DRM. In theory, yes, DRM shouldn't be possible, but in practice, manufacturers have been able to hobble hardware acceleration behind trusted computing model. Often, they do a poor job and it gets cracked (as with HDCP [1], and UWP [2]).

The question in my head is whether the failures in their approaches are due to a flaw in the implementation (in which case it's practically possible to do what they're trying to do although they haven't figured out a way to do it), or whether it's fundamentally impossible. With DRM and content, there's always the analog hole, and if you have physical control over the device, there's always a way to crack the software and the hardware if need be. My questions are whether:

a) this is a workable analogy (I think it's imperfect because Gen AI and DRM are kinda different beasts)

b) even if it was, is there real way to limit Gen AI at a hardware level (I think that's also hard because as long as you can do hardware accelerated matmul it's basically opening up the equivalent of the analog hole towards semi-turing completeness which is also hardware accelerated)

I imagine someone has thought through this more deeply than me and would be curious what they think.

[1] https://en.wikipedia.org/wiki/High-bandwidth_Digital_Content...

[2] https://techaeris.com/2018/02/18/microsoft-uwp-protection-cr...


Yeah I think it's fair to assume DRM will be a never-ending cat and mouse between developers and end-users.

Netflix for example can implement any DRM tech they want -- ultimately they're putting a picture on my screen, and it's impossible to stop me from extracting it.


Can you explain that context a little bit of Turing complete?


You can’t regulate the ownership of computing devices.

It’s too generic. There are too many of them.


They could ban and phase out systems with unsecure bootloaders. That would go a long way. Many vendors have already locked down their boot process.


So this should be it for trying to regulate theft, right? If you can open a window without any tool other than your own body. It doesn't seem like it's possible to stop thefts without a totalitarian surveillance state (and barely event then!).

Or same can be said about media "piracy". Or ransomwares.

States have forever regulated things that are not possible to enforce purely technically.


But theft is quite a different thing, is it not? It's a physical act that someone can be caught engaging in - be it by another person, a guard or a security camera. Sure, the "barrier for entry" to commit it is low, but retailers et al. are doing as much as they can to raise it.

Piracy most often isn't treated as a criminal matter, but a civil one - few countries punish piracy severely, but companies are allowed to sue the pirate.

I agree with OP in principle - regulating generative AI use would be way harder than piracy or whatever, especially since all of it can be done purely locally and millions of people already have the software downloaded. And that's not getting into the reasoning behind a ban - piracy and similar "digital crimes" are banned because they directly harm someone, while someone launching Stable Diffusion on their PC doesn't do much of anything.


> few countries punish piracy severely, but companies are allowed to sue the pirate.

UNCLOS, Part VII, Section 1, Article 100 https://www.un.org/depts/los/convention_agreements/texts/unc...

>> Duty to cooperate in the repression of piracy

>> All States shall cooperate to the fullest possible extent in the repression of piracy on the high seas or in any other place outside the jurisdiction of any State.

We could have just added "private computer" to the definition of piracy, and it largely would have applied.

>> Definition of piracy

>> Piracy consists of any of the following acts:

>> (a) any illegal acts of violence or detention, or any act of depredation, committed for private ends by the crew or the passengers of a private ship or a private aircraft, and directed [...] on the high seas, against another ship or aircraft, or against persons or property on board such ship or aircraft;


..What? Digital piracy has absolutely no logical or legal connections to naval piracy, except for sharing the same name.

No sane person could ever implement anything like this. This is like saying that we could "just" add the word "digital" to the laws prohibiting murder to make playing GTA illegal.


An extra-territorial crime

Mostly committed by private citizens in pursuit of profit

That all nations of the world have an interest in suppressing to encourage free trade that economically benefits them

But which some countries at various times have a geopolitical interest in supporting

... you're right, they have no logical or legal connections at all.


You could tie essentially any two crimes by assigning more broad descriptors to them that'd boil down to "this is what countries want to discourage". Not to mention, half of this is just wrong - digital piracy most often isn't extraterritorial (it very much falls under the jurisdiction of where the piracy took place), and most individuals pirate for personal needs, not profit.

The point stands - no jurisdiction that I know of treats digital piracy similarly to naval piracy, and there is no strong argument in favor of doing so.


> digital piracy most often isn't extraterritorial (it very much falls under the jurisdiction of where the piracy took place)

The canonical eBay/PayPal fraud from eastern Europe example?

> most individuals pirate for personal needs, not profit.

But most piracy is done by individuals in pursuit of profit, not for personal need.


no, this is a lousy analogy because there is a clear harm to others in the case of theft. we've tried regulating other difficult to regulate things where the harm is unclear or indirect (drugs being a good example) to no avail.

your piracy example is better. consider that it's the rise of more convenient options (netflix and spotify) not some effective policy that curtailed the prevalence of piracy.


> consider that it's the rise of more convenient options (netflix and spotify) not some effective policy that curtailed the prevalence of piracy.

The turning point was earlier than Netflix or Spotify – it was the iTunes Store. It was such a dramatic shift, people labelled Steve Jobs as “the man who persuaded the world to pay for content”.

https://www.theguardian.com/media/organgrinder/2011/aug/28/s...


Theft has a clearance rate of only 15%. Sounds like we already stopped trying to regulate most theft, in practice.


“Trying to regulate” and “succeeding in enforcing regulations” aren't the same thing.

In fact, a low clearance rate can be evidence of trying to regulate far beyond one's capacity to consistently enforce; if you weren't trying to regulate very hard, it would be much easier to have a high clearance rate for violations of what regulations you do have.


Yes, it is impossible to stop theft.


> If these models and their inference infra can be shrunk down to be runnable on a PS2, it doesn't seem like it's possible to stop this tech without a totalitarian surveillance state (and barely even then!).

The original requirement for these is 16GB of RAM, which can be had for less than $20. They run much faster on a GPU, which can be had for less than $200. Millions of ordinary people already have both of these things.


The PS2 only had 32 MB of ram. Even the PS3 only had 256 MB.

I know it was a bit of a funny hyperbolic example, but you'd need to shrink this down way further to run it on a PS2.


I thought most of the regulatory efforts were focused on training runs getting bigger and bigger rather than generation with existing models. Is there regulation you’re aware of around use of models?


Copyright infringement is quite cheap as well. Ease and illegality are tangential. You'd still stop commercial acts even if it's impossible to fully stop something.

That said, I don't think blanket regulation is all that likely anyhow.


What sort of regulations on the tech are you talking about? It really depends on what you are trying to do whether you can or not.


Not a surveillance state, but a stop on producing new, high performant chips should be enough.


I can't wait for Stable Diffusion for Windows 3.1


./lifts eyebrows suggestively


This is insane! 11 hours or not, I didn't expect SD could ever run on hardware like Pi Zero.


Is there any summary of the minimum requirement to run and generate key open source model? Wonder to get 24g MacBook Air m2 or still have to use Intel and N… Gpu for this kind of text and image AI … for learning the technology mainly…


Can anyone recommend a project that uses similar optimisation but is targeted at pcs without a dedicated gpu or lots of ram?

I do not mind waiting for my images, but i still have more to offer than a raspi


The trade-off between memory usage and inference time uncovers a potential flaw in prioritizing resource efficiency over performance.

This would deter real-time or near real-time applications where latency is a critical factor.

Also, the confusion over the phrase "0.5-2x slower" highlights a possible lack of clarity in communication within the community, which would hinder the accurate assessment and adoption of such optimizations in practice.


You might be making some good points, but it took me about 3 attempts to understand your comment.

For example:

> Also, the confusion over the phrase "0.5-2x slower" highlights a possible lack of clarity in communication within the community, which would hinder the accurate assessment and adoption of such optimizations in practice.

Maybe instead:

> The phrase "0.5-2x slower" is confusing. You might get more adoption if the language was more clear.


This is so cool!!! Nice job on it!


I wonder if this could be accelerated with the Pi's onboard GPU somehow.


Is not not already using it? I thought ONNX had a GPU runtime the Pi could use.


When is someone going to run this on the blockchain...

(ducks)


> This is another image generated by my RPI Zero 2 in about 11 hours

So pointless. I love it


Calculator next


[flagged]


It looks like your account has been using HN primarily for promotion. This is against HN's rules - see https://news.ycombinator.com/newsguidelines.html:

"Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity."


A bit of advice...stop. Blatent self promotion of commercial products is a hard no here. We dont want it, and its against the rules. Delete this, and the other posts before they get deleted for you along with account closure.


Before you waste your time, this is a commercial product and you need to pay $30 to buy their model to run it.


What does it mean to be "partially uncensored"?


Okay... what's the downside?


In terms of, what's the tradeoff for the time decrease?

Apples to oranges, they're comparing 11 hours on a Raspberry Pi Zero to:

- 10 seconds on Intel i7-13700

- 3 seconds on Intel i9-9990XE

- 5 seconds on Ryzen 9-5900X

Additionally, the 2048 is accomplished by using RealESRGAN to 2x, which isn't close to what a native 2048 diffuser's quality would be.

It does look interesting and is an achievement, in terms of, it's hard to write this stuff from scratch, much less in pure C++ without relying on GPU.


Ah. I use RealESRGAN (or one of its descendants, rather) as a first pass upscaler before high-resolution diffusion. If you skip the diffusion step, of course it'll be faster.


Unrelated, but now I'm curious about how much would it take on RPis 4 and 5.


yeah me too...I've been very negative about the edge, it got overhyped with the romanticization of local LLMs, but there's a bunch of stuff coming together at the same time...Raspberry Pi 5...Mistral 7B Orca is my 20th try of a local LLM...and the first time it handled simple conversation with RAG. And new diffusion, even every 2 hours, is a credible product, arguing about power consumption aside...


Also $29 to get pre-trained model assets to run code.


Why does this one needs pretrained models? Can't we use any of the thousands of already available ones?


These are mostly Stable Diffusion architecture models, but its not the only game in town.


Hard to tell since there is zero documentation in regard to models.


I see you're opting for AGPL on a codebase that is designed to be embedded as a library. Genuine question, what kind of user did you have in mind when you decided on this license?


Are those 2048 x 2048 images still sensible? SD 1.5 is best used at 512x512 and may produce sensible images upto 768. It generates monstrosities above that. Similarly SD XL is good upto 1024.


These are limitations of a single text-to-image gen, which is the least interesting way to use those models. When guided by a previous low-res generation, it won't fall apart at arbitrary resolutions, that's how all diffusion upscalers work. Just don't expect being able to fit every detail in one pass, use multiple ones (that's how detailers work).


> Are those 2048 x 2048 images still sensible? SD 1.5 is best used at 512x512 and may produce sensible images upto 768. It generates monstrosities above that. Similarly SD XL is good upto 1024.

You can do significantly higher resolutions with various tricks like tiled diffusion, which is also a memory efficiency hack. (The stable-diffusion-webui tiled diffusion extension uses 2560×1280 direct [no upscale step] generation with an SD 1.5-based model as one of its examples.)


Up scaling the image in chunk creates loads of semantic issues. For example, bottom of tree might look further in the mountains but it's top will be near you. You don't see problems like these in non scaled images.


> Up scaling the image in chunk creates loads of semantic issues.

No, tiled upscaling generally does not have that problem significantly (compared to direct generation at native model-supported size, which doesn't completely avoid that kind of issue), since the composition on that level is set before the upscale (direct tiled generation does, if you aren’t using something like controlnet to avoid it.)

> You don’t see problems like these in non scaled images.

You actually occasionally do, but its fairly rare.


It's conditioned on the lowres input, so if it doesn't have semantic discontinuites it doesn't happen. It will eventually happen if you continue doing this indefinitely, but with reasonable size to tile ratio (say <6x) it works well. With manual or object detection-assisted tiling and proper conditioning (controlnets sidechannel, especially if it's a custom trained controlnet/t2i) it can be pushed further.


> Similarly SD XL is good upto 1024.

I don't think that's right. SD xl is good starting from 1024. Anything lower generates a useless mess.


SDXL native trained resolution for 1:1 aspect ratio is 1024x1024 like SD 1.5’s is 512x512. Like SD 1.5, you can go a bit below or above that without too much problem; unlike SD 1.5, SDXL also has significant training in a fairly wide set of other resolutions (ranging from 2048x512 to 512x2048) with approximately 1 mebipixel resolution, and they can be treated as starting points as easily as 1024x1024 can. I think SDXL has a narrower (proportionate) range of viable resolutions around its starting points, but that’s offset but having more than one “starting point”.


Maybe next time shamelessly mention that you sell models for $29 and there is no instructions to convert from vanilla SD.


I can't believe this is still the top comment. I wish I didn't edit down my reply, shoulda just said "this is stupid, you're comparing your desktop to a raspberry pi"

ONNX streaming is way cooler and more impressive than another commercial wrapper around SD. Doesn't deserve this.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: