Hacker News new | past | comments | ask | show | jobs | submit login
Run Stable Diffusion on Your M1 Mac’s GPU (replicate.com)
1007 points by bfirsh on Sept 1, 2022 | hide | past | favorite | 401 comments

Magnusviri[0], the original author of the SD M1 repo credited in this article, has merged his fork into the Lstein Stable Diffusion fork.

You can now run the Lstein fork[1] with M1 as of a few hours ago.

This adds a ton of functionality - GUI, Upscaling & Facial improvements, weighted subprompts etc.

This has been a big undertaking over the last few days, and I highly recommend checking it out. See the mac m1 readme [3]

[0] https://github.com/magnusviri/stable-diffusion

[1] https://github.com/lstein/stable-diffusion

[2] https://github.com/lstein/stable-diffusion/blob/main/README-...

Brilliant, thank you! I just got OP's setup working, but this seems much more user-friendly. Giving it a try now...

EDIT: Got it working, with a couple of pre-requisite steps:

0. `rm` the existing `stable-diffusion` repo (assuming you followed OP's original setup)

1. Install `conda`, if you don't already have it:

    brew install --cask miniconda
2. Install the other build requirements referenced in OP's setup:

    brew install Cmake protobuf rust
3. Follow the main installation instructions here: https://github.com/lstein/stable-diffusion/blob/main/README-...

Then you should be good to go!

EDIT 2: After playing around with this repo, I've found:

- It offers better UX for interacting with Stable Diffusion, and seems to be a promising project.

- Running txt2img.py from lstein's repo seems to run about 30% faster than OP's. Not sure if that's a coincidence, or if they've included extra optimisations.

- I couldn't get the web UI to work. It kept throwing the "leaked semaphor objects" error someone else reported (even when rendering at 64x64).

- Sometimes it rendered images just as a black canvas, other times it worked. This is apparently a known issue and a fix is being tested.

I've reached the limits of my knowledge on this, but will following closely as new PRs are merged in over the coming days. Exciting!

I followed all these steps, but I got this error:

> User specified autocast device_type must be 'cuda' or 'cpu'

> Are you sure your system has an adequate NVIDIA GPU?

I found the solution here: https://github.com/lstein/stable-diffusion/issues/293#issuec...

I had to manually install pytorch for the preload_models.py step to work, because ReduceOp wasn't found. Why even use anaconda if all the dependencies aren't included? Every time I touch an ML project, there's always a python dependency issue. How can people use a tool that's impossible to provide a consistent environment for?

You are completely correct that there are a lot of dependency bugs here, I would just like to pedantically complain that the issue in question is PyTorch supporting MPS, which is basically entirely a C++ dependency issue rather than a Python one. (PyTorch being mostly written in C++ despite having "py" in the name.) And yeah the state of C++ dependency management is pretty bad.

FYI: black images are not just from the safety checker.

Yes, the safety checker will zero out images but can just turn it off with an “if False:”; Mostly black images are due to a bug, especially frustrating because it turns up on high step counts and means you’ve wasted time running it.

My experience has been roughly 2-4/32 of an image batch comes back black at the default settings, regardless of the prompt.

Just stamp out images in batches and discard the black ones.

I was able not to have black images by using a different sampler

--sampler k_euler

full command:

"photography of a cat on the moon" -s 20 -n 3 --sampler k_euler -W 384 -H 384

I tried that as well but resulted in an error:

AttributeError: module 'torch._C' has no attribute '_cuda_resetPeakMemoryStats'


hi jastanto. Im on an intel mac running into the same problem. Did you find a workaround?

To get past `pip install -r requirements` I had to muck around with CFLAGS/LDFLAGS because I guess maybe on your system /opt/homebrew/opt/openssl is a symlink to something? On mine it doesn't exist, I just have /opt/homebrew/opt/openssl@1.1 symlinked to /opt/Cellar/somewhere.

The command that finally worked for me:

  python3 -m venv venv
  . venv/bin/activate
  CFLAGS="-I /opt/homebrew/opt/openssl@1.1/include" LDFLAGS="-L /opt/homebrew/opt/openssl@1.1/lib -L/opt/homebrew/Cellar/openssl@1.1/1.1.1q/lib -lssl -lcrypto" PKG_CONFIG_PATH="/usr/local/opt/openssl@1.1/lib/pkgconfig" GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1 GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1 pip install -r requirements.txt

Thank you with those extra steps I got it working now myself. At least I think thank you. My work productivity for the next few days might not agree.

Instructions don't work here, dead ends at

  FileNotFoundError: [Errno 2] No such file or directory: 'models/ldm/stable-diffusion-v1/model.ckpt'
Looks like there's a step missing or broken at downloading the actual weights.

Going up to the parent repo points at a bunch of dead links or hugginface pages.

You have to download the model from the huggingface[0] site first (requires a free account). The exact steps on how to link the file are then detailed here[1].

[0] https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin... [1] https://github.com/lstein/stable-diffusion/blob/main/README-...

I did this but then moved the directory. When re-linking and checking with ls for the path I thought "oh, alright, it's already there". Oh well, better check with ls -l earlier next time.

Can you describe how you did (/ are doing) this? Do you now need to use conda (as opposed to OPs pip only version)?

See my edit for more info. (Just ironing out a couple of other issues I've found, so might update it again shortly)

I only get black images.

You have to disable the safety checker after creating the pipe

Nice. We'll get this guide updated for this fork. Everything's moving so fast it's hard to keep track!

We struggled to get Conda working reliably for people, which it looks like lstein's fork recommends. I'll see if we can get it working with plain pip.

I really appreciate the use of pip > conda. Looking forward to the update for the repo!

Running lstein's fork with these requirements[0] but seeing this output[1]. Same steps as original guide otherwise.

Anyone got any ideas?

[0] https://github.com/bfirsh/stable-diffusion/blob/392cda328a69...

[1] https://gist.github.com/bfirsh/594c50fd9b2e6b173e31de753a842...

Same output for me also.

EDIT: https://github.com/lstein/stable-diffusion/issues/293#issuec... fixed it for me.

Boom - nice. Here's a fork with that: https://github.com/bfirsh/stable-diffusion/tree/lstein

Requirements are "requirements-mac.txt" which'll need subbing in the guide.

We're testing this out with a few people in Discord before shipping to the blog post.

Thank you for these guides!

Which Discord?

Check my comment alongside yours, I got Conda to work but it did require the pre-requisite Homebrew packages you originally recommended before it would cooperate :)

I couldn't get the setup process working until I switched the python distro to 3.10, as the scripts were relying on typings features that were added in 3.10 even though the yml file specified 3.9. Was strange.

Conda is recommended because it starts from a clean environment so you're not debugging 13 other experiments the user has going on.

are there benchmarks?

I was following the github issue and the CPU bound one was at 4-5 minutes, the MDS one was at 30 seconds, then 18 seconds, and people were still calling that slow.

What is it currently at now?

and I don't know what "fast" is, to compare

What are the Windows 10 with nice Nvidia chips w/ CUDA getting? Just curious whats comprehensive

> What are the Windows 10 with nice Nvidia chips w/ CUDA getting?

Are you referring to single iteration step times, or whole images? Because obviously it depends on the number of iteration steps used.

Windows 10, RTX 2070 (laptop model), lstein repo. I get about 3.2 iter/sec. A 50 step 512x512 image takes me 15 seconds.

I’m referring to there being a community effort to normalize performance metrics and results at all, with the M1 devices being in that list as well, so that we dont have to ask these questions to begin with

Are you aware of any wiki or table like that?

Huh, that’s the same speed I get on Collab. Pretty good.

I only run 1 sample at a time (batch size 1), forgot to mention that, and that affects the step time.

It looks like each additional image in a batch is cheaper than the 1st image. For example if I reduce my resolution so I can generate more in a single batch

1 image, 50 steps, 320x320: 5s

2 images, 50 steps, 320x320: 8s

3 images, 50 steps, 320x320: 11s

4 images, 50 steps, 320x320: 14s

And the trend continues, and my reported iteration/sec goes down as well. It's not accounting for the fact that with steps=50 and batch size=4 it's actually running 200 steps, just in 4 parallel parts.

Wow, that is over twice as fast as my Windows 11, RTX 3080ti

I just commented on another sibling comment (too late to edit the first one), but I forgot to mention my batch size is only 1. I think most people use batch size 4, so basically multiply my time by your batch size for a real comparison.

It was my bad, my script was still running a different fork. Seeing <10 second times with those parameters now. 13.6 seconds for an 3072 × 2048 upscaled image, which I'm particularly happy about.

Wait, what? On my M1 imac I’m getting about 25 minutes. What am i doing wrong?

It's falling back to CPU. Follow the instructions to use a GPU version - sometimes it's even a completely different repo, depending on whose instructions you're following.

Around 6 seconds.

I ran into:

ImportError: cannot import name 'TypeAlias' from 'typing' (/opt/homebrew/Caskroom/miniconda/base/envs/ldm/lib/python3.9/typing.py)

I followed the conda instruction which uses Python 3.9 and ran into the same issue. The workaround is to import TypeAlias from typing_extensions:



  from typing import Optional, Callable, TypeAlias

  from typing import Optional, Callable
  from typing_extensions import TypeAlias
This issue is tracked in https://github.com/lstein/stable-diffusion/issues/302

you can also just change the python version in the yml file to 3.10.4 and it'll work

I ran into this. You need Python 3.10. I had to edit environment-mac.yaml and set python==3.10.6 ...

I changed the dependency to 3.10.4 (tried 3.10.6 as well), installed python 3.10.4, deactivated and activated ldm environment, but it still uses python 3.9

Can you delete your environment and try again?

Since I don't know how to use conda, I had to struggle a bit to learn how to recreate the environment. Here's the commands that worked me for future reference:

  conda deactivate
  conda env remove -n ldm
Then, again:

  CONDA_SUBDIR=osx-arm64 conda env create -f environment-mac.yaml
  conda activate ldm

Thanks, it worked

This worked for me too.

TypeAlias is only used once, you can open sampling.py and remove the import on line 10 and the usage on line 14:

  from typing import Optional, Callable

  from . import utils

  TensorOperator = Callable[[Tensor], Tensor]

What do I need for the in painting? is there a source for the models/ldm/inpainting_big/last.ckpt' file?

I used this: wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=... Found it here: https://huggingface.co/spaces/multimodalart/latentdiffusion/...

This worked afterwards: python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results

Everything works excepts it only generates black images,

did you run

python scripts/preload_models.py

python scripts/dream.py --full_precision ?

Disable safety check

What's the performance of these models ,how much pc spec required for sane operation?


Everyone posting their pip/build/runtime errors is everything that's wrong with tooling built on top of python and its ecosystem.

It would be nice to see the ML community move on to something that's actually easily reproducible and buildable without "oh install this version of conda", "run pip install for this package", "edit this line in this python script".

Interesting question to me is whether it's actually in part fundamental to the success of Python. There are plenty of "clean" ecosystems that avoid at least some of these issues. But in generally they fail to thrive in these type of spaces and people keep coming back to ecosystems that are messier.

Is it possible that creativity and innovation actually require a level of chaos to succeed? Or alternatively, that chaos is an inevitable byproduct of creativity and innovation, and any ecosystem where these are heavily frowned on deters the type of people who actually drive the cutting edge forward?

Just putting it out there as food for thought. It does translate back to quite often, that it acutally does make a lot of sense to do your prototyping and experimenting in one ecosystem but then leave that behind to deploy your production workloads where possible.

JavaScript is a much cleaner ecosystem but for some reason there's a long running stigma against it. It would work fine for this use case.

You can have an ecosystem that's chaotic (and vibrant) without its core pillars being chaotic (and shaky)

It's not just the python, this is the experience practically everywhere and that's why people create containers etc. It's excruciatingly hard to setup the environment to start doing anything productive these days, you can't just start coding unless you use an IDE like Xcode or PyCharm.

JS isn't perfect, but it's so much easier to deal with than Python in these regards.

I've had the exact same issues with the JS ecosystem (ran into a problem where npm wouldn't work but yarn did, still haven't figured out why).

Both are easy and reliable with a few months of experience. Both are terrible if you rarely ever use them.

That's probably fair. I've found pyenv/venv/poetry/pipenv/pip/conda/etc more frustrating than ESM/CJS/yarn/npm/PNP/etc, but then I just do a lot less Python stuff than JS these days.

Js libs don't need to care for things like system packages and drivers as much as Python ML does.

Ever tried running JS libs with C bindings? You're going to run into the exact same problems as you have running python code with C bindings.

CommonJS vs ES6 module loading is already a nightmare.

Not sure about that. Golang or Rust have no such problem in my experience.

Maybe the difference is the legacy and the community? Golang and Rust kind of languages are geared toward software engineers and are pretty new, meanwhile things like Python or JS or anything very popular are used as tools by people great at things that are not necessarily in the domain of software engineering but they created very useful libraries and tools regardless which usually means piles and layers of very useful code that has poor engineering.

I don't think this is particularly fair. This is literally hours old, and people installing now are really debugging rather than installing a "finished" build.

The forks are weird mashups of bits of repos, and running on a M1 GPU is something that barely works itself.

Give it maybe 3 months and it will be much smoother.

I think it is fair, actually. This is not unique to hours-old python projects, it's a common theme among almost all python tools I've used.

I have a suspicion that that something written in Julia, Go, Rust, or possibly even C wouldn't have nearly this many issues. I'm not talking about debugging the actual functionality of the software, but rather the environment and tooling surrounding the language and software built with it.

This project in particular should be an easy case because you know the hardware you'll be running on ahead of time.

I'm ranting a bit, but I've tried so many tools based on python and almost _none_ of them built/installed/ran correctly on the happy path laid out in each project's readme.

Anyway, sorry, rant over.

Yes there is a fair amount of truth in that.

I do think that experience helps here. I have a recipe for installing Python that works on most python projects most of the time.

  git clone <project>
  python3 -m venv ./venv
  source ./venv/bin/activate
  pip install -r requirements.txt
  deactiviate # need to do this to include the correct command line tools in path (eg Jupyter)
  source ./venv/bin/activate

On a Linux or Intel Mac system this works with pretty much every reasonable Python project.

On M1 Macs the situation isn't great at the moment, though.

As if compiling any given C program wasn't also a crap shoot.

People also tend to forget that these ML packages are ridiculously complicated, and have a lot of dependencies not just on other libraries but on particularities of your system.

That, and ML researchers can't also be expected to be good at everything. They are busy doing ML research and waiting to be given a recipe to follow.

Meanwhile I can put together a Python package in my sleep that works perfectly on pretty much any system, but I don't know a damn thing about Autotools and would probably make a total mess if I tried use it. Or CMake. Or whatever Java uses.

This "look Python bad!!" stuff has some merit (if only historical), but it mostly amounts to FUD and does a big disservice to the people who have worked hard over the past few years to get everything fixed up.

Plus at least one of the issues I see here is because people disregarded the instructions and used Python 3.9 even though it says to use 3.10 because the code assumes it's running under 3.10.

Also, sorry to double post, but you don't actually need to activate the venv. You can just invoke `venv/bin/pip`. That should hopefully save you a bit of time and annoyance.

That's bad advice for two reasons:

Firstly, non-experiences Python users (the kind who need recipes to follow) will also blindly follow installation instructions on websites. These all say "pip install xxx" and users will inevitably forget they need to change the path to make that work for their environment. By getting into the habit of activating the environment this problem goes away.

Secondly, you do need to activate the venv so that command line tools in ./venv/bin are used (instead of whatever is in your default path).

This includes both the correct version of python (if you set it when you setup the virtual environment) and importantly jupyter. From personal experience there's a whole set of pain when you don't realize your jupyter isn't the one inside the virtual environment and so is using other libraries.

Because many shells cache the location of executables, if you don't deactivate and reactivate your virtual environment after installing Jupyter into it then it may (in some circumstances) use a globally installed version.

Noted! I'll try this out next time I have to use a python project and see how it goes.

Notably, it works perfectly on the linked repo for this story (unlike the conda version which is the first comment and is a complete mess).

My anecdotal experience is that it briefly gets better, then much worse. Build scripts downloading tarballs from dead URLs and dependencies on unspecified versions of libraries that have since had breaking API changes are frequent issues.

…it’s still awesome that people put in the effort to do these things at all, but the tools often have a tendency to make me feel like an archaeologist trying to piece what ancient artifacts are missing and how they were supposed to all fit together.

I ran into many of these same problems when trying to replicate ROS environments five years ago. It's not Python's fault, it's just academics being bad at releasing software. After all, why would they?

In a university, ten steps to replicate the PI's personal environment is perfectly fine. How would releasing a single binary make their life any easier? Why would they bother?

I found a pretty great Docker image for the SD webui. I was forced to extreme measures since NixOS isn't super friendly with Conda (and I was particularly lazy).

Worked out fine in the end, though. Highly recommended if you're on an Nvidia rig: https://github.com/AbdBarho/stable-diffusion-webui-docker

Docker the silver bullet /s

Looking through the top level comments, there are only 2 like that and they are about build errors in dependent native libraries written in other languages, which is not really the Python ecosystem's fault (as the author has chosen not to distribute prebuilt stuff).

I disagree, dependencies are sort of a universal problem. Ever had to set LD_LIBRARY_PATH?

Python is pretty much innocent here.

> Python is pretty much innocent here.

Could not possibly disagree more. https://xkcd.com/1987/

The problem with dependencies is that for very bad reasons people don’t ship them.

Someone needs to package SD with a full copy of the Python runtime and every dependency. This should be the default method of distribution.


There are working CUDA Docker images that work both on Windows and Linux for stable diffusion so the package including dependencies already exists. It's just they the standard packaging method doesn't work well on Apple hardware.

Just don't mess up your system. The tutorial linked works perfectly fine in a clean python environment.

That XKCD is more about messing up your system by not knowing what you're doing but randomly following shitty tutorials which suggest stuff that collide with each other... Python's only fault in this is that it's a simple language and thus attracts people who aren't software engineers (students and math majors) and thus mostly don't know or care how to keep your system clean but love writing tutorials.

It's pretty easy to keep your pythons clean, don't use conda, never run pip with sudo, never run pip with --user..., never run pip outside a virtualenv (a good safety measure for that is to have pip point to nothing in your user shell, you can access system python with python3/pip3 if needed)... To check your python is clean, create a new environment and run pip freeze, it should output nothing.

pyenv is a non-destructive system for managing multiple pythons and virtualenvs (all pythons and envs get installed into ~/.pyenv), pip is a good system for distributing dependencies (when library authors don't skip out on providing binary wheels, and software authors use pip freeze to generate requirements.txt files).

If your solution is “spend a lot of time learning a lot of rules and make sure you never accidentally do something you shouldn’t” then your solution is fragile and shitty.

You know what’s radically simpler? Shipping your damn dependencies so all any user needs to do is double-click the launcher and it will always work no matter what weird and bad things your user as done. The fact that “keep your system clean” is a thing anyone thinks about ever is a catastrophic failure.

stable diffusion is a work of researchers, not engineers, it's not a released product, it's an attachment to a research paper... researchers shouldn't need to know anything about releasing a software product...

you see it here on hacker news, because it's open source and thus hackers can and do already play with it..

as it's an interesting piece of software, I'm pretty sure someone will eventually make an easy to use GUI app based on it that ships all dependencies, but it's not yet at that stage,

this article is literally about the first easy (for a software engineer who knows python) way to run it on M1 macs, these people literally just discovered which versions of their dependencies work and share their work to invite other people to help them build something on top of it....

It's insane to me how fast this is moving. I jumped through a bunch of hoops 2-3 days ago to get this running on my M1 Mac's GPU and now it's way easier. I imagine we will have a nice GUI (I'm aware of the web-ui, I haven't set it up yet) packaged in an mac .app by the end of next week. Really cool stuff.

I hope this kickstarts some kind of M1 migration. There are so many ML projects I'd like to try, but they all depend on CUDA.

Yep, I was just thinking the same thing. M1/M2 appears to be a huge untapped resource for ML stuff as this proves. I maxed out my MBP Max and this is probably the first time I'm actually fully using the GPU cores and it's pretty freaking cool. Creating landscapes or fictional characters (think D&D) is already super fun, I look forward to playing with img2img some more as well.

The performance gap to the top-end Nvidia cards will get much larger as they release new cards later this year, though.

Maybe, but I can buy a Mac, you just order one from Apple.

An RTX 3090ti with 24GB of VRAM is widely available now that the crypto markets have crashed for $1150 or so. They were $2500 a year ago if you could find them.

A twist on the above comment: I _already own_ an M2 Mac, but I'm never gonna buy a high-end GPU to play around with this sort of tech. If the things people (who aren't gamers, crypto miners, or ML researchers) already own can be useful for some hobby-level work in the space, we'll see a lot more work and experimentation in the space. Its super exciting stuff.

To be fair now many people have gaming PC. Perhaps more than who have their own M1 Pro/Max.

Perhaps, you think? ^^

The gaming PC market is huge. Have a look at https://www.businesswire.com/news/home/20210329005150/en/Glo... to get some numbers. There is a list of shipments in a year. Apple sells a lot of units, but not nearly enough to match the accumulated household supplies of gaming PCs - in how much, 2 years, while gaming PCs and laptops are still being sold?

Don't take that comment personally please, but this "perhaps" is a perfect example of being in a complete Apple bubble. It's so far from reality it is frankly unfathomable.

It’s different kinds of people. I know more people with an apple mx computer than with a gaming computer (people have a console to game or nothing at all).

Arguably the cost effective solution is to use cloud services, since we're talking just a few seconds difference (or you might be lucky like one HN reader who got allocated an A100 today.)

But to play devil's advocate there are clear strengths available to the different platforms. PCs can readily upgrade into high end GPUs, but the compromise is that this becomes a requirement as basic GPUs don't feature enough VRAM and CPU-only mode is woeful.

On the mac side of things, the GPU is not going to be the latest and greatest, but the M-series features unified memory, so a relatively normal M-series mac is going to have the necessary hardware to load the models. Not the fastest (but still fast), and ready to go. (Also as it stands the M-series can offer additional pathways to optimisation.)

> Arguably the cost effective solution is to use cloud services

And running it on a fresh setup might help with the ‘works on my machine’ type of bugs that are being reported.

It's not clear if the shortages will happen with this new release as they did last time. Ethereum mining is going away and not as many people are stuck at home because of Covid. On the other hand, the performance increase looks to be substantial, increasing the demand.

Same here. My M1 Max's GPUs were basically idling until this came along!

Do they depend on CUDA, or are they just much better tuned for NVIDIA cards? I thought the whole ML ecosystem was based on training models and then running them on frameworks, where model was sorta like data and the framework handles the hardware? (albeit with models that can be tweaked to run more efficiently on different hardware) (I don't really know the ecosystem so it is definitely possible that they are more closely tied together than I thought).

The latter. The major frameworks, at least, can be run in CPU-only mode, with a hardware abstraction layer for other devices (like CUDA-capable cards, TPUs etc). So practically it means you need an Nvidia GPU to get anywhere in a reasonable amount of time, but if you're not super dependent on latency (for inference) then CPU is an option. In principle, CPUs can run much bigger model inputs (at the expense of even more latency) because RAM is an order of magnitude more available typically.

I was thinking (as someone who knows nothing about this really) that the Apple chips might be interesting because, while they obviously don't have the GPGPU grunt to compete with NVIDIA, they might have a more practical memory:compute ratio... depending on the application of course.

Is there any blocker to have VRAM swap (on RAM or SSD)? It would make processing much slower, but it should be better than nothing (cause OOM) or alternatively run on CPU (more slower).

Not sure. I suspect the issue would be lots of memory transfer between the GPU and the CPU, because downstream layers usually need previous layer outputs. It would probably depend on the receptive field of the network? Also on how expensive memory transfer is, maybe it's worth it in some cases. But there's no reason why you couldn't run say the first big layers on the CPU and then treat deeper layers (which may take a smaller input) as a separate network to run on the GPU. I suppose you want the largest subgraph in your model that can fit in available VRAM. Certainly the Coral/EdgeTPU will dispatch unsupported operations to the CPU but that affects all ops beyond that point in the computation graph.

From my experience the bigger frameworks may have support for non-CUDA devices (that is not just the CPU fallback) but many smaller libraries and models will not, and will only have a CUDA kernel for some specialized operation.

I encounter this all the time in computer vision models.

I'd rather see something more platform agnostic. I'm sad OpenCL isn't a bigger success.

OpenCL still works amazingly well on all platforms (e.g. two commercial programs of mine), it's just that everyone keeps saying it's dead and refusing to use it :(

Gods, yes. I've been trying to get various kinds of deepfake-related projects to work on M1 (or even just on Mac—the Intel Macs haven't come with Nvidia cards for years now) for some time now, as we're trying to generate video stimuli that present different people doing the same things, or the same person doing several different things, for psych research.

It's been an exercise in frustration.

It's not really better to move from one closed eco system to another. We should collectively agree to strengthen more open platforms, shouldn't we?

Just yesterday I read another comment on HN saying we will have to wait another decade before being able train it in someone "basement"( https://news.ycombinator.com/item?id=32658941 ). I made a bookmark for myself ( https://datum.alwaysdata.net/?explorer_view=quest&quest_id=q... ) to look for data that help estimate when it will be feasible to run Stable Diffusion "at home". I guess it's already outdated!

To run stable diffusion at home you have to download the model file, which took the equivalent of tens of thousands of hours spread across cloud provided GPUs.

If the model file just vanished from everyone's hard drive one day, and cloud providers installed heuristics to detect and ban image dataset training, retraining the model file would actually take decades for any consumer, even an enthusiast with a dozen powerful GPUs. The image dataset alone is 240TB.

You forget how much mark-up cloud providers charge.

I trained StyleGAN 2 from scratch using 8x 3090s at home and it took 3 months. It's fine.

240TB is small fish, my homelab is a petabyte and I consider it small.

It would take you way longer to train something like GPT-3 with such a setup.

Just to build my intuition, how long do you think it would take you to train Stabl eDiffusion in your homelab if you dedicated it to that task ? 10 years ? 20 years ? What about GPT-3 ?

Umm training is not the same as running it.

Is there a good set of benchmarks available for Stable Diffusion? I was able to run a custom Stable Diffusion build on a GCE A100 instance (~$1/hour) at around 1Mpix per 10 seconds. I.e, I could create a 512x512 image in 2.5 seconds with some batching optimizations. A consumer GPU like a 3090 runs at ~1Mpix per 20 seconds.

I'm wondering what the price floor of stock art will be when someone can use https://lexica.art/ as a starting point, generate variations of a prompt locally, and then spend a few minutes sifting through the results. It should be possible to get most stock art or concept art at a price of <$1 per image.

It can be even cheaper.

Midjourney, in case you appreciate their output, has an unlimited plan for 30$ a month. The only limitation is that if you're an extremely heavy user, they may "relax" you, which means results come in a bit slower.

Note that they've been also experimenting with a --beta parameter which basically means the algorithm uses StableDiffusion's algorithm behind the scenes, or you can use any of 4 versions of MidJourney's more stylistic algorithms.

So if you don't want to tinker or don't have a high-end GPU, it's a cheap way to play around. I have StableDiffusion running locally but still prefer MidJourney. I enjoy the stylistic output but it's also a highly social way to generate art. Everybody is doing it in the open.

Anyway, the stock art part is a hairy subject. You should assume that you AI image is not copyrighted. Which begs the question why they would pay at all.

>The only limitation is that if you're an extremely heavy user, they may "relax" you, which means results come in a bit slower.

You don't have to be an extremely heavy user. I used it for about an hour every evening and it took 11 days out of a month subscription for them to put me on relax mode.

The relax mode is based on how busy the service is. If usage is low, it's the same a fast mode. But other times its really slow.

That makes it unpredictable enough that it stopped being fun for me to use it. I've barely used midjourney since I got put on relaxed mode - it stopped feeling like I can jump on and play because I might hit a busy period and then it'll take 5 minutes to generate a prompt

That said, I could buy more hours of fast mode and I think it's still way cheaper than Dall-E or Dreamstudio

Related: I wrote up instructions for running Stable Diffusion on GCE. I used a Tesla T4, which is probably the cheapest that can handle the original code. If you're spinning up an instance to play with, rather than to batch-process, then cheaper makes more sense because most of the machine's time is spent waiting for you to type stuff and look at the results.


So you’re estimating over a thousand generated images an hour and less than a tenth of a cent per image using the A100. If that turns out to be accurate, it seems like some online image generation will included in the price of the stock art.

(DreamStudio is charging a bit over one cent per generated image at default settings, depending on exchange rates.)

My 3170Ti will create 512x512 image in about 5-6 seconds with 50 inference steps

Bananas. Thanks so much... to everyone involved. It works.

14 seconds to generate an image on an M1 Max with the given instructions (`--n_samples 1 --n_iter 1`)

Also, interesting/curious small note: images generated with this script are "invisibly watermarked" i.e. steganographied!

See https://github.com/bfirsh/stable-diffusion/blob/main/scripts...

> Also, interesting/curious small note: images generated with this script are "invisibly watermarked" i.e. steganographied!


So that future iterations of StableDiffusion (or similar models) don't end up getting trained on their own outputs.

Oh wow, I didn't even think of that. I am pretty sure a few of the repo's have turned off the invisible watermark, I wonder if that will have consequences down the line for training data.

... so this means that watermarking an image you own is probably the only way to avoid it being used for training further models? :-)

After playing around with all of these ML image generators I've found myself surprisingly disenchanted. The tech is extremely impressive but I think it's just human psychology that when you have an unlimited supply of something you tend to value each instance of it less.

Turns out I don't really want thousands of good images. I want a handful of excellent ones.

Human curation will likely remain valuable into the future.

I've been playing with Stable Diffusion a lot the past few days on a Dell R620 CPU (24 cores, 96 GB of RAM). With a little fiddling (not knowing any python or anything about machine learning) I was able to get img2img.py working by simply comparing that script to the txt2img.py CPU patch. Was only a few lines of tweaking. img2img takes ~2 minutes to generate an image with 1 sample and 50 iterations, txt2img takes about 10 minutes for 1 sample and 50 generations.

The real bummer is that I can only get ddim and plms to run using a CPU. All of the other diffusions crash and burn. ddim and plms don't seem to do a great job of converging for hyper-realistic scenes involving humans. I've seen other algorithms "shape up" after 10 or so iterations from explorations people do online - where increasing the step count just gives you a higher fidelity and/or more realistic image. With ddim/plms on a CPU, every step seems to give me a wildly different image. You wouldn't know that steps 10 and steps 15 came from the same seed/sample they change so much.

I'm not sure if this is just because I'm running it on a CPU or if ddim and plms are just inferior to the other diffusion models - but I've mostly given up on generating anything worthwhile until I can get my hands on an nvida GPU and experiment more with faster turn arounds.

> You wouldn't know that steps 10 and steps 15 came from the same seed/sample they change so much.

I don't think this is CPU specific, this happens at these very low number of samples, even on the GPU. Most guides recommend starting with 45 steps as a useful minimum for quickly trialing prompt and setting changes, and then increasing that number once you've found values you like for your prompt and other parameters.

I've also noticed another big change sometimes happens between 70-90 steps. It's not all the time and it doesn't drastically change your image, but orientations may get rotated, colors will change, the background may change completely.

> img2img takes ~2 minutes to generate an image with 1 sample and 50 iterations

If you check the console logs you'll notice img2img doesn't actually run the real number of steps. It's number of steps multiplied by the Denoising Strength factor. So with a denoising strength of 0.5 and 50 steps, you're actually running 25 steps.

Later edit: Oh and if you do end up liking an image from step 10 or whatever, but iterating further completely changes the image, one thing you can do is save your output at 10 steps, and use that as your base image for the img2img script to do further work.

https://github.com/Birch-san/stable-diffusion has altered txt2img to support img2img and added other samplers, see:


That branch (birch-mps-waifu) runs on M1 macs no problem.

With the 1.4 checkpoint, everything under 40 steps can't be used basically and you only get good fidelity with >75 steps. I usually use 100, that's a good middleground.

How do you change these steps in the given script? Is it the --ddim_steps parameter? Or --n_iter? Or ... ?

With --ddim_steps

I found I got quite decent results with 15-30 steps when generating children’s book illustrations (of course, no expectation for hyperrealism there)

Are we being pranked? I just followed the steps but the image output from my prompt is just a single frame of Rick Astley...

EDIT: It was a false-positive (honest!) on the NSFW filter. To disable it, edit txt2img.py around line 325.

Comment this line out:

    x_checked_image, has_nsfw_concept = check_safety(x_samples_ddim)
And replace it with:

    x_checked_image = x_samples_ddim

That means the NSFW filter kicked in IIRC from reading the code.

Change your prompt, or remove the filter from the code.

Haha, busted!

To be fair, the reason the filter is there is that if you ask for a picture of a woman, stable diffusion is pretty likely to generate a naked one!

If you tweak the prompt to explicitly mention clothing, you should be OK though.

Wow, is that true? I’ve never heard a more textbook ethical problem with a model.

Safari blocking searches for "asian" probably had more impact: https://9to5mac.com/2021/03/30/ios-14-5-no-longer-blocks-web...

It's an ethical problem with our society, not the model.

If you consider that the training set includes all western art from the last few centuries it’s not too surprising. There’s an awful lot of nudes in that set & most of them are female.

If you open up the script txt2img and img2img scripts, there is a content filter. If your prompt generated anything that gets detected as "inappropriate" the image is replaced with Rick Astley.

Removing the censor should be pretty straightforward, just comment out those lines.

It bothers me that this isn't just configurable. Why would they not want to expose this as a feature?

Plausible deniability

When the model detects NSFW content it replaces the output with the frame of Rick Astley.

It's kind of amazing that ML can now intelligently rick roll people.

I think it would be awesome to update the rickroll feature to the following:

Auto Re-run the img2img with some text prompt: "all of the people are now Rick Astley" with low strength so it can adjust the faces, but not change the nudity!!!1

Hah, it would be hilarious if it generated all the nudity you wanted - but with Rick Astley's face on every naked person!

To be fair, the developers added this "feature" and can easily be disabled in the code. The ML just says "this might be NSFW".

Same thing happened to me which is especially odd as I literally just pasted the example command.

It has a lot of false positives. A lot of my portraits of faces were marked as NSFW. Possibly detecting proportion of the image that's skin color?

Unrelated to stable diffusion, but I was showing DALL-E to my sister last night and a prompt with > Huge rubber tree set off the TOS violation filter.

AI alignment concerns are definitely overblown...

For those as keen as I am to try this out, I ran these steps, only to run into an error during the pip install phase:

> ERROR: Failed building wheel for onnx

I was able to resolve it by doing this:

> brew install protobuf

Then I ran pip install again, and it worked!

In the troubleshooting section it mentions running:

    brew install Cmake protobuf rust
To fix onnx build errors. I had the same issue.

What kind of speed does this run at? Eg. How long to make a 512x512 image at standard settings?

I haven't installed from this link specifically, but I used one of the branches on which this is based a few days ago, so the results should be similar.

On a first-gen M1 Mac mini with 8GB RAM, it takes 70-90 minutes for each image.

Still feels like magic, but old-school magic.

On an M1 Pro 16GB it is taking a couple minutes for each image.

Is that the difference in graphics performance between the M1 and M1 Pro or did the other person do something wrong? 70-90 minutes seems nuts

I have the M1 8GB I mentioned in my first comment, and the M1 Pro 16GB I mentioned in my second component, side-by-side. However, the first one was running a Stable Diffusion branch from earlier in the week, so I replaced using the same instructions. The only difference now is the physical hardware.

The thing to understand is that the 8GB M1 has 8GB. When I run txt2img.py, my Activity Monitor shows a Python process with 9.42GB of memory, and the "Memory Pressure" graph spends time in the red zone as the machine is swapping. While the 16GB M1 Pro immediately shows PLMS Sampler progress, and consistently spends around 3 seconds per iteration (e.g. "3.29s/it" and "2.97s/it"), the 8GB M1 takes several minutes before it jumps from 0% to 2% progress, and it accurately reports "326.24s/it"

So yes, whether it's M1 vs M1 Pro, or 8GB vs 16GB, it really is that stark a difference.

Update: after the second iteration it is 208.44s/it, so it is speeding up. It should drop to less than 120s/it before it finishes, if it runs as quickly as my previous install. And yes, 186.04s/it after the third iteration, and 159.22s/it after the fourth.

Sounds entirely like a swap-constrained operation. You need ~8gb of VRAM to load the uncompressed model into memory, which obviously won't work well on a Macbook with 8gb of memory.

My first-gen M1 MacBook Air with 16GB takes just under 4 minutes per image. Running top while it's generating shows memory usage fluctuating between 10GB and 13GB, so if you're running on 8GB it's probably swapping a lot.

Might be the RAM difference. RAM is shared between CPU and GPU on the M1 series processors.

My 16gb M1 Air was initially taking 13 minutes for a 50 step generation. But when I closed all the open tabs and apps it went down to 3 minutes.

Looks like RAM drastically affects the speed.

A little over three minutes on a first-gen M1 iMac with 16GB.

It looks like memory is super-important for this (which isn't all that surprising, really...).

Installed from this link on a MacBook Pro (16-inch, 2021) with Apple M1 Pro and 16GB. First run downloads stuff, so I omit that result.

I had a YouTube video playing while I kicked off the exact command in the install docs, and got: 16.84s user 99.43s system 61% cpu 3:08.51 total

Next attempt, python aborted 78 seconds in! Weird.

Next attempt, with YouTube paused: 16.31s user 95.48s system 65% cpu 2:49.45 total

So around three minutes, I'd say.

For 512x512 on M1 MAX (32 core) with 64 GB RAM I'm getting 1.67it/s so 30.59s with the default ddim_steps=50.

I've gotten 1.35it/s that corresponds to 38s, but I've the M1 Max with the 24 cores GPU (the "lower end" one).

I just set up a similar thing on my own a bit differently than OP did.

Apple M1 Max with 10-core CPU, 32-core GPU, 16-core Neural Engine - Takes 38 seconds as well up to 46 when it gets hotter.

Can anyone give comparison with Nvida gpu in terms of performance?

On my M1 Pro MBP with 16GB RAM, it takes ~3 minutes.

Looks like I'm getting around 4s per iteration on my M1 Max. At 50 iterations, that's 200 seconds.

On my M2 Air, 16G, 10 CPU cores, the default command as in the installing instructions takes like 2m20s.

MacBook Air M2 8 CPU 8GB the example apple image took 35mins. Guess I'll wait for now.

You clearly doing something wrong as I get about 3 minutes per image on m1 mac mini.

But yeah, at this stage most of guides are early hacks and require individual tweaking. It is quite expected that people get varying results. I assume in a week or a month situation will get much better and much more user-friendly.

Getting around 4 minutes per image on M1 MacBook Air 16GB

Hm, taking 2 hours on my M1 MacBook Air 16GB and it's clearly swapping. Are you using model v1.4? Or any other memory optimization that you applied?

M1 Max (32gb) is around 35 seconds per image.

I just had to:

> brew link protobuf --overwrite

Don't blindly run this command unless you understand what you're doing.

Python dependency hell in a nutshell. Impossible to distribute ML projects that can easily be ran.

Is there anyway to keep up with this stuff / beginners guide? I really want to play around with it but it's kinda confusing to me.

I don't have an M1 Mac, I have an Intel one with an AMD GPU, not sure if i can run it? don't mind if it's a bit slow, or what is the best way of running it in the cloud? Anything that can product high res for free?

Yes, you can run it on your Intel CPU: https://github.com/bes-dev/stable_diffusion.openvino

And this should work on an AMD GPU (I haven't tried it, I only have NVIDIA): https://github.com/AshleyYakeley/stable-diffusion-rocm

There are also many ways to run it in the cloud (and even more coming every hour!) I think this one is the most popular: https://colab.research.google.com/github/altryne/sd-webui-co...


It's not free but I've played with it a lot over the last two days for around $10, generating the most complex photos I can (1024x1024, 150 steps, 9 images, etc)

follow this guide: https://github.com/lstein/stable-diffusion/blob/main/README-...

i am runnig it on my 2019 intel macbook pro. 10 minutes per picture

I wrote a guide for AMD.


it's for, but it could be adapted to Linux as.lomg as you install the right drivers and such.

Have you managed to set it up? I might have the same computer as you.

Not yet, I haven't had much time to look into it all yet.

Looks like it's going to be a lot of fun though.

I'd rather see someone implemented glue that allows you to run arbitrary (deep learning) code on any platform.

I mean, are we going to see X on M1 Mac, for any X now in the future?

Also, weren't torch and tensorflow supposed to be this glue?

Broadly speaking, it looks like they are. The implementation of Stable Diffusion doesn't appear to be using all of those features correctly (i.e. device selection fails if you don't have CUDA enabled even though MPS (https://pytorch.org/docs/stable/notes/mps.html) is supported by PyTorch.

Similar goes for quirks of Tensorflow that weren't taken advantage of. That's largely the work that is on-going in the OSX and M1 forks.

I got stuck on this roadblock, couldn’t get CUDA to work on my Mac, was very confusing

That's because CUDA is only for Nvidia GPUs and Apple doesn't support Nvidia GPUs, it has its own now.

Didn’t apple stop supporting Nvidia cards like 5 years ago? How could it be confusing that Cuda wouldn’t run?

Ah, I didn't realize. It's not very obvious what GPU you have in your Macbook, I couldn't actually find out where to find that in my System settings. On Windows it's inside the "Display" settings but on MacOS... where is it? :)

lol presumably the OP didn't know that... hence the confusion.

    (base)   stable-diffusion git:(main) conda env create -f environment.yaml
    Collecting package metadata (repodata.json): done
    Solving environment: failed
      - cudatoolkit=11.3
oh i was following the github fork readme, there is a special macos blog post


If you look at the substance of the changes being made to support Apple Silicon, they're essentially detecting an M* mac and switching to PyTorch's Metal backend.

So, yeah PyTorch is correctly serving as a 'glue'.


As mentioned in sibling comments, Torch is indeed the glue in this implementation. Other glues are TVM[0] and ONNX[1]

These just cover the neural net though, and there is lots of surrounding code and pre-/post-processing that isn't covered by these systems.

For models on Replicate, we use Docker, packaged with Cog for this stuff.[2] Unfortunately Docker doesn't run natively on Mac, so if we want to use the Mac's GPU, we can't use Docker.

I wish there was a good container system for Mac. Even better if it were something that spanned both Mac and Linux. (Not as far-fetched as it seems... I used to work at Docker and spent a bit of time looking into this...)

[0] https://tvm.apache.org/ [1] https://onnx.ai/ [2] https://github.com/replicate/cog

Without k-diffusion support, I don't think this replicates Stable Diffusion experience:


Yes, running on M1/M2 (MPS device) was possible with modifications. img2img and inpainting also works.

However you'll run into problems when you want k-diffusion sampling or textual inversion support.

stable-diffusion supports k-diffusion just fine on M1. You just have to detach a tensor in to_d() to stop the values exploding to infinity. https://twitter.com/Birchlabs/status/1563622002581184517?s=2...

I've been following your MPS branch and have run it but couldn't address the issue without this explanation. Thank you!

Note that once you run the python script for the first time it seems to download a further ~2GB of data

Including a rick astley image for the first thing you gen -_-

That’s the NSFW filter :D

How long does it take to generate a single image? Is it in the 30 min type range or a few mins? It's hypothetically "possible" to run e.g. OPT175B on a consumer GPU via Huggingface Accelerate, but in practice it takes like 30 mins to generate a single token.

About 10 secs per image on M1 Max with the right noise schedule and sampler. https://twitter.com/Birchlabs/status/1565029734865584143?s=2...

On my late 2019 intel macbook pro with 32gb and a AMD 5550m it takes about 7-10 minutes to generate an image.

Runs on my 2070S at 12s/image (no batch optimization) and on my GTX1050 4GB at 90s/image

I was able to run YaLM 100B in about 5min per iteration, NVMe being the bottleneck.

I'm using a 2021 Macbook Pro with the base tier M1 Pro and it generates images in about 1 minutes per image.

Has anybody had success getting newer AMD cards working?

ROCm support seems spotty at best, I have a 5700xt and I haven't had much luck getting it working.

I have it working on an RX 6800, used the scripts from this repo[0] to build a docker image that has ROCm drivers and PyTorch installed.

I'm running Ubuntu 22.04 LTS as the host OS, didn't have to touch anything beyond the basic Docker install. Next step is build a new Dockerfile that adds in the Stable Diffusion WebUI.[1]

[0] https://github.com/AshleyYakeley/stable-diffusion-rocm [1] https://github.com/hlky/stable-diffusion-webui

The RX6800 seems like a great card for this - 16GB of relatively fast VRAM for a good price.

How long does it take to do 50 iterations on a 512x512?

I've tried using this set of steps [1], but have so far not had luck, mostly because the ROCm driver setup is throwing me for a loop. Tried it with an RX 6700 XT and first was going to test on Ubuntu 22.04 but realized ROCm doesn't support that OS yet, so tried again on 20.04 and ended up breaking my GPU driver!

[1] https://gist.github.com/geerlingguy/ff3c3cbcf4416be2c0c1e0f8...

Yes. That's expected.

AMD market segmented their RDNA2 support in ROCm to the Navi21 set only (6800/6800 XT/6900 XT).

It is not officially supported in any way on other RDNA2 GPUs. (Or even on the desktop RDNA2 range at all, that only works because their top end Pro cards share the same die)

Oh... had no clue! Thanks for letting me know so I wouldn't have to spend hours banging my head against the wall.

As an aside, a totally unsupported hack to make it somewhat work on Navi2x smaller dies which you use:

HSA_OVERRIDE_GFX_VERSION=10.3.0 to force using the Navi21 binary slice.

This is totally unsupported and please don't complain if something doesn't work when using that trick.

But basic PyTorch use works using this, so you might get away with it for this scenario.

(TL;DR: AMD just doesn't care about GPGPU on the mainstream, better to switch to another GPU vendor that does next time...)

Looks like I may be out of luck with NAVI 10.

you can try my guide. got it working on a 6750XT


I tried getting pytorch vulkan inference working with radv, it gives me a missing dtype error in vkformat. Fp16 or normal precision have the same error. I think it's some bf16 thing.

6600XT reporting in. Spent a few hours on Windows and WSL2 setup attempts, got no where. I don't run Ubuntu at home and don't want to dual boot just for this. From looking around I think I'd have a better chance on native Ubuntu.

Buy an NVIDIA card. ROCm isn't supported in any way on WSL2, but CUDA is.

AMD just doesn't invest in their developer ecosystem. Also as you use a 6600 XT, no official ROCm support for the die that you use. Only for navi21.

Or wait, if its just about stable diffusion multiple people try to create onnx and directml forks of the models/scripts, which atleast in theory can work for AMD gpus in windows and wsl2

The difference between an M2 air (8gb/512gb) versus an M1 pro (16gb/1tb) is much more than I expected.

  * M1 pro (16gb/1tb) can run the model in around 3 minutes.
  * M2 air (8gb/512gb) takes ~60 minutes for the same model.
I knew there would be some throttling due to the m2 air's fanless model, but I had no idea it would be a 20x difference (albeit, the m1 pro does have double the RAM. I don't have any other macbooks to test this on).

That's probably due to swapping due to the 8GB of RAM. People who have run Stable Diffusion on M2 airs with 16 GB of RAM seem to get performance that is in line with their GPU core count.

Correct. We've been seeing 8GB is super slow, >=16GB is fast. We'll add that to the prerequisites.

Unscientifically that puts the M1 Pro GPU at about 25% of the performance of a RTX 3080.

Not too shabby...

EDIT - this comment implies it's much faster: https://news.ycombinator.com/item?id=32679518

If that's correct then it's close to matching my 3080 (mobile).

It's likely that a significant fraction of the perf difference between Apple' GPUs and NVIDIA GPUs is due to NVIDIA's CUDA being high optimized, and Pytorch being tuned to work with CUDA.

If Pytorch's metal support improves and Apple's Metal drivers improve (big ifs), it's likely that Apple's GPUs will perform better relatively to NVIDIA than they currently do.

img2img runs in 6 seconds on my GeForce 3080 12 GB. 6+ it\s depending on how much GPU memory is available. If I have any electron apps running it slows down dramatically.

Curious about:

1. Image size

2. Steps

3. What your numbers are for text2img

4. (most importantly) are you including the 30 seconds or so it takes to load the model initially? i.e. if you were to run 10 prompts and then divide the total time by 10, what are your numbers?

Re 4 the lstein repo gives you an interactive repl, so you don't have to reload the model on every prompt.

I also have a 3080 and as far as I remember (not at my pc right now) it was 3-10 secs for img2img 512px cfg13 50 steps batch size 1 dimm sampler.

what args are you passing to img2img?

Neither of these should take minutes. Try Heun sampler, 8 steps, Karras noise schedule. Should be possible to get good images in 11 secs (or 10 secs if you go down to 7 steps). measurements admittedly from M1 Max. https://twitter.com/Birchlabs/status/1565029734865584143?s=2...

I would assume it is the memory. The test command from the discussed link runs in slightly over 2 minutes on my M2 Air (16gb). How long does it take for yours?

I suspect the lack of RAM is the issue here.

I suspect that the M2 air is thrashing the disk pretty aggressively. Diffusion models rerun the same model once per step, so for a generation with 50 steps, you copy the entire model in and out of memory 50 times. That's going to kill performance.

I believe the model is copied into ram once upon calling StableDiffusionPipeline, unless the mac implementation partially loads the model due to only having 8G of ram.

It's only copied to VRAM once when implemented correctly.

M1 is a unified memory system and doesn't have VRAM.

I know, I was just adding context.

A few suggested changes to the instructions:

    /opt/homebrew/bin/python3 -m venv venv  # [1, 2]
    venv/bin/python -m pip install -r requirements.txt  # [3]
    venv/bin/python scripts/txt2img.py ...
1. Using /opt/homebrew/bin/python3 allows you to remove the suggestion about "You might need to reopen your console to make it work" and ensures folks are using the just installed via homebrew python3, as opposed to Apple's /usr/bin/python3 which is currently 3.8. It also works regardless of the user's PATH. We can be fairly confident /opt/homebrew/bin is correct since that's the standard homebrew location on Apple Silicon and folks who've installed it elsewhere will likely know how to modify the instructions.

2. No need to install virtualenv since Python 3.6 which ships with a built-in venv module which covers most use cases.

3. No need to source an activate script. Call the python inside the virtual environment and it will use the virtual environment's packages.

Running into this error `RuntimeError: expected scalar type BFloat16 but found Float` when I run `txt2img.py`

SOLUTION - append the command with `--precision full`

Awesome, that works

For reference the full command:

`python scripts/txt2img.py \ --prompt "a red juicy apple floating in outer space, like a planet" \ --n_samples 1 --n_iter 1 --plms --precision full`

Confirming I'm stuck on the same error when running the tutorial-instructed python scripts/txt2img.py command

    RuntimeError: expected scalar type BFloat16 but found Float

Yes, me too! Please post here if you find a solution for all the other people that come and find this by commmand-F'ing this error

I'm stuck on 'RuntimeError: expected scalar type BFloat16 but found Float' too. Most relevant links seems https://github.com/CompVis/stable-diffusion/pull/47 but I'm not sure. Please post when there is a solution.

That might have to do with your Mac OS version. Pre-12.4 Mac OS does not allow the Torch backend to use the M1 GPU, and so the script attempts to use the cpu, but then the cpu does not support half-precision numbers.

Yep---that was it in my case. I had the same error but it went away after upgrading to MacOS 12.5. You should actually check if your PyTorch installation can detect the mps backend: `torch.backends.mps.is_available()` must be equal to True.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact