Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: New AI edits images based on text instructions (github.com/brycedrennan)
1098 points by bryced 8 months ago | hide | past | favorite | 227 comments
This works suprisingly well. Just give it instructions like "make it winter" or "remove the cars" and the photo is altered.

Here are some examples of transformations it can make: Golden gate bridge: https://raw.githubusercontent.com/brycedrennan/imaginAIry/ma... Girl with a pearl earring: https://raw.githubusercontent.com/brycedrennan/imaginAIry/ma...

I integrated this new InstructPix2Pix model into imaginAIry (python library) so it's easy to use for python developers.

Fireworks. These AI tools seem very good at replacing textures, less so about inserting objects. They can all "add fireworks" to a picture. They know what fireworks look like and diligently insert them into "sky" part of pictures. But they don't know that fireworks are large objects far away rather than small objects up close (see the Father Ted bit on that one). So they add tiny fireworks into pictures that don't have a far away portion (portraits) or above distant mountain ridges as if they were stars. Also trees. The AI doesn't know how big trees are and so inserts monster trees under the Golden Gate bridge and tiny bonsais into portraits. Adding objects into complex images is totally hit and miss.

Another thing was the "Bald" for the girl with the pearl earring, seems like it doesn't about things like ponytails under headdresses

The new models that take depth estimation into account will probably solve this.

Perhaps stereoscopic video should be part of the training data?

Human stereoscopy is only good out to a few meters (and presumably people aren't going out with giant WWII stereoscopic rangefinders to generate training data). So it wouldn't help them for things like fireworks or trees.

Found out I couldn’t see in stereo. Got prism glasses. Completely blew my mind to seen ”depth” for the 1st time. Had no idea I couldn’t. Never had any trouble.

Apparently without prism glasses my vision just switches from one eye to the next every 30 seconds. Completely seamlessly.

Wow! Any brand suggestions? And, are these the same prism glasses as those that let you watch tv laying in bed?

No. They are for double vision. Had a series of small strokes. Caused high cranial pressure which caused six nerve palsy. I take a pill now and it fixes double vision. I lose stereo that way. But prism glasses are very distracting to wear.

Given the appropriate training and test set, we can build a model that can overcome these issues, right?

I've played with several of these Stable Diffusion frameworks and followed many tutorials and imaginAIry fit my workflow the best. I actually wrote Bryce a thank you email in December after I made an advent calendar for my wife. Super excited to see continued development here to make this approachable to people who are familiar with Python, but don't want to deal with a lot of the overhead of building and configuring SD pipelines.

Thanks Paul!

Whoa. Another Bryce D. So when do we fight?

When the narwhal bacons of course. ugh.

Can it make it pop? Because that was the #1 request I remember dealing with.

I tried it out :-)

`aimg edit assets/girl_with_a_pearl_earring.jpg "make it pop" --prompt-strength 40 --gif`


I think it tried to make it pop art? Which is not a bad response to be fair.

I was expecting something balloon-like to appear, and was not disappointed.

I don’t know why people use this “AI” thing, I have been using make my logo bigger cream (tm) for ages with success.



Hum... Is that whitespace eliminator still on sale?

It’s funny that the bottle doesn’t look more like a Dr Bronner’s bottle [0]

[0]: https://www.drbronner.com/products/peppermint-pure-castile-l...

Well if you check the web, not a lot :D

Try these prompts:

"Add lens flare"

"Increase saturation"

"Add sparkles and gleam"

#1 request of what, for what, requested by whom?

That is a common request when working with clients. They have a hard time describing what they want so end up asking to “make it pop”

Maybe but it could put their business logo anywhere!

this should do it:

>> aimg edit input.jpg "make it pop" --prompt-strength 25

A similar tool: Instruct pix2pix to alter images by describing the changes required: https://huggingface.co/timbrooks/instruct-pix2pix#example

Edit: Just noticed it is the same thing but wrapped, nevermind, pretty cool project!

Here is a colab you can try it in. It crashed for me the first time but worked the second time. https://colab.research.google.com/drive/1rOvQNs0Cmn_yU1bKWjC...

I could not get the first cell to run.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow 2.9.2 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible. tensorboard 2.9.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.

I believe you can just ignore that error. The cell ran.

EDIT - it's free of charge: https://research.google.com/colaboratory/faq.html


First time I've used "colab" - looks great. Out of interest, who pays for the compute used by this?

Is it freely offerred by Google? Or is it charged to my Google API account when I use it? Or your account? It wasn't clear in the UI.

Freely offered by Google. They offer a subscription model if you want to run your collab notebooks on a beefier machine.

Huh, I'm trying it now and the results seem so weak compared to any other model I've seen since dall-e.

Hmm that’s true for me too. Not sure if it is due to resource constraint. I had a picture of a car in indoor parking lot with walls and pillars. When I prompted “Color the car blue” the whole image was drenched in a tint of blue. Similarly when prompted “make a hummingbird hover” … the hummingbirds were a patch of shiny colors with an shape that sort of looked like an hummingbird but not like a real one.

Try "turn the car blue"

Does dalle do prompt based photo edits now?

But yeah sometimes it doesn't follow directions well. I haven't noticed a pattern yet for why that is.

How would I upgrade to 2.1 if running locally?

If you're wanting to use Stable Diffusion 2.1 with imaginairy you just specify the model with `--model SD-2.1` when running the `aimg imagine` command.

Sorry for the offtopicness but could you please email me at hn@ycombinator.com? (I have an idea/possibility for you. Nothing that couldn't be posted here in principle but I'd rather not take the thread off topic.)

This was an uncanny comment, somehow.

Hope ya'll brainstorming session is fruitful.

Sorry for the uncanning! My thought was simply to connect the OP with a YC partner in case they wanted to explore doing this as a startup.

I send such emails all the time but on semi-rare occasions have to resort to offtopic pleas like the GP.

I hope that helps clear things up!

It does, thank you!

The language of high-level art-direction can be way more complex than one might assume. I wonder how this model might cope with the following:

‘Decrease high-frequency features of background.’

‘Increase intra-contrast of middle ground to foreground.’

‘Increase global saturation contrast.’

‘Increase hue spread of greens.’

They behave quite poorly, because the keywords used by the models are layman language not technical art or color correction/color grading-speak

Hopefully in a couple of years when things have matured more there will be more models capable of handling said requests

The most precise models are actually anime models because the users have got high standards for telling the machine what they expect of it and the databases are quite well annotated (booru tags)

When I was training Dreambooth on images of myself, then trying to get tags out of the images it generated of me to write better prompts, I clicked “booru tags” on automatic1111 not knowing what it was. It thought I was some sort of Yaoi manga and generated lots of tags that made me both uncomfortable and confused.

Around how many samples are required for an effective training set?

Leonardo.ai can probably make a model that handles one of those prompts well with 40 images.

To handle them all you would need a larger sample.

What are the most affordable GPUs that will run this? (it said it needs CUDA, min 11GB VRAM, so I guess my relatively puny 4GB 570RX isn't going to cut it!)

Can you imagine only being able to cook a hamburger on one brand of grill? But you can make something kinda similar in the toaster oven you can afford?

I want to be productive on this comment… but the crypto/cuda nexus of GPU work is simply not rational. Why are we still here?

You want to work in this field? Step 1. Buy an NVIDIA gpu. Step 2. CUDA. Step 3. Haha good luck, not available to purchase.

This situation is so crazy. My crappiest computer is way better at AI, just because I did an intel/nvidia build.

I don’t hate NVIDIA for innovating. The stagnation and risk of monopoly setting us back for unnecessary generations makes me a bit miffed.

So. To attempt to be productive here, what am I not seeing?

> Can you imagine only being able to cook a hamburger on one brand of grill? [...] the crypto/cuda nexus of GPU work is simply not rational. Why are we still here?

Because nvidia spent a long time chasing the market and spending resources, like they wanted it.

You wanted to learn about GPU compute in 2012? Here's a free udacity course sponsored by nvidia, complete with an online runtime backed by cloud GPUs so you can run the exercises straight from your browser.

You're building a deep learning framework? Here's a free library of accelerated primitives, and a developer who'll integrate it into your open source framework and update it as needed.

OpenCL, in contrast, behaves as if every member of the consortium is hoping some other member will bear these costs - as if they don't really want to be in the GPU compute business, except out of a begrudging need for feature parity.

And in terms of being rational - if you're skilled enough to be able to add support for a new GPU vendor into an ML library, you're probably paid enough that the price of a midrange nvidia GPU is trivial in comparison.

All is not lost, though - vendors like Intel are increasingly offering ML acceleration libraries [1] and most neural networks can be exported from one framework and imported into another.

[1] https://www.intel.com/content/www/us/en/developer/tools/onea...

Because innovating in the hardware space is just a lot more expensive and slow.

Also the vast majority of ML researchers and engineers are not system programmers. They don't care about vendor lock because they're not the ones writing the drivers.


1. It's just not a huge deal to many people. Most people who want to do local ML training and inference can just buy a NVIDIA GPU.

2. AMD only has a skeleton team working on their solution. It's clear it's not a focus.

The cheapest NVidia GPU with 11+GB VRAM is probably the 2060 12GB, although the 3060 12GB would be a better choice.

The setup.py file seems to indicate that PyTorch is used, which I think can also run on AMD GPUs, provided you are on Linux.

I really want these ML libraries to get smarter with use of VRAM. Just cos I don't have enough VRAM shouldn't stop me computing the answer. It should just transfer in the necessary data from system ram as needed. Sure, it'll be slower, but I'd prefer to wait 1 minute for my answer rather than getting an error.

And if I don't have enough system RAM, bring it in from a file on disk.

Tha majority of the ram is consumed by big weights matrices, so the framework knows exactly which bits of data are needed when and in what order, so should be able to do efficient streaming data transfers to have all the data in the right place at the right time. It would be far more efficient than 'swap files' that don't know ahead of time what data will be needed so impact performance severely.

If it doesn't fit in the GPU memory, it will often be faster to just compute everything on CPU. So the framework doesn't really need to be smart. And the framework can easily already compute everything on CPU. So it can do what you want already.

Using RAM to hold unneeded layers was one of the first optimizations made for Stable Diffusion. The AUTOMATIC1111 WebUI implements it, not sure about others.

It works fine on CPUs. Takes about a minute to generate images on my 8 core i7 desktop.

How do I get it to use CPU rather than GPU, please? I have 128GB of RAM but a lowly 2GB of VRAM in my PC.

Try setting the CUDA_VISIBLE_DEVICES environment variable to ''.

That worked! Thanks.

Hopefully one can utilize a highly intel (or AMD) optimized stack, such as intel version of pytorch to make this run even faster

I'm running on a 2080 TI and an edit runs in 2 seconds. On my Apple M1 Max 32Gb edits take about 60 seconds.

If this was all packaged into a desktop app (e.g. Tauri or electron) how big would the app be? I'd imagine you could get it down to < 500MB (even if you packaged miniconda with it).

> I'd imagine you could get it down to < 500MB (even if you packaged miniconda with it).

I don't know where that imagined number came from. This tool appears to be using Stable Diffusion, and the base Stable Diffusion model is 4 or 5 gigabytes by itself. I think there are some other models that are necessary to use the base Stable Diffusion model, and while they are smaller, they still add to the total size.

Whatever 3060 variant that has the most VRAM is probably your best shot these days.

On the strength of this HN submission I just ordered an RTX 3060 12GB card for £381 on Amazon so I can run this and future AI models.

This stuff is fascinating, and @bryced's imaginAIry project made it accessible to people like me who never had any formal training in machine learning.

For what it's worth, it ran fine on my 2070 (8GB of VRAM), even with the GPU being used to render my desktop (Windows), which used another ~800MB of VRAM. I was running it under WSL, which also worked fine.

Note the level of investment that NVIDIA's software team has here: they have a separate WSL-Ubuntu installation method that takes care not to overwrite Windows drivers but installs the CUDA toolkit anyway. I expected this to be a niche, brittle process, but it was very well supported.

Google colab free acount, you get access to 15GB vRAM T4 GPU, or Kaggle which gives you access to 2xT4's, or one P100 GPU.

“Add a dog in my arms”

I’ll keep you posted how well this works for dating apps

I am not a fan of software such as this putting in an arbitrarily "safety" feature which can only be disabled via undocumented environment variable. At least make it a flag documented for people who don't have an issue with nudity. There isn't even an indication that there is a "safety" issue, you just get a blank image and are wondering if your GPU/model or install is corrupted.

This isn't running on a website that is open to everyone or can be easily run by a novice.

Anyone capable of installing and running this is also able to read code and remove such a feature. There is no reason to hide this nor to not document it.

Also the amount of nudity you get is also highly dependent on which model you use.

Nudity isnt really the core issue. It is about illegal content. Nudity is the over-protective, over-inclusive bandaid solution to prevent this thing from being used to generate the very illegal material that will trigger authorities.

Then instead of just presenting me with a blank image tell me why it's blank. Or add the word "nude" to the default negative filter which by default doesn't have that in it.

The default is:

   negative-prompt:"Ugly, duplication, duplicates, mutilation, deformed, mutilated, mutation, twisted body, disfigured, bad anatomy, out of frame, extra fingers, mutated hands, poorly drawn hands, extra limbs, malformed limbs, missing arms, extra arms, missing legs, extra legs, mutated hands, extra hands, fused fingers, missing fingers, extra fingers, long neck, small head, closed eyes, rolling eyes, weird eyes, smudged face, blurred face, poorly drawn face, mutation, mutilation, cloned face, strange mouth, grainy, blurred, blurry, writing, calligraphy, signature, text, watermark, bad art,"

Sure, if this was a commercial product i would also scream about feedback. But as this is a free side project im not going to criticize decisions on what they think they need to do to avoid unnecesary drama. I credit them for making the bypass relatively simple.

Nobody is screaming

Then society shouldn't tolerate knifes either.

Banning an entire class of activity because someone somewhere might abuse it is a ridiculous irrational way to reason about things.

The question is simple: Will the thing mostly be used for bad or good? Unless you think the vast majority of humanity are pedophiles then these features should be allowed.

Think of the children was never a valid argument and isn't a valid one today. I truly hope this utterly stupid and brain-dead way of thinking never escapes the AI community and infect other fields.

Agreed. This was an inevitable use case for AI imagery from the get-go. No way around it. Even if one dev/trainer goes out of their way to make certain that it can't be used for that purpose, another model will be trained that can. So the only three solutions are:

A. Get over it.

B. Ban the tool in its entirety.

C. Waste tons of governmental resources, spy on people more, and operate in legal grey-area to hunt down people that produce that kind of stuff with these models.

The correct answer seems obvious to me.

Genuinely curious what type of content would be considered illegal here.

- the tool is drawing original content. - the tool is executing on my laptop. - the output image is not shared with anyone.

In a way, whatever this tool can do, MS Paint could (theoretically) do.

Or am I misunderstanding the whole thing?

There's a great wikipedia page on this very topic [1]. In some countries like Germany fictional porn isn't considered porn, while other countries like Australia or France consider the possession of drawings of naked minors a crime worthy of jail sentences. And then there's the US where having the images on a computer is fine, as long as they aren't sent over the internet and the computer never crosses state lines.

In general it's a topic people are careful about because the legal situation is complex, and in many countries associated with the potential of heavy jail sentences.

1: https://en.wikipedia.org/wiki/Legal_status_of_fictional_porn...

When you stop and think about it, it is kind of insane that a fictional drawing in private possession, not even communicated to anyone, is already sufficient to land someone in prison in so many jurisdictions. Isn't this a clear case of legislating moral purity (as opposed to actual harm)?

Nope. It is illegal if you do it on ms paint too. Images of naked kids (real or fiction) + transmitted over internet (ie state lines) = bad bad day for all involved.

> + transmitted over internet (ie state lines)

But OP specifically said: "the tool is executing on my laptop. - the output image is not shared with anyone."

The state lines thing is only about federal crimes. Possession is most likely illegal as a state crime on its own. The repercussions of even being suspected of crimes in this area means that every rational person would take precautions far above what is legally necessary to prevent any association with such material.

Maybe use a desktop instead, because crossing state lines with that laptop might carry a jail sentence with a minimum term of 5 years.

I would assume that running an AI image generation software is something you'd do on a desktop anyway.

But it requires transmission over the internet, right? This is actually very interesting. Am I legally allowed to draw naked kids for my own enjoyment in the comfort of my own home?

Simple possession is also a crime, just a state crime. The internet transmission triggers federal jurisdiction and federal charges. There is no safe/legal way to possess such images.

Is any AI generated image even illegal, assuming originality and difference from training set?

The console output tells you what happened.

Can you tell which env variable to set?

Slightly off topic.

I’ve been looking for an easier way to replace the text in these ai generated images. I found Facebook is working on it with their TextStyleBrush - https://ai.facebook.com/blog/ai-can-now-emulate-text-style-i... but have been unable to find something released or usable yet. Anyone aware of other efforts?

The authors of TextStyleBrush cite SRNet, which is available at https://github.com/youdao-ai/SRNet but probably has worse quality. I don't know of others, but I have not looked very hard either.

> Here are some examples of transformations it can make: Golden gate bridge:

I'm on mobile so can't try this myself now. Can it add a Klingon bird of prey flying under the Golden Gate Bridge, and will "add a Klingon bird of prey flying under the Golden Gate Bridge" prompt/command be enough?

No. At least not with the Stable Diffusion 1.5 checkpoint used in the colab notebook. It seems to only have a very vague idea of what a Klingon bird of prey is. The best I could get in ~30 images was [1], and that's with slight prompt tweaks and a negative prompt to discourage falcons and eagles.

1: https://i.imgur.com/gDj2Kn4.png

A CUDA supported graphics card with >= 11gb VRAM (and CUDA installed) or an M1 processor.

/Sighs in Intel iMac

Has anyone managed to get an eGPU running under MacOS? I guess I could use Colab but I like the feeling of running things locally.

I'm told Imaginairy secretly does run on Intel Mac. very slowly. I just don't want to be on the hook for support so the reqs are written that way.

Ooh, I'll give it a whirl - performance isn't a priority right now. Thanks Bryce.

How does this work? When I run it on a machine with a GPU (pytorch, CUDA etc installed) I still see it downloading files for each prompt. Is the image being generated on the cloud somewhere or on my local machine? Why the downloads?

Shouldn't be downloads per prompt. Processing happens on your machine. It does download models as needed. A network call per prompt would be a bug.

OK. I noticed that the images are not accurate when I give my own descriptions. Not sure if this is a limitation of Stable Diffusion. For example, for the text "cat and mouse samurai fight in a forest, watched by a porcupine" I got a cat and a mouse (with a cat's face and tail!!) in a forest sort of fighting. But no - porcupine

Thank you for creating this.

yes stable diffusion is not great about handling multiple ideas. New image models coming out soon though.

I keep seeing this even when the prompt is unchanged

Downloading https://huggingface.co/runwayml/stable-diffusion-v1-5/resolv... from huggingface Loading model /home/hrishi/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/889b629140e71758e1e0006e355c331a5744b4bf/v1-5-pruned-emaonly.ckpt onto cuda backend...

followed by a download

That is strange. I'm not sure what would cause that unless it was running in some ephemeral environment. What OS? Can you open a github issue with a screenshot?

Ubuntu 22.04.1 LTS. I'll file a issue

Oh I see what you're saying. It's not actually downloading but there is a bug where it's not using the cache properly. will fix

Is there a way to pre-download all models? I want to create a docker image and cache the models.

Also any way configure the generated file path beyond directory or directly pipe image from the CLI?

So, Deckard can ask it to enhance, finally :)

Many thanks to the OP, can't wait to try this out! I have a question I'm hoping to slide in here: I remember there were also solutions for doing things like "take this character and now make it do various things". Does anyone remember what the general term for that was, and some solutions (pretty sure I've seen this on here, apparently forgot to bookmark).

PS: I'm not trying to make a comic book, I'm trying to help a friend solve a far more basic business problem (trying to get clients to pay their bills on time).

dreambooth perhaps?

Dreambooth is what I'm using now, but I think I remember the concept had a specific name, something like 'context transfer' or so (pretty sure that was not the term) and tools that were pretty good at it that came out before Dreambooth. If I could at least remember the term it might be easier to search for them.

Dreambooth is ok at it, but it requires multiple images (you often read 30, but I've actually had decent results with as few as five) to recognize what it's supposed to replicate. I remember there were tools that were more adapted to the workflow "create a humanoid cartoon character with a bunny face", pick one image that you like, and then "now show that same character in scene X, eg teaching in a classroom" or "wearing a cowboy outfit".

"textual inversion"

I think that's it, thank you!

Thanks a lot!!!

Works perfectly for me (Gentoo Linux + nVidia RTX3060 12GiB VRAM - I installed last week your package and it just worked, experimenting with it since then, telling about it parents & colleagues).

The results (especially in relation to "people's faces") can vary a lot between ok/scary/great (I still have to understand how the options/parameters work), all in all it's a great package that's easy to handle & use.

In general, if I don't specify a higher output resolution setting than the default (512x386 or something similar), with e.g. "-w 1024 -h 768", then faces get garbled/deformed like straight from a Stephen King novel => is this expected?

Cheers :)

I've been toying with SD for a while, and I do want to make a nice and clean business out of it. It's more of a side-projecty thing so to speak.

Our "cluster" is running on a ASUS ROG 2080Ti external GPU in the razer core-x housing, and that actually works just fine in my flat.

We went through several iterations of how this could work at scale. The initial premise was basically the google homepage, but for images.

That's when we realised that scaling this to serve the planet was going go be a hell of a lot more work. But not really, conceptualising the concurrent compute requirements as well as the ever-changing landscape and pace of innovation in this absolutely necessary.

The quick fix is to use a message queue (we're using Bull) and make everything asynchronous.

So essentially, we solved the scaling factor using just one GPU. You'll get your requested image, but it's in a queue, we'll let you know when it's done. With that compute model in place, we can just add more GPUs, and tickets will take less time to serve if the scale engineering is proper.

I'm no expert on GPU/Machine learning/GAN stuff but Stable Diffusion actually prompted me to imagine how to build and scale such a service, and I did so. It is not live yet, but when it does become so the name reserved is dreamcreator dot ai, and I can't say when it will be animated. Hopefully this year.

A ticketing scheduler system is how 99% of systems that require long running CPU/GPU intensive that cannot be run in parallel are implemented. It's how I built up my stable diffusion discord bot which is backed by a single RTX 2060.

I'm glad you have this working but I wouldn't exactly call this "solving the scaling problem", you're just running it in a blocking "serial fashion". With enough concurrent users it could still take somebody until the heat death of the universe for their image to finally be generated.

Agreed. The scaling model/queueing system was implemented as a POC which can be scaled by plugging in more cards/hosts. I hope we can find the time to animate this soon enough.

Queues are everywhere - at all levels - in the end a single transistor is either on or off - doing one thing at once.

Your queue decouples demand from supply - though you now have another problem - if demand far exceeds supply will your queue overflow?

In that scenario you might need to push the queue back to the requester - ie refuse the job and tell them to resubmit later.

This is really cool. Haven't seen something like this yet. Going to be very interesting when you start to see E2E generation => animation/video/static => post editing => repeat. Have this feeling that movie studios are going to look into this kind of stuff. We went from real to CGI and this could take it to new levels in cost savings or possibilities.

Played around for a bit. Definitely a cool tool. Wish I had an M1 though. Taking me quite a bit to generate and fans running at full blast. Haha

It's very interesting, thanks! I've noticed (on the Spock example) that "make him smile" didn't produce a very... "comely" result (he basically becomes a vampire).

I was thinking of deploying something like that in one of our app features, but I'm scared of making our Users look like vampires :-)

Is it your experience that the model struggles more with faces than with other changes?

Yes if you're not careful it can ruin the face. You can play with the strength factor to see if something can be worked out. Bigger faces are safer.

Wow that's really impressive (I've seen similar things in research papers for a while now, but having it usable so easily and generic is great).

A few questions:

- would it be possible to use this tool to make automatic mask for editing in something like GIMP (for instance, if I want to automatically mask the hair)?

- would it be possible to have a REPL or something else to make several prompt on the same image? Loading the model takes time, and it would be great to be able to just do it once.

- how about a small GUI or webui to have the preview immediately? Maybe it's not the goal of this project and using `instruct-pix2pix` directly with its webui is more appropriate?

Thanks for the work (including upstream people for the research paper and pix2pix), and for sharing.

> would it be possible to use this tool to make automatic mask for editing in something like GIMP

probably but GIMP plugins are not something I've looked into


already done. just type `aimg` and you're good to go


GUIs add a lot of complexity. Can your file manager do thumbnails and quick previews?

> GUIs add a lot of complexity. Can your file manager do thumbnails and quick previews?

Somewhat OT, but I find this really funny. It says a lot about the difficulty of using various ecosystems and where communities spend time polishing things.

"Yeah, I made something that takes natural language and can do things like change seasons in an image. But a GUI? That's complicated!"

It's not a criticism of you, but the different ecosystems and what programmers like to focus on nowadays.

Fair but I'd point out I also didn't make the algorithm that changes photos. I'm wrapping a bunch of algorithms that other people made in a way that makes them easy to use.

It's not just that GUI's are hard, it's that the "customer" base will inevitably be much less technical and I'd receive a lot more difficult to resolve bug reports. So no-gui is also a way of staying focused on more interesting parts of the project.

thanks for the quick answer and cool for REPL. Yeah sure I can just launch Gwenview on the output directory.

> probably but GIMP plugins are not something I've looked into

I was just thinking about a black and white or grey level output image with the desired area , no need to integrate it in GIMP of whatever. I've tried a prompt like "keep only the face", but no luck so far.

There is a smart mask feature. Add `--mask-prompt face --mask-mode keep`. I believe it outputs the masks as well

I'm getting mixed results, and for a given topic it seems to invariably give a better result first time you ask, then not so good if you ask again.

It could be random and my imagination, but seems that way.

Looks really interesting, although my immediate thought with "-fix-faces" is how long before someone manages to do something inappropriate and whip up a storm about this.


Aaarrrgghh let me know when it's down to 4GB like Stable Diffusion

The prompt-based masking sounds incredible, with either pixel +/- or Prompt Relevance +/-

VERY impressive img2img capabilities!

It is stable diffusion but yes my fork does not have the memory optimizations needed to run it on only 4gb

You can get a used 2080Ti for under $300 on eBay

That's a lot of money for most people. It also means they have to have a PC to put it in.

Thank you for your sensical response... Very happy with my 6GB VRAM card and don't have $300 lying around to use on a git repo that will probably be slimmed down in a month or two

Or, if there is a Colab version, I’d happy to pay Google for premium GPU.

Well, just open a new GPU Colab and create a cell mit „!pip install imaginairy“ and you should be good to go …

It does work in non-pro colab apparently. Here you go: https://colab.research.google.com/drive/1rOvQNs0Cmn_yU1bKWjC...

How to make it use my GPU (I have RTX 3070)? It complains about using sloooow CPU, but I don't see option to switch to GPU, which I think should be sufficient...? I'm running it on Windows 10.

Note, that I have the CUDA installed. Still the imaginAIry runs on CPU :(

Two things

1. It actually makes me insecure.

2. Don't we already have apps that do such things? Yes, they were more specialized, but it's the same thing as Prisma app.

Does anyone know if there is something like Google Cloud for GPUs but with an easy way to suspend the VM or container when not in use? Maybe I am just looking for container hosting with GPUs.

I am just trying to avoid some of the basic VM admin stuff like creating, starting, stopping for SaaS if someone already has a way to do. Maybe this is something like what Elastic Beanstalk does.

I think there's https://brev.dev and https://banana.dev

Hey, founder of Brev.dev here. Brev lets you suspend the instances when not in use, and also auto-stops it after 3 hours of inactivity to avoid expensive surprises. Would love for you to give it a shot

Maybe not quite what you’re looking for, but I’ve seen some people mention banana.dev


Never used it myself, but looks like AWS Lambda/GCP Cloud Functions tailored to ML models.

"Log in with Github". No thanks.

vast.ai, paperspace.com

I see it's able to generate politician faces. I recall this wasn't possible on DALL·E 2 due to safety restrictions.

I run a friendly caption contest https://caption.me so imaginAIry is going to be absolute gold for generating funny and topical content. Thank you @bryced!

This is amazing! It’s only so long until video..

Video would be very useful too, but I expect running such models locally would be prohibitively expensive for most folks. (I am not talking about those $300k/year people here.)

anyone know how to use this? kind of confusing install instructions in the readme

If you don't care what exact tool in particular, https://github.com/AUTOMATIC1111/stable-diffusion-webui is the easiest to install I think and gives lots of toys to play with (txt2img, img2img, teach it your likeness, etc.)

If you're used to installing python packages it should be relatively easy. There are other projects with nice UIs but that's not what this library is for.

Are you telling me I can finally ENHANCE!?

Great stuff man, thanks!

Hoping that this is quickly implemented into the automatic1111 webUI.

Does anyone know of any tool like this for UI design? I'd love something that'd help creatively impaired people like myself communicate more visually.

Where can I find more data about the work you did to create this?

I did the work to wrap it up and make it "easy" to install. The researchers who did the real work can be found here: https://www.timothybrooks.com/instruct-pix2pix

Super nice. Would this work if I have my own version of fine-tuned SD? Also, curious how / whether this is different from img2img released by SD. Thanks!

This is itself it's own finetuned version of SD so now it won't work with alternative versions. img2img works by just running normal stable diffusion img2img on a noised starting image. As such it destroys information at all parts of the image equally. This new model uses attention mechanisms to decide which parts of the image are important to modify. It can leave parts of the image untouched while making drastic changes to other parts.

Well, to be fair you can use feathered bitmap masks for img2img with some UIs (automatic1111)

Perfect explanation, thank you!

I hope there is a James Fridman version of this kind of AI.

Is there a link to how this works - in terms of nn architecture to combine the embedding of the existing image with the edit instruction?

How can I try this?

Can this be run on a Digitalocean VM?

I looked around on DO's products, but none seems to advertise that it has a GPU. So maybe it is not possible?

Try paperspace, they have GPUs and you can set billing limits to stop accidental overusage

(no affiliation other than being a happy customer)

Here is a google colab you can try it in: https://colab.research.google.com/drive/1rOvQNs0Cmn_yU1bKWjC...

This is a lot of fun!

And they aren't kidding that on a CPU backend it is slooooow :)

Wow this is cool I think I am going to make a site so people can use this

How about “fix the hands”?

The hands issue is going to be an awesome story for all of us in 10-20 years. The younger generation just won’t fathom how hard it was to get proper hands. I wonder what a parallel comparable now would be? Something the slightly technical general public just can’t wrap their head around why it was complicated “back then”.

Maybe todays example is a smart voice assistant like Alexa.

Or maybe it will never be fixed, and in the future when they are trying to determine if someone is a human or an artificial replica, they will simply ask them to draw a set of human hands as a test.

Most humans would struggle to draw human hands as well.

Someone needs to redo the blade runner scene with SD with the hands question :)

Well, this already uses this default negative prompt https://github.com/brycedrennan/imaginAIry/blob/master/imagi... so it may fix the hands automatically.

This is cool! Makes me want to pull the trigger on an M2



how about telling cars where to go ?



It's a little premature, fine, but I want to start liquidating my rhetorical swaps here: I've been saying since last summer (sometimes on HN, sometimes elsewhere) that "prompt engineering" is BS and that in a world where AI gets better and better, expecting to develop lasting competency in an area of AI-adjacent performance (a.k.a. telling an AI what to do in exactly the right way to get the right result) is akin to expecting to develop a long-lasting business around hand-cranking people's cars for them when they fail to start.

Like, come on. We're now seeing AIs take on tasks many people thought would never be doable by machine. And granted, many people (myself included to some extent) have adjusted their priors properly. And yet so many people act like AI is going to stall in its current lane and leave room for human work as opposed to developing orders of magnitudes better intelligence and obliterating all of its current flaws.

Been doing a lot with prompts lately. What people are calling "prompt engineering" I'd call "knowing what to even ask for and also asking for it in a clear manner". That was a valuable skill before computers and will continue to be one as AI progresses.

I've been pretty disappointed to introduce ChatGPT to people in jobs where it would be a game changer and they just don't know what to do with it. They ask it for not-useful things or useful things in a non-productive way. "here is some ad copy I wrote, write it better". Whether you're instructing a human, chatgpt, or AI god... that's just too vague of instructions.

> I'd call "knowing what to even ask for and also asking for it in a clear manner".

It was a very important skill for searching. Nowadays, with Google "I know what you want better than you" search, it’s not that useful anymore (not useless, I get better search results by not using google and knowing what I want, just less required).

Most people struggle with deliberate logical thought.

IMHO it stems from lack of imagination. Impressive as the results may sometimes be, the user interfaces for AI are still extremely crude.

Soon we will see AI being used to define semantic operations on images that are hard to define exactly (imagine a knob to make an image more or less "cyberpunk", for example).

I also expect AI-powered inpainting to become a ubiquitous piece of functionality in drawing and editing tools (there are already Photoshop plugins).

Furthermore, my hunch is that many of the use cases around image creation will gradually move towards direct manipulation. Somewhat like painting, but without a physical model. AI components will be probably applied to interpreting the user's touch input in a similar way to how they are currently deployed to understand text input.

Prompt engineering already exists, it's called management.

If Asimov had robopsychologists in his stories, why can't we in real life? Who wants to be the first Susan Calvin?

Haven't we been here before? - see self driving cars.

LLMs and Image AIs are the opposite of self-driving cars. "Everybody" had concrete expectations for at least half a decade now that the moment where self-driving cars would surpass human ability was imminent, yet the tech hasn't lived up to it (yet). While practically nobody was expecting AI to be able to do the jobs of artists, programmers or poets anywhere near human level anytime soon, yet here we are.

Still bad at poetry due to the tokenizer though. I wrote a whole paper on how to fix it: https://paperswithcode.com/paper/most-language-models-can-be...

Great work, congratulations! One question, if I understood it right you based your demo on GPT-2 - what is your experience working with those open-source language AIs. In terms of computational requirements and performance?

I'm really fascinated by all the tools the OS community is building based on StableDiffusion (like OP's), which compares favourably with the latest closed-source models like Dall-E or Midjourney, and can run reasonably well on a high-end home computer or a very reasonably-sized cloud instance. For language models, it seems the requirements are substantially higher, and it's hard to match the latest GPT versions in terms of quality.

If LLMs (etc.) had the same requirements and business models as AV cars they'd still be considered a failure. Nobody expects Stable Diffusion to have a 6-sigma accuracy rate*, nor do we expect ChatGPT to seamlessly integrate into a human community. The AV business model discourages individual or small scale participation, so we wouldn't even have SD (would anyone allow a single OSS developer to drive or develop an AV car? Ok, there's Comma, that's all there is on the OSS side).

* The amount of times that I've seem an 'impressive' selections of AI images that I consider a critical failure deserves it's own word. The AIs are impressive for even getting that far, it's just that some people have bad taste and pick the bad outputs.

I've certainly seen this argument before..

Yes, it's true that not all technology evolves as fast as predicted (by some, at some point), but first of all I still believe we will see self driving cars in the future and secondly, it's one anti-example in a forest of examples of tech that evolves beyond anyone's expectations. I don't find it very convincing.

Autonomous driving as it currently exists came unexpected for most people. Now many look at it with the power of hindsight but back in the day the majority never thought we'd have cars (partly) driving on their own within a few years. The case of AI art seems the exact same to me, now that many are working on it there's lots of progress but it's still nowhere near what an experienced human can do. And that seems to be the general rule, not an exception.

We might need to create real intelligence for that to become true. A machine that can think and is aware of its purpose.


Large language models are stateless. The apps and fine-tuned models are doing prompt engineering on users' behalf. It's very much a thing for developers, with the goal of making it invisible for end users.

I think that "prompt engineering" stuff went away when ChatGPT came out.

Has it? I mean, maybe the idea of people doing this as a long-time career has, but practically, I still find it a challenge to get those AIs to do exactly what I want. I've played around with Dreambooth-style extensions now, and that goes some way for some applications, and I'm excited to try OP's solution, but in my experience, it is still a bit of a limitation for working with those AIs right now.

Oh yeah it's definitely still an issue right now! But I think the power of ChatGPT's ability to understand and execute instructions has convinced most people that "prompt engineering" isn't going to be a career path in the future.

Absolutely. I briefly thought about asking ChatGPT to write a prompt, but then I remembered that the training corpus is probably older than those tools (I heard that if you ask it the right way, it will tell you that its corpus ended in 21 - whether it's true or not, it sounds plausible). But that's a truly temporary issue, the respective subreddits probably have enough information to train an AI for prompt engineering already (if you start from a strong foundation like the latest GPT versions).

Plus, who knows whether future models won't be able to integrate those different modes much better (along those lines https://www.deepmind.com/publications/a-generalist-agent).

In the near future you can totally imagine a dialogue like that you'd have with a real designer, "can you make it pop a bit more?" or "can you move that logo to the right side?". It might some trial and error but it's only going to improve.

Making the AI truly creative (which means going beyond what the client asks for, towards things the client doesn't even know they want) would be a much larger leap and potentially take a lot longer.

I don't get it. Pre-ChatGPT prompt engineering was a BS exercise in guessing how a given model's front-end tokekizes and processes the prompt. ChatGPT made it only more BS. But I've seen a paper the other day, implementing more structured, formal prompt language, with working logic operators implemented one layer below - instead of adding more cleverly structured English, they were stepping the language model with variations of the prompt (as determined by the operators), and did math on probability distributions of next tokens the model returned. That, to me, sounds like valid, non-BS approach, and strictly better than doubling down on natural language.

Think about the problem in an end-to-end fashion: the user has an idea of what sort of image they want, they just need an interface to tell the machine. A combination of natural language plus optional image/video input is probably the most intuitive interface we can provide (at least until we've made far more progress on reading brain signals more directly).

How exactly we get there, by adding layers like on top like language models, or adding layers below like what you described, doesn't seem like such a fundamental difference. It's engineering, you try different approaches, vary your parameters and see what works best. And from the onset, natural language does seem like a good candidate for encoding nuances like "make it pink, but not cheesy" or "has the vibes of a 50's Soviet propaganda poster, but with friendlier colors".

The headline and the heavy promotional verbiage on the site seems to be claiming this is some new functionality we didn’t have before. Image2image with text instructions isn’t new as the headline implies.

InvokeAI (and a few other projects as well) already does all this stuff much better unless I’m missing something. There are plenty of stable diffusion wrappers. Why not help improve them instead of copying them?

I’m not against having enthusiasm for one’s project, but tell us why this is different and please don’t pretend the other projects don’t have this stuff.

I'm not aware of any pre-existing open-source model that selectively edits images (leaving some parts untouched) based on instructions. This new method is much better than the image2image that shipped with the original stable diffusion. I'm looking at the InvokeAI docs right now and don't see anything like this feature. We previously had smart-masks, but InstructPix2Pix mostly does away with the need for those as well.

If I am mistaken please provide links to these prior features.

If only Stable Diffusion wasn't already populated with a host of copyrighted images already.

Make your own art, dammit. This is the equivalent running some Photoshop filters through someone else's work.

Doesn't work if any people are in the photos: https://twitter.com/kumardexati/status/1616972740728356867/p...

Did you have to tweet it wasn’t working versus just not making it a public “omg it’s not working it’s no good”

Works fine for me, you just need to adjust the strength of the edit.

You mean steps?

no. in imaginairy it's called `--prompt-strength`. In other libraries it's called CFG or "classifier-free guidance". For the image edits I vary the strength of the effect from between 3-25

For the specific example you provide you could also use a prompt-based mask to prevent it from editing the person.

It does work on some things with people. I colorized a black and white photo of myself and then turned the colorized version into me as a Dwarven king.

These look awful! They are very displeasing aesthetically. They look like they were done by someone with absolutely no artistic ability. Clearly there is some technical interest here, but I just felt the need to point out the elephant in the room. They are very ugly.

I'm not an artist, but they look fine to me. I am not the kind of person who spends hours in a Gallery admiring the nuances of paintings or photos. However, at the level of detail I usually admire these things, the clown one looked interesting and the Monalisa one was funny. The strawberry seemed a bit weird, but I don't think I'd care for it even if it was perfect anyway. The wintery landscape I thought was pretty good and the red dog it delivered what was asked. Not sure how it could be much different than that.

"I'm not qualified to have a nuanced opinion about this, but let me confidently tell you what I think..."

Well, most people who consume art are not professional artists. That was my main point. From the point of view of a lay person (such as I), it looks pretty good.

I guess we should lower our standards, then...

I love the cognitive dissonance between "you're just stealing people's art and modifying them slightly!" versus "AI art sucks and has no artistic value"

I mean, the two are obviously not mutually exclusive.

Wow, it's really impressive to see how advanced AI image generators have become! The ability to create stable diffusion images with a "just works" approach on multiple operating systems is a huge step forward in this technology. We've deployed similar tech and APIs for our customers and are contemplating using this library as part of our pipeline for https://88stacks.com

Dark patterns are frowned upon here on HN.

Letting the user upload dozens of images and only after that telling them they need an account. Not good.

its not a dark pattern, what would happen to the images after uploading?

> its not a dark pattern

You are entitled to your opinion, no matter how wrong many of us think it is :)

> what would happen to the images after uploading?

I have no idea, as there is no such information given going as far as the upload form, nor in the FAQ. This is information you should provide.

Though that isn't the key problem IMO. For someone who backs out because of the sign-up requirement, you've wasted their time (and the service now has their images with no obvious pre-agreed policy covering re-use or other licensing issues).

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact