Hacker News new | past | comments | ask | show | jobs | submit login
Open-source rival for OpenAI’s DALL-E runs on your graphics card (mixed-news.com)
345 points by hardmaru on Aug 15, 2022 | hide | past | favorite | 181 comments



Stable Diffusion is mind-blowingly good at some things. If you are looking for modern artistic illustrations (like the stuff that you would find on the front page of Artstation) - it's state of the art, better in my opinion then Dalle-2 and Midjourney.

But, the interesting thing is that while it is so good in producing detailed artworks and matching the styles of popular artists, it's surprisingly weak at other things, like interpreting complex original prompts. We've all seen the meme pictures made in Craiyon (previously Dalle-mini) of photoshop-collage-like visual jokes. Stable Diffusion with all its sophistication is much worse at those and is struggling to interpret a lot of prompts that the free and public Craiyon is great with. The compositions are worse, it misses a lot of requested objects or even misses the idea entirely.

Also as good as it is at complex artistic illustrations, it is as bad at minimalistic and simple ones, like logos and icons. I am a logo designer and I am already using AI a lot to produce sketches and ideas for commercial logos, and right now the free and publicly available Craiyon is head and shoulders better at that then Stable Diffusion.

Maybe in the future we will have a universal winner AI that is the best at any style of pictures that you can imagine. But right now we have an interesting competition when different AI have surprising strengths and weaknesses and there's a lot of reason in trying them all.


Just think where we'll be two more papers down the line


For those unaware this is a catchphrase of Dr. Karoly Feher from the absolutely wonderful YouTube channel "Two Minute Papers" which focuses on advances in computer graphics and AI.


Random rant: it feels like over time Two Minutes Paper has started to lean more and more into its catchphrases and gimmicks, while the density of interesting content keeps decreasing.

The whole "we're all fellow scholars here" bit feels like I'm watching a kid's show about science vulgarization, patting me on the head for being here.

"Look how smart you are, we're doing science!"

I dunno. I like the channel for what it is (a vulgarization newsletter for cool ML developments) but sometimes the author feels really patronizing / full of himself.


I agree that I like to for what it is - something more along the lines of Popular Science or Wired than Scientific American if you want to compare to magazines. However, the content, while surface level, is always accurate - something that can’t be said for other content creators in the field.


I agree that it can be a lot at times, especially if you watch several in a row, but I dunno, I kind of love that he's keeping that enthusiasm (real or not). I think the world is a brighter place because of it. Just a tiny bit, but still.


I think the biggest benefit is the curation aspect. After all, how much can you actually learn in two minutes? Once I see something interesting, I go and read through the actual paper. Having said that, you're lucky if you can find a paper with enough details to actually reproduce the work.



You're mistaking earnest for patronizing. He's a genuinely positive dude.


he stopped summarizing methods at some point- now its just results


> Now squeeeze those papers!


it's surprisingly weak at interpreting complex original prompts because the model is really small, the text encoder is just 183M parameters. Craiyon is much larger.


I have a penchant for wanting to make technically "bad" or heavily stylized photos - and Stable Diffusion is pretty poor at those. There's very little good bokeh or tilt shift stuff and CCTV/Trailcam doesn't come out too well.

In fact Dall-E isn't as impressive for some styles as "older" models (Jax/Latent Diffusion etc)


My hunch is that is the result of this: https://github.com/CompVis/stable-diffusion#weights

> 515k steps at resolution 512x512 on "laion-improved-aesthetics" (a subset of laion2B-en, filtered to images with an original size >= 512x512, estimated aesthetics score > 5.0

https://github.com/LAION-AI/laion-datasets/blob/main/laion-a... for more details.

What's remarkable is this: https://github.com/LAION-AI/laion-datasets/blob/main/laion-a...

That aesthetic predictor was apparently trained on only 4000 images. If my thinking is correct, imagine the impact those 4000 ratings have had on all of the output of this model.

You can see samples (some NSFW) of different images from the original training set in different rating buckets here, to get an idea of what was included or not in those training steps. http://3080.rom1504.fr/aesthetic/aesthetic_viz.html


That is really a shame, because all I really want is a version of Craiyon that I can modify and run on my own hardware.

The amount of enjoyment I have derived from playing with Craiyon over the last two months is ridiculous.


IIRC Craiyon runs Dalle-mega. https://huggingface.co/dalle-mini/dalle-mega

Note I think you need 16gb of VRAM to run it.


You can run craiyon / dalle-mini on a card with 8GB of VRAM if you decrease batch size to 1 and skip the CLIP step. Takes about 7 sec to generate an image on a 3070.

I started with https://github.com/borisdayma/dalle-mini/blob/main/tools/inf... and pared it down.


Have you checked out MidJourney? Makes Craiyon look like crayons :P


Craiyon is free, whereas Midjourney is not. If you want MJ level quality, check out Disco Diffusion or go straight to Visions of Chaos, which runs just about every AI diffusion script in existence. The dev is very active and adds new features every couple days, such as recently the ability to train your own diffusion models, which I've been doing the last 3 days nonstop on my little 3060 Ti (8GB VRAM, which is barely sufficient to run at mostly default settings).


MidJourney does give you 25 minutes of free compute time though. Which is enough for at least trying it ~40 times

I've checked out Disco Diffusion but hadn't heard of Visions of Chaos, thanks. The biggest shortcoming to DD is there's simply not yet a sufficiently trained model to produce stuff to the level of MidJourney or Craiyon


Are they giving out trials without an invite now? I was invited from someone who was paying, became addicted by the end of my trial, subscribed for a month, then gave out invites to friends, some of which ended up also paying - not a bad business model! It was bad timing though, the day after I joined, I discovered Disco Diffusion and haven't stopped rendering since (roughly 10k images rendered, mostly for animations). It takes longer, the results are often less realistic compared to Midjourney, Dall-E (1/2) or Stable Diffusion (which I've been toying with for a few weeks), but it's somehow much more satisfying having to wait xx minutes for a render to complete, running on your own local PC, not having to use bots with 1000 other people in a channel spamming their prompts, and having TONS of parameters to play around with. I have a google drive full of docs from my own studies, comparing parameter values, models, prompts, etc. I'm really looking forward to Stable Diffusion releasing their models, I know VoC will add those models as soon as they are available. On top of that, VoC has been adding support for diffusion models (of which I'm training my own), but there are new ones added constantly as more and more people build models for e.g. pixel art, medieval style, monochromatic, etc.

Also the results vary drastically if you have enough VRAM to load more models e.g. a 3090 (24GB) or an A6000 (48GB). I've been saving money and waiting impatiently for 4090s to drop. Check out the Disco Diffusion or VoC Discord - people post their works in there and often you will see results that make you wonder if they're cheating ;)


Which are the best models the 3090 enables you to load?


I'd start with VITL14 and add one or more RN types. I personally like to use multiple VIT and RN models just to fill up VRAM. What exactly will fit depends on your output resolution and requires a lot of trial and error. I always have Task Manager open to monitor VRAM usage. In general, VIT = more realistic, RN = more artistic. It can take a lot of experimentation to find what exactly tickles your fancy. I constantly change which models I'm using depending on what I'm going for. This redditor did a nice comparison[0], and there are many more "studies" for which models to use - you can google around, there are new articles/studies being posted daily.

You can also try disabling use_checkpoints if you have extra VRAM, since it will render a bit faster (but uses more VRAM since it doesn't save intermediary 'checkpoints' to disk).

When/if you get bored, try disabling use_secondary_models which will use a lot more VRAM but can deliver results on a completely different level. You will likely struggle for a few days figuring out which parameters to tweak to get good results (e.g. tv_scale, sat_scale, etc, which are otherwise AFAIK ignored).

In any case I recommend reading A Traveler's Guide to the Latent Space, which I call "The Bible" since it covers so many topics and has links to various studies and will keep you busy reading for months ;)

Also check out the Discord for Disco Diffusion and Visions of Chaos, as you can read endless tips and tricks to getting amazing results.

Have fun! :)

[0] https://www.reddit.com/r/DiscoDiffusion/comments/t7p4bi/seas...

[1] https://sweet-hall-e72.notion.site/A-Traveler-s-Guide-to-the...


I think I might disagree with your assessment of DD.

I can't use it, but check out this guy's work. incredible detail

https://instagram.com/textrnr


> Of course, with open access and the ability to run the model on a widely available GPU, the opportunity for abuse increases dramatically.

> “A percentage of people are simply unpleasant and weird, but that’s humanity,” Mostaque said. “Indeed, it is our belief this technology will be prevalent, and the paternalistic and somewhat condescending attitude of many AI aficionados is misguided in not trusting society.”

Holy shit.

On the one hand, I'm super excited by this technology, and the novel applications that will become possible with these open-source models (stuff that would never be usable if Google and OpenAI had a monopoly on image generation).

On the other hand, I really really really hope Bostrom's urn[0] has no black ball in it, because we as a society seem to be rushing to extract as many balls as possible over increasingly short timescales.

[0] https://nickbostrom.com/papers/vulnerable.pdf


I don't see why this is incorrect. It seems ever since DALL-E, Midjourney, caught on, it seems like we've got more and more people trying to 'filter out' incorrect uses of their software under the assumption people cannot be trusted to just use it for whatever they want.

And it depresses me, because well... imagine if other pieces of tech were treated this way. If the internet or crypto or computers or whatever were heavily limited/restricted so the 'wrong people' couldn't use them for bad things. We'd consider it ridiculous, yet it's somehow accepted for these image generation systems.


Imagine image editors worked like this. Only over the internet, with rules attached and if a human moderator finds you are breaking them then you get banned.

It sounds riddiculous, yet Photoshop is more dangerous than DALL-E in all regards.


Nuclear tech is treated this way (much stricter even).


I think there might be at least a small difference between nuclear tech and image generation, at least as far as the effects that could happen if it goes wrong.


Get back to me when you can vaporise a city with a AI generated image.


You just need the right AI generated image.


Part of me kinda wants to find that image.


good cryptography was treated this way for a while, at least by the US government.


Yes, and look how many amazing uses of cryptography have proliferated after it was made available, unrestricted, to the masses.


The length of the democratization cycle we're seeing – months to weeks between a breakthrough model and a competent open-source alternative that runs on commodity hardware – really highlights the genie-stuffing posture of Google and OpenAI. All the thoughtful, if highly paternalistic guardrails they build in amount to little more than fig leaves over the possible applications they intend to close off.

I'm personally in the "AI risk is overstated" camp. But if I'm wrong, all the top-down AI safety in the world is going to be meaningless in the face of a global network of researchers, enthusiasts, and tinkerers.


I wonder how it would be possible to stop people from publishing and spreading research that will doom some subsegment of humanity/culture. If it turns out that proliferating something like DALL-E 5 causes serious irreversible effects to human culture, what would the researchers conclude was the correct thing to do at this exact point in time? Stop publishing AI research? How would we get all 8 billion people on Earth to agree?

It's the reason why the rapid amount of progress in this field scares me at times. It feels like gradually being crushed under a wall of the inevitable march towards progress. It could be the case that stopping ourselves before it's too late isn't possible.

Sometimes I get the feeling that the laws of nature will eventually destroy or severely impact any lifeform that gains too much of an understanding about the world. For example, in some other universe it might be possible to survive X more decades if the laws of physics weakened the effects of nuclear war just enough for civilization to recover in a relatively short period of time, but we're stuck with what we have, and that doesn't necessarily map to the long-term survival of a highly intelligent lifeform.

It would be a shame if unbounded curiosity would be humanity's undoing. That curiosity is also a part of me, and my family, and my neighbors down the street, and those in the situation room.


> How would we get all 8 billion people on Earth to agree?

You don’t, so the question is mostly academic.


They claim the guardrails are for public good, but they're pretty clearly using them to try to establish a competitive moat.

It's similar to the "we don't sell personal information" claim. Sure, but that's because they make money renting malicious actors access to a black box that contains your personal information. Selling the contents of the box would reduce their overall revenue.


To me it seems like an obvious case of reputational risk being much larger for more prominent organizations than for smaller ones.

It makes sense for Google to wait for some startup to "go first" in releasing a model largely without controls. That way, some random startup takes the initial heat of "people are using AI for bad things!!" headlines plastering tech blogs. Then Google can do basically the same thing a little bit later, and any attack pieces will sound old hat.


I was using AI Dungeon with the full power GPT-3 model before they crippled it. That thing had a very uninhibited mind for erotica! Imagine what would happen when that power comes to image models!


It's only a matter of time before someone trains a model on stills from the major adult sites. Actually surprised it hasn't been done / made public yet.


Yes, I feel much safer if OpenAI and Google are the sole keepers of such technology. They have my and the publics best interest at heart.


Is this satire?


Yes


Googlers and OpenAI legitimately believe this


Sarcasm.


Let me put it this way: it's not great to live in a world where the immense majority of nukes are controlled by Donald Trump and Vladimir Putin.

But it's arguably better than living in a world where every single citizen has a nuke.

(Though the potential for harm of diffusion models is far below nuke; it's not "kill millions of people", it's "produce cheap disinformation and very convincing fake evidence to ruin someone's life")


I think that would be a tough argument to make (in regards to image generation). The same could be said of just about any computing technology. The problem is we lose out on a lot of potential good.

Either way it doesn't matter, you can't control bits like you can enriched uranium. It's just a matter of time. In the grand scheme of things Open AI will be irrelevant.


People will generate creepy porn and fake pictures. Humanity will survive this.


You're not addressing my broader point, though, just the easy-to-snide-at version of my point.

Yes, it's pretty obvious that Dall-E and similar models won't destroy humanity.

My point isn't that Dall-E is a black ball. My point is we better hope a black ball doesn't exist at all, because the way this is going, if it exists, we are going to pick it, we clearly won't be able to stop ourselves.

(For the sake of dicussion, we can imagine a black ball could be "a ML model running on a laptop that can tell you how to produce an undetectable ultra-transmissible deadly virus from easily purchased items")


> we can imagine a black ball could be "a ML model running on a laptop that can tell you how to produce an undetectable ultra-transmissible deadly virus from easily purchased items

I think we’re already past the point where we could have done something about this. In fact, we’ve probably been past that point since humanity was born.

I think it’s probably more valuable if we think about how we’ll deal with it if we do draw something that could be/is a black ball.

That said, so far all evidence points to extreme destruction just being really hard, which leads me to believe that truly black ball technologies may not exist.


What do you mean by 'black ball'?

Like 'black balling', eg shunning?

Or 'black box', eg poorly understood technology?


https://nickbostrom.com/papers/vulnerable.pdf

> black ball: a technology that invariably or by default destroys the civilization that invents it.


If you know that image generators are not it, then why talk about it here? Do you get this kind of angst at every technological increment?

Apart from nuclear scientists I don't know a field where participants are as conscious of the risks as AI research.


> Apart from nuclear scientists I don't know a field where participants are as conscious of the risks as AI research.

Great. Now some of these researchers preceived some risk with this technology. Not human extinction level risk, but risks. So they attempted to control the technology. To be specific: OpenAI is worried about deepfakes so they engineered guard rails into their implementation. OpenAI was worried about misinformation so they did not release the bigger GPT models.

Note: I’m not arguing either way if OpenAI was right, or honest about their motivations just observing that they expressed this opinion and acted on it to guard the risk.

Got this so far? Keep this in mind because I’m going to use this information to answer your question:

> If you know that image generators are not it, then why talk about it here?

Because it is a technology which were deemed risky by some practitioners and they attempted to control it, and those attempts to control the spread of the technology failed. This does not bode well towards our ability to restrain ourselves from picking up a real black ball, if we ever come across one. And that is why it is worth talking about black balls in this context.

Note it is unlikely that a black ball event will completely blindside us. It is unlikely that someone develops a clone of pacman with improved graphics and boom that alone leads to the inevitable death of humanity. It is much more likely that when the new and dangerous tech appears on our horizon there will be people talking about the potential dangers. What remains to be answered: what can we do then? This experience has shown us that if we ever encounter a black ball technology the steps taken by OpenAI doesn’t seem to be enough.

This is why it is worth talking about black ball technologies here. I hope this answers your question?


Social media is being used to undermine societies globally, arguably it has a fair chance of destroying humanity.

So I think that horse may have bolted already.


Who says a computer be able to invent a virus by thinking about it really hard?


I hear you. The concepts we are dealing with here are very abstract.

It sounds like you are getting hung up on details of a particular example. That is not useful. We can’t give you exact details for a particular black ball because we haven’t encountered one yet. Sadly the fact that we haven’t encountered one yet doesn’t mean that they don’t exists.

Think about it like this: There are technologies which are easier to stop spreading and there are technologies which are harder to stop spreading.

Example for a technology which is easier to stop: Imagine that a despotic government wants to stop people from space launches. All the known tech to reach orbit is big and heavy and requires a lot of people. It is comperatively easy to send out agents who look at all the big industrial installations and dismantle the ones used for space launches. There will be only a handful of them and they are hard to hide.

Now an example for a technology which is harder to stop: imagine that the fictional despotic government has it in for cryptography. That is a lot harder to stop. One can do it alone in the privacy of their own home! All you need to do is some pen and paper. That can be hidden anywhere! A lot lot harder thing for the agents to find and distrupt.

We talked about how easy to stop the spread of a given technology. Now let’s think about something else. The potential of a given tech to cause harm.

An example for a risky technology: nuclear weapons. If you have them you can level a city. That is a lot of harm in one pile.

An example for a less risky technology: ergonomic tool handles. Those rubbery overmoldings which make it nicer to use the tool long term. There is no risk free technology, but I hope you agree that these are a lot less dangerous than a nuclear bomb.

Do I have you so far? Good. Because this was the easy part. We talked about things which already exists. Now comes the hard part. This requires some imagination: We have seen tech which was easy to control and tech which was harder. We have seen tech which was risky and tech which was less risky. Can these properties come in all combinations? In particular: are there technologies which are both risky and hard to control? Something for example where any able human can accidentally or intentionally level a city or kill all humans? I can’t give you an example, we don’t have technology like that yet.

The example you are asking about is an example for this kind of technology: high risk, hard to control.

Nobody says that you can download software today from github which can help you engineer a deadly virus from household chemicals. This does not exist. It is a stand in for the kind of tech which if it were possible it would mean that we have a high risk, hard to control technology

Does this help explain the context better? Let me know if you still have questions.


Sorry to cut out your long post, but:

> Does this help explain the context better? Let me know if you still have questions.

That's the problem. The thing you're scared about (dangerous technology) has nothing to do with the context (AGI) because there's no reason to think AGI is especially capable of creating any of it or is going to. Humans create general intelligences (babies) all the time and you aren't capable of, nor are you putting any effort into, "aligning" babies or stopping them from existing.

AGI being superintelligent won't give it superhuman creation powers, because creating things involves patience, real-life experimentation and research funds, and while I'll grant you computers have the first they won't have the other two.


Sorry. Where did i mention anything about AGI? Why is that the “context”?

Some form of AGI under some circumstances might be black ball tech. There can be other black balls which have nothing to do with AI let alone AGI.

> The thing you're scared about

I’m scarred about many things but black ball tech is not one of them.

One can discuss existential risks without being scared about it.

> they won't have the other two

If you say so? I don’t agree with you on this, but it feels this would mislead the conversation, since AGIs and black ball tech has at most some overlap.


> Sorry. Where did i mention anything about AGI? Why is that the “context”?

Well, that's what the article's about. (With some generalization, since I don't think anyone expects art AI to be civilization-ending.)


Arthouse cinema has been doing that and more for decades and we're still here.


Exactly, and there are many pros for humanity to this too: people will be able to make funny pictures and things like that, so it's not like it's a bad deal.


What if the black ball was a red herring all along and the usual suspect tech-CEO's hand(s) rushing to control the said crystal ball are the real hazard?



Wouldn’t nuclear weapons or even plastics be a black ball already?

Humanity is not homogeneous, we will always react to new inventions or tools differently , many will use it positively some won’t . Short of weapons of mass destruction I am not sure anything else will destroy civilization itself .


No, the black ball is a technology that, once invented, humanity cannot survive. Nuclear weapons have been invented and humanity is surviving. Same with plastics.

A black ball would be like - suppose nuclear weapons ignited the atmosphere. We test the first nuke, it ignites the atmosphere, a global fire storm consumes all breathable oxygen, kills all plants and everyone on the surface and everything else suffocates shortly after. Plastics aren't even close to this level of harm.


>> Wouldn’t nuclear weapons or even plastics be a black ball already?

> No, the black ball is a technology that, once invented, humanity cannot survive. Nuclear weapons have been invented and humanity is surviving. Same with plastics.

I don't think that definition is a good one. Technological civilization [1] has survived nuclear weapons for ~80 years, but there's no guarantee it will survive it for another 80 years, let alone forever. It seems like these "black balls" should be though of like time bombs, there are at least two variables: how much destruction it will cause when it goes off AND the delay time before that happens. We shouldn't confuse a dangerous technology with a long delay time for a safe technology. My intuition tells me that there will probably be nuclear war at some point over the next 1,000+ years.

[1] I don't think nuclear weapons can make humanity extinct, so long as there are still little poorly-connected subsistence communities in remote areas. However, if The Market, manages to extend its tentacles into every human community, we're probably fucked.


The term comes from Nick Bostrum's article on The Vulnerable World [1]. In it he defines a black ball like this "a black ball: a technology that invariably or by default destroys the civilization that invents it". Nuclear weapons don't invariably destroy civilization because we could just imagine that we keep using them as is - that's not impossible. Also, Bostrum considers nuclear weapons explicitly and calls them a gray ball.

1 - https://nickbostrom.com/papers/vulnerable.pdf


> "a black ball: a technology that invariably or by default destroys the civilization that invents it"

I think I see the issue, and I think that summary leaves an important facet out that the paper talks about. It would probably be better as "a black ball: a civilization-destroying technology that cannot be regulated, so invariably or by default destroys the civilization that invents it"


I guess people are getting confused because in terms of risk-of-destroying-humanity, nuclear weapons seem higher risk than DALL-E.


Nuclear weapons alone aren't an existential risk. There are far fewer nuclear weapons in the world than there are major cities, among other things.

Dall-E isn't an x-risk, but an advanced AI might be (though a lot of people have their opinion on that part).


This is news to me. A "modern" nuclear weapon is much more powerful than the ones we recall. The difference between what was used in WWII and what was developed after is astounding. I was under the impression the US nuclear arsenal alone can wipe out humanity. Its not the bomb itself that would kill most people. It's what comes after. And thats just ONE bomb. There's a reason nuclear armed countries show restraint, as well as aggressors against those countries. Its not a war, its suicide.


> A "modern" nuclear weapon is much more powerful than the ones we recall.

The nuclear weapons deployed during the 60s and 70s were far more powerful than the ones of today. Instead of multi-megaton yields being the default, now most warheads are in the 100-300kt range. (This is largely due to improved accuracy reducing the size of a warhead required to take out a target) That means 2-3x the damage radius of the Hiroshima and Nagasaki bombs at 10-15kt. (radius does not scale linearly with yield)

> I was under the impression the US nuclear arsenal alone can wipe out humanity.

Even if you assume 1 nuke = 1 city, the US only has ~5k warheads. According to [0], city number 3000 has 141k inhabitants. Now, indirect effects are going to kill a LOT more (no more industrialized agriculture or global supply chains), but that'll still only get you to 99% at most. (And that's in this extremely contrived scenario where it only takes one nuke to kill Beijing, and not a few dozen)

0: https://data.mongabay.com/cities_pop_03.htm


There's also Nuclear winter to contend with. 100 or more cities being destroyed throwing dust and soot into the air is enough to lead to problems with food production, leading to widespread famine. Still not enough to eliminate human life entirely, but severely unpleasant to undergo. For that, you'd need up look into a significantly larger explosion, enough to crack the Earth's crust and unbalance its orbit. It might be easier to deorbit the moon and crash it into the Earth.


> Same with plastics

Plastics are playing the long game. They have to turn into micro- and nanoplastics first and may then enact undesired, unforeseen biological functions, just like BPA [1].

Not even talking about weaponizing this stuff...

[1] https://pubmed.ncbi.nlm.nih.gov/21605673/


It seems implausible to me that plastics are going to kill all humanity. If the paper you linked makes that case, then I will read it, but I didn't get that from skimming the abstract.


>Nuclear weapons have been invented and humanity is surviving.

Isn't the point of the discussion started by PoignardAzur about how we deal with such technology after it is pulled out of the bag?

If you define black ball technology as fundamentaly uncontainable, then there is no point in talking about our practices of restricting access to new tech.


Either we equalise chaos or we reduce chaos, there exist no other options for entropy incarnate.


Yet another "open" model that isn't open. We shall see if they actually do release to the public. We keep seeing promises from various orgs but it never pans out.


Their plan seems less hand wavey, they're being explicit with "first we release it like this, then like that, then freely to everyone".

You're right that they could always change their minds and that would suck, but so far they seem to be being up front.


Zero of their flagship models have been released AFAIK, e.g. the pre-DALLE model, CLIP, has not had weights for its ViT-H/16 model released from a paper 19 months ago.

This was before DALL-E and way before DALL-E 2.

Not that I think they need to. It’s just weird people defend them as being open just because they release crippled versions of their model.


They've said that they intend to release the model soon-ish for anyone to run on their own machine, with no censorship*. I don't think the other major players, like OpenAI, have committed to doing the same.

* They're actually working on a filter right now, but IIRC it's an optional one, for when you don't want to accidentally generate NSFW output.


I'm so sorry I thought the discussion was about OpenAI, I'm glad you clarified it was about StableDiffusion. Hope I didn't muddy the waters too much for others.


No biggie.

You may be interested to know that the code for SD has already been released on Github, and they've given the weights to a bunch of researchers in preparation for full release. I've also heard that one of the researchers leaked the weights earlier today, and 4chan has been using them for, uh, stuff.

https://github.com/CompVis/stable-diffusion


OpenAI should probably rebrand lol the "open" part is basically parody at this point


Agreed, and they are influencing others, like the company in this article, which is following their model of "opening" things.


Recent and related:

Stable Diffusion launch announcement - https://news.ycombinator.com/item?id=32414811 - Aug 2022 (37 comments)


I’m excited for the coming race to improve and miniaturise this tech. Apple has a great track record of making ML models light enough to run locally. There will come a day when photorealistic image generation can run on an iPhone.


Maybe this is their long term plan for getting rid of the camera bump.


3 days from launch to getting your twitter account suspended.

https://twitter.com/DiffusionPics/


Context?


Can someone tell me how this compares to the guide and repo shared a few days ago on HN: https://news.ycombinator.com/item?id=32384646


This version is a bit more optimized, and better packaged. Also the model has been trained longer, so when the weights become publicly available the resulting quality should be much higher.


There's also Disco Diffusion: https://www.reddit.com/r/DiscoDiffusion/

Not sure how they compare. DD seems to be quite popular. I'm currently setting up DD locally.


I've been running DD for a few months now... I tend to just edit the python script or use e.g. entmike's fork which can read config files to make changes to the 50+ parameters (basically everything is better than having to use Jupyter Notebooks IMO), granted if you don't have a GPU with 6+ GB of VRAM, you can often get a decent enough GPU for free from Google Colab. For running locally, I can also highly recommend Visions of Chaos, which includes multiple versions/forks of Disco Diffusion, as well as a ton of other latent diffusion scripts, not to mention many many many other generation features such as fractals and even music. They also recently added the ability to train your own diffusion models which I've been doing the last few days using thousands of my own photographs. It also has a pretty nice GUI and the dev is extremely responsive on Discord. Also after you do the setup for VoC it handles running all the python venv setup stuff otherwise necessary with local DD installs. In any case, check out DD Discord and/or VoC discord for lots of info, tips, help, examples, and support.


Thanks for the info. It is possible to do something like transfer learning on top of existing models or do you train your own models from scratch? I'll check out that Vision of Chaos thing. I'm just beginning my journey into this generative art stuff and just basically trying to get this running right now.


Well there are 2 types of models, CLIP and diffusion models. With VoC, Disco, etc. latent diffusion, you pick multiple CLIP models and a single diffusion model. The CLIP models are the big gigabyte ones like ViT and RN, and you can use CLIP search engines that search on the LAION datasets to give you a rough idea what will happen when you use those words in your prompts: https://rom1504.github.io/clip-retrieval

I will otherwise refer you to the "Bible" of latent diffusion: https://sweet-hall-e72.notion.site/A-Traveler-s-Guide-to-the...

Whatever isn't covered in there is probably in the Disco Diffusion cheatsheet: https://botbox.dev/disco-diffusion-cheatsheet/

There are tons of resources out there, and it's a nonstop learning and experimenting process to try to achieve what you want.


Thanks again. Now I got my first image out and it ended up being a complete failure. :) I'll keep experimenting / learning.


Welcome to the party! My first image was also a total failure, it can only get better from here ;) Prepare to spend a lot of time reading before you start to make sense of things.


If you want to see more examples of what this AI is capable of, check out the subreddit:

https://reddit.com/r/stablediffusion


If anyone from Stability is reading, the confirmation e-mail to sign up is sending a broken link:

"We couldn't process your request at this time. Please try again later. If you are seeing this message repeatedly, please contact Support with the following information:

ip: XXXX

date: Mon Aug 15 2022 XX:XX:XX GMT-0700 (Pacific Daylight Time)

url: https://stability.us18.list-manage.com/subscribe/confirm"


It worked for me just now, so maybe it was temporary, or they already fixed it?


I forwarded this Thread to a member of the Project.


I also had this response.


The site shows a notification in German that I need to enable JavaScript to use the site, after the first paragraph. But then after that is the full article, including images, which is almost perfectly readable, except it's at 5% opacity (or maybe the JavaScript popup is 95% opacity overlaid on the article), which makes it impossible to read again. :'(


Article says it needs 5.1GB of Graphics RAM.

Does any one know how much data download and disk storage does it need?


The v1.3 model weighs in at 4.3 GB. There's an additional download of 1.6 GB of other models due to usage of huggingface's transformers (only once on startup). And the conda env takes another 6 GBs due to pytorch and cuda.

Larger images will require (much) more than 5.1 GB. In my case, a target resolution of 768x384 (landscape) with a batch size of 1 will max out my 12GB card, an RTX3080Ti.


I think this is a good time to ask if anyone is working on parallelizing machine learning compute anymore? For at-home computation like this it seems like it would be a lot better to allow people to stack a few cheaper GPUs rather than having to pony up thousands of dollars for ML-oriented beast cards to be able to do things like generate large images.


AI upscaling will solve everything ;)

I've generated some remarkably good-looking print quality images by upscaling 512x512 sources


What do you use for upscaling? Standard software like Photoshop or Affinity, etc. or more dedicated software? Any recommendations for options, etc.?


Recently impressed by https://replicate.com/nightmareai/latent-sr but otherwise - Cupscale


Here's a particularly impressive result I got from the former (considering it's not especially optimized for "vector art" enlargement): https://twitter.com/andybak/status/1558737805546749953?s=20&...


For videos in particular, if you don't mind shelling out cash, the current go-to (at least according to various AI discord servers I'm on) for AI animation nerds is currently Topaz upscaler. There are free alternatives but I've yet to see any of them work as well as Topaz, though I'm sure that will change soon. For interpolating frames Flowframes is "free" (new features if you join the Patreon) and is IMO very good.

I've seen a number of 80s/90s VHS recordings of concerts being uploaded to YouTube in 4K (using Topaz) and they look like they were recorded that way, truly amazing. I do hear it can be a bit of work though getting the settings right.


If you read directly from the site. The requirements for the graphic card are 10 VRAM as a minimum. Because it's ruins locally you don't need to download anything apart from the initial model, this applies to the disk space too.


Does this work on Apple silicon processors? They have plenty of RAM accessible to the GPU.


The articles says it will, but that it is not using the GPU unfortunately.


Has anyone made a pixel art generator, that can create the animation sprites ?


@KaliYuga did - she got hired by StabilityAI just a few days ago. Here is a link to the Pixel Art Diffusion notebook:

https://colab.research.google.com/github/KaliYuga-ai/Pixel-A...


Check out NUWA-Infinity[0][1], submitted to arxiv jul 20, 2022. It captures artistic style very well (though can't speak to the quality of the pixel art it would generate) and can do image to video.

[0] https://nuwa-infinity.microsoft.com/#/ [1] https://arxiv.org/abs/2207.09814


You can use DALL-E and other models to make pixel art ("as pixel art"), although it can both be overkill and hard to get consistent results that you'd put into animation. I'm guessing that starting from more of a video model and then converting to pixel art could be better. Although it's also non-trivial to turn "realistic" video into convincing animation.


I pay good money for a specialized machine learning algorithm that can take a pixel art character, and then generate all the animated sprites for it.

I actually tried to get Dalle to do this, And it made like three good sprites in the rest were just broken. But it was so strange, because you could see it was still organized as a sprite sheet, it's just the sprites were useless.

I think the practical applications of this technology will be hyper specialized models for specific purposes.


hi there. we're working on this, been working on a model for months now. hope to release something soon. how best to get in touch with you?


That's a really good idea.


This is exactly the type of application I am interested in as well. As a hobby game dev with only mediocre pixel art skills, having a generator to finish the busy work would be an absolute lifesaver. I'm also interested in using it for fleshing out artistic vision through generating variations of an initial concept.

Hopefully we aren't more than a few years away from something practical like this.


This article says both that it's "open-source" and that it's "available for [only] research purposes upon request". These can't both be correct. Where is the error?


They jumped the gun with this announcement. I get wanting to share the excitement of AI doing something cool with the world (I've been there) but they should've waited until it's accessible to the public.


The code is open source, the models are not.


My friend wants to know when she can use this to generate porn, are we close?


I did a show HN about this https://news.ycombinator.com/item?id=31900095 a month ago, to experiment with the technology. The training was done in a week-end only, with 2 old gpus (1080ti).

Currently waiting to scale-up for improve quality mainly for economic reasons, not quite sure I could recoup the training costs yet. Even more so if I go with cloud training.

NVidia will release the 4090 in september, and ethereum may do "the merge" that will make GPU useless for mining so GPU price could be affordable so I can update my home cluster with affordable 3090s. (But electricity prices are also up).

Also there are new algorithms every month like the stable diffusion, that would obsolete your previous training.

The video generation cost is probably still too expensive compared to just paying a cam girl in a low wage country. But it will probably go down soon.

This is also some sensitive data, as plagued with copyright issues, so it's quite troublesome to legally share training datasets to share costs.

It also has its own challenges with respect to custom dataset creation with text description, so it's probably a better idea to adapt the algorithm to the currently available data to keep the costs low.

Finally once someone releases a model, in the next month there will be at least 3 clones.

There is also the problem to find an adult friendly payment processor.

And the multitude of potential legal issues.

But it's probably inevitable.


The data used to train those model is specifically filtered to remove sexual content, so the model can't generate porn because it has no idea what it looks like, beyond a few samples that made it past the filter.

So no, your "friend" can't use it for that.


Why is it that sexual content is so frowned upon in this space? If it's a content publishing platform I would understand that advertisers don't want that, but this is literally dictating people what is bad and good. I just don't understand this Puritan outrage with text-to-image porn generation.


Because you can't control what the model is going to output in response to a query. The model is trained to respond in a way that is aligned but there is no guarantee.

Since we certainly don't want to show generated image of porn or violence to someone that didn't specifically ask for that, the easiest way to ensure that's not going to happen is to just not train on that kind of data in the first place. The worst that can happen with a model trained on "safe" images is that the image is irrelevant or makes no sense, meaning you could deploy systems with no human curator on the other end, and nothing bad is going to happen. You lose that ability as soon as you integrate porn.

Also with techniques like in-painting, the potential for misuse of a model trained on porn/violence would be pretty terrifying.

So the benefits of training on porn seems very small compared to the inconvenience. I don't think it's anything to do with puritanism, it's just that if I am the one putting dollars and time to train such a model I am certainly not going to be taking on the added complexity and implications of dealing with porn to just to make a few people realize their fetishes at the risk of my entire model being undeployable because it's outputting too much porn or violence.


> porn/violence would be pretty terrifying.

uh have you seen American/European mainstream pornography? it's already pretty violent (ex. face slapping, choking, kicking, extreme bdsm).

I just don't see why this stuff is allowed and protected by the law (if its not recorded and published its illegal) and then we are suddenly concerned about what text can do.

Just one of the many double standards I see in Western society.


> uh have you seen American/European mainstream pornography? it's already pretty violent

That's not at all what I am talking about. What I am saying is that such a model would give everyone the ability to create extremely realistic fake images of someone else within a sexual/violent context, in one click, thanks to inpainting. This can become a hate/blackmail machine very fast.

Even though Dalle-2 is not trained on violence/porn it still forbids inpainting pictures with realistic faces that have been uploaded by users to prevent abuse, so now imagine the potential with a model trained on porn/violence.

Someone is eventually going to do it, but back to your initial question about why it's still not done yet, I believe it's because most people would rather not be that someone.


One example risk is someone using computer-generated content to extort money, demand ransom, etc. The cheaper and easier this becomes, the more likely it is to be weaponized at scale.


but wouldn't the ability to auto-generate blackmail material mean the value of blackmail would fall? Just from a supply and demand perspective, it makes sense to me why a deepfaked kompromat would put serious discount on such material especially if everybody knows it was generated by an AI.

Someone like Trump would just shrug and say the pee tapes are deepfaked. I don't think its possible for AI to bypass forensics either. So again this narrative that "deepfake blackmail" would be dangerous makes no sense.


I think it's less for Trump level people and more for basic scams. Imagine just automating a Facebook bot to take someone's picture, edit it into a compromising scene, and message them saying you'll share it with their friends if they don't send you some Bitcoin. This gives you a scalable blackmail tool.

Of course, after a while it'll probably stop working, but there will be a period of time where it can be done profitably and a longer period where it will be obnoxious.

And, of course, you could probably always use the tool to scare children, who, even in the future, might not know that everyone would shrug off the generated pictures.


seems like cryptocurrency is the problem


You mean like gift cards are used in the same manner?


The cheaper and easier this becomes, the more likely it is to be weaponized at scale.

...and the more people will be aware of and stop believing in the "fake reality".

Ensuring this technology is only available to a tiny subset of the population is to essentially give all the power of distorting reality to that tiny group of people.

In fact, I suspect that is precisely the reason.


Because it's a lot more annoying for your innocuous content to be rendered as porn when the ai happens to interpret it that way than it is for you to be unable to render your pervy desires intentionally.

A porn model should really be it's own thing.


I imagine a large part of it is that it could generate photorealistic child porn (also "deepfake" porn of real people) and there's not really a good way to prevent it entirely while also allowing generalized sexual content AFAIK. There's probably some debate on how big a problem this really is, but no one wants their system to be the one with news stories about how it's popular with pedophiles. It was the issue they had with AI Dungeon.


Correct me if I'm not wrong, but in many countries even simulated child porn is illegal. A model spitting that out could be legally problematic.


Do they remove certain political or religious ideas which is considered illegal in somewhere as well?


Craiyon certainly doesn't.


Because if the model generates anything problematic the New York Times will ruin your life.


I'd guess that, for general purpose companies, it's an area full of legal ambiguity and potential for media outrage, so just not worth the risk. However, given the evidence of human history, it's certain that someone with an appetite for exploiting this niche will develop exactly that kind of tool.


Because the law makes it very difficult to provide such services in the spirit of preventing the exploitation of minors.

Make no mistake this is an indirectly legal hurdle.


This article hints that Stable Diffusion can at least generate normal looking nude women: https://techcrunch.com/2022/08/12/a-startup-wants-to-democra...

There are attempts to gather porn images and train or fine-tune existing networks on it, here's a recent attempt by an art student mentioned in the article above (NSFW!!): https://www.vice.com/en/article/m7ggqq/this-furry-porn-ai-ge...


Jesus H Christ those are some seriously cursed hindquarters


I was a better person this morning for not knowing that furries had the term "hindquarters". I mean, that's fine for other people, you do you, but for me, I was better this morning.


You'll have to train it on your own data. as others have mentioned the training data for dall-e, stable diffusion etc has been cleaned prior to training.

However, if it is possible to re-start the training process from the weights of a non-sexually aware model, this finetuning might not take all that long..!


Is there a way to make money selling the model to people who want to use it to make porn? If so, it will trickle down relatively quickly. If not, it'll still eventually trickle down but will take longer.


Surely you could sell custom prompt runs for porn for a great deal more than OpenAI is charging for generalist custom prompts.

Making money at it should be easy, and places like PornHub wouldn't care about any outrage. The real challenge would be limiting criminal and civil liability, at least to my not-in-the-business thinking.


This is already a thing in the kpop fake porn industry. I don't know how Patreon/Onlfans are allowing this to happen, I mean it's a travesty that highly suggestive lyrics and stripper dance moves in scantily clad kpop idols are being used for sexual gratification


Not their fault their country banned actual porn.


Once training can be done on a beefy home rig folks will be all over it.


You'll need to train your own model, though I'm sure if someone manages to crowdsource it there's a very obvious economic incentive.


Props for just coming out and saying it


It is possible to generate nudes, but not pornographic ones.


Wait, so the closed source generator known as DALL-E is owned by a company called OpenAI?


It's a bit of a dead horse at this point but yes. See the previous discussion: https://news.ycombinator.com/item?id=28416997


As Elon said "OpenAI should be more Open IMO"


Curiously it seemed to lock down even more after they "partnered" with Microsoft.


This is pretty amazing, anyone have any tips on building a pc for machine learning with a RAID device ?


Hasn't been updated since 2020, but Tim Dettmer's guide [0] is pretty much the gold standard for optimizing what to buy for which area of DL/ML you're interested in. The pricing has changed thanks to GPU prices coming back down to earth a bit, but what to look out/how much ram you need for which task hasn't. Check out the "TL;DR advice" section then scroll back up for detail info on why and common misconceptions. For the tips on a RAID/NAS setup alongside it, just head to the datahoarders subreddit and their FAQ.

[0] https://timdettmers.com/2020/09/07/which-gpu-for-deep-learni...


Look into building an ethereum mining machine... it can double as an ML workstation. That's what I did.


If you just want to try it out, consider using a remote cad workstation from a company like paperspace.

(No affiliation.)


doesn’t build on my mac studio due to a dependency whose mac version is two major versions behind


Unfortunately it's a commercial license and the model isn't available to the public so it isn't very useful.


It's going to be MIT from what I have heard. On phone atm so can't provide sources.



That’s just a restricted interim release. The proper public release isn’t ready yet. No timescale but sounds like days/weeks rather than months/years.


like OpenAI?


Not sure I follow. OpenAI are not claiming they are going to release their model at all. The team behind Stable Diffusion have currently kept every promise they've made.

(And if you're insinuating something, just come out and say it so people can engage appropriately)


i think he meant are they releasing a form like openAI in the interim release? someone's a bit aggro


Yeah. Too long on Reddit...

Still not sure I understand. It's already available in two forms: a Discord bot for a wide group of beta users, and the "researchers-only" source release.

OpenAI only have a paid SaaS version of Dall-E


Isn't that just temporary until the public release? Or is the article misleading by calling it open source?


Would be a blast if the cloud is up ended by RISC and GPUs powerful enough to crunch “big data” at home.

Would love to see FAANG and SV crash and burn, margins chipped away to nothing.


We are heading into uncharted territory :(




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: