Hacker News new | past | comments | ask | show | jobs | submit login
Imagen: An AI system that creates photorealistic images from input text (research.google)
293 points by fagnerbrack on Aug 25, 2022 | hide | past | favorite | 226 comments



Imagen, a text-to-image diffusion model - https://news.ycombinator.com/item?id=31484562 - May 2022 (634 comments)


Every single time I see an article about a new AI model that has a section called "societal impact" I know immediately they are not releasing the model, the training set, nothing...

It seems to be the kind of bullshit statement that those companies put in place of "we paid $500k training this model and we're not giving it for free to anyone".


Fortunately we have Stability.AI and they release their image generation model already. Hopefully they'll follow with other projects too.

https://stability.ai/blog/stable-diffusion-public-release


There's also a fork that requires a lot less VRAM. I was able to get this working with an Nvidia GTX 1070.

https://github.com/basujindal/stable-diffusion

You'd clone the fork, then download Stability AI's checkpoint, sd-v1-4.ckpt, from https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...

Follow the instructions in the forked repo, and you should be good to go in a manner of minutes.


The only issue I had was WSL 2 was not using my GPU, so obviously it failed. Couldn't get it working on Windows either, failed at installing dependencies. I'm thinking of dual-booting Linux just to try out stable-diffusion.

I normally dev on Linux but don't do GPU work, only my Windows gaming PC has the horsepower to run stable-diffusion and unfortunately I couldn't get it working on Windows. Shame cause I'd love to play around with it! Does anyone have any tips?


Hm, what issues did you come across on Windows? I have it running on both a Windows PC and a Fedora PC. I uninstalled python, then installed the full installation of Anaconda. Then I followed the "Requirements" section of the readme and I was good to go.

Edit: Just noticed they removed the instructions from the readme for some reason. The requirements section can be found in this diff: https://github.com/basujindal/stable-diffusion/commit/487a0f...


Thanks for the input. I can return with error logs later, but essentially when doing the `conda create env -f environment.yaml` line it would error out partway through. It would download dependencies for a few minutes then fail on one. I tried multiple times and got the same error. I just assumed it was an issue of a dependency not properly supporting windows.

Are you saying you coulda `conda create env -f environment.yaml` fine? This was my first conda install on Windows, so perhaps I missed something else.


Yeah, my Windows machine was able to run that command just fine.

Maybe one of the newer commits broke something? I think I cloned it when b91816301fa62df239a45336b381ee918ff52b2a was the latest, but I'll have to check when I get home.


Yes it works fine for me on Windows with conda.

    conda env create --name envname -f environments.yml
Try reinstalling conda. Do you have a firewall on or anything like that?


Try out a Ubuntu live image. You can probably get set up with everything without even having to touch your HDD! I just don't know how the native NVidia drivers work in that case, but I think you can get them installed and restart the window manager, even w/a a liveCD.


That's a really good idea than finding another HD to install. Thanks.


Same. I've spent a number of nights trying variations on the install instructions. Got the windows CUDA drivers (older graphics card), but no matter what I try, conda, or WSL, pytorch refuses to see the CUDA available.


I'm able to make it works on my GTX 1650 LP. Had to use full precision though or else it generates green images. I also had to reduce the resolution to 320x320 and close every applications that uses vram because the GPU only has 4GB of vram. Took about 2 minutes to generate a single image with ddim sampling steps set to 50.


Does it make images of lesser quality or does it just take longer for the same quality?


Anyone built a Docker image for this? I don't want to duplicate efforts.


Much respect to Stability - they followed through on their promise.


It's probably a lot more than that, $500k is like one Google ML engineer's salary.


Right? I feel like novel development of these models looks more like the GPT 10-20M numbers.


>Every single time I see an article about a new AI model that has a section called "societal impact" I know immediately they are not releasing the model, the training set, nothing...

This section is just a requirement for some of the big ML conferences, like NeurIPS.


And especially so because this looks like a straight up marketing website. This is all about their own product.


It’s not new, it was created shortly after dall-e. To my knowledge, there is no product atm. I’d pay to get access to this api, but you can’t.


I for one am glad that, for once in my life, an obviously huge advancement is taking into account the human impact of releasing the technology responsibly. Maybe it’s just a “BS statement” but given the major strides they’re making at removing racial/gender biases[1] from similar projects, I don’t think it’s just hot air. Especially given the phenomenon of “bias amplification”.

Maybe I’m being too optimistic but either way, given the pace of progress we won’t have to wait very long to play with this magic.

[1] https://openai.com/blog/reducing-bias-and-improving-safety-i...


>the major strides they’re making at removing racial/gender biases

They're literally just appending a race/gender string at the end [1]. In what world is that not just hot air?

[1] https://twitter.com/jd_pressman/status/1549523790060605440


That’s a pretty uncharitable interpretation of my post. You shared an example of a single mitigation that you personally find ludicrous (without explaining why). And then I’m supposed to throw up my hands and go “I guess it’s pointless to try and be less racist”?

Bias amplification is a real issue. https://www.theverge.com/2016/3/24/11297050/tay-microsoft-ch...

Tay might still be around if Microsoft gave a thought to potential issues before release. I’d prefer not to have this awesome technology tainted out of the gate as a tool for racists and pornographers. They’ll get their hands on it eventually but it’d be nice if they don’t get all the up-front press.


> a single mitigation

That is the only mitigation used in DALL-E 2, which up until recently was the only publicly available text to image model.

> I’d prefer not to have this awesome technology tainted out of the gate as a tool for racists and pornographers

Why is it your business what people do with the model? If people want to be racist they can already do so, they don't need a shitty model that doesn't work half as well as paying some guy in the third world $2/h to shitpost online. And I don't see the problem with pornography.


I just voiced my support for their decision. Just as you are voicing your displeasure with it. I don't see how we are different.


Everything has a “bias” including reality. This type of talk is not about removing biases, it is about imposing biases in line with the wokeligion and attempting to manipulate public view away from the facts of reality. But reality always wins in the end.


Funny how it stopped mattering when OpenAI decided to charge for it.


right, because I can't be trusted to not be racist


Indeed, you can't be trusted not do anything bad with it. Who's going to vet each and every user? Who's going to check every time it's used for the coming 10, 20, 30 years?


If he's racist he can already hire black people to take a photo for an "upcoming action movie" called evil baby. They will be asked to hold guns aiming it at a crib. Then release it on the internet and say he saw 3 black people about to shoot a baby.

It would be called out as fake or staged just like an imagen/walle2/openai would be called out as fake. The thing that makes stories real is real people - actual events and backed up testimonies.


If he wants a picture of a dog with sunglasses in a boat, he can hire a dog, a handler, and a boat. He doesn't need this model for anything.

> The thing that makes stories real

Since when do the hordes care about stories being real? People get harassed over photoshopped images. People get killed over false rumors. These tools make it easier to trigger such reactions.


>hire black people to take a photo for an "upcoming action movie" called evil baby. They will be asked to hold guns aiming it at a crib.

As a side note, I'd watch that. I imagine the next scene shows a bloody room and a baby escaping.


If someone is racist to me online, I'll just turn off my screen


> “While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data, and thus inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place.”


Of course that training set was scraped off of everyone else’s web sites, which is fair use. But the model is copyrighted, and I bet you’ll hear an argument that it’s outputs are copyrighted.

The only AI ethics I’m worried about are the lack of anything ethical going on w.r.t intellectual property in AI.


Yes, this is how fair use works. Once copyright agrees that you were in the right to make something, you have exclusive ownership over it, it does not get locked open like GPL software does.

Scraping other people's content is already established fair use thanks to Google prevailing against the Author's Guild in front of SCOTUS. I find it difficult to understand how it could be illegal to scrape a bunch of creative works to create a system that generates new works, but legal to scrape a bunch of creative works to let people search two-page excerpts out of them.

Output copyright is very much up in the air. The Copyright Office has rejected copyright registrations claiming the software itself created the work; but presumably this wouldn't apply to a human taking ownership over something they used ML to create. The amount of prompt engineering you have to do to make these kinds of systems alone would count as some kind of creativity. The only real complaint I could see is if the system regurgitated its training data, which would be bog-standard copyright infringement.

Also, related note: I really hate how the whole AI thing is making the FSF sound like the Author's Guild did a decade and change ago. The law is already very clear that you cannot launder an infringement (copying GPL code) through a fair use (trained ML weights). Please do not adopt the arguments of copyright maximalists.


They will still use it and make a lot of money from it, I guarantee it.

Google is in the business of ads, this might have useful applications in the ad space.


It feels like the wind has shifted because of Stability AI, and now companies doing this are going to look like those that have “call to get a demo” links on their homepage in a SaaS world.


in the case of imagen, I suppose the cost is at least two orders of magnitude over 500k.


... no it definitely wasn't. that's $50m. read the paper, they tell you how long it took on a v4-256, which you know the public rental price for.


Pretty much everybody who trains models trains tens to hundreds of models before their final run. But a better way to think about this is: Google spent money to develop the TPUv4, then paid money to have a bunch built, and they're sitting in data centers, either consuming electricity or not. Google has clearly decided the value of these results exceeds the amortized cost of running the TPU fleet (which is an enormous expense).


While I don't agree with gp's price estimate of $50M, you are also forgetting that to train the final model, they had to iterate on the model for research and development. This imposes a significant multiple over the cost to just train the final model.


they're talking about the total cost of the project, including the salaries of the humans, their health care, their office space, etc.


And where did you think the tagged dataset and software came from?


Marginal cost of using that is basically $0. The internal dataset is a sunk cost that's already been paid for (presumably for their other, revenue generating products like Google Images). Half of their dataset is a publicly available one.


One order of magnitude over 500k it's 5M, two orders is 50M


Is Google going to share or do something other than better advertising with this? What do they plan to do with it? I'm frankly tired of these show-off blog posts by Google that neither make it into general hands nor are used for anything positive.

This is just a link to their previous posted site btw, nothing new.


My guess is they're they'll use Imagen and LaMDA to build a "conversational" search experience of some kind. So, instead of providing a list of websites to go to, they'll synthesize an answer to the search query, with imagery to go along with it, and so on.


I don't know about the market at large, but I do not want that. I want the a search engine to just have a huge database of websites, and look up stuff in that database based on my query, spitting out a link to the page that matches best, then the one that matches second best etc etc.

Using ML to determine the order of matches is absolutely fine, but to "digest" the internet and cook up an answer to what I'm looking for without proper sourcing? I do not want that. I don't want to try and guess what biases the language model might have. It's way easier for me to gauge the bias of another human, and for that, I need to be sent to a page a human has written.

(Of course, I realize that the "blog written by a bot" genre of writing is also becoming more convincing, making this whole thing harder...)


Yah I would love someone to convince me this isn't like touchscreen interfaces becoming prevalent in cars over physical control surfaces. It is sold as technological advancement that is good for everyone but in reality it is just corporations figuring out how to make something worse for cheaper and force consumers into it.

I already feel like google search sucks. It is partially googles fault partially the internets fault. The best source of information was and will always be forums full of knowledgable people having genuine conversations and everybody moved to discord because.... because... om I dont know and besides google doesnt get paid to elevate tiny ass forum websites in the search results.


Plans to build conversational agents always seem to ignore Moravec’s paradox. The hard bit of building such an agent is in building a tolerably human-like conversational partner, not in searching and sorting data.

It’s like the “AI” projects that spend all their time building a fancy sci-fi-looking robot body, ignoring the fact that how much of a “person” a robot seems has almost nothing to do with how physically anthropomorphic it is. Johnny 5 and Wall-e are proof enough that the mind, not the geometry of the body, is what’s important.



These tools are amazing for prototyping. I had an idea for a promotional poster, and seeing my idea just by writing it felt like magic. The generated image had too many artifacts to use, but gave me a guideline to follow when creating the real thing in Pixlr.

AI content generation (text, image, source code, video, music) will be a huge boon for prototyping where applied judiciously.


Google hasn't released squat.

Google's product is vaporware and we shouldn't afford them any airtime until they release something usable. They're just trying to butt in and get press off of the backs of the teams actually working in the open, and that's super lame.

Release your model, Google, or stop bragging and talking over the others here. You're greedily sucking oxygen out of the conversation, and as a trillion dollar monopoly you don't deserve anything for free off of the backs of others. Not when you're not contributing. Stop being the rich kid talking over everyone else about how awesome your toys are.

Anyhow, the real story is Stable Diffusion. They're actively demonstrating the correct way to run this as opposed to the entirely closed OpenAI DALL-E or the (again vaporware) Google non-product.

Even MidJourney uses Stable Diffusion under the hood, using sophisticated prompt engineering to make their product distinct and powerful.


I feel there's a strong argument to be made that these organizations should be required to release these models publicly. These are built on the works of the public at large, and the public should get the full benefit of them.

Whatever effort Google has put into building the model is infinitesimally small compared to the work of the creators they're harvesting.

I don't expect this to happen easily, if at all, but I'm strongly in favor of it, and would even support legislation to that effect.


They are afraid of being sued because they are using all the images they have scraped on all the website ever created. They are probably even using images not publicly available.


Well… midjourney used stable diffusion (with an additional guidance model I believe, not just prompt engineering) for their beta model which they already closed down again… it’s back to their old far inferior model.


Why did they close it down?


The rumours are that it was too good at generating nudity for their comfort, and in particular that some users may have combined that with younger subjects.


I kind of get the sentiment about openness but I think it's way more nuanced than you are making out.

There are very good reasons for withholding SOTA models, primarily from the info hazard angle and avoiding escalating the capabilities race which is basically the biggest risk we have right now.

Google / Deepmind have actually made some good decisions to try and slow down the race (such as waiting to publish).


They're not slowing down anything. The cat's out of the bag.

What good does a few months lag do when nobody is bracing for impact?


I'm not saying they are doing a good enough job, but that doesn't mean their approach isn't entirely without merit.

Even ignoring the infohazard angle if they published everything immediately that would escalate the race. By sitting on their capabilities and waiting for others to publish (e.g. PaLM, Imagen vs GPT-3, DALL-E) they are at least only playing catch up.


Capabilities race, seriously? This is not nuclear warfare my guy. It's mathematics.


Nuclear warfare is much less concerning than misaligned AI.

Take a look into scaling laws and alignment concerns, this is a very real challenge and existential risk not some crackpot theory.


In the same sense that deep learning is just linear regression with a steroid problem.


Information warfare is pretty dangerous too!


Can you talk more about the prompt augmentation the midjourney is doing behind the scenes? It's certainly true that you can put in a two-word phrase like "Time travelers" and get an amazing result back, which reveals just how much your prompt is getting dropped into a prompt soup that also gives it that midjourney look by default.


Yep, I feel that exact way about Nvidia Canvas [1]. It does not produce anything even close to usable as a final product, but it can produce an amazing start to a concept.

[1] https://www.nvidia.com/en-us/studio/canvas/


wow, that's kinda insane and almost looks more fun for those of us short on words.


This was the first thing I tried with DALL-E. Took some photos of my house where I'm renovating, wiped out the construction debris and told it to fill it in with what I wanted.

It worked okay - one issue was DALL-E wants to keep "style" consistent so any stray bit of debris greatly affected the interpretation, but I did in fact get 1 design idea out of it which changed how I think we'll do a bit of it.

These things in many ways are just extremely enhanced search tools - "describe what you want to see"


> Creating the real thing in Pixlr

is it that good now? note that Pixlr was bought by Google


Pixlr was not bought by Google. So many people spreading lies on the internet :(


I like Pixlr, I use pixlr.com/e. It's free and web based and has always worked well for me.

Did that happen recently? It says in Pixlr's site they're owned by INMAGINE.


I'm surprised how very little this forum and the public in general knows or understands about the current capabilities and availability to these ai art generators.

You have midjourney.com beta.dreamstudio.ai craiyon.com (real quick version no fuss, low quality) creator.nightcafe.studio

and those are just some of the entry-level ones. I make ai art all day long! Check out my media feed on twitter @Sheilaaliens


> I make ai art all day long!

I feel like this is the epitome of modern "content creation"

Typing a few sentences into a software you barely understand, said software shits 15 jpegs out, 2 are good, "hey I make art". What a sad state of affair, tech is consuming everything and people are cheering, one more step on the path to being complete useless key pressers.


Photography is well known to have killed art.

Press a button on a box you barely understand and the camera shits out a pic. "Hey I make art". What a sad state of affair, tech is consuming everything and people are cheering, one more step on the path to being complete useless key pressers.


You still have to decide what you take a pic of, that's the artistic part. You have to chose a subject, lights, &c. it's still incredibly more complex than "hey google draw me a horse with a tuxedo"

You can tell that right away seeing how many camera users never produce anything of quality. A 6 years old can press the button of a camera but no 6 years old will produce meaningful work. In the case of Ai art tools you just need some basic english skills


> "hey google draw me a horse with a tuxedo"

I think this is an oversimplification.

There was a recent write up by a guy who used DALL-E to create his logo for his open source project. What was clear from that writeup is that it is still a process for getting exactly the look that one is aiming to achieve. Even with an AI, there are different styles, decisions, choices, and visual representations that have to be made.

Your position that with photography, you have to "chose a subject, lights, &c" doesn't change with AI generated artwork; one still has to describe the subject, color scheme, visual style, composition details, etc. for the AI to generate the image. Except that instead of composing a scene with makeup, props, and subjects, you do it textually.

I'd say that in some respect, it is far more "creative" than photography because it removes physical and real-world constraints from the artist which would otherwise require knowledge of CGI and digital tools.

> A 6 years old can press the button of a camera but no 6 years old will produce meaningful work

This is also true of even painting. Even a 6 year old can grab a paintbrush and paint without producing meaningful work. So that does not change with AI generated artwork. Yes, a 6 year old can describe a scene to an AI that generates some image -- just as a 6 year old can pick up a brush and apply paint to a canvas, but the likeliness of a 6 year old presenting the seed/input that the AI needs to generate something unique and of visual interest/originality is low just as it is with a paintbrush.


> Except that instead of composing a scene with makeup, props, and subjects, you do it textually.

One of the truly fascinating things with stable diffusion is that you can use a starting image. So you can start with a vague sketch to control the composition. It's quite incredible.


You still have to pick styles, tweaks, lighting, subject. Then iterate, set composition, guide the results. You need to pick the right models, params and samplers to get the style you want just like choosing your film/camera.

> it's still incredibly more complex than "hey google draw me a horse with a tuxedo"

And just like taking a photo of a random horse in a field you'll mostly get a pretty bland result.

You can argue it's simpler to create art with, which is an odd complaint, but if you're not working to make something great you generally won't get it - just like photography is much simpler than painting as it's "just press a button".


>You still have to decide what you take a pic of, that's the artistic part.

You still have to select the 2 AI jpegs out of the 15, that's the artistic part.


Photography didn't replace drawing and painting.

Also if you are doing something interesting with digital photography it is definitely not just pressing a button.


Exactly, this won't replace all other forms of art and if you're doing the "just press the button" equivalent you won't get things that are that great.

At worst the complaint seems to be "with care you can more easily create good art" which is a very odd complaint.


Also true


It's only sad if you think art or humans sacrificing their time has some kind of fundamental moral value rather than being just another type of product. As a materialist I don't really subscribe to that. I see work as a negative thing that society progresses to reduce.

The keypressing is clearly, objectively, not useless. It is useful, just less costly and more accessible to way more people. I'm generally happy with this.


I’m not crazy about the tone you’re striking, but the bit about “software you barely understand” does ring true. Lots of folks who think hitting play in a notebook makes them some sort of AI-art-engineer these days.

On the other hand, the whole point of automating creation is that you don’t need to understand the underlying mechanisms.


It's division of labour. The toolmaker has a different set of skills to the user of the tools.


The user, me included, has less and less knowledge and skills though.

When your tool becomes a megacorp owned subscription based service you barely understand is it really a tool ? It doesn't produce art in the way a brush and paint produce art, you barely have any control on what it does, you just become an image filter, you press a few keys, look at the image for 5 seconds and decide if it triggers the right part of your brain or not

It's not so different from the artist shitting on canvas, but at least he had the creativity to do something new and daring.

The end result might be called art but the whole process is completely devoid of what makes art "art"


> you barely have any control on what it does,

You have quite a lot of control from style and colour to composition by feeding in initial images. You can do this iteratively, selecting parts you want to keep and others you want to adjust.

Photography hasn't killed art, yet you can describe it as "point at something and press a button". Sure you'll get something out of it, and it might be alright. But the great outputs take more work and care, just like with the air art now.


I can sort of see the "barely understand" part being relevant if I squint. Really don't know how "megacorp subscription" relates to a tool's toolhood.


> Really don't know how "megacorp subscription" relates to a tool's toolhood.

Owning the tool vs being owned by the tool, yadda yadda..


Oh ok. That makes some sense.


Time spent designing a tool is time spent not using the tool to make art and honestly the types of brains that are good at designing art tools... often have very formulaic and structurally rigid ways of thinking about creating art (they tend to use tools for what they are intended for which almost no breakthrough in art has happened from). It is a good thing if artists can uses tools without having to understand the nitty gritty.


Somewhat related, that made me think if pioneers, settlers and town planners

https://medium.com/org-hacking/pioneers-settlers-town-planne...


> software shits 15 jpegs out, 2 are good

Not really. This very much depends on the prompt. Just look at these, made these yesterday in a batch run. Same prompts, different seeds. This all what it made on seed it chose.

https://imgur.com/a/NGSvK48


Is anyone aware of people doing the same for NSFW images? They can easily wipe out an entire industry.

No models to pay and to check for legal ages, infinite possibilities: just write the pic you want, the massive body part you want, how many genitals are involved and boom. You have your image.


[0] Was front page recently.

[0] (OBVIOUSLY NSFW) https://news.ycombinator.com/item?id=32572770


This certainly begs the question: When does DALLE2 become viable to generate movie clips of up to 10 minutes in length?

10 min * 60 * 24 FPS = 14400 images ~ 2^13

So maybe 10 years?


Looks like there’s a lot of road ahead for this, but def. A starting point


Thanks!


I agree, it does seem like this makes a lot of sense for the adult industry. But seemingly all of the models that get released have a built in censor for that sort of thing from what I understand.


[flagged]


Andrew Huberman - How to Cherry Pick Neuroscience "Studies" to Pretend Something is True So You Can Make Lots of Money as a Health Guru Personality


I’ve listened to a number of his podcasts, I’ve never gotten that impression at all. He comes across as a well versed, legit scientist. And there is lots of other evidence to back that up, not just my impression from a few podcasts, publications, tenured professorship at Stanford, etc.

But in this case, does it really need extensive scientific study? Just think about it and look around.

The thing our entire biological systems are largely oriented around, reproduction, and human connection/contact.

And then we have this super stimuli version of that always on tap at a moments notice, but it is a trap, a mirage.

It might be best to minimize exposure to that since in the end it doesn’t/cannot produce the same results.

I’ve not even listened to that link yet, I just caught glimpse of it the other day. I think it’s just starting to seem obvious to more and more people. The evidence seems to accrue daily as to the social detriment it causes.

Are there any specific examples of cherry picking you have?


Almost all of his diet/neuroscience/health claims (outside of commonly known things, like "exercise is good for you") are at best a gray area, and at worst just flat wrong. That is because we really don't know much about the brain and its relationship to human experience and behavior - and so anyone claiming to know how to improve your life with "neuroscience" is a con artist. Same with diet, which he also harps on constantly.

For almost every non-obvious claim he makes about something, there exist studies that directly contradict him. And, being a good con artist, he conveniently ignores them. Here are some things he has claimed that are not at all settled science, although he talks as if they are and leaves out the contradictions:

- Artificial sweeteners cause insulin resistance - Light alcohol consumption is detrimental - You can "hack your brain chemicals" with supplements to achieve some desired effect on your mind - We know what role neuromodulators play on really high level cognitive concepts like "creativity" - Increasing the diversity of your gut microbiome has positive health effects

Frankly, no one's life will be any different from listening to Huberman or from doing anything he tells you to do, aside from those things everyone already knows about (like exercising, sleeping an appropriate amount, and not eating a bunch of refined sugar). If there IS some change outside of those, it is just as likely to be the result of random chance than from taking Huberman's advice, because there is no evidence that anything he says (outside of the obvious) is actually beneficial or worthwhile.

To your question - yes...it absolutely DOES need a scientific study. Huberman's claim is porn literally DESTROYS YOUR BRAIN. But again, the studies are terrible and there are many contradicting studies that he would never point out - he's trying to make you think he knows what he's talking about and has something worthwhile to say (when he actually doesn't).


Is there something personal here as well? “Con artist” seems like a strong choice of words.

It’s interesting that I don’t really see that at all.

Maybe you’re a more frequent listener, I’ve just listened to a few where he is a guest and a few more than that of his solo episodes on his podcast which I particularly like because they just come across as a professor giving an extended lecture.

I don’t get any hype man, wannabe guru, snake oil vibes at all. He does use some of the Tim Ferriss type verbiage of like, optimize, and what not. (Not saying Ferriss is snake oil, btw) But I’d say that’s just par for the course in that genre of podcasting, it never seems to impede on valid information. He speaks like a scientist.

I’d imagine he’d be the first to say we know very little about the brain, but what do we mean by that? Things like qualia and consciousness, sure. But that doesn’t mean we can’t begin to parse out some things, like the things in his wheelhouse that he talks about, the effects of light exposure, positioning of the eyes, different types of breathing, etc.

It’s often easier to tear down than put up. Do you or anybody else you know do better at avoiding the criticisms you have here about the communication of research?

To some points- I’ve heard him talk about chemicals affecting the brain, which is a given right? Chemicals effect the brain in various ways. I’ve also heard him emphasize extreme caution and prudence in experimenting with chemicals at levels outside those one typically encounters day to day and often outright discourage it.

Light alcohol consumption is detrimental I’m no expert here but it is easy to imagine wishful thinking leading to the pro column.

I do notice marked difference in quality of sleep with even light consumption.

Do the pros outweigh the cons? Best left to a personal calculus I suppose.

You can "hack your brain chemicals" with supplements to achieve some desired effect on your mind

Is this a controversial statement? Have you conversed with anyone who has had a cup of coffee smoked a joint, ingested psilocybin or LSD?

The way he talks about the dopamine system and how too much redlining can throw the system out of whack seems well in line with what one observed in many other systems, biological and otherwise.

My life has been changed as everything changes your life to some degree however big or small. I’ve gained knowledge condensed from someone else’s life’s work in the same that occurs in reading a book.

This is a “force multiplier”(genre term) I should remind myself of more often.

I should spend more time reading books.


I speak from a place of compassion. Let’s make it past the Great Filter, our descendants at least, and not wank ourselves to death.


i only have a passing curiosity in these projects personally.

can someone in the field explain why this has exploded recently? there seems to be a lot of these tools released recently (text to image) was there a major breakthrough? a new idea that pushed everyone forward? a recent sharing of talent between groups?

edit: just another thought, are they just being posted to HN now, i don't see a date on the page for when it was released . I also don't know the general term to find a list of all of these to find all the release dates


The previous models were either 1. Limited in their capacity to create something that looked very cool, or 2. Gigantic models that needed clusters of GPUs and lots of infrastructure to generate a single image.

One major thing that happened recently (2ish weeks ago) was the release of an algorithm (with weights) called stable diffusion, which runs on consumer grade hardware and requires about 8GB of GPU RAM to generate something that looks cool. This has opened up usage of these models for a lot of people.

example outputs with prompts for the curious: https://lexica.art/


Is Lexica finding results previously computed? Or generating them? I could only work with very simple queries like "photo of a cat".


It's just a database of submitted works I think. You can try scrolling down on the opening page to see random prompts and outputs.


It's ~1.5 million entries inputted by users during the beta period on Discord.


There are a lot of prompts and results that aren't being included. Not sure what the criteria were.


The first major projects were OpenAI's DALL-E, then DALL-E 2 a year later. DALL-E 2 was much much better. After that a few new projects have been released in rapid succession including the open source stable diffusion.

Here are some of the projects on GitHub: https://github.com/topics/text-to-image

Another good source is https://paperswithcode.com/task/text-to-image-generation


Like anyone deeply in a field I know maybe several thousand people who could probably give a better answer, but I figure I'll give an effort to provide one since I don't see any good ones posted yet.

The moment everyone knew this was going to be big was in 2019 when StyleGAN came out. They used a lot of tricks like aligning face features (like eyes) and had all their pictures of a single domain (the most famous being faces) but none the less, that was the moment everyone in the AI field knew this was going to be big, and so three years ago a lot of big people shifted to this line of research.

The four main innovations since then have been:

1. Transformers

Generalized computation kernels which allow for images to consider non-localised relationships between pixels of an image. Released in 2017, and originally used for language.

2. Pixel Patch Encodings

Different resolution semantic and geometric image information encodings which allow for better representations of relationships between image areas than pixels are able to achieve given the same compute. Allows using Transformers on high resolution images.

3. CLIP

Contrastive Language and Image Pairing. Before, the only way we knew to classify an image was as a "face" or "cat" or "ramen". When the genius idea of labeling images as semantically meaningful vectors rather than one hot encoded classes was revealed, it changed everything in computer vision very quickly, and problems that used to be hard became trivial. Released in 2021

4. Diffusion Models

GANs penalise you for making an image which does not seem to be part of an existing dataset. This encourages one to make the worst quality image that looks like a member of that dataset. Diffusion learns to denoise an image, removing noise is perceptually similar to increasing resolution, people like images that look that way. There may be more people with better intuition about diffusion models may be able to add on why they're superior. I've read all the papers leading up to the latest unCLIP (Dalle2) but it's complicated. Released in 2020, with major improvements to the training process continuously being made since then.

Hope this was helpful. All of the above were only implemented for images in any real way in the last three years. Putting them all together is something many people only just this year did, resulting in DallE, Stable Diffusion, and Imagen.

I'm working on doing this for 3D and later for use cases in AR. 3D generation still hasn't been cracked the same way image has but the above will likely contribute to the solution to that as well. Anyone who's intersted in working on that feel free to message me.


> I've read all the papers leading up to the latest unCLIP (Dalle2) but it's complicated. Released in 2020, with major improvements to the training process continuously being made since then.

The models behind Imagen and StableDiffusion are actually simpler than DALLE2, and both are higher quality (SD of course isn’t always since it’s much smaller). That suggests DALLE3 will also be simpler again.

There’s also been very recent work with generalized diffusion models (that use problems other than noise removal and still work) and Google researchers have been tweeting results from a merged Imagen/Parti in the last few days.


Thanks for answering. Since you mentioned your work on text-to-3d, what are the ways to enhance the image/3d model to actually be photo-(or rather reality)-realistic? Even (presumably) hand-picked examples from google on the linked page lack support bars of the sunglasses, include floating cups of wine with base-less Eiffel tower in the background.

P.S. It seems raccoons are unimaginable (even for AI) with any sunglasses: if photo-realistic mode is selected for a raccoon, changing to "wearing a sunglasses and" makes no difference :)


I know as much about how to get the best image outputs from text inputs as the person who designed an airport knows the best place to eat in it. The emergent properties of the system are a result of the data put into it, so I can only discuss the system itself, not what it ended up doing with the data in that system.

The models are a product of their datasets, specifically the relationship of the images and prompts via CLIP. CLIP puts both images and text into coordinate space, imagine just a 2D graph. It tries to assure that for any real image and its caption, they will each be each others closest neighbor in that coordinate space.

So if you want a certain image, you have to ask "what caption would be most likely and most uniquely given to the image I'm imagining".

I'm sure this advice is way less helpful than what you find in prompt engineering discord channels and guides I've seen.


Is 3d a different problem, or a similar one but considerably harder? I'd expect the data encoding (vertices vs pixels) to change a bit about it but I'm not familiar enough to know.


Pixel values are discrete (length x width x r256 x g256 x b256) and vertex values are continuous, so that is one major difference.

Secondly, there's vastly more labeled image data in the world than 3D data, so creating a CLMP (contrastive language and mesh pairing) model is harder.

It's very late but I may be able to give a much better answer on more of the nuances of 3D generation tomorrow.


Would voxels be easier than vertex-based meshes?

I can imagine you'd have the problem of stray floating voxels then, which isn't as noticeable when it happens with 2D pixels.


The “hot new thing” is NeRF, neural radiance fields, which can take into account the way light interacts with the object (and hence you can correlate data from pictures taken at different angles)


Interesting!

I knew about transformers, CLIP and diffusion, but pixel patch encodings are new to me.

Can you give me more details / point me towards an explainer? A quick duckduckgo search didn't help.


I don't quite remember whether it was first used in Vit paper[1], but it's a fairly straight forward idea. You take the patches of an image like they are words in a sentence, reduce the size of the patch(num_of_pixel x num_of_pixel) with a linear projection so that we can actually process it and get rid of sparse pixel information, add in positional encodings to put in location information of the patch and treat them as how you treated words in language models from that point on with transformers. Essentially, words are human constructed but information dense representation of language but images do have quite sparsity in them because individual pixel values don't really change much of an image.

1: https://arxiv.org/pdf/2010.11929.pdf


> reduce the size of the patch(num_of_pixel x num_of_pixel) with a linear projection

What does that mean?

(Thanks for the explanation)


The flattened image patch of width and height PxP pixels gets multiplied with a learnable matrix of dimension P^2xD where D is the size of the patch embedding. In other words, it’s a linear transformation that reduces the dimensionality of the image patch.


Googling the author list, gives me a preprint dated March this year. https://arxiv.org/abs/2205.11487


It is now available to lay people by just typing into a website. Months ago it was rather „use this Jupyter notebook“. So people are now using it for more serious stuff.

For example, here is an RPG designer using Midjourney for illustrations: https://www.bastionland.com/2022/07/primeval-bastionland-pla...


A coworker and I were playing with DALL-E 2 yesterday, and I pointed out that while I don't think any major RPGs are going to be moving away from artists anytime soon, the quality of Ashcan Editions just jumped way, way up.


Not in the field, but maybe diffusion models? They seem to be used by a lot of different image generation techniques.


I don't know. This 'explosion' in the public space may just be the technology crossing a threshold where big $$$ are on the table and competitors rushing in to get hold of good chunks of the emerging market. No tech breakthrough, but a marketing & sales assault.


This is amazing but Google/OpenAI haven't released their models and don't seem to plan to. There is stable diffusion which was released and is probably slightly less good but still good if anyone wants to mess with it.

There is a huggingface instance, Collab notebooks, and local running notebooks here. [1] on the stable diffusion subreddit.

Also someone has packaged an exe that runs it with no fuss on computers with Nvidia GPUs that they posted on the media synthesis subreddit[0]

In my limited testing this compares ok to Dalle2. Style shifting works slightly less well and it's hard to force it away from normal images but with a little work it tends to be more accurate to your prompt.

[0] https://grisk.itch.io/stable-diffusion-gui

[1] https://www.reddit.com/r/StableDiffusion/comments/wqaizj/lis...


Smart for Dalle, Midjourney, and Stable Diffusion to capitalize on this quickly. It looks like the technology is being commoditized at rapid speed. I wonder what’s next.


People are getting into "prompt-craft". People have also started pasting different AI tools together like putting these images into CogVideo[0] to generate actual videos (takes ages and usually the length of a gif currently). Others have realized you can run these images through GANs to make faces look extremely convincing

I think "what's next" is fitting these tools together in a larger system

[0] https://replicate.com/nightmareai/cogvideo


Ok I’m just going to list ideas, feel free to add more

* Text to 3d model

* Text to video clip

* Illustrations for newspapers

* Generating pictures for food menus

* Generating a music video from a song

* Generating pictures or even an entire movie for a book

* Interior design ideas

* Product design ideas

I have noticed that leveraging these tools isn’t easy. It requires a fair bit of creativity to come up with the prompts to create an image to really wow somebody.


> Generating pictures for food menus

This is what I've been doing for any recipes[0] that don't have pictures with pretty good result.

[0] www.reciped.io ex: https://www.reciped.io/recipes/mushroom-and-onion-pizza/


Someone on the midjourney discord is making a comic with images generated by the bot. Another person is making a Magic The Gathering card pack with generated art (it looks good too!). People are already using it for stock photos

One murky area we're still far away from but I'm curious to follow the developments on: AI-generated movies. Once generated clips gets good, what if some movie buff can just generate a movie, scene-by-scene, just using these tools? What about "casting" certain celebrities? The comic I mentioned uses Zendaya (probably because she plays a character in the Dune movies) as a character


Lot of potential for movies- using AI to up res, AI to turn a 2d movie 3d, more advanced would be new or edited scenes. Or how about translating a live action movie to a cartoon and vice versa. Or a different style or tone. Run The Lord of the Rings through a cyberpunk filter.


I'm eager for someone to start recreating the missing episodes of Doctor Who. The audio still exists and there are publicity shots from many of the missing episodes


I wonder how good AI will be at recreating the look of the props made out of broom sticks, gaffer tape and paper mache.


How do they get consistent looks? I mean if you are using it for a comic book your characters should look the same throughout the story. From my experience, the results between prompts are completely different even with minor variations.


If the seed remains the same, the difference between similar prompts is actually continuous.

Here's a slew of images (1 through 5) I generated all from the same seed and same prompt sans a word or two: https://www.instagram.com/p/Chg60Fou6xB/


Thanks. Those are amazing by the way


Thank you mate. Spent a lot of time, even though the AI did all the _real_ work :-)


Is there a good central source that lists out all of the currently available tools?


The AI is moving up the "content ladder" generating text -> now images -> next videos

Smart for Google to invest in this because their business relies on third party content to exist (blogs, webpages, youtube videos)

If they can vertically integrate their business to create the content AND own the discovery algorithms then they officially win the internet


Generating videos and 3D models is _much_ more difficult than images. You can’t just train off videos from the internet in the same way, because they don’t have sufficient text labels to understand them like CLIP does.


Oh but they have sound which can be annotated much faster/more efficiently. You also potentially have screenplay but the amount of training data is probably too less and sparse.

FWIW, I don’t think the AI systems will generate a whole video by itself - it’ll be some form of image to image generation where an artist will render a rough sketch of the scene and the AI will fill in the details, frame by frame.


I mean at this point it’s really not a question of if, but when - right?


I wonder if subtitles could be used, so rather than describing the video, you just write a script and it generates video for you. I'm certainly no expert, but it does seem like there's a lot more data there.


Stability.ai and probably others too already working on video and audio models too and also I think I heard a service to train / finetune the model with your own dataset.


From the cherry-picked example-images on that page, it seems like Imagen more closely follows the prompt than the open Stable Diffusion model[0]. Stable Diffusion needs a lot of hints before it makes out of the ordinary pictures.

In general, I think these models are a great and funny toy, but not a threat to stock-photos yet. This may change within a year or three years though.

[0]:https://stability.ai/blog/stable-diffusion-announcement



The davinci ones are incredibly intricate


those were generated? those look dope


Generated and paying for MidJourney gives you the copyright to them so you can use them for whatever projects you want


Which feels a bit sketchy(?) to me, seeing as all models are built on imagery scraped from the net without anyone's permission. It's one reason why I've spent the last month training my own models using my own imagery. If these stock photo sites had any brains, they would also start training models on images in their databases, especially since they already have everything sorted into categories based on keywords (which I'll spend the next year doing, until I can get img2text tools working in recursive batch mode).


I totally agree. Also, that sounds like a wonderful project :) You should post project updates somewhere to keep track of your model as it evolves!


Thanks! I have a g-doc where I've been documenting settings and progress here[0] although I just now realized I might not have been using my diffusion model correctly for most of my tests. Iteration 509 of my model I seem to have finally nailed it though! :) I partially "blame" Visions of Chaos since the amazing dev (or devs?) drops updates almost every day with new Machine Learning features, model training was only added recently. I must have reset something on accident.

Also I realize there's a lot of image prep work required, not to mention I have a less than ideal amount of VRAM (3060 Ti w/8GB but no monitors attached i.e. 8GB free) so I have to lower some settings. The source images have to be in 1:1 format (which none of my photos are) so I'm using a script to batch call ImageMagick's 'convert' to add white borders to the top/bottom, which results in my renders also having white borders.

[0] https://docs.google.com/document/d/1CnC5SaqpeJiQS-TlDS4trzJR...


They all look great! But the usual customers of stock photos are not looking for dragons or cavemen taking a group selfie. And all the images are not the result of a beginner trying their first prompt, they all took many tries to be generated.


I've used MidJourney and I really don't think it takes that many tries to get what you want. When you enter in a prompt you get 4 results and then you can either upscale one of them or you can variate on one of them and produce 4 variations of that particular one.

Here's a couple "first results" that I personally tried and you can judge for yourself:

"the government is putting violence in our water"

https://mj-gallery.com/87f5a54d-7d59-44d3-aab4-1dd3ef34902e/...

"cherry monkey"

https://mj-gallery.com/5d2e14ba-8ea1-4797-ab6b-4a6807cfffa8/...

"permaculture garden city"

https://mj-gallery.com/088c18c1-8e61-44da-b109-edfbd32967ac/...

"Acmella oleracea"

https://mj-gallery.com/9d1bc9f3-3cdc-44ec-8577-0791c69aa942/...

"mondrian banana cloud forest"

https://mj-gallery.com/25a8ef06-1a07-4a40-97e5-bebbd2ee925e/...

"lonely neon rainforest at night"

https://mj-gallery.com/8e7b73c0-f519-4727-bf16-30e0a42ab412/...

Obviously these prompts are a bit more artsy than stock photos are meant to be but the point is just to give you an idea of how it does on the first try. All of these took less than a minute to produce



Is this being reported as "new news" because it is available to the public?


AFAIK nothing is released yet? They are probably still trying to, like Dall-E 1/2, remove every every image and ban every word that might generate even the slightest hint of controversial imagery before releasing it to the public. To be fair, Stable Diffusion spent months basically doing the same on Discord with thousands of beta users, and had an army of moderators flagging images which were subsequently removed in order to prevent people from generating porn. Didn't help though, as people got access to the beta models (for "research" purposes) a few weeks before the official release and are using that to e.g. create "porn"[0].

[0] pornpen.ai: https://news.ycombinator.com/item?id=32572770


Even if you don’t want to block generating porn, you still want to know if you’re getting it, because nobody wants to get porn when they didn’t ask for it. (and it may be illegal depending on your country)

It’s easy to remove the filter from the SD scripts and that’s intentional.


Fair enough and props to Stability for giving people the benefit of the doubt, not to mention actually open sourcing almost all of their work, unlike other "open" projects. Speaking of which, Dall-E bans a decent chunk of the English dictionary in the hopes of preventing people from generating anything even remotely offensive, which can be somewhat frustrating.


someone told me bloody mary was banned from dalle


This list is already a month old but it should give you an idea of what kind of words are banned: https://www.reddit.com/r/dalle2/comments/wa3jt6/banned_words...


A bit off topic, but looking the stable diffusion site there's barely any mention of the people behind the project which I find odd yet refreshing.


Not a single human face in these samples. I wonder how well it does on faces.


Saul Goodman if he was a character in Twin Peaks:

https://media.discordapp.net/attachments/999426920376717513/...

Generated with Midjourney Beta


Which is now Stable Diffusion under the hood, sprinkled with a little prompt parsing "magic".


This may be misleading.

As far as I could tell (using it before, during, and after this Beta option was available) it was the upscalar using that, not the original 4 image generation.


The samples I was shown definitely had a theme: robots, raccoons, hats, and raccoons wearing hats.

Our brains are very thoroughly wired to detect faces in general, and flaws in faces. So we have a very, very high standard for what passes muster.

We are apparently much more forgiving with regards to what raccoons and corgis look like.


I’m also curious. The animal faces and eyes look good, less creepy than most.


have you seen https://thispersondoesnotexist.com/ ? that's old already. different tech tho


A photo of a Shiba Inu dog wearing a sunglasses and black leather jacket playing a guitar in a garden, does not work correctly.


There are some obvious mistakes in tools like this. Such as: Human faces are wrong, writing is usually scrambled, fingers look weird etc... Do you know if we need to have a major breakthrough similar to what happened 6 months ago to fix this or could these be fixed with incremental improvements in current techniques / datasets?


I wouldn't necessarily say a major breakthrough as such but I do think some architectural change is needed. There are concepts in images that we don't rely on purely visual understanding for - like words, we have a language model that we use to rely on when we see text in images. I think we need the same thing in our models to reach the next level of capabilities by combining models across different domains. I don't know if this manifests as pre-training with a language model and then expanding and updating the tensors as part of image training, or some more complicated merger of the models.

To learn logical concepts just from images seems entirely impractical, like we can't rely on having enough images such that models can understand words coherently as language. You could draw a picture of a sign that says "children crossing" not because you can understand and remember exactly what an image of such a sign would look like, but because you have an understand of English and the character set that would let you reproduce it. If you tried to learn to create the same sign in Arabic you'd either need to see a huge number of signs to learn from or (more likely) build a language model for Arabic.

The kind of abstract understandings that we know we can train in language models just aren't learned by image transformers at this scale (or likely any practical scale). A language model could easily understand: "A red cube is stacked on top of a blue plate, a green pyramid is balanced on the red cube" and infer things like the position of the pyramid relative to the blue plate, image models quickly fall over with such examples.

An interesting nascent (and hacky) example of the benefits of combining models is people are using language models like GPT-3 to create better prompts for image models.


The writing issue demonstrably has been solved without any breakthroughs by simply making a larger model (dall-E 2 vs the publicly available dall-e), the same appears to be for other main issues as well - it's just that the publicly available versions are based on smaller/weaker models than the state of art because they're significantly cheaper to run.

Also, I seem to recall that at least some models deliberately harmed generation of human faces (e.g. by selection of training data) to draw away attention from the deepfake/fakenews usecases and the related ethical,political and PR issues; I would assume that if any of them wanted to actually try and make specifically faces look good, that would be purely a matter of some engineering work without any breakthroughs needed - I mean, we have evidence from face-specific models that the same technical architecture can do decent faces.


Neither DALLE version 1 or 2 was completely released. Further, DALLE2 definitely still has issues with generation of text, although latent diffusion can do an okay job of it.


It’s Imagen that can competently generate text, and their ablation studies show it gains the ability at a size somewhat above DALLE2’s, IIRC.


Apologies I was referencing personal experience w.r.t. latent diffusion.

https://replicate.com/laion-ai/erlich

It still has issues of course, but a lot better at spelling than DALLE2.


> similar to what happened 6 months ago

The major breakthrough that happened 6 months ago is that someone put their api behind a website for people to play with


> Human faces are wrong, writing is usually scrambled, fingers look weird

I think a lot of this has been solved in DALL-E already. It's pretty good at right-looking faces, and fingers. Text not so much... but it does appear that whatever OpenAI are doing behind the scenes, that's getting better too.


On the other hand, those are all things humans have troubles with when drawing, unless they have an extraordinary talent or a lot of experience.


These things keep showing up latetly in various forms. A day or two about it was creating image of a porno star. But always a sometimes good sometimes crappy web app you can test it out, but usualy it is very limited or not possible.

The porn star one said it had used up its resource, and this one I cant figure out if i can run it. I get spreadsheet from the linkm but I am unsure how it helps me.

As a guy who does not know a lot of "AI" are these things just showing they "Hey I trained this then to do ...... / I have a big data model"?

Is there a standard way to share it?

If you start out with 10.000 penguins to teach a model about penguins. Once the model has worked through it, do you have any more use of the images separately?

How large are the models?


Why is this text-based image research advancing so rapidly? Is there a market application they're aiming for? Seems like multiple teams have been on multiple models and I see a new one every week.


It’s basically the first time we’ve automated perception of vision as it relates semantically with language.

In terms of theory, such systems are candidates for a generic perception engine you might use in say, a robot with cameras, speaker and a microphone.

Perception is just one aspect of intelligence, but this research ultimately makes it possible for a machine to encode data semantically.


It's like a love child of Photoshop and Polaroids with the simplicity of writing text. It would have been amazing if it didn't became popular.


2014: I made a new javascript framework!

2022: I made a new AI image generator!


This will be nice when a person wants a custom item tailor made to describe visually what it should look like and verify the picture before sending over. Taking that idea further into the future, we can clearly see how speech-to-text and dynamic adjustments coupled with 3d printing could lead.


I feel like Imagen gets all the noise and people forget about Parti - https://parti.research.google/

It's another google project using a different set of techniques.


I just ignore Google projects since they don’t get released… and when they are released it’s with such severe limitations that they don’t work well.


That’s because nobody cares that they got a slightly worse result with a different model architecture. Especially the pictures aren’t as sharp, so they’re not as fun to look at.

Parti+Imagen is in development and is competitive again.


Did anyone think illustrators would lose their jobs to AI before e.g. cab drivers? I wouldn't have guessed that in a million years. Singularity upon us or just quirky development in an already weird field?


Is the next thing going to be image generation trained on your own photos? It seems like personalized generation would be a huge market.

Ex. Show me walking on a beach. Show my dog wearing sunglasses, etc.


You might want to look at this :

https://textual-inversion.github.io/

It allows you to find a string that corresponds to some "thing" you have pictures of. Then you can use a text to image model containing a reference to your "thing".

Looks pretty amazing! However it needs 20gb of VRAM so I couldn't try it out.


Woah, it’s certainly going to be trippy when the AI is animating this in VR for us and maybe even listening to us direct the game/world.


GPT-3(4), DALL-E, Imagen, ...

I wonder what's next?


Hopefully something that actually matters for humanity, and less "game playing" and "stock image market disrupting".


Sound and moving images. Put it all together and fan fiction will never be the same again!

But seriously it may usher in a new cottage industry of content creation, from graphic novels,to animation to movies.


The indignation of still having to go to work every day when this could all be automated


Diminishing returns.


Does anyone know if or when google intends to make this available for beta testing or public use?


They're probably afraid that people will generate images that don't conform to the current American Progressive Vision.


Towards the bottom of the page they say: “The potential risks of misuse raise concerns regarding responsible open-sourcing of code and demos. At this time we have decided not to release code or a public demo. In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access.”


Are they afraid of lawsuits or are they painfully regressive, prude moralists?

Stable diffusion was un-neutered within 24 hours of its public release and the worst people do with it is Emma Watson porn.


In a nutshell they're afraid of people using prompts generating black people looking like monkeys/gorillas. And other such sensitive examples that others have posted/mentioned.


Or they just remember Microsoft Tay: https://en.wikipedia.org/wiki/Tay_(bot)


It's not about prudishness, although generating pornography is one of the concerns. In the paper there's a full page on ethical concerns, but some of issues they mention are misinformation, and perpetuating harmful cultural stereotypes around race and/or gender roles.


While I understand that view, pandora’s box is already open, by their own measurements other publicly-available technology is comparable: so the only thing they’re stopping with this is direct comparison of their tech to others.

A more cynical side of me just thinks Google is rushing out the PR (including the hand-plucked sample images) because they can see that the state of the tech is progressing rapidly and perhaps by the time their tech is release-ready a competitor will already have something better (the new round of betas certainly look promising.)

It is a little bit on brand for Google to make an announcement they have the best, only for those claims to fall over later.


By the time they figure out the "responsible externalization" portion, competitors (including open source) will surpass their results.

The technology isn't special to Google. They don't control its proliferation.


The angle and framing is the same in all images. They all have a stock photo vibe.


Model weights or shut up.


Is there an AI that creates realistic text from photos?


Look into image-to-text or CLIP captioning. There are several tools out there, although similar to running text-to-image tools locally, may require some time to setup via conda. The easiest way that I know of (currently) would be install Visions of Chaos. You will still need to follow the TensorFlow setup guide[0] which will take 1-2 hours, then, running Visions of Chaos > Mode > Machine Learning will download 3-400GB of models in order to be able to run all the available tools they have integrated, even if you aren't interested in most of them. The setup may require several retries and several hours due to some servers being extremely slow (or overloaded).

EDIT: Just found this[1] as well, though setup might also be a pain.

[0] https://softology.pro/tutorials/tensorflow/tensorflow.htm [1] https://replicate.com/methexis-inc/img2prompt



iOS is pretty good at it if you enable VoiceOver.

Microsoft has an app called Seeing Eye that’s not as good at it.

Both of those use older pre-CLIP technology.


What is the purpose of these new AI photo generators besides crime, impersonation, or advertising? People seem really excited by them, but I don't get it.


Check out this artist on youtube that uses Midjourney to draw something in 20 minutes that would usually take him hours: https://www.youtube.com/watch?v=EsQgD9yNMxU


It's a creative outlet, it's art, it's fun.


Coming from Google? No thanks. If we are already being screwed by simple services like YouTube, I cannot imagine how screwed up we'll be when we become addicted to services that rely on the research posted on the article. Coming from Google, it's probably only good for Google, not for its potential customers/users/slaves.


For all those who are practically demanding this be turned over to them, here's a quote from a recent article (about Telegram):

Filing a charge is pointless, says Ezra. Since two years, she's being harassed on Telegram. It started when she was sixteen: photoshopped nudes with her snapchat account were circulated. They had taken selfies from her social media, and those of her family, and combined them with porn fragments. She doesn't know the perpetrator, but that person takes a lot of trouble to ruin her. "Nowadays, the boys have so many ways to make it look real." [1]

If you read that, and think all these tools should be released, you're part of the problem.

[1] de Groene Amsterdammer,146/33, p. 21.


> If you read that, and think all these tools should be released, you're part of the problem.

Such comments make me wonder whether a social credit system is actually a good idea. Yes, it can be abused to deny my rights, but how can it be worse than being assumed to be a sexual predator by default?


Nobody assumes that. But if you give it away to literally everybody that asks for it, you'd be worse than naive to assume that nobody is going to abuse it.

But your comment makes it sound as if you'd rather give up your rights than not have access to this system. I don't think it's that interesting, is it?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: