If people just want to run text to image models locally by far the easiest way I've found on Windows is to install Visions of Chaos.
It was originally a fractal image generation app but it's expanded over time and now has a fairly foolproof installer for all the models you're likely to have heard of (those that have been released anyway).
Thank you, I've been looking for something like that and it looks very cool, judging by this tutorial that shows it in action, as it creates an image and displays the results in progress:
It's only a matter of time before we get a good NSFW image generation algorithm. Text erotica generation is already a not-insignificant part of the public AI world (remember AI dungeon? Most people used it for porn). The question is whether it's going to come out of the established adult industry or not. There are clear benefits (no real humans needed anymore, personalized fetish material forever, etc.), the only issue is whether they're willing to deal with the inevitable bad press (e.g. fake images of real people, taboos like underage content or disturbing imagery). I wouldn't be surprised if any models from the adult industry will basically be heavily gimped from the start to avoid liability.
I feel like anybody, especially female, who have high res photos/videos of themselves nude or not on the internet are going to wake up one day that they have turned into a pornstar.
For example, kpop idols. There are high rest 60fps fancams of them dancing, there are high res broadcasts, images from all angles.
There's already a huge market for deep fake kpop and I believe that they are in for a reckoning.
Your last sentence combined with your introductory visual of somebody “waking up” one day as a pornstar is pretty disturbing. Anybody waking up to discover they are the subject of pornography would be experiencing a traumatic public humiliation. Check your head.
BTW, basically every “female” along with male out there has high resolution photos on social media.
I agree with the other commenter and feel the connotations of your post, the hints at your enjoyment at the prospect of this potential future - are disturbing.
I'm not sure whether you actually think this is good or bad but honestly I bet it would actually be a net positive, in a "if everybody's a pornstar, nobody is" kind of way. If you can generate nsfw video from a picture of a face with the press of a button, it will probably cease to be a thing that matters to anybody within a generation or two, in the same way we think about Photoshop today. It might be a bit of an awkward transition though since it definitely offends most modern people's sensibilities.
Lets be real though, it's the anime-style generation models that are going to be the pioneers in this field.
Possibly related: I was at Cornell when there was a guy who was an excellent artist and was making money by selling pornographic art of other actual students. Needless to say, a stop was put to this once word got out...
wandb forbids reusing the models and other information, independently of their usage, so they should find another source for their models
EDIT: as I am being accused of inventing it I will quote the terms of agreement and license, since maybe its own founder seems to not have read it or someone without training on how to write proper terms and agreements made it for them and the restrictive usage of "Material" does apply to its hosted software.
Note that there is no formal definition of "Materials" or "Service", so that it applies to all the contents of the webpage including the software stored there:
https://wandb.ai/site/terms
I quote it:
2. Use License Whether you are accessing the Services for personal, non-commercial transitory viewing only (our free license for individuals only), for academic use, or for commercial purposes (our subscription package for businesses), permission is granted to temporarily download one copy of the information or software (the “Materials”) from our website. This is the grant of a license, not a transfer of title, and under this license, you may not: a. Modify or copy the Materials; b. Use the Materials for any commercial purpose, or for any public display (commercial or non-commercial); c. Attempt to decompile or reverse engineer any software contained in the Materials; d. Remove any copyright or other proprietary notations from the Materials; or e. Transfer the Materials to another person or "mirror" the Materials on any other server. This license shall automatically terminate if you violate any of these restrictions and may be terminated by us at any time. Upon terminating your viewing of these materials or upon the termination of this license, you must destroy any downloaded materials in your possession whether in electronic or printed format. f. Utilize our personal license for individuals for commercial purposes and any such use of our personal license for commercial purposes (e.g. using your corporate email) may result in immediate termination of your license.
Founder of Weights & Biases here (wandb). We don’t forbid anything, models are property of the people who created them.
Why do you think that?
EDIT: I'll edit respond, since you did. Look at sections 3b and 3c in the terms, they cover Models and other user content specifically. Those are user property, not our property. But I can see how this is confusing. We will clarify it.
Did you write your own terms of agreement rather than a lawyer? I signed up today and that was explicitly written in the license agreement, point 2. Note that there is no formal definition of "Materials" or "Service", so that it applies to all the contents of the webpage including the software stored there, and as soon as anything is ambiguous the interpretation is free (or random).
2. Use License
Whether you are accessing the Services for personal, non-commercial transitory
viewing only (our free license for individuals only), for academic use, or for commercial purposes (our subscription package for businesses), permission is granted to temporarily download one copy of the information or software (the “Materials”) from our website. This is the grant of a license, not a transfer of title, and under this license, you may not:
a. Modify or copy the Materials;
b. Use the Materials for any commercial purpose, or for any public display (commercial or non-commercial);
c. Attempt to decompile or reverse engineer any software contained in the Materials;
d. Remove any copyright or other proprietary notations from the Materials; or
e. Transfer the Materials to another person or "mirror" the Materials on any other server.
This license shall automatically terminate if you violate any of these restrictions and may be terminated by us at any time. Upon terminating your viewing of these materials or upon the termination of this license, you must destroy any downloaded materials in your possession whether in electronic or printed format.
f. Utilize our personal license for individuals for commercial purposes and any such use of our personal license for commercial purposes (e.g. using your corporate email) may result in immediate termination of your license.
I am certainly not a lawyer, however, 3b and 3c (from the terms link you posted) state that user content, including specifically Models, are property of the user.
Are you saying you think there is a conflict between 2 and 3b, 3c, or did you miss section 3?
```
3. Intellectual Property & Subscriber Content
a. All right, title, and interest in and to the Services, the Platform, the Usage Data, the Aggregate Data, and the Customizations, including all modifications, improvements, adaptations, enhancements, or translations made thereto, and all proprietary rights therein, will be and remain the sole and exclusive property of us and our licensors.
b. All right, title, and interest in and to the Subscriber Content, including all modifications, improvements, adaptations, enhancements, or translations made thereto, and all proprietary rights therein, will be and remain Subscriber’s sole and exclusive property, other than rights granted to us to enable (i) Subscriber to process its data on the Platform, and (ii) us to aggregate and anonymize Subscriber Content solely to improve Subscriber's user experience.
c. Subscriber Content means any data, media, and other materials that Subscriber and its Authorized Users submit to the Platform pursuant to this Agreement, including, without limitation, all Models and Projects, and any and all reproductions, visualizations, analyses, automations, scales, and other reports output by the Platform based on such Models and Projects.
```
Sorry Slewis, I cannot reply to your other comment with another subcomment.
This is an ambiguation and it is an issue that as founder you should address as it could be interpreted as a self contradictory agreement and then invalidate part of the agreement (as I've seen with non formal open source licenses, ill formed patents which became bypassed, etc).
A way to solve that might be using a glossary of what do you mean by each term, however I do recommend you to use a lawyer for such things as this sort of mistakes might become expensive later on.
Go buy a copy of https://www.survivingiso9001.com/ [No affiliation] for your lawyers. Not directly for the subject matter, but as a cautionary tale about consequences of word choice.
Please abstain yourself of writing any sort of derogatory comment without provide any constructive argument to the conversation, and just downvote the comment, or even better: follow how the conversation follows.
I also recommend you to read the guidelines on how to post in the HN community, check the section regarding how and what to comment in the link here:
I spun up an aws ubuntu ec2 with 2 Tesla M60. When I run
python3 image_from_text.py --text='alien life' --seed=7
I get this error
detokenizing image
Traceback (most recent call last):
File "/home/ubuntu/work/min-dalle/image_from_text.py", line 44, in <module>
image = generate_image_from_text(
File "/home/ubuntu/work/min-dalle/min_dalle/generate_image.py", line 74, in generate_image_from_text
image = detokenize_torch(image_tokens)
File "/home/ubuntu/work/min-dalle/min_dalle/min_dalle_torch.py", line 107, in detokenize_torch
params = load_vqgan_torch_params(model_path)
File "/home/ubuntu/work/min-dalle/min_dalle/load_params.py", line 11, in load_vqgan_torch_params
params: Dict[str, numpy.ndarray] = serialization.msgpack_restore(f.read())
File "/usr/local/lib/python3.10/dist-packages/flax/serialization.py", line 350, in msgpack_restore
state_dict = msgpack.unpackb(
File "msgpack/_unpacker.pyx", line 201, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.
I get a similar error running it locally (not sure if related, but it also can't find my GPU, which is a 3080ti and should be sufficient): Traceback (most recent call last):
File "/home/pmarreck/Documents/min-dalle/image_from_text.py", line 44, in <module>
image = generate_image_from_text(
File "/home/pmarreck/Documents/min-dalle/min_dalle/generate_image.py", line 75, in generate_image_from_text
image = detokenize_torch(torch.tensor(image_tokens))
File "/home/pmarreck/Documents/min-dalle/min_dalle/min_dalle_torch.py", line 108, in detokenize_torch
params = load_vqgan_torch_params(model_path)
File "/home/pmarreck/Documents/min-dalle/min_dalle/load_params.py", line 12, in load_vqgan_torch_params
params: Dict[str, numpy.ndarray] = serialization.msgpack_restore(f.read())
File "/home/pmarreck/anaconda3/lib/python3.9/site-packages/flax/serialization.py", line 350, in msgpack_restore
state_dict = msgpack.unpackb(
File "msgpack/_unpacker.pyx", line 202, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.
Can anyone give instructions for M1 Max MBP? I had a compilation issue in building the wheel for psutil that looks like "gcc-11: error: this compiler does not support 'root:xnu-8020.121.3~4/RELEASE_ARM64_T6000'" (gcc doesn't support ARM yet?)
Running the "alien life" example from the README took 30 seconds on my M1 Max. I don't think it uses the GPU at all.
I couldn't get the "mega" option to work, I got an error "TypeError: lax.dynamic_update_slice requires arguments to have the same dtypes, got float32, float16" (looks like a known issue https://github.com/kuprel/min-dalle/issues/2)
Edit: installing flax 0.4.2 fixes this issue, thank all!
Founder of Weights & Biases here (wandb). There are a couple issues raised in this thread: api key shouldn’t be required to download a public model, cache in home directory is annoying for this case. We will fix them.
I'd prefer to download it myself and choose where I put it too.
It now uses some hashed filename in some config directory in your homedir for this, I dislike this and want control over where I put models, make it more self contained instead of random directories spread all over your OS, and give them as input by file path.
This feedback is about dalle mini playground instead but it does the same thing. If this one is stripped to bare essentials I'd expect this type of dependencies stripped too.
Edit: I don't want to seem like complaining too much though and am very happy with these open models and tooling for them. Thanks!
Does it just download pre-trained DALL-E Mini models and generate images using them? Because I can't seem to find any logic in that repo other than that. I'm not into that field, just curious if I'm missing something.
To add to the sibling comment. The challenge is not converting the weights as such. Pre-trained model weights are just arrays of numbers with labels that identify which layer/operation they correspond to in the source model. The challenge is expressing the model in code identically between two frameworks and programmatically loading the original weights in, since these models can have hundreds of individual ops. Hence why you can't just load a PyTorch model in Tensorflow or vice versa.
There are tools to convert to intermediate formats, like ONNX, but they are limited and don't work all the time. The automatic conversion tools usually assume that you can trace execution of the model for a dummy input and usually only work well if there isn't any complex logic (e.g. conditions can be problematic). Some operations aren't supported well, etc.
This isn't always technically difficult, but it's tedious because it usually involves double checking that at all steps, the model produces identical outputs for a given input. An additional challenge when transferring weights is that models are fragile and minor differences might have large effects on the predictions (even though if you trained from scratch, you might get similar results).
Also for deployment, the less cruft in the repository the better. A lot of research repositories end up pulling in all kinds of crazy dependencies to do evaluation, multiple big frameworks etc.
I don't understand why execution of a model with the same layers and weights would be different between PyTorch and Tensorflow.
Is it a problem of accumulation of floating-point errors in operations that are done in a different order and with different kinds of arithmetic optimisations (so that they would be identical if they used un-optimised symbolic operations), or is there something else in the implementation of a neural network that I'm missing?
In principle you can directly replace function calls with their equivalents between frameworks, this works fine for common layers. I've done this for models that were trained in PyTorch that we needed to run on an EdgeTPU. Re-writing in Keras and writing a weight loader was much easier than PyTorch > ONNX > TF > TFLite.
Arithmetic differences do happen equivalent ops, but I've not found that to be a significant issue. I was converting a UNet and the difference in outputs for a random input was at most O(1e-4) which was fine for what we were doing. It's more tedious than anything else. Occasionally you'll run into something that seems like it should be a find+replace, but it doesn't work because some operation doesn't exist, or some operation doesn't work quite the same way.
It's just that expressing those "layers and weights" in code is different in tensorflow and Pytorch. I think a good parallel would be expressing some algorithm in two programming languages. The algorithim might be identical, but JS uses `list.map(...)` and python uses `map(list, ...)`, and JS doesn't priority queues in the "standard lib" while Python does, ...etc. Similarly, the low level ops and higher level abstractions are (slightly) different in Pytorch and Tensorflow.
I'm not too familiar with Tensorflow, so I can't give an example there, but a similar issue I recently faced when converting a model from Pytorch to ONNX is that Pytorch has a builtin discrete fourier transform (DFT) operation, while ONNX doesn't (yet. They're adding it). So I had to express a DFT in terms of other ONNX primitives, which took time.
In principle all operations can be translated between frameworks, even if some ops aren't implemented in one or the other. This, however, depends on whether the translation software supports graph rewriting for such nodes.
Lambdas and other custom code are also problematic, as their code isn't necessarily stored within the graph.
Seems like it'll be a serious issue for people hoping we can someday upload human brains into machines if we can't even transfer models from TensorFlow to PyTorch reliably.
Unrelated problems, really. Having written such translation library, I can say with confidence that the only reason for this is lack of interest in it.
Graph to graph conversion can be tricky due to subtle differences in implementation (even between different versions of the same framework), but it's perfectly possible, though not many utilities go all the way to graph rewriting if required.
Problems arise with custom layers and lambdas, which are not always serialised with the graph depending on the format.
Human brains have high degrees of plasticity -- our brain is much more generic than its usual functional organization would suggest. I don't think we'd be able to upload brains ("state vectors" was the sci-fi buzzword) before digital supports are able to emulate that.
They converted the original JAX weights to the format that Pytorch uses. Because JAX is still fairly new, it can be a lot easier to get Pytorch to run on e.g. CPU. I do find the number of upvotes interesting and I imagine many people just upvote things that have DALLE in the title, to a degree.
Look how much easier it is to install & run, people are interested in and up-voting the result, not how much work was (or wasn't) required to achieve it.
> RuntimeError: This version of jaxlib was built using AVX instructions, which your CPU and/or operating system do not support. You may be able work around this issue by building jaxlib from source.
Unfortunately, following the instructions to build JAXlib from source (https://jax.readthedocs.io/en/latest/developer.html#building...) result in several 404 not found errors, which later cause the build to stop when it tries to do something with the non-existent files.
Unfortunately, it looks like I won't be running this today.
I have a 2019 MacBook Pro 2.4Ghz Quad-core i5, 8GB RAM with
Intel graphics card
python3 image_from_text.py --text='a happy giraffe eating the world' --seed=7 154.61s user 22.18s system 262% cpu 1:07.40 total
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
As you can see, it took 1min 7seconds to complete.
I assume it would be much faster with a grunty graphics card
Interesting when testing with inputs like "Oscar Wilde photo" or "marilyn monroe photo" and comparing to a Google image search. After some iterations we can have quite similar images but the faces are always blurry.
Transformers output fixed-length sequences. For this transformer they chose 256 pixels, or 32 "image tokens" that each decode to an 8-by-8 pixel "patch".
You can technically increase or decrease this - or use a different aspect ratio by using more or fewer image tokens, but this is static after you start training. It will also require more "decodes" from the backbone VQGAN model (responsible for converting pixels to image tokens), and thus take longer to run inference on.
CLIP-guided VQGAN can get around this by taking the average CLIP score over multiple "cutouts" of the whole image allowing for a broad range of resolutions and aspect ratios.
It's already being scaled up to 256x256 from something smaller anyway. You could add an extra upscaler to go further which I've tried with moderate success, but you're basically doing CSI style 'enhance' over and over.
The google collab link works if you replace the computed path to flax_model.msgpack on line 10 in load_params.py with ‘/content/pretrained/vqgan/flax_model.msgpack’
Edit: actually it's easier to open a terminal and move /content/pretrained/vqgan to /content/min-dalle/pretrained/vqgan
I couldn't get this to pick up my graphics card when running it with WSL 2, it's just says no cuda devices found or something so I gave up, not sure if anyone had any luck
Not sure why this is downvoted: seems like the inevitable endpoint of AI vision/image generation research and it warrants consideration and discussion.
I tried a few non-descriptive statements from random tweets. As it turns out, nobody's made a random tweet since 2016, but for the few that exist, the results are great. E.g. "Good Morning Everyone , Happy Nice Day :D" generates something that can only be described as bored-ape meets Picasso in kindergarten. Probably the next-gen 1M$ NFT. If anybody needs proof that these models don't think, this is it.
Alternatively, you could try to hide the fact that something was created by an AI, and probably get away with it, but that isn't relevant to the legal question of whether something can be copyrighted when it is known to be created by an AI.
> But copyright law only protects “the fruits of intellectual labor” that “are founded in the creative powers of the [human] mind.” COMPENDIUM (THIRD) § 306 (quoting Trade-Mark Cases, 100 U.S. 82, 94 (1879)); see also COMPENDIUM (THIRD) § 313.2 (the Office will not register works “produced by a machine or mere mechanical process” that operates “without any creative input or intervention from a human author” because, under the statute, “a work must be created by a human being”). So Thaler must either provide evidence that the Work is the product of human authorship or convince the Office to depart from a century of copyright jurisprudence.
Please cite a source that confirms your claim, rather than stating it as a fact without evidence. You may be right, but I have no way of confirming that given the content of your comment.
The report I’ve cited makes a compelling argument against your claim, and several prominent organization’s copyright policies align with it.
Your quote says “without any creative input or intervention from a human author”.
Just hitting a "Generate random artwork" button indeed certainly doesn't seem to qualify as "creative output or intervention", but as for how DALL-E and consorts currently work, I'd say that coming up with a suitable prompt text, potentially refining it to get the output closer to what you want, curating the output, maybe using one of the output pictures as input for further processing, etc. arguably all are at least some amount "creative input or intervention from a human author".
Is Dalle significantly different from Adobe Photoshop in the eyes of the law? In both cases you will use a software agent to create art. In fact CGI art has existed for decades. Surely this is a settled question
This stuff is so cool and it makes me happy that we're democratizing artistic ability. But I can't help but think RIP to all of the freelance artists out there. As these models become more mainstream and more advanced, that industry is going to be decimated.
Translators at least have official documents (aka the only times I see a translator in my life across 3 countries) because the government is retarded and needs someone with a title to translate "Name" and "Surname" on a birth certificate.
There is no equivalent for illustrators.
My friends who studied some specific language are all unemployed or doing unqualified jobs. Their peers from a generation before are teachers or work in some embassy.
That said, before some unicorn really start doing some serious polishing, you'll still want some illustrators to piece art together. Taking the output of these models won't deliver a ready made product easily.
I will rephrase it: if people today are available to eat dirt instead of nourishment - etc. for innumerable instances -, to get contented with lack of quality (with the akin acceptance of consequential decline of the general perception of quality), the fault is more in decadence than in instruments.
You need well cultivated intelligence to obtain a good product: if "anything goes" is the motto, if "cheap" is the "mandate", there lies the issue.
Only in the same manner that GPT-3 eliminates the need for writers. Or influencers remove the need for advertising.
That is, a surface-level view might show these things as equivalent, but the skills required to produce a decent result are not encapsulated in the averages that models contain.
I'm sure a lot of "content writers" for SEO spam will become obsolete. The content level is already rock bottom, so is easily replaced by brainless machines.
But I'm more bothered by sociatal effects where art is automated. I believe it'll expedite the effects we saw when the internet short circuited the feedback loop for creators, killing any gaps where non revenue optimizing humane creative force could thrive. Not to mention the crazy mimetic positive feedback loops tearing the discourse apart.
I dunno. I've read a lot of GPT output. It lacks a certain consistency over medium scales. The big picture checks out, and the word-by-word grammar checks out, but the sentence-by-sentence information often isn't cohesive, or certain entity references change over time.
Text-to-image algos did the same thing for a while, but you look at the latest full-size DALL-E and it's pretty much flawless.
If I were considering art school, I'd certainly be reconsidering my options. Maybe there are some defects in the output, but nothing photoshop can't fix.
I think where humans win out (for now) is where a high degree of specificity/precision is needed (e.g. graphic design). Or certain legal requirements are present - AI art can't be copyrighted at this time - such as logo design.
Or most places where art is displayed and/or sold, because those places generally disparage purely digital art, and method is part of what goes into the valuation of the piece. "Oil on canvas" is worth more than "AI-assisted digital print", especially because duplicating it requires considerable effort.
I really don’t think that this will be the case anytime soon. Images can be generated from zany prompts, but making a coherent, fits-together-well set of images for a product like a web page or an illustrated book is far off.
Further, artists have a host of skills that DALL-E doesn’t, like “take that image, but change the colors a bit to make it more acceptable to the client, and move the cartoon bird a little further down”. Or “make an image that will look as good in a print as it does on a small screen”.
"an illustrated book is far off".
Hi, just to mention that i'm using mini DALL-E for graphic novel experiments... Indeed not really a human quality but ... https://twitter.com/Dbddv01
It was originally a fractal image generation app but it's expanded over time and now has a fairly foolproof installer for all the models you're likely to have heard of (those that have been released anyway).
https://softology.pro/tutorials/tensorflow/tensorflow.htm