This looks great! I've been looking for a Python library to use with Phantasmagoria[1] for ages, but everyone is doing web UIs. You even packaged it up in a Docker container, very nice, thank you!
The progress in the AI space is absolutely astounding.
In less then a year, we went from no AI photo generation(from prompts), to DALL-E2 a commercial service, then competitors started popping up like mid journey, and now we have Stable diffusion, which is a source available AI you can run your self, unlocking implementations like this.
There are other companies now hyping AI video generation like runaway(1)
Totally disagree. A whole of AI business space seems totally focused on pushing the boundaries of what is possible, completely ignoring delivering something consistently useful. I played a bit with image generation recently and most results were abysmal. Sure, it can create great things and prompt hacking will be a thing for a while. It’s however very far from “for each prompt I get a working (as in not broken with artifacts) and matching image”. IMO business usability depends on average case mostly and this hasn’t impressed me yet.
The elephant in the room is the “black box” nature of all neural networks. They are not interpretable for humans, therefore we cannot know when they can royally screw up. That means unless we keep a human in the loop, it’s hard to really integrate it into anything critical. And keeping humans out of the loop is what most AI companies promised as an end goal.
Basically, I am embracing incoming AI winter. Not because no great progress has been made, but because what was promised will never be delivered (as have been the case previous times). At the same time, AI is here to stay and AI based tools are going to become common place. It will just be less of a great deal than everybody expected.
I think image generation is an interesting case, because even if a human is always in the loop, and you need to try several times before you get a good image for your prompt of interest, that's likely still faster and cheaper than photoshopping exactly what you want (or certainly faster than hiring an illustrator). And the images produced are sometimes really quite good. A model which produces some amount of really messed up images can still be 'useful'.
_However_ the kinds of failures it makes highlight that these models still lack basic background knowledge. I'm willing to let the stuff about compositionality slide -- that's asking kind of a lot. But I do draw a very straight line from DeepDream in 2015 producing worm-dogs with too many legs and eyes, style-gan artifacts where the physical relationship between a person's face or clothing and surroundings was messed up, and the freakish broken bodies that stable diffusion sometimes creates. Knowing about the structure of _images_ only tells you a limited amount about the structure of things that occupy 3d space, apparently. It knows what a John Singer Sargent portrait looks like, but it's not totally sure that humans have the same number of arms when they're hugging as when they're not.
In the same way, large language models know what text looks like, but not facticity.
So I don't know that an AI winter is called for. But maybe we should lean away from the AI optimism that we can keep getting better models by training against the kinds of data that are easiest to scrape?
>Totally disagree. A whole of AI business space seems totally focused on pushing the boundaries of what is possible, completely ignoring delivering something consistently useful.
Interestingly, Midjourney is taking an approach you might be interested in, where they're fine-tuning their model to prioritize consistent, visually-appealing outputs with even the most vague prompts (e.g. "a man").
And... it's really making me appreciate its competitors more. This always-good-enough consistency is very much a double-edged sword, IMO, because it also results in a very same-y feel for most Midjourney images (and kind of makes me appreciate instantly-recognizable MJ images a little less, in a way not unlike how I used to be impressed by starry-sky spraypaint pieces and then realized they're basically SP101). You almost always get something good out (at a rate I'd feel comfortable wrapping a production-quality app around) but it has become harder and harder to produce new visuals/aesthetics as Midjourney has progressed closer to their desired consistency levels.
Back when I started on it, I'd get good/interesting images every 5-10 generations that I'd then tweak and get even more interesting images. Now I'm lucky to see something new/interesting every 5-10 generations, although everything in between is _fine_.
My background here, FWIW: according to the site, I've been using Midjourney for 4 months straight and generated almost 10,000 images. I also have ~700GB of generations on disks from other models in the meantime and run a few sites that basically do wrap these kind of generation models, like novelgens.com, that try to find a good ratio between consistency and divergence.
In the grand scheme of things, I think the AI generation space needs both ends of the spectrum: consistent results like Midjourney lower the barrier of entry for new people to explore the space, but prompt-dependent powerhouses like Stable Diffusion enable artists to push the tooling further and have significantly better control over the art they're trying to create.
What’s being delivered is useful. I agree that you still need a human in the loop, but that’s true of any creative tool -- having Adobe Illustrator doesn’t make me an artist. The current generation of tools has made certain design tasks easier, the main thing missing still is not ML advances as much as nice UIs that put it in the hands of creative professionals.
It is how you measure "progress". For you progress seems to be only about business. I think majority of people (including me) are just delighted about this new toy that brings them joy. In my life it is great progress if I get new innovative toys that haven't been available before.
How did you get to conclusion I think progress is mostly about business?
My point is about promises that funded current wave of AI craziness do not seem to get fulfilled. At the precise moment it becomes obvious for everybody, funding will stop and bet on the next horse, whatever it will be.
It seems that autonomous driving got stuck on “driver must be ready to take control”. If that doesn’t change, Uber is just a glorified tax corporation. Tesla car revolution ain’t happening, if my car can’t drive me to a spot without assistance (allowing me to sleep or be drunk or whatever). They become just another car company with a head start on electric.
And the rest of the AI industry seems to follow the pattern for me - great results (cars can mostly drive themselves nowadays, that’s insane!), but always a notch less than expected. Because what was expected were humans replacements and what we got is human augmenters. It’s probably better for humanity, as productivity will rise, but humans won’t be cut from the loop. I just don’t think it is this particular result that Big Money had in mind when they poured money over it.
Stable Diffusion only cost $600k to train. It's hardly Big Money, and for those $600k you have a revolution in digital art, image processing, stock photo generation, FUN, etc. taking place right now that is insane, and it has only been around for a few weeks!
Btw neither OpenAI nor Stability have at any point suggested that this technology was intended to "cut humans from the loop". These are just new tools. What we do with them is up to us but they are certainly useful and inspiring.
You are expecting delivery of your dream. Where some companies get quite close, close enough for a lot of people and they are dismissed.
This vastly improves process of generating artworks. Like it or not there is still human. But given a job of delivering artwork somebody with those ai tools is going to appear like a superhuman - if time is important (usually in business it is).
Now, funnily you bring AI cars into this, afaik George Hotz is running a very successful product(comma ai) that kinda gets you there, almost. Yet people are ok with it, if they have to hold the wheel every time they need to shift direction. It still helps them 60-95% of the time.
If I can get a tool that is helpful 20% of time, if its accessible I would rather have it. Especially if I can determine this 20% easily. Big money I think would think similarly.
> what was expected were humans replacements and what we got is human augmenters
The advantages of AI even without full human replacement are: volume, consistency, testability, cost, scalability and upgrade story. You can process more data, processing is more consistent as enforced by the model, you can test the model extensively, costs less, can handle sudden scaling and is easier to control the upgrade process when you want to update the system.
But even if the AI ability to remove the human in the loop is low today - that's not necessarily a bad thing. We need some time to adapt and transition as a society. If the transition is too fast, much more social displacement will happen. If we can match the automation rate to the natural job attrition rate then we can have a much smoother transition.
> Enhance 34 to 36. Pan right and pull back. Stop. Enhance 34 to 46. Pull back. Wait a minute, go right, stop. Enhance 57 to 19. Track 45 left. Stop. Enhance 15 to 23. Give me a hard copy right there.
Surely it's possible to have a full alpha mask, such that 50% alpha means "push the diffusion process towards this value, but don't force it to generate this value".
You'd effectively be trying to do a pixel-by-pixel diffusion strength parameter, which I'm not sure has a coherent interpretation to the algorithm (because it's currently a scalar applied to the run settings).
the "alpha" is already built into the tooling and specified independent of the mask. IE. the stable diffusion inpainting takes hints from what you leave and "decides" what to keep
https://github.com/brycedrennan/imaginAIry#automated-replace...
https://news.ycombinator.com/item?id=32887385
And I got the idea from here:
https://github.com/ThereforeGames/txt2mask
Which is using the model here:
https://github.com/timojl/clipseg
Clipseg is doing the hard part!