Hacker News new | past | comments | ask | show | jobs | submit login
Break-a-Scene: Extracting Multiple Concepts from a Single Image (omriavrahami.com)
192 points by breakascene 8 months ago | hide | past | favorite | 39 comments



This is wild to me as a portrait/wedding/commercial photographer. Part of me is incredibly excited for the ways tech like this can speed up/dramatically improve my workflow and artistic capabilities. The other part of me is terrified - I shifted away from graphic design with the AI art explosion, but this makes me feel like it won’t be long until the average person no longer needs to hire a photographer either.


This is a harbinger of dramatic change. Scan your whole video and photo library for people, pets, and things, and environments.

Construct new scenes anytime and algorithmically.

"Generate pictures of me with each of my friends, making crazy faces, while drinking mai tai's at my favorite tropical resort bar."


My takeaway from this type of tech is that it stands change human culture drastically.

We won't be able to value photography in the same way, so the place photography has in society will naturally change as a result.

If all photography because meaning-neutral, does that mean we'll value the medium in the same way we value stock photography?


Wow, you are right.

Time to collect ideas for that avant-garde fine art photoshoot I can now do without a camera, props, site searches, or expensive supermodels.

I have also been collecting and developing story ideas. We are not that far away from one person being able to create an entire movie or TV series.

This may lead to some really great movies. Having every detail coming from and integrated in one mind, with one vision, iterating without friction, will be unprecedented. No need for actors and their schedules, retakes, wardrobe, sets, caterers or VIP RV's.

Or if models have long term memory at that stage, two "minds"!

Obviously, video is going to take more processing power. But it is inevitable.

Two components I need: (1) scenes that stay consistent, and true to natural lighting and physics, and (2) the ability to create characters whose proportions, look, facial expressions, voice and style of movement are consistent by default, but easily adjusted.

We are entering the singularity. This isn't stopping.


>Having every detail coming from and integrated in one mind, with one vision, iterating without friction, will be unprecedented. No need for actors

Actors don't just mindlessly repeat the lines some singular genius wrote. A lot of famous movie lines/scenes were improvised by the actors.


I have great respect for what actors bring to a story.

The benefits of isolation do not negate the benefits of collaboration.

The difference will be that both ends of the spectrum will be possible.

Individuals will be able to do it all.

But people will continue to be able to collaborate with anyone. Co-writers, actors, set designers, etc.

An actor who brings a lot could now inhabit several characters in the same story. Or every character!

A creative set designer is no longer limited to found or built sets.

Starting with one writer, there will be no limits on tasks they can do, or who they collaborate with, and how they divide up the work.

Collaborations of all kinds will continue.

But teams will generally be much tighter.


I'm not sure .. I think multiple people on a project can be a really good thing. Will definitely be interesting to see if new art forms develop off the back of this though.


And pay for storage of those scans, generated images, and for the creation of those images forever()

or until the VC money dries out and when it does you will have two weeks to download it all.


But why


I can't actually think of serious reason, but I can think of infinite non-serious uses.

Make real life, "I am so rich and happy with my material wealth and vast travel budget" influencers redundant.

However, I am not looking forward to Facebook spamming friends with ads posing as personal recommendations, showing me at my house happily using sketchy products or butt bleacher.

They will be sneaky: Show me, but with my face obscured in some way. But recognizable. But deniable. But recognizable.


The use case for me is: "take these five pictures and give me one where everybody is smiling and has their eyes open."


And put all the shorter people in front!


Edit it again but this time get rid of my ex son in law.

Actually, if the software is really smart (and maybe supplied by Google) you could ask "remove all of my kids' spouses that will be divorced at some point in the future"


Same tech, different concept though.


> average person no longer needs to hire a photographer either.

Make you can become an independent filmmaker. It seems like creatives won't need Hollywood tools or capital in the future, either.

Musings aside, I'm not so certain people won't need a photographer to be present at weddings anyway. They can't exactly trust that to an appliance. They get one shot to take the pictures, and they need to make it work.


How many people actually remember their wedding poses? I remember being bustled off to the side by the photographer to take our post-vows shots, but the actual poses? No way.

So if you had taken a few of the low quality smartphone pictures that guests took (had smart phones been as ubiquitous when I was married) and a beautiful background of our wedding location, and asked the AI to create some gorgeous portraits of us, with just the right "golden hour" light and the right look of love in our eyes, would the final portraits tucked away in our album be any different? Would they even be better?

It reminds me of the crowds of people in front of the Mona Lisa with their cameras out, so they can own their own, personal, blurry picture of the world's most photographed piece of art, instead of buying the perfectly-lit postcard in the gift shop. What's actually different about that much-ridiculed desire, vs. wanting to have the "authentic" photographs that that the photographer actually took (and then photoshopped, etc.).


You named it yourself at the end. The difference is that the people took that photo. What's important is not the image itself, the arrangement of pixels, but what it means: for some people means achievign the dream of travel to Paris, for some other people it means "I was here, I see it live", etc. People don't take pictures because they are beautiful or well composed (well, most people), but because the meaning in their lives about that particular event.

That's why I don't think people would like IA to improve their photos of their weddings, or so on. People don't make a wedding to have the perfect photo at sunset, but to have a nice event with people they love, and the photos are to capture that and nothing else. People's favourite photos (and not only from weddigns) are blurry, caught with friends, etc., because those pictures capture the feeling of a good moment.


Back when I was working on a photo-memories app in the early 2000s, a PM said, you know, the photo of the outside of dreary hotel near the beach isn’t about the hotel but about the hilarious hotel bar tender you and your friend were entertained by the night before the photo was taken.


I mean, why doesn't anyone do that now? Instead of hiring a photographer to come onsite and spend several tedious hours with you, just email them a single photo and let them Photoshop something good. It's not as cheap as AI, but it should be cheaper than paying for an in-person photographer, and the results can have any background you want.

There's a difference between discovering an old photo and thinking "haha I don't remember taking this one, we looked so good" and seeing it and thinking "haha I don't remember taking this, oh right because it didn't happen". In the second case, there's no reason to even have a photo album or generate anything at wedding-time. Just store the single base photo, and then 50 years later when you want to show the grandkids you can just have the computer generate whatever they want to see.


I'm 100% part of the group taking blurry pictures, but that's because I don't care at all about posting. I don't use Instagram or Facebook, all the pictures I take are for browsing my photo library later.


I'm not sure that the photographer present will need to be especially skilled, though. One can imagine an "Uber for Photography" service to get raw material, and you get the final "shots" out of a guided generative ML model of some sort, possibly including some moments that participants remember but weren't actually captured.


Depends what you think the skill of a wedding photographer is.

When I have seen an excellent wedding photographer, the skill seems to be 20% photography and 80% people management. Great wedding photographers seem to be some sort of strange combination of entertainer, project-manager and therapist (who also takes photos).


Average person still needs someone with the camera around. Clicks, flash, posing are all part of the ceremony. Besides, you have to choose the time and angle, arrange people for a good shot, and so on. AI makes your post processing easier and results more impressive if you want. But just AI is soulless fake. Also, you can add a small drone to your toolset or cooperate with a pilot. This way you can take shots and videos which were impossible before.


That is quite something.

Add a chatGPT like interface where you can communicate by sentences, sprinkle in a bit of speech recognition. And you have computers from 90s sci fi movies. "Computer, give this cat an hawaiian shirt, and make it surf."

The future is now.


"Give me a printout of Oyster smiling" - https://www.youtube.com/watch?v=maAFcEU6atk

Life imitating art imitating life. Exciting times indeed.


That's really impressive! Are there any limitations or ways to further improve this work? Are the samples shown on the homepage selectively chosen to highlight better performance?


As an artist, I can observe that (generally) aesthetic images such as paintings or photos, are visually unified. This unity is similar to that of a signature: complex yet uniquely expressive.

this singularity is composed of many elements (themes, forms, subjects etc) and sub-elements. What I would have loved to see in this paper is a means by which the heirachies of these elements can be changed, to produce new heirachies (and therefore new images).


I may be misunderstanding what you're saying, but it looks like they may do that in the Image Variations section.


At first I thought this was more of a 'pipeline' paper that chains together existing image extraction like SAM with LoRAs, but I'm impressed by the improvement in visual fidelity they get from doing intermediate end to end training. The union sampling is also not something I would have thought of.

Impressive! Thanks for the release.


An use for AR meeting where one persons can be extracted to another scene


Extremely impressive. I'm confident this will become integrated in Photoshop in a few years just like Generative AI was added recently.


I don't know how someone can look at this and not conclude that it's very probable the human brain follows a very similar sequence of steps when we use our own imagination to picture, say, a certain shirt being worn by a cat.

It's either that, or else there are multiple totally-unrelated methods of achieving essentially the same outcome.


I’m fairly certain there aren’t any matrix multiplications happening in my brain right now


If it can be said that a lens can perform a Fourier transform, then it can certainly also be said that the brain can perform matrix multiplications.

Both might not do the computation using sequential IEEE 754 floating point operations, but they perform the computation nonetheless.


A lens and the Fourier transform are directly related because the underlying physical process is being modeled. Matrix multiplication and the way our brain works are not close in the same way. The result is similar enough, but there’s no reason (that I know of) to believe that the underlying mechanism in our brains is matrix multiplication


A lens performs a Fourier transform just because any process that groups light waves by spatial frequency is necessarily performing a Fourier transform. That's because a Fourier transform is defined by how it relates input and output, not by whether it happens via a piece of glass, or a pen and paper, or a FFT software routine.

So this might be just a philosophical matter, but as far as I am concerned, there is no distinction between "performing a matrix multiplication" and "performing the operation that a matrix multiplication performs", such as projection, mapping, rotation, scaling. If you can mentally picture a plane being stretched, skewed, or spun, along with all the points on that plane, then you are mentally executing (in that case) 2x2 matrix operations. But we do this so effortlessly that we don't notice it.

As a kid learning to read, you might start by recognizing individual letters and sounding out the words. But at some point in fluency, you hardly even notice the letters anymore -- recognizing them disappears into an effortless task that is executed so quickly you might not even notice it's happening. That doesn't mean that your nervous system is no longer performing some kind of best-fit-comparison between an optical image on your retina and a set of learned characters; it certainly must be, but it's just happening (basically) with hardware acceleration.

You might think you aren't doing matrix multiplication because you aren't consciously iterating through [A1B1 + A2B1 + ...] like a child sounding out vowels, but the operation is necessarily happening somewhere along the line; it's just happening at the hardware level below your conscious perception.

I can do digital matrix multiplication with Numpy, but I can also make a circuit that does analog matrix multiplication using op-amps, where the addition and multiplication happens in voltages; I can make a mechanical device that does matrix multiplication using gears and levers, and I can certainly coax neurons to multiply matrices. It's hard to imagine a better structure to implement a matrix multiplication than the dendrites of a neuron.


With cherry picked results, as always, I presume.


It's so wild to me that the code is just... on github. Like, there's something to be said for all the effort going into these technologies and how freely available it is. A few years ago we would think this was almost undistinguishable from magic, and you'd be excused for trying to raise millions of dollars in VC money to turn it into a company.

Now I can just go run that on my computer.

insane.


The most surprising thing here is a Google Research paper with code release




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: