Btw, I did this in pixel space for simplicity, cool animations, and compute costs. Would be really interesting to do this as an LDM (though of course you can't really do the LAB color space thing, unless you maybe train an AE specifically for that color space. )
I was really interested in how color was represented in latent space and ran some experiments with VQGAN clip. You can actually do a (not great) colorization of an image by encoding it w/ VQGAN, and using a prompt like "a colorful image of a woman".
Would be fun to experiment with if anyone wants to try, would love to see any results if someone wants to build
Depends, given the low res, the 3x64x64 pixel space image is smaller than the latents you would get from encoding a higher-res image with models like VQGAN or the stablediff VAE at their native resolutions.
It's easier to get a sense of what's going wrong with a pixel space model though. With latent space, there's always the question of how color is represented in latent space / how entangled it is with other structure / semantics.
Starting in pixel space removed a lot of variables from the equation, but latent diffusion is the obvious next step
Took a lot of failed experiments, the model would keep converging to greyscale / sepia images. Think one of the ways I fixed was by adding an greyscale encoder to the arch. Used its output embedding as additional conditioning. Can't remember if I only added it to the Unet input or injected it during various stages of the unet down pass.
I never knew that was a thing, today I learned. I was spoiled with our first VCR having SCART already :p. And an IR remote! We could put the antenna cable into the VCR then use the remote to change channels (all three) instead of having to walk up to the TV. (this was late 80's, maybe early 90's; I wonder if we were late with things like that)
I’m not a fan of b&w colorization. Often the colors are wrong, either outright color errors (like choices for clothing or cars) or often not taking in to account lighting conditions (late in day shadows but midday brightness).
Then there is the issue of B&W movies. Using this kind of tech might not give pleasing results as the colors used for sets and outfits were chosen to work well for film contrast and not for story accuracy. That “blue” dress might really be green. (Please, just leave B&W movies the way they are.)
I think keeping the art as it was produced is important but there is also a good history of modifying art to produce new art too. In the digital age, we aren’t losing the original art so it seems even stranger to be against modification of the “original.”
However, just applying a simple filter (or single transform without effort) definitely feels derivative to me.
I have always viewed them as "more relatable". People may well be biased to think of more relatable things as more true, but I don't think that is the fault of the colourisation or how it is presented.
Maybe you're used to looking at B&W stuff and effortlessly figuring out what the scene is depicting, but for me at least it's very hard. Adding a little color makes it much easier. In that regard, it doesn't matter to me if the colors are wrong.
(Perhaps it just takes some getting used to. Back when I read a black and white comic for the first time (as a child), I had a hard time figuring out things at first but got used to it at some point.)
I think the point being made is that movies were made for the B&W end result, not just shooting color with B&W film.
For instance, fake blood in B&W was often produced with black liquid. Colorizing it correctly just doesn't make sense. Or a green or blue dress can be chosen because of the way it looks on film, not because it's supposed to BE a green or blue dress.
Not specific to colorization, but didn’t something like this happen with the Star Wars trilogy? Lucas made a re-release with a few edits (that were not universally liked) and it’s now impossible to purchase a new copy of the original version (or something like that, I only remember hearing about it).
I don’t see why it matters if the blue dress was really green. The result is either an enhanced experience or not, if it is then minor inaccuracies don’t seem relevant.
If there's a source that a blue dress was green, then that could be taken into consideration for recoloring, but as you said, it's to enhance the experience, not to be 100% accurate.
Quite often, colorized pics and movies have people wear blue-ish clothing, which is fairly unbelievable. It's a gimmick that produces an effect that's not quite right for a goal for which it's not suitable. Because what is it that colorizations try to achieve? To make people think "Oh, so that's how it looked back then"? Then there shouldn't be errors in the image. And if it makes the pictures more relatable, or whatever handwaivy arguments are being thrown around, then non-colorized pictures will become even less relatable, in effect alienating people from recent history (if you believe such arguments).
I'd like to make one exception, though, for They Shall not Grow Old. That was impressive.
I think colorization with some effort put in can be pretty decent. E.g. I prefer the 2007 colorization of It's a Wonderful Life to the original. It's never perfect but I don't think that's a prerequisite to being better. Some will always disagree though.
About every completely automated colorized video tends to be pretty bad though. Particularly the YouTube "8k colorized interpolated" kind of low effort channels where they just let them pump out without caring if it's actually any good.
Yeah it's cool tech but I really don't appreciate how it is just straight up deceitful and spreading misinformation. A lot of hues are underdetermined and the result is more or less arbitrary in a historical context. If one were to research and fine-tune the model such that ambiguous shades are historically accurate I would be less annoyed by the sense that these images are just spreading misinformation. Compare this with Sergey Prokudin-Gorsky's photos of the Russian Empire or autochromes of Paris in 1910 which are actual windows into a lost world.
*for works of fiction these issues vanish, but for any historical or documentary photographs/films, I really hate that I am being lied to.
Nope - fundamentally different. First, multispectral imaging is a mapping of information into a form that can be easily seen and interpreted. It is not the wholesale synthesis of new information. The goal is to reveal the true structure to the eye based on actual measurements of the object. Second scientific imaging and social imaging have fundamentally different functions in terms of what they reveal about the world. The color of cloth has different meanings than the color of a distant nebula.
Edit - technically, I suppose, the way Deoldify works is by rendering the color at a low resolution and then applying the filter to a higher resolution using OpenCV. I think the same sub-sampling approach could work here...
Technically yes, the encoder and unet are convolutional and support arbitrary input sizes, but the model was trained at 64x64px bc of compute limitations. You could probably resume the training from a 64x64 resolution checkpoint and train at a higher resolution.
But like most diffusion models, they don't generalize very well to resolutions outside of their training dataset
basically the training works as follows:
Take a color image in RGB. Convert it to LAB. This is an alternative color space where the first channel is a greyscale image, and two channels that represent the color information.
In a traditional pixel-space (non latent) diffusion model, you noise all the RGB channels and train a Unet to predict the noise at a given timestep.
When colorizing an image, the Unet always "knows" the black and white image (i.e the L channel).
This implementation only adds noise to the color channels, while keeping the L channel constant.
So to train the model, you need a dataset of colored images. They would be converted to LAB, and the color channels would be noised.
You can't train on decolorized images, because the neural network needs to learn how to predict color with a black and white image as context. Without color info, the model can't learn.
But since you do not have access to colour originals of historical photos in almost every instance, you cannot possibly train the network to have any instinct for the colour sensitivity of the medium, can you?
Colourising old TV footage can only result in a misrepresentation, because the underlying colour is false to have any kind of usable representation on the medium itself.
And this caricatured example underpins the problem with colourisation: contemporary bias is unavoidable, and can be misleading. Can you take a black and white photo of an African-American woman in the 1930s and accurately colour her skin?
Yeah, the model is racist for sure. That's a limitation of the dataset though (celeb A is not known for its diversity, but it was easy for me to work with, I trained this model on Colab)
And plausibility is a feauture, not a bug.
There are always many plausibily correct colorizations of an image, which you want the model to be able to capture in order to be versatile.
Many colorization models introduce additional losses (such as discriminator losses) that avoid constraining the model to a single "correct answer" when the solution space is actually considerably larger.
No more so than any other colorization method that isn’t dependent on out-of-band info about the particular image (and even that is just more constrained informed guesswork.)
That's what happens when you are filling in missing info that isn't in your source.
EDIT:
Of course, color photography can be “bullshit” rather than accurate in relation to the actual colors of things in the image; as is the case with the red, blue, and green (actual colors of the physical items) uniforms in Star Trek: The Original Series. But, also fairly frequently, lots of not-intentionally-distortive reproductions of skin tones (often most politically sensitive in the US with racially non-White subjects, where there are also plenty of examples of deliberate manipulation.)
Showing color X on TVs by actually making the thing color Y in the studio, well, filming, not bullshit. It's an intentional choice playing out as intended. It is meant to communicate a particular thing and does so.
That particular thing was not intentional, and is the reason why the (same color in person, different material) command wrap uniform that is supposed to be color-matched to the made-as-green uniforms isn’t on screen.
But, yes, in general inaccurate color reproduction can be intentionally manipulated with planning to intentionally create appearances in photos that do not exist in reality.
Why are you so negative about it? Pretty sure many people would find it impressive to colorize old photos to look at them as if these were taken in color.
Should artists not put their bs in the world? Writers? Musicians? Most of it is made up but plausible to make you feel something subjective.
This is true, but if you have some reference images, you can probably adapt some of the recent diffusion adaptation work such as DreamBooth, to tell the model „hey this period looked like this“, and finetune it.
>You can't train on decolorized images, because the neural network needs to learn how to predict color with a black and white image as context. Without color info, the model can't learn.
I think the parent means with delocorized images used to test the success and guide the training (since they can be readily compared with the colored image they resulted from which would be the perfect result).
Not to use decolorized images alone to train for coloring (which doesn't even make sense).
Is there a reason for using LAB as opposed to YCbCr? My understanding is that YCbCr is another model that separates luma (Y) from chroma (Cb and Cr), but JPEG uses YCbCr natively, so I wonder if there would be any advantage in using that instead of LAB?
The Y in YCbCr is linear, and is just a grayscale image. The L channel in lab is non-linear (as are A and B), and is a complex transfer function designed to mimic the response of the human eye.
A YCbCr colorspace is directly mapped from RGB, and thus is limited to that gamut.
LAB can encode colors brighter than diffuse white (ala #ffffff), like an outdoor scene in direct sunlight.
Sorta HDR (LAB) vs non-HDR (YCbCr).
This image (https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Ex...) is a good demo, left side was processed in LAB, right in YCbCr). Even reduced back down to a jpeg, the left side is obviously more lifelike, since the highlights and tones were preserved until much later in processing pipeline.
The description included with that image conflicts with your account:
> An example of color enhancement using LAB colorspace in Photoshop (CIELAB D50). Left side is enhanced, right side is not. Enhancement is "overdone" to show the effect better.
And per the original upload the “enhancement” demonstrated is linear compression of the a* and b* channels—
Black and white film doesn't have one single colour sensitivity. Play around with something like DxO FilmPack sometime (it has excellent measurement-based representations of black and white film stocks).
It's a much more complex problem than it might seem on the surface.
The other day I was working on a mono photo to prove a point: that a model (a photographic artist's model!) with very striking pink hair was of little concern to a photographer who worked in black and white only, and might actually present some opportunities for choosing tonal separation that are not present in those with non-tinted hair.
In different circumstances (film and filter) her hair could appear (in black and white) to the viewer as if it was likely brunette or likely blonde, before any local (as opposed to image wide) adjustments were made.
The question you are asking, I think, is could you get the hair colour right based on the impact of those same circumstances on other known objects in the scene.
I think the answer is no, in the main, generally because those objects likely don't survive to make colour comparisons from (and there are known cases where the colourisation of a building has been completely wrong because it had simply been repainted). And also because it's sometimes not even obvious what a structure actually is, without its colour. People who colourise by hand make this mistake too.
But I concede that given that we have to work with contemporary images to have a colour source, randomising the tone curve is the only thing that could work.
Is that challenging? Humans have awful color resolution perception, so even if you have a huge black-and-white image, people would think it looks right with even with very low-resolution color information. Or, if the AI hallucinates a lot of high frequency color noise, it wouldn't be noticable.
I see what you mean. I think that you can happily scale the B&W image down, run the model, and then scale the chroma information back up.
Something I was thinking about after writing the comment is that the model is probably trained on chroma-subsampled images. Digital cameras do it with the bayer filter, and video cameras add 4:2:0 subsampling or similar subsampling as they compress the image. So the AI is probably biased towards "look like this photo was taken with a digital camera" versus "actually reconstruct the colors of the image". What effect this actually has, I don't know!
good point, I hadn’t realized that you only need to predict chroma! That actully greatly simplifies things
re. chroma subsampling in training data: this is actually a big problem and a good generative model will absolutely learn to predict chroma subsampled values (or JPEG artifacts even!). you can get around it by applying random downscaling with antialiasing during training.
yeah, you can use SOTA super res, but that tends to be generative too (even diffusion based on its own, or more commonly based on GANs). it can be a challenge to synthesize the right high res details.
but that’s basically the stable diffusion paper (diffusion in latent space plus GAN superres)
Yeah, if you have a high res image, you can get color info at super low-res and then regenerate the colors at high res with another model. (though this isn't an efficient approach at all)
Is there anything that exists right now with diffusion models to improve poor VHS coloring? The coloring does exist so I would not want to replace a red shirt by a blue shirt for example but it's just not very accurate.
I think the bigger question is would it be stable enough. Many SD like models struggle with consistency across multiple images (i.e. frames) even when content doesn't change much. Would he a cool problem to see tackled.
temporal coherence is def an issue with these types of models, though I haven't tested it out with ColorDiffusion. Assuming you're not doing anything autoregressive (from frame to frame) to do temporal coherence, you can also parallelize the colorization of each frame, which would affect cost.
Tbh most cost effective would be a conditional GAN though
24 frames per second * 60 seconds per minute * 90 minute movie length = 129600 frames
If you could get cost to a penny per frame, about $13k? But I'd bet you could easily get it an order of magnitude less in terms of cost. So $1500 or so?
And that's assuming you do 100% of frames and don't have any clever tricks there.
Seems like a pretty reasonable estimate, if it cost about $2 a hour to rent a decent GPU, that’s 18s per penny which sounds pretty doable to run one frame.
Think inference time was on the order of 4-5seconds per image on a v100, which you can rent for like .80 cents an hour, though you can get way better gpus like a100s for ~1.1 usd/h now. But ofc this is at 64px res in pixel space.
If you wanted to do this at high res, you would definitely use a latent diffusion model. The autoencoder is almost free to run, and reduces the dimensionality of high res images significantly, which makes it a lot cheaper to run the autoregressive diffusion model for multiple steps.
This is a cool party trick, but I don't see a need for this in any real applications. Black and white is its own art form, and a lot of really great black and white images would look like absolute garbage if you could convert them to color. This is because the things that make a great black and white image (dramatic contrasts, emphasis on shape/geometry, texture, etc) can lose a lot of their impact when you introduce color. Our aesthetic tolerance for contrast seems significantly reduced in color because our expectations for the image are more anchored in how things look in the real world. And colors which can be very pleasing in some images are just distracting in others.
So all this is to say.... I don't think there would be commercial demand to, say, "upgrade" classic movies with color. Those films were shot by cinematographers who were steeped in the black & white medium and made lighting and compositional choices that take greatest advantage of those creative limitations.
I've run colorization like this against historic photographs and it had a very real impact on me - I found myself able to imagine life when the photo or video was taken much more easily when it was no longer in black and white.
The issue I have is that the examples appear to be color images that were converted to black and white. In other words, they are modern images with plenty of images acquired in color for training. Converting an archival shot from early 20th century is totally different. Totally lowers credibility in my eyes
Counterexample: They Shall Not Grow Old, a WW1 documentary film with mostly colorized footage with recreated audio. The film was commercially successful and I found it to be a great watch.
> I don't think there would be commercial demand to, say, "upgrade" classic movies with color.
There was, and maybe there will be again once we get far enough from the consumer burnout from the absolute deluge of that in, mostly, the 1980s-1990s.
I really really want something like this to actually reduce noise in images. Current raw photo denoising often is little more than a tuned up gaussian blur. Making it guess color based on larger context and external info would be step up.
Eztra happy if it would be possibe to tune denoising, using photos from the same series. Multiframe NLMeans right now is slow and mostly theoretical.
Some of the old Doctor Who stories that were filmed in colour they only have black and white copies of. The colourisations have been ... very good, better than I would have thought, but not perfect. Could be an a good application.
Colourising old photographs is the banal apotheosis application of diffusion AI.
It's the pinnacle of the whole thing: "imagine it for me in a way that conforms to my contemporary expectations".
If you're going to colourise images, have the decency to do it by hand. If possible on a print with brushes.
Edit: didn't think this would be popular. Maybe it's the historical photography nerd in me, but colourising images without effort and thought is like smashing vintage glass windows for the fun of it: cultural vandalism.
> But since you do not have access to colour originals of historical photos in almost every instance, you cannot possibly train the network to have any instinct for the colour sensitivity of the medium, can you?
Plenty of people say that about colorization period, which, while I disagree, seems more sensible than your position to me, which just seems to be fetishizing suffering.
The point I am making is that colourisation is subjective art, and that alone.
Colourisation cannot fail to enforce contemporary biases based on poor understanding of the materials. It will darken or lighten skin inappropriately, and mislead in any number of ways.
Doing it by hand (in photoshop or on a print) acknowledges the inherent bias that is involved in colourisation.
Automating it is banal at best and dangerous at worst; colourised images risk distorting history.
> Colourisation cannot fail to enforce contemporary biases based on poor understanding of the materials. It will darken or lighten skin inappropriately, and mislead in any number of ways.
If anything, an AI trained on a large and diverse dataset is probably going to wind up being much more accurate with regards to skin color than a human colorist would be in most cases.
The problem here isn't whether colorization is done by man or machine; it's just ensuring that colorized photos are identified as such. Which they usually are -- that's not a new problem to be solved.
A diverse data set of black and white images doesn't have any kind of knowledge of the colour sensitivity of the medium in that moment.
What film was it? How was it processed? Is it a scan of a negative or a print? What was the colour of the lighting? Was a particular colour tint filter used on the lens? Was the subject wearing makeup optimised for black and white photography?
The black and white image, standing alone, cannot tell you this, I think. Sure, it might get a bit better at, say, identifying a 1950s TV show. But what is the "correct" accurate colour representation of that scene, when televisual makeup was wildly unnatural in colour?
But do people have any of that knowledge either? Most of the time, I don't think so -- they colorize stuff in a way that just "looks right" or "looks natural" or "looks nice" to their eye, that's all.
And the dataset an AI is going to train on should be using original color photos that are then converted to B&W across a wide variety of color curves. So it should be fairly robust to all sorts of film types. So again, I repeat that it's probably going to wind up being more accurate with regard to skin tone than a human (with their aesthetic biases) usually would.
> But do people have any of that knowledge either? Most of the time, I don't think so -- they colorize stuff in a way that just "looks right" or "looks natural" or "looks nice" to their eye, that's all.
No, indeed. Which is why doing it by hand is more respectful of the notion that it is subjective.
Automatic colourisation is and will be viewed differently, as more "scientific", when it's still absolutely beholden to the same biases and maybe misconceptions that we can't unpick because they come from poor training data.
Finally: "original colour photos" are also a problem. Not only for the part of the history where they don't exist. But also for the part of history (until the early 1960s) when the colour rendition of those photos was false or incomplete. You can get a little closer to understanding what that colour looked like, but it's important to understand that colour emulsions vary in the way they work: it's not black and white film with extra colour sensitivity.
So at best you will be colourising the black and white film to look like the colour film, which is not reality. And there are well-understood problems with correct representation of skin tones with colour film until the mid-eighties.
I can see your point; I just think there's a bigger picture here (pun not intended) that you're not seeing.
> Automatic colourisation is and will be viewed differently, as more "scientific"
Then the solution is to correct that misperception, not deny ourselves a useful tool.
> I can see your point; I just think there's a bigger picture here (pun not intended) that you're not seeing.
My overarching point is that this is a tool like any other. And the idea that "doing it by hand is more respectful of the notion that it is subjective" I will push back on 100%.
There is nothing disrespectful about colorizing a photo, automatically or by hand. But it should always be clearly communicated that it is subjective not objective, whether human or machine.
Again, if someone believes the colorization is somehow "real" or "scientific" because a computer did it, then correct their misbelief. Don't stop using the tool. That's the bigger picture here.
>Automating it is banal at best and dangerous at worst; colourised images risk distorting history
Well, faces still have a certain tint, the sky is mostly blue, the grass green, water is blue, mud pools are brown, the ground too, a lot of historical fabrics are certain inherent colors, known flowers have known colors, brownstones have red/brown color. A lot of it, is just not that subjective.
Besides different color film stock (or camera sensor "color science") can already result in dozens of widely different colorings of the same exactly scene.
You cannot accurately colourise skin from photographic film without an _enormous_ amount of knowledge of the taking and processing of the film, and of the lighting and subject.
An AI can't do it any better than a painter. You can't take a scan of a print or a negative and get skin tones right.
Think about how weird the skin tones are from scans of wet-plate photography plates compared to the same process used in antiquity with the aim of producing a carbon print.
Yes. There's just not a single one across all faces - but I wasn't meaning that.
What I mean is, we know the kind of tints a face will have. A face is not suddenly going to be blue or green or poppy red. And by how light a black and white face appears, we can tell quite well if it's a darker one (oilish to brown) or lighter (pinkish towards more pale).
If we get it wrong within a range it's no big deal. Color film stocks would also vary it widely.
Hell, even actual people who met the person we colourise in real life will remember (or even experience in real time) their face's hue somewhat differently each.
Black and white films of different technologies and manufacturers and eras actually lighten or darken skin tones. Really very significantly.
And it's not going to be obvious from the final positive, unless there's _extensive_ data with those images about how the photography was done. And there never is.
Editing because I can no longer reply: the question of whether a skin tone is a dark one or a light one has had severe real life impacts on people whose lives are now only represented in photographs. You can't write this off as micromanagement; it's about the ethics of representation.
>But how brown? How pink? How light? How dark? This is an enormously important issue
Is it?
If 2 colour film stocks took the same image of them, it would show their hue a little (or a lot) different.
Even if two different people actually met the same person, they will probably describe their face as slightly different tones from memory. (And let's not even get into different types of color-blindness they could have had).
Hell, a person's hue will even look different to the same person looking at them, in real time, depending on the changes in lighting and the shade at the scene as they talk (e.g. sun behind clouds vs directly sun vs shade vs bulbs).
It's not really "enormously important" to micromanage the (non-existent) exact right brown or right pink.
I can now reply so I will say what I added in an edit: the question of whether a skin tone is a dark one or a light one has in the past had severe real life impacts on people whose lives are now only represented in photographs.
You shouldn’t write this off as micromanagement; it's about the ethics of representation. It is better to leave the original image uncoloured than to colour it automatically based on some fundamentally ill-informed model.
Hand colouring that image based on individual knowledge (for example that someone could or could not pass as white) is ethically better, if colourised images are needed.
Important nuances of culture and history, important and complex stories of discrimination and survival, are damaged by automatic colourisation by models that have no knowledge of the source of the mono image they are colourising.
I also remember reading articles about film stocks on the same subject. But in light of the actual tangible harm caused to blacks, I think worries about the exact shade in photos, are basically overcompansating for things that should have been (or still should be) improved outside the realm of film stock/photo retouching...
Which makes it mostly american baggage. Other places who didn't have that history don't have much of an issue with whether a person is shown this or that exact shade in a photo, as it doesn't change anything, the same way making a white guy a little pinker doesn't change anything.
> Doing it by hand (in photoshop or on a print) acknowledges the inherent bias that is involved in colourisation.
No, doing it by hand doesn't acknowledge that your interpretation is a fallible interpretation shaped by bias, just like translating a written work (e.g., the Bible, for a noted example where this has been done often without any such acknowledgement being conveyed) by human effort doesn’t do that.
Acknowledging bias in translation of either kind is an entirely separate action, orthogonal to the method of the translation itself.
> Automating it is banal at best and dangerous at worst; colourised images risk distorting history.
There's a lot of irony in acknowledging this but not acknowledging that each and everyone of us has their own biases inherent to our perception and experiences.
Like the blue and white dress; we all perceive things differently even on identical images, monitors, screens, etc.
My post you are replying to contains the sentence "Doing it by hand (in photoshop or on a print) acknowledges the inherent bias that is involved in colourisation".
So because I didn't list out my argument with completeness, my argument is defective and thus it is laden with "huge" irony.
You're imagining this irony to suit your personal requirement that I am wrong or foolish.
But if you look at what I am talking about elsewhere in my comments on this topic, it is human bias that concerns me.
One of the things automated colourisation cannot get right is historical depictions of human skin. In a way that really matters.
Human biases will creep into automatic colourisation because they can't _not_ creep in: the test data cannot fully describe the subject matter so contemporary bias will take over.
One of the areas where this really matters is historical depictions of Black people. Automatically colourising a black and white photo of a Black person's skin will almost certainly get their skin tone wrong in a way that might well very significantly misrepresent their history.
The same is true in mixed cultures all around the world; colourism is as much an issue as racism.
People had different lives based on how their skin tone was perceived. Automatic colourisation will not (cannot!) automatically produce a colour image that fits that experience. Because dark skin can appear light (and light appear dark) depending on complexities of reproduction.
This stuff matters. Hence my position: if you wish to ethically colourise an image, first consider not doing it at all. Second, consider doing it by hand based on real knowledge of the subject (their lived history etc.).
I am fully cognisant of the bias issues here (well, as fully cognisant as a white amateur student of photographic history can be)
Fair enough. Honestly this was just a fun side project. I actually coded this up last october when I was doing a deep dive to learn about diffusion models, and saw that no one had ever applied them to colorization. This was just a fun opportunity to build a project that no one had done before
The point is that the source black and white image is not truthful about skin colour. The film locks in a level of lightness but that lightness may be very wrong (depending on the red and blue sensitivity of the film, the colour of the light, the time of day, the print, whether a filter was being sued).
So if you colourise an image of someone who appears to be a light-skinned 1930s African-American with colours that appear to conform to our contemporary understanding of light-skinned Black people of our era, you might be getting it right, of course.
But you might be getting it quite, quite wrong, in a way that matters.
So we're using a color space that has two channels dedicated entirely to color, which is the only thing the model needs to learn.
The model doesn't need to touch the lightness channel at all, only predict the noised added to the color channels at train time.
At inference time, we start with a real lightness channel (b/w image), and initialize the color channels to random noise. The model iteratively denoises the color channels while keeping the lightness channel locked.
I think a lot of it depends on what you are doing and why.
Yes, recolors can be inaccurate but they can make historical moments feel more alive and connected. At the same time one can imagine the issues of a recolor that is inaccurate and that is troubling with historical photographs.
At the same time I have a bunch of old family photos I'd love to recolorize. Maybe the colors won't be quite right but that's an OK failure mode for family photos!
I'd love to see a version where you can drop just a spot or two of the correct color and let the AI fill it out. My grandmother had stark red hair but most algorithms will color her as a blond. It'd be nice to fix that, using one of the color photos we do have.
Colourised images absolutely replace mono images in image searches, unfortunately; I've seen this again and again. It gets more difficult to find originals.
But also you have to consider that bias is being introduced in the colour rendition. That causes damage.
For example, you could see a photograph of an African American woman in the 20s or 30s, and your AI would say, this is an African American woman and colour her skin in some way.
But a lighter-skinned-looking African American woman in a pre/early-post-war photo is a challenge. She may have had darker skin -- been unable to "pass" -- and the film simply didn't get that across because of its colour sensitivity.
Or she may actually have been light-skinned and able to "pass" (or wearing makeup that helped).
Automatically colouring that image introduces risks to the reading of history; you can read that woman's entire life completely wrong.
It's also common with photos of men from that era who worked outdoors. Many of them will come across much darker-skinned in photos than they actually would have appeared in real life, because not-readily-visible sun damage can look odd in mono. But if you colourise all those sun-baked people the same way, what happens to those of mixed heritage among them? (A thing that is already rather "airbrushed out" of history.)
Without knowing about the lighting, the material, the processing and the source of the positive (is it a negative scan? was it a good one? or is it a scan of a print?) you cannot make accurate impressions of skin tone.
And given the power and importance of photography in the history of the USA in particular -- photography coincides with and actually helps define the modern unified US self-image -- this is not something to blaze through without care.
This is a far less tricky problem in more homogeneous societies, obviously. But even then, there is this perception from photographs that British women in the 1920s were all deathly pale; colourisation preserves that illusion that actually comes in part from photographic style.
There is a community of people who carefully recolor historical photos by hand. It's really beautiful time consuming work and often they invest heavily to get the colors to be correct.
The effort is obviously going to be less accurate.
But it reflects the fact that an accurate colourisation of a black and white image without access to every possible detail about the scene and processing from the photographer's perspective is impossible.
Black and white film is substantially more complex and varied than people understand. Its sensitivities are complex and vary from processing run to processing run, and people at the time knew of the weaknesses of black and white and often used false colour to get an acceptable rendition.
Colourisation is a form of expression, not a form of recovery.
>But it reflects the fact that an accurate colourisation of a black and white image without access to every possible detail about the scene and processing from the photographer's perspective is impossible.
Accurate colourisation is impossible even in a color photograph. There is no "canonical" film stock that accurately represents all actual real-life colors.
The expectation from colourisation is not an accurate representation of the original colors, but a good application of color based on our knowledge (whether from historical facts a human colorist knows or from training with similar objects and materials a NN did) that matches a realistic representation of the scene.
If a human colourist draws a dress and doesn't know the color of it, nor have they any historical information about what the person depicted wore that day, they're going to take a guess. That's kind of what the NN will do as well.
Making music without actually knowing anything about it is the banal apotheosis application of Generative AI. - Music nerd in me
Creating art without actually knowing anything about it is the banal apotheosis application of Diffusion AI. - Artist in me
Using ChatGPT to write essays that are better than anyone could have ever written is the banal apotheosis application of LLMs - Teacher in me
It is already here. Better use, appreciate, and try to understand how it works rather than complaining about it doing a better job.
In this instance, for example, the model can be made to generate multiple outputs or even better, generate output based on precise user input.
I work at krea.ai. We are for sure making art extremely accessible, but we consider it enhancing creativity rather than replacing.
I fully agree that being able to generate an aesthetically pleasing image with an AI that has been optimized to do exactly that is a banal application of creativity.
I do think that AI has incredible potential to make (and become art).
The best AI artists don't just throw art into midjourney, they experiment, create their own secret sauce.
Training models has become an art form in and of itself: ai artists curate incredible datasets and devise recipes for training stunning models. Their workflows span multiple companies / tools / models.
AI just means that the goalposts for creativity are shifting. Boring people will use AI to make boring art, artists will find completely unexpected ways to use the tools we build to create art forms we've never imagined before.
I was really interested in how color was represented in latent space and ran some experiments with VQGAN clip. You can actually do a (not great) colorization of an image by encoding it w/ VQGAN, and using a prompt like "a colorful image of a woman".
Would be fun to experiment with if anyone wants to try, would love to see any results if someone wants to build