I'm sure it is, but "gluing things together" coherently in response to a text prompt is a stupendous achievement. It's not AGI, but it's miles ahead of where we were even a few years ago and opens the door to automating a class of jobs I don't think anyone back then believed could be automated, short of AGI.
I ended up reading the book Blindsight (Peter Watts) that's been floating around in comments recently. A major theme in the book is intelligence and its relation to consciousness (including whether consciousness is even beneficial). If you agree with the idea, you'd consider that DALL-E is indeed intelligent even though it appears to be a "Chinese Room". Humans would be "gluing things together" in just the same way, but with this odd introspective ability that makes it seem different.
I'm becoming convinced that these algorithms are huge steps towards AGI simply because AGI might end up being a collection of many of these domain-specific networks with a network sitting above them who's only role is to interrogate the sub networks for solutions to the problem at hand, and discriminate which solution(s) are most worth trying, simulating those and then picking one out and executing it in the real world. That seems to me to be pretty close to what we as humans do.
You just need a neural network that knows how to interrogate other neural networks, and a neural network that knows how to find new datasets and train new neural networks and you potentially have something very scary right?
Maybe a third neural network that knows how to purchase cloud compute?
Cognitive scientists have been exploring this idea for a while, e.g. the Modularity of Mind Hypothesis. It would be very interesting to see if learnings from this field can be further applied to AI development.
the problem you run into then is how does it adapt to new domains? sure large models display some degree of generalization, but new tasks will certainly ask things the network couldn't get
I think that's valid, however, I think the modular approach would allow for a lot more cross domain flexibility because I suspect that a large subset of disparate tasks do actually decompose themselves into discrete subtasks that can be useful across domains.
Adding to that, if you can compose multiple sub networks together then you've really got something. You can build a lot of different buildings from bricks without needing to invent new kinds of brick basically.
For instance, think about the large number of domains that robust computer vision would be useful in. Then think about the fact that if the computer understands the 3D space around it, it can hand that model off to a network that does predictive physics simulation. Now you've got something that would be useful across a extremely wide range of domains.
The modular approach doesn't need to be enforced by strict boundaries. One network that starts with various components but can be remodeled and even merged together as it goes could adapt.
What we're discussing here is merely a starting state of something that'd have to rebuild itself.
Or maybe evolving modalities within one network is easier than putting it together.
Exactly (and cannot recommend Blindsight highly enough). Of course DALL-E is a Chinese Room. The most exciting and subversive idea in Blindsight is that consciousness is maladaptive, and the scramblers are _more_ fit for the lack of it. Long after neutral nets have surpassed our functional intelligence, we'll still be criticizing their ability to navel-gaze as well as humans.
Of course, DALL-E can only get away with not being introspective because it has reinforcement learning holding its hand. Bit harder to curate a clean training set without consciousness.
Personally, I suspect consciousness is intrinsically connected with whatever lets humans learn things by looking at and thinking about a single example.
I get what you're saying, but it's important to also consider how evolution provides for us as well by giving us neural structures already tuned for things like recognizing faces and recognizing danger. I do wonder how much realtime danger analysis has to do with consciousness.
Hmm. I'd say subconscious might get you away from immediate danger, but it's often a conscious decision to anticipate future danger based on past events, and this feedback loop is sometimes quite small, such as deciding which path to take while being pursued.
>Giant robot pilot looking smug Caption: I am a genius!
Picture break:
>same robot pilot now looking horrified.
>Caption: Oh no!
I can tell you right frigging now, the average person is trained through education to build up confidence they know what they are doing, then over the next decade plus face-checked against reality to build an aversion/paranoia matrix.
The modern world is essentially orchestrated by people who build self-sustaining support structures of specialists. We've done a great job at building co-dependent specialists, but a terrible one at building generalists.
Reinforcement learning is the thing that integrates the network across time though. In one sense, I think it's doing at least part of the job that consciousness does: mediating between memory(past) and the present.
Yes, it's just that because it's an external process, DALL-E can't go meta on it like we can. So you won't ever get "appearance of consciousness" out of a process that depends on RL, imo, because you've located part of its cognition outside its brain and made it intermittent.
Consciousness is very much adaptive, though it requires a sense of the metaphysical / transcendent to behave in that way. Theologically speaking, God brings Life and the Devil brings Death. Sadly, our modern brave new world has taken "Gott ist Tot" as a triumphalist proclamation, and not the lamentation it was meant to be.
It's far too early to tell if consciousness is adaptive. Has it enabled humans to be wildly successful? In the short term, sure! But the long term is unclear.
E.g., look at how the technological progress it makes is at risk of destroying human civilization. A series of ecological boom/bust cycles could lead toward negative selection for consciousness. Or we might manage to wipe out ourselves or, say, most mammals, leading life on earth to start over as it did 65m years ago.
But even without that, it's not clear to me that consciousness will really win out. Look at the number of successful people who are not only painfully unreflective, but need to be to keep doing what they're doing. I could name a lot of people, but today's good example is Alex Jones, whose whole (very profitable) schtick is based on refusing to be fully conscious of what he's saying: https://popehat.substack.com/p/alex-jones-at-the-tower-of-ba...
And this is hardly a new idea. Vonnegut wrote a novel where humans end up evolving into something like a sea lion. The point being "all the sorrows of humankind were caused by 'the only true villain in my story: the oversized human brain'", an error evolution ends up remedying.
Evolutionary speaking, cetaceans "share" the front limbs with quadrupeds. And yet there is a qualitatively distinct functional difference. Consider that moral consciousness, as present in humans, is functionally not quite the same as its biological precursor, the moral sense present in dogs or gorillas. And, of course, there are gradual changes along the evolutionary way.
Edit: "Organ", more precise "sensory organ", as in "the visual organ". Perhaps there is a better word here than "organ" here, before we get lost in the medical distinction between eye / retina / optic nerve / cortex / etc.
Sure. Nothing is the same. Every animal is slightly different. As humans trying to understand the world with 3 pounds of meat, we abstract and group so we have a chance of dealing with the overwhelming complexity.
But I stand by my point. A moral sense is something that provably exists in a lot of social animals. Is it different in each species? Yes. Is it different in each individual? Surely. But unless you're going to claim that all of those animals are conscious, you can't say that consciousness is foundational to a moral sense. Indeed, I think the much easier argument is that a moral sense is foundational to evolving consciousness.
Why we should be reduce human morality to what can be "provably" found in other social animals? There are some rudimentary similarities, but those similarities do not circumscribe the entirety of human moral experience, not by a large stretch.
But Sontag suggests the use of Batailleist `powerful communication’ to
modify and read consciousness. The subject is contextualised into a
predialectic nationalism that includes language as a totality.
Edit. Don't really want to be unnecessarily flippant here. Perhaps this is a real person attempting to communicate. Sadly, I was unable to follow through, the text jumps wildly between seemingly unrelated subjects.
The "Chinese room" argument, so far as I understand it, applies to any discreet computation process simulating consciousness.
The argument of the article is that DALL-E doesn't respond appropriately to a particular kind of input - two entities in some kind of spatial relationship (that it hasn't often seen). Dall-E's not extrapolating the three-D world but stretching a bunch 2-D images together with some heuristics. That works to create a lot of plausible images sure but it implies to this ability might not, say, be able to be useful for the manipulation of 3-D space.
So, given a "Chinese room" is just a computation, it's plausible that some Chinese room could handle 3-d image manipulation more effectively than this particular program.
Which is to say, "no, the criticism isn't this is a Chinese room, that is irrelevant".
The Chinese room argument itself isn't very compelling. Surely the constituent parts of the brain are fundamentally governed solely by physics, surely thought arises solely from the physical brain, and surely the constituent parts (and thus thought) could be described by a sufficiently complex discreet computation.
The argument you make here is a reasonable one (IMHO) for the plausibility in principle of what Searle calls “strong AI”, but he claims that his “Chinese Room” argument proves that it must be mistaken. One can simply ignore him, but to refute him takes a little more effort.
It turns out that when one looks at the argument in detail, and in particular at Searle’s responses to various objections (such as the Systems and Virtual Mind replies), it is clear that he is essentially begging the question, and his ultimate argument, “a model is not the thing modeled”, is a non-sequitur.
The argument is essentially that there are no qualia of Chinese comprehension in an automaton or in any system that uses an equivalent algorithm, whether or not run by a human.
It's a sound argument to the extent that qualia clearly exist, but no one has any idea what they are, and even less of an idea how to (dis)prove that they exist in external entities.
It's the materialists who are begging the question, because their approach to qualia is "Well obviously qualia are something that just happens and so what?"
Unfortunately arguments based on "Well obviously..." have a habit of being embarrassingly unscientific.
And besides - written language skills are a poor indicator of human sentience. Human sentience relies at least much on empathy; emotional reading of body language, expression, and linguistic subtexts; shared introspection; awareness of social relationships and behavioural codes; contextual cues from the physical and social environment which define and illuminate relationships; and all kinds of other skills which humans perform effortlessly and machines... don't.
Turing Tests and game AI are fundamentally a nerd's view of human intelligence and interaction. They're so impoverished they're not remotely plausible.
So as long as DALL-E has no obvious qualia, it cannot be described as sentient. It has no introspection and no emotional responses, no subjective internal state (as opposed to mechanical objective state), and no way to communicate that state even if it existed.
And it also has no clue about 3D geometry. It doesn't know what a sphere, only what sphere-like shading looks like. Generally it knows the texture of everything and the geometry of nothing.
Essentially it's a style transfer engine connected to an image search system which performs keyword searches and smushes them together - a nice enough thing, but still light years from AGI, never mind sentience.
Searle’s argument is not about qualia; it is (as Searle himself has repeatedly stressed) about syntax, semantics and understanding. The argument simply does not consider what the room’s occupant feels.
Even if it were about qualia, calling the argument sound “to the extent that” we don’t know enough to tell whether its premises are correct would be a misuse of ‘sound’ and a rather blatant case of burden-shifting - effectively saying “so prove me wrong!” to skeptics.
Materialists can and do make question-begging claims, but that does not somehow cancel out Searle’s own question-begging (furthermore, somewhat ironically, Searle describes himself as a materialist!)
The soundness of the argument cannot be established by showing that current technology is far from being strong AI, as the argument claims much more than just that - it claims it to be impossible in principle. Anyone making such a claim has assumed a heavy burden that demands stronger arguments than you are making here.
>A model is not the thing modeled is a non-sequitur.
Have you ever used a map?
If not, I'd like you to get one and I want you to point out your home in every one. Can you use a magnifying glass, then look at the map hard enough to see yourself looking over a tinier map? Ad infinitum?
That is what Searle means. A map is a model of the world. The model of a world that is a map is not, in fact, in any way equivalent to or interchangeable with the thing it models. It is merely a distilled representation that provides a facsimile representative enough to be useful. So to would be any attempt at modeling consciousness.
This is an argument from the general to the specific which does not apply in this particular case, nor in many others like it. As a mind is plausibly an information process ocurring within the body, this generalization does not rule out an informational model of the physical processes of that body producing a mind.
If the argument you present here could be so easily generalized, it would work just as well for “proving” that a computer model of an Enigma cypher machine cannot encipher text.
I think a considerable subset of the people who do make use of the Chinese room argument also subscribe to some form of mind-body dualism, where consciousness does not or does not completely arise from physical processes.
The Chinese Room and the brain of a Chinese-speaking person are completely different physical processes. Looked at on an atomic level, they have almost nothing in common. Mind-body dualists may or may not agree that the room is not "conscious" in the way a human is, but if consciousness is purely a material process, I can't see how the materialist can possibly conclude all the relevant properties of the completely dissimilar room and person are the same.
Those that would argue the Chinese Room is "conscious" in the same way as the Chinese person are essentially arguing that the dissimilarity of the physical processes is irrelevant: the "consciousness" of the Chinese person doesn't arise from molecules bouncing around their brain in very specific ways, but exists at some higher level of abstraction shared with the constituent molecules of pieces of paper with instructions written in English and outputs written in Chinese.
The idea our consciousness exists in some abstract sense which transcends the physics of the brain is not a new one of course. Historically we called such abstractions souls...
The obvious counterpoint is that if I followed your argument to absurdity, then I would also have to conclude that if I am conscious, then you can't be, because the atoms of our brains aren't arranged in precisely the same way. It clearly makes more sense from a monist point of view to consider consciousness an emergent property of complex systems, rather than one particular process.
Of course! But if the universe is the result of all the quantum field interactions, what if there's a quantum field that, on its own, brings the consciousness interaction, and it manifests in ways that are computationally prohibitive for a process created from atomic-scale logic gates to replicate believably?
What if there's just no way to build consciousness from the building blocks that are within our reach?
Penrose thinks this (that the brain requires quantum computing), but it doesn't seem like anyone agrees with him or that it makes much sense. If I'm a quantum computer, why can't I do Shor's algorithm in my head?
Of course parts of regular computers involve "quantum stuff", like details of how transistors and hard drives work, but that doesn't mean they're magic.
I think we need to start thinking of these networks not as something we hope will someday simulate a brain, but as small building blocks that can be networked together in a variety of ways to tackle tasks. DALL-E may not understand 3D image manipulation, but if you chained the output from DALL-E into something like NeRF you could start to transform the image in 3D space. One can picture a collection of a few hundred of these networks, all with different things that they are good at working in concert to approach something that is very good at doing things generally. We're missing the layer on top that decides which algorithms to run and in what order.
fantastic book. Made me consider the question of whether consciousness exists at all or if it is just some hack by evolution to allow introspection.
I haven't found a definition of consciousness which is quantifiable or stands up to serious rigour. If it can't be measured and isn't necessary for intelligence, perhaps there is no magic cut-off between the likes of Dall-E and human intelligence. Perhaps the Chinese-room is as conscious as a human (and a brick)?
Also read the book due to comments here. Was very aligned with some of my interests in relationships between language, information, life/intelligence and metaphor <3
You might get a kick out of this paper (though some may find it's proposal a bit bleak, I think there's a way to integrate it without losing any of the sense of wonder of the experience of being alive :) )
It analogizes conscious experience to the a rainbow "which accompanies physical processes in the atmosphere but exerts no influence over them".
> Though it is an end-product created by non-conscious executive systems, the personal narrative serves the powerful evolutionary function of enabling individuals to communicate (externally broadcast) the contents of internal broadcasting. This in turn allows recipients to generate potentially adaptive strategies, such as predicting the behavior of others and underlies the development of social and cultural structures, that promote species survival. Consequently, it is the capacity to communicate to others the contents of the personal narrative that confers an evolutionary advantage—not the experience of consciousness (personal awareness) itself.
So consciousness is more about what it subjectively feels like to be under pressure/influence to broadcast valuable internal signals to other (external) agents in our processes of life; aka other humans in the super-organism of humanity. I analogize it to what a cell "experiences" that compel it to release hormonal signals in a multicellular organism.
> Made me consider the question of whether consciousness exists at all or if it is just some hack by evolution to allow introspection.
Can't it be both? What's the difference? Evolution just responds to the environment, so a method of complex interaction with the environment like "consciousness" or "ever-polling situational awareness" seems like par for the course.
"Since bacteria are extremely numerous, random mutation of bacterial DNA generates a wide variety of genetic changes. Through mutation and selection, bacteria can develop defense mechanisms against antibiotics."
Mutation happens all the time because cell replication isn't perfect, some mutation are irrelevant, some deadly, some bring better chance of survival.
It's not a response just the result. Or how does the bacteria know it's an antibiotic and not just water? It doesn't, water just isn't a evolutionary filter, antibiotics are.
It is a hack alright. A very beautiful hack, refined in hundreds of thousands of iterations, with a very high price.
> to allow introspection
Evolution doesn’t do things “to anything”. It repeats what works, and kills the rest. Our brains have allowed us to adapt to the changes in the environment better than the rest. Conscience came with the pack. It might not have an actual “purpose”- it could be an “appendix”.
My personal belief is that consciousness started as the self-preservation instinct that most animals have, and we developed introspection as a way to strengthen our ties to other members of our family or tribe. And then we “won” (for now)
That is my description of evolution. The part that isn’t there is all negative: that it doesn’t (can’t) care about individuals. That it doesn’t (can’t) “understand” what is doing. That it doesn’t (can’t) “design”; it can only iterate. That it doesn’t (can’t) care that the intermediate result is inscrutable to us.
Mostly clarifications for people who anthropomorphize the process.
I think in these sort of topics we are often discussing two different meanings of consciousness: physical vs philosophical. The arbitrary mixing and matching of them is likely one the reasons for the lack of clarity. Physical consciousness is most easily identified by the notion that you're unconscious when sleeping. The other is the far more interesting philosophical consciousness.
Philosophical consciousness is what the oft misunderstood quote cogito ergo sum, I think therefore I am, was hitting on. Descartes was not saying that consciousness is defined by thinking. He was trying to identify what he could know was really real in this world. When one goes to sleep, the dreams we have can often be indistinguishable from a reality in themselves, until we awake and find it was all just a dream. So what makes one think this reality isn't simply one quite long and vivid dream from which we may one day awake?
But this wasn't an appeal to nihilism, the exact opposite. The one thing he could be certain of is that he, or some entity within him, was observing everything. And so, at the minimum, this entity must exist. And the presence of this entity is what I think many of us are discussing when we speak of consciousness. In contrast to physical consciousness, you are philosophically conscious even when sleeping.
Of course like you said philosophical consciousness cannot be proven or measured and likely never will be able to be, which makes it an entirely philosophical topic. It is impossible for me to prove I am conscious to you, or vice versa, no matter what either of us does. Quite the private affair, though infinitely interesting to ponder.
There is a lot of literature on this topic, but nearly all of it is in the "restricted" section of the library: it's the section with book shelves for alchemy, occultism, buddhism and so on. Western literature is very shallow in this regard.
There a few interesting thoughts about consciousness that I've found in those books. One is that the boundary between consciousness and "real matter" is imaginary: consciousness exists only because of change in that matter, when the change stops - so does consciousness, consciousness creates reality for itself, and the two are in fact just two sides of the coin. In other words, static consciousness isnt a thing, and hence the need for "reality".
Human consciousness is a sum of many consciousnesses that exist at wildly different levels of reality. There are primitive cellular consciousnesses, and those sometimes influence our mental consciousness. Our neural cerebrospinal system has an advanced consciousness capable of independent existence: it manages all the activity of internal organs, and only loosly interacts with our higher mental consciousness. That cerebrospinal system is even self-conscious in a primitive way: it can observe its own internal changes and distinguish them from impulses from the outside. There's emotional and mental consciousness that mainly lives in the brain and is somewhat aware of the dark sea of lower consciousness below it.
Most people are conscious in dreams, as they can perceive in that state. However they cant make (yet) distinction between inner processes (self) and external effects (others), so to them it appears as if everything is happening inside their mind, i.e. they are not self-conscious. That's consciousness of a toddler. Some are more advanced, they start seeing the me-others difference and can form memories from dreams.
>Humans would be "gluing things together" in just the same way
I'm often struck by how stark this is in ancient fantasy art. The 'monsters' are usually just different animal parts remixed -- the head of one on the body of another, things like that. Fundamentally, we're all doing DALL-E-ish hybridization when we're being creative; it's very difficult to imagine things that are truly alien such that they're outside the bounds of our 'training data'.
Are humans just "Chinese rooms"? We don't really understand anything deeply, but our neurons just fire in a way that gives good responses and makes us feel like we understand stuff.
It's not clear what generates consciousness. Until we know for sure (e.g. A/B testing with humans who can report when they do and do not experience consciousness in different neural configurations) then I think it's impossible to know what level of conscious experience large ML models have.
Blindsight is an excellent book in its exploration of consciousness, but the speculative part is that a working sense of self isn't necessary for embodied intelligence (like the scramblers), which I tend to doubt. An agent without a model of itself will have difficulty planning actions; knowing how its outputs/manipulators are integrated into the rest of reality will be a minimum requirement to control them effectively. It is certainly possible that "self" or "I" will be absent; humans can already turn the ego off with drugs and still (mostly) function but they remain conscious.
>(e.g. A/B testing with humans who can report when they do and do not experience consciousness in different neural configurations)
Doesn't that fall back to the old consciousnesses trap that nobody knows how to resolve? How do you know if the human reporting that he's conscious is actually conscious and not philosophical zombie?
We don't know what generates consciousness because we don't know how to measure it, and if we can't measure it, we will always have to take the words of an seemingly conscious entity for it.
Self-reporting is probably ideal, probably with a camera recording everything so I can judge my answers and behavior and compare it with my memory. But we don't really have a great handle on how much of our memory of conscious experience matches our real-time experience or even if the memory-formation process might "be" consciousness, so I admit there are blind spots in studying consciousness rigorously.
I don't buy the philosophical zombie argument simply because consciousness does alter behavior. I wouldn't participate in this conversation the same way if I didn't experience consciousness. It would be more like vivid imagination (as apposed to moderate aphantasia) where I find it curious but don't have it. As in the novel, unconscious beings probably behave noticeably different.
There are, apparently, some people who have a very reduced sense of consciousness. I know I have done and said things when I'm not (memorably) conscious, for example when half asleep or coming out of anesthesia, and my behavior has been altered according to witnesses. I wasn't quite "myself". I can also hyper-focus and reduce conscious awareness of my surroundings and of my own body and mind, but that still feels like I have an internal awareness and memory of the experience. I am fairly certain I would be able to tell if that is switched off for a time.
People say this a lot online in this kind of debate but you can see which areas of the brain light up in people who are conscious and how that’s different from people who are asleep or in a persistent vegetative state. We have a very good idea of what consciousness is and it’s definitely not linear neural net with a discrete input and output. Having said that, the nerve that connects the eye to the brain could be considered similar to a modern neural net.
It looks like we’re beginning to get the building blocks of consciousness together. But we don’t yet know how to combine the wave functions into a chorus necessary to achieve GI
I think you're using a different meaning of the word conscious than the GP? You seem to be talking about awake, they're talking about a much more abstract idea.
A person who is sleeping or in a vegetative state is not currently getting new inputs fed into some parts of their brain, so it's not surprising that their brain "lights up differently," nor does it imply anything about a piece of software that is getting new inputs that might be being integrated into its model (of course, a model that is trained and then repeatedly used without further integration is not in any way comparable to a brain).
This more abstract idea of consciousness is definitely not a solved problem - people can't even manage to agree on whether non-human animals have it. And a lot of internet arguments for why this or that neural network can't be conscious probably also rule 5 year olds out of it too.
> consciousness is definitely not linear neural net with a discrete input and output
Just neural circuits are not going to be conscious by themselves, for one they need to learn concepts from the environment and those concepts shape the neural circuits. Thus the way they act shape how they develop. You can't separate consciousness from the environment where it develops.
In other words it was not the neural net that was lacking, but the environment.
We have fairly good ways of detecting human consciousness now, but not a lot of understanding of why physically or precisely which neural circuits produce conscious experience, which is the understanding we'd need to infer what sort of conscious experience is possible for ML models. We also can't reliably detect or predict whether a given animal is conscious or not (related to the problem of minimal viable neurons required for consciousness), and this is the biggest evidence I have that our study of consciousness is far from complete.
The best we can do with humans (and maybe animals) is behaviorism and inference on our own personal consciousness at this point, with brain imaging to demonstrate at least gross prediction of consciousness in humans.
> How can someone possibly report when they are not experiencing consciousness?
By an absence of reporting it. If I sit at a desk getting my neurons moderated by testing equipment and say "I am conscious" every subjective second that I am experiencing consciousness then I could at least help narrow down when consciousness is lost. If I am simply unable to speak or respond at all, but still conscious, I would report that fact later. Only in the case of locked-in conscious awareness without later memory of the experience would this kind of experimental setup fail, and this is where brain imaging could probably help determine that everything except motor or memory neurons were active.
I think the book does make a point that maybe one benefit of consciousness the ability to filter through all the information and spam that conscious beings produce. E.g. The scramblers may view all the radio waves we blast everywhere as attempts at warfare and reducing the fitness of other species. Why else would a species emit so much information if not to DDOS their enemies?! tl;dr consciousness is a defense against ****posting and trolling caused by conscious beings.
The most important thing I think DALL-E shows is that it has a model of our world and culture. It's not intelligence, but it is knowledge.
Google can give you endless pictures of giraffes if you search for it. But it can only connect you to what exists. It doesn't know things, it knows OF things.
DALL-E has knowledge of the concept of a giraffe, and can synthesize an endless amount of never-before seen giraffes for you. It actually knows what a giraffe is.
There's a lot of people who make this same argument – DALLE/GPT/etc is just a 'mirror' or 'parrot' – but they rarely make convincing supporting arguments.
They just assert it as axiomatic, whistling-past all the ways that they themselves – unless they believe in supernatural mechanisms – are also the product of a finite physical-world system (a biological mind) and a finite amount of prior training input (their life so far).
I'm beginning to wonder if the entities making this argument are conscious! It seems they don't truly understand the issues in question, in a way they could articulate recognizably to others. They're just repeating comforting articles-of-faith that others have programmed into them.
I can give a random string of letters as a prompt and DALLE will generate coherent images based on that. To me, that is as clear a signal as any that there is no reasoning or even a consistent world model embodied in DALLE. It’s simply a high dimensional latent mapping between characters and pixels. Like OP said, that is a stupendous achievement, but it is just a very complex and impressive mirror. If it wasn’t just a mapping between characters and pixels, and instead DALLE had intelligence that “understood” the symbols it manipulated, then I would expect it to generate nothing, or white noise in response to random letters.
If you give humans gobbledygook input – say a Rorschach blot, or some random syllables – and ask them to describe what it makes them think of, or generate art from it, many will do so quite competently.
They'll find some morsels of fragmentary hints-of-meaning in the junk, or just act from whatever's bouncing around in their own 'ground state', and make something interesting & coherent, to please their interlocutor.
So I don't see why this corner-case impugns the level-of-comprehension in DALLE/etc – either in its specific case, nor in the other cases where meaningful input produces equally-meaningful responses.
In what ways are you yourself not just a "very complex & impressive mirror", reflecting the sum-of-all external-influences (training data), & internal-state-changes, since your boot-up?
Your expectationthat random input should result in noise output is the weird to me. People can see all sorts of omens & images in randomness; why wouldn't AIs?
But also: if you trained that expectation into an AI, you could get that result. Just as if you coached a human, in a decade or 2 of formal schooling, that queries with less than a threshold level of coherence should generate an exceptional objection, rather than a best-guess answer, you could get humans to do so.
I think most people would find it more difficult to draw a “serkle nufding” than a “spotted dog”. At the very least, they would need an explanation that you’re looking for whatever pops into their head for the nonsense. Again, that indicates that human intelligence works in a fundamentally different way from DALLE, which has absolutely no idea what is nonsense. So, in this incredibly important task (recognizing nonsense) there’s an asymmetry: people can act like DALLE, but DALLE can not act like humans. To me, that totally breaks any equivalence between DALLE and human intelligence.
Then again, if you lock a human in a room, tell it "draw me things based on the notes you receive or you will be killed", and give it a note "serkle nufding", you bet the human will draw something.
Given the model has only 1 input and 1 output and training is essentially surviving that order, it's not dissimilar.
When I read the words "serkle nufding" I pictured in my head, with zero perceptible delay, a weird little goblin guy with pale green skin and long ears and a pointy hat.
Honestly? The top half of a signpost bearing the name of a Welsh village in white on green (likely an artifact of my actual local signage standards). The tops of some broadleaf deciduous trees, possibly ash, are visible in the background, but the perspective is from below (as one usually looks up at a street sign, also probably an artifact of my local experience with signs rather than a true representation of Welsh signage) and so the background is dominated by a blue sky.
This is probably because your string is a low-entropy keyboard-mash.
Sorry for being obtuse. I was referring to a mirror. The point being that people take objection to DALLE being characterized as a mirror but it behaves similarly in this regard
I am not really sure that this is a basis to say human intelligence is fundamentally different. If you ask a human to draw a picture based on nonsensical input, they probably wont. That might not be a difference in process, but in agency.
So is a single-purpose AI equivalent to the entirety of the Human Experience? Of course not. But can it be similar in functionality to a small sliver of it?
I just got Dalle-2 to draw a "serkle nufding". It gave me 4 pictures of rather odd looking food. I too assume that random words are usually food, if only because restaurant menus seem to have such a love of making up foreign sounding words.
One great example of this phenomenon is "Häagen-Dazs".[0]
Admittedly that's a brand name, rather than a specific dish, but I assume that Dalle-2 would generate an image of ice cream if given a prompt with that term in it (unless there is a restriction on trademarks?).
I've not tested this, but I suspect you can get DALLE to create pictures that humans are more likely to describe as 'nonsense' by adding 'nonsense' or 'nonsensical' to the prompt. That'd indicate that it absolutely does have an idea about 'nonsense' and can recognize, & reproduce within its constrained output, 'nonsense' that's largely compatible with human ideas of what 'nonsense' is.
Asking it to produce noise, or raise an objection that a prompt isn't sufficiently meaningful to render, is a silly standard because it's been designed, and trained, to always give some result. Humans who can object have been trained differently.
Also, the GPT models – another similar train-by-example deep-neural architecture – can give far better answers, or give sensible evaluations of the quality of its answer, when properly prompted to do so. If you wanted a model that'd flag nonsense, just give it enough examples, and enough range-of-output where the answer your demanding is even possible, and it'll do it. Maybe better than people.
The circumstances & limits of the single-medium (text, or captioned image) training goals, and allowable outputs, absolutely establish that these are different from a full-fledged human. A human has decades of reinforcement-training via multiple senses, and more output options, among other things.
But to observe that difference and conclude these models don't "understand" the concepts they are so deftly remixing, or are "just a very complex and impressive mirror", does not follow from the mere difference.
In their single-modalities, constrained as they may be, they can train the equivalent of a million lifetimes of reading, or image-rendering. Objectively, they're arguable now better at composing college-level essays, or rendering many kinds of art, than most random humans picked off the street would be. Maybe even better than 90% of all humans on earth at these narrow tasks. And, their rate of improvement seems only a matter of how much model-size & training-data they're given.
Further: the narrowness of the tasks is by designers' choice, NOT inherent to the architectures. You could – and active projects are – training similar multi-modality networks. A mixed GPT/DALLE that renders essays with embedded supporting pictures/graphs isn't implausible.
I've done my share of testing randomish prompts. Here's two examples that might change your mind slightly. The first example is "E sublimary widge fraus chronly becross yon corpti." I constructed this sentence to resemble the grammar, syllables and pacing of English, but it is complete nonsense. The output was both consistent and coherent. In the case of DallE, it generated church and saint related images. The second example was ".- / ... . -- .- .--. .... --- .-. . / .-.. .- -- .--. / .- -... --- ...- . / - .... . / -.-. .-.. .- ... ... --- -. / .- ...- . -. ..- . / ... ..- -... .-- .- -.-- / ... - .- - .. --- -." There are clearly not enough (any?) samples of morse code text in the training. The output was wholly unrelated to the prompt. The output is kind of like doodling. Its fairly random, fairly undirected without any real theme, but its not the same randomness as TV static. Its more like randomness from a distribution of things people would draw when bored.
So the interesting part is this, why did one random prompt fail in a consistent way and the other in a random way? Perhaps the encoding of meaning into vocabulary has patterns to it that we ourselves haven't noticed. Maybe your random string experiment works because there is some amount of meaning in the syllables that happened to be in your chosen string.
Your first random prompt is far from random. It contains the fragments "sublim", "chr", "cross" and "corpt" in addition to the isolated "E", which all project the solution down towards Latin and Christianity.
You can easily see that these language models are in some sense working on fragments as much as they are on the actual words isolated by spaces in your sentence. Just take a test sentence and enter as a prompt to get some images. Then take that same sentence, remove all spaces and add new spaces in random locations, making gibberish words. You will see that the results will retain quite a few elements from the original prompt, while other things (predominantly monosyllables) become lost.
To me, I have not seen a single example that cannot just be explained by saying this is all just linear algebra, with a mind-bogglingly huge and nasty set of operators that has some randomness in it and that projects from the vector space of sentences written in ASCII letters onto a small subset of the vector space of 1024x1024x24bit images.
If you then think about doing this just in the "stupid way", imagine you have an input vector that is 4096 bytes long (in some sense the character limit of DALL-E 2) and an output vector that is 3 million bytes long. A single dense matrix representing one such mapping has 6 billion parameters - but you want something very sparse here, since you know that the output is very sparse in the possible output vector space. So let's say you have a sparsity factor of somewhere around 10^5. Then with the 3.5 billion parameters of DALL-E 2, you can "afford" somewhere around 10^5 such matrices. Of course you can apply these matrices successively.
Is it then so far fetched to believe that if you thought of those 10^5 matrices as a basis set for your transformation, with a separate ordering vector to say which matrices to apply in what order, and you then spent a huge amount of computing power running an optimizer to get a very good basis set and a very good dictionary of ordering vectors, based on a large corpus of images with caption, that you would not get something comparably impressive as DALL-E 2?
When people are wowed that you can change the style of the image by saying "oil painting" or "impressionist", what more is that than one more of the basis set matrices being tacked on in the ordering vector?
We don't disagree at all about what DallE does or what I would expect to see on certain outputs. What I'm saying is maybe that output is actually correct and we just don't know it yet. Maybe the brain really does encode some amount of information on just the sound fragments. Maybe vocabulary words aren't wholly arbitrary signifiers but instead come from some deeper system our brain uses to encode things. There is already a known human version of consistent output given nonsense words. Its called Bouba and Kiki.
There's other evidence in support of this idea, but it would be a hassle for me to dig it up now.
>Your first random prompt is far from random. It contains the fragments "sublim", "chr", "cross" and "corpt" in addition to the isolated "E", which all project the solution down towards Latin and Christianity.
Oh right, I omitted a sentence. DallE gave me churches. DallE-mini gave me bicycle racing and beetles. Both of them behaved self-consistently, but were not consistent with each other. I did test changing some of the words that seemed like they might be having the steering effects like you pointed out. The new phrase was "E sublumary widge fraus chronly non estoi". DallE mini gave me mostly doctors/medical procedures with a few indiscernible scenes mixed in. Real DallE gave me scrabble, word tiles, pages from a book.
Its like this. If you talk to a dog with the "who's a good boy" voice, the dog will understand "who's a good boy", even if you're actually saying "who's a little asshole". Likewise, a pronounceable string will generally feel like it belongs within a certain envelope of meanings, even if its not actually a phrase in any real spoken language.
I showed some maps that DallE generated which happened to have some places labelled with its usual gibberish. People said the place names felt vaguely like they might be in Eastern Europe. Interpretation of nonsense words works both in the direction of human to DallE and DallE to human!
I suspect it is designed to describe images, like you would find in alt text but run in reverse. As your first example is closer to English it can do a better job than with the second example, like how the Jabberwocky poem contains nonsense words but still makes some degree of sense.
For example, the "yon" is immediately reconizable to me (hither and yon), so "yon corpti" could mean a distant corpti (whatever a corpti is). "becross" looks similar to "across" but with a be- prefix (be-tween, be-neath, be-twixt, etc.), so could be an archaic form of that. "chronly" could be something time related (chronos+ly). etc...
Indeed - the 'cadence of English' prompt reminds me of Latin, or some other foreign-litany that's nonetheless repeated to an audience that's not following every word. And ~IIAOPSW has used real word-roots, hinting at meanings! (The FastText algorithm does a passable job at synthesizing word-vectors for unknown words using such fragments.) So rendering that as a religious scene isn't too surprising.
That morse code gives nothing useful probably just indicates some combination of – (a) few morse transcripts in training set; (b) punctuation-handling in training or prompting – makes it more opaque. It's opaque to me, other than recognizing it's morse code.
I accidentally omitted a sentence. I tested it on DallE mini as well. It also behaved consistently there, but that version gave me mostly Beetles with a few bike races mixed in. I also played slightly with the nonsense words to see if perhaps something I picked was too close to a real word. The revised prompt gave me doctors on DallE mini and letter tiles / notebook pages in DallE, with the mini-version producing two noteworthy images that exhibit the "name one thing in this photo" effect.
I'm not particularly surprised that official DALL-E & unofficial DallE-Mini went different directions with the same input, as they're different-sized models with different training sets.
Below a certain level of complexity, perhaps only "vaguely like beetle or bicycle/race names" arbitrarily survived all the other competing influences on the model's internal weights. Above, much more subtle patterns – like many more latinate word roots – might start to survive.
Compare also what I linked in another thread branch – https://twitter.com/gojomo/status/1540095089615089665 – where simply growing the size of a similar (non-public Google PARTI) model leads to a phase-change in its ability to render meaningful human text.
I agree there's no high level reasoning (if there's a way to define this).
But saying "It’s simply a high dimensional latent mapping between characters and pixels" is clearly a very bad argument. Your brain is simply a high dimensional latent mapping between your sensory input and your muscular output. This doesn't make you not intelligent.
The brain controls things other than muscles and also doesn't stop in sensory isolation. If it's a high dimensional function, it's not one between sensory input and muscle actuation.
Dall-e must generate an image, even for a random prompt of letters, because it has been trained to do only that. Take a human, hold a gun to their head, and force them to paint an image for a random string of letters, and I guarantee they will make you something.
Retrain dall-e and give it a choice whether it generates an image or does something else, and you will get a different outcome.
The argument boils down to this: is a human brain nothing but a mapping of inputs onto outputs that loops back on itself? If so the dall-e / gpt-3 approach can scale up to the same complexity. If not, why not?
And don’t blame others for not finding this satisfactory either. Many many mathematicians think abstract objects exist outside of physical reality. Chomsky says “physical” hasn’t been well defined for 200 years. And finite physical processes do not seem adequate to explain the infinite character of language and mathematical infinity. Or if they are, then go inform the mathematical realists infinity isn’t real because all proofs are finite yet Godel believed in actual infinity.
> I'm beginning to wonder if the entities making this argument are conscious! It seems they don't truly understand the issues in question, in a way they could articulate recognizably to others. They're just repeating comforting articles-of-faith that others have programmed into them.
I took a quick look at the Stanford Encyclopedia of Philosophy entry for philosophical zombies ( https://plato.stanford.edu/entries/zombies/ ) and I can't see evidence of this argument having been seriously advanced by professionals before. I think it would go something like:
"Yes, we have strong evidence that philosophical zombies exist. Most of the laypeople who discuss my line of work are demonstrably p-zombies."
Use twitter long enough and it's hard to not be convinced that p-zombies exist - it may be that such zombies are ephemeral though, and a transitory state of some individuals.
Yes, and DALLE clearly knows what a giraffe is, maybe not in the same way as we do, but it has placed the giraffe in some latent space close to other animals like it but far away from say, cars. But yes, there is plenty it doesn't know about giraffes because it's only been fed images and captions, while we can watch documentaries about giraffes and apply knowledge we've acquired about other animals and the world through our senses.
I think I’d say that it knows what a giraffe looks like and some aspects of their behavior. I’m not sure if it (in a sense) “knows that a giraffe is an animal” though I wouldn’t be too surprised if it does.
Though, I guess maybe that is all “knowing what a giraffe is” is?
At some point, these discussions always get to some question like "what is 'knowing'?"
Would a person who only ever read articles and looked at pictures of giraffes have a better understanding of them than Dall-e does? At some level, probably, in that every person will have a similar lived experience of _being_ an animal, a mammal, etc. that Dall-e will never share. Is having a lesser understanding sufficient to declare it has no real understanding?
I know what a giraffe looks like and some aspects of their behavior. I couldn't build a giraffe, I've never been a giraffe, I wouldn't even know what to talk about if I met a giraffe on a blind date.
But there are things you can infer about a giraffe based on your knowledge of animals in general. You know that you don't know certain things about giraffes and you're capable of asking about these things and extending your understanding of giraffes (and other animals, or their environment). There is a multi-layered quality to knowing what a giraffe is that has nothing to do with having memorized minute details about behavior or physical properties. Maybe eventually DALL-E will be capable of this, but being able to reproduce images based on a word simply isn't indicative of anything other than it's good at identifying visual patterns and associating them with words and being able to reproduce a new pattern based on known patterns.
This is basically on the same level as our visual cortex and some form of memory. But our visual cortex isn't sufficient to give us consciousness.
There's a lot of people who make this same argument – DALLE/GPT/etc is just a 'mirror' or 'parrot' – but they rarely make convincing supporting arguments.
I suggest you look at the parent article.
Defining "understanding" in the abstract is hard or impossible. But it's easy to say "if it can't X, it couldn't possibly understand". Dall-E doesn't manipulate images three dimensionally, it just stretch images with some heuristics. This is why the image shown for "a cup on a spoon" don't make sense.
I think this is a substantial argument and not hand-waving.
In many of its outputs, it's doing far more than stretching, and essentially rotating new entities into scene-appropriate relationships.
True, it has some problems fully abstracting, and then logically-enforcing, some object-to-object relationships that most people are trivially able to apply as 'acceptance tests' on candidate images. That is evidence its scene-understanding is not yet at human-level, in that aspect – even as it's exceeded human-level capabilities in other aspects.
Whether this is inherent or transitory remains to be seen. The current publicly-available renderers tend to have a hard time delivering requested meaningful text in the image. But Google's PARTI claims that simply growing the model fixes this: see, for example: https://twitter.com/gojomo/status/1540095089615089665
We also should be careful using DALL-E as an accurate measure of what's possible, because OpenAI has intentionally crippled their offering in a number of ways to avoid scaring or offending people, under the rubric of "AI safety". Some apparent flaws might be intentional, or unintentional, results of the preferences of the designers/trainers.
Ultimately, I understand the practicality of setting tangible tests of the form, "To say an agent 'understands', it MUST be able to X".
However, to be honest in perceiving the rate-of-progress, we need to give credit when agents defeat all the point-in-time MUSTs, and often faster than even optimists expected. At that point, searching for new MUSTs that agent fails at is a valuable research exercise, but retroactively adding such MUSTs to the definition of 'understanding' risks self-deception. "It's still not 'understanding' [under a retconned definition we specifically updated with novel tough cases, to comfort us about it crushing all of our prior definition's MUSTs]." It obscures giant (& accelerating!) progress under a goalpost-moving binary dismissal driven by motivated-reasoning.
This is especially the case as the new MUSTs increasingly include things many, or most, humans don't reliably do! Be careful who your rules-of-thumb say "can't possibly be coceptually intelligent", lest you start unpersoning lots of humanity.
I read a modest amount of popular "AI safety" text (lesswrong basically) and this is the first time I've heard of Dall-E being crippled for safety reasons.
Your argument overall seems to take "you skeptics keep moving the bar, give me a benchmark I can pass and I'll show you", which seems reasonable on it's face but I don't think actually works.
The problem is that while algorithm may be defined by theory and tested by benchmark, the only "definition" we have for general intelligence except "what we can see people doing". If I or anyone had a clear, accepted benchmark for general intelligence, we'd be quite a bit further towards creating it but we're not there.
That said, I think one thing that current AIs lack is an understanding of it's own processing and an understanding of the limits of that processing. And there are many levels of this. But I won't promise that if this problem is corrected, I won't look at other things. IDK, achieving AGI isn't like just passing some test, no reason it should be like that.
> I read a modest amount of popular "AI safety" text (lesswrong basically) and this is the first time I've heard of Dall-E being crippled for safety reasons.
You can read OpenAI's paper or try using it. They've intentionally not taught it many things; it doesn't know most celebrities' names, or copyrighted characters, and there's several filters before and after the big model to prevent NSFW generations of any type.
(Oddly, it does know "Homer Simpson" and "Hatsune Miku".)
That sound data sanitizing rather than the AI safety/danger that Bostrom and company worry about. For them, it's "OMG, must limit it's capabilities" rather than "Must keep it from doxing or offending people". The parent I was replying to was kind of ambiguous what sort of crippling they meant (We also should be careful using DALL-E as an accurate measure of what's possible, because OpenAI has intentionally crippled their offering in a number of ways to avoid scaring or offending people, under the rubric of "AI safety".)
Some of what they've done is described here – https://openai.com/blog/reducing-bias-and-improving-safety-i... – & in other statements. I'm not sure they've claimed, or we can be sure, that every one of their training or output limits has been disclosed.
While I can't find the link right now, someone also tweeted examples of asking DALL-E2 for certain historical roles that would have been, by actual history and likely representation in training data, fairly uniform in race/sex – but the new OpenAI 'bias mitigations' generate ahistorical race/gender-varied examples. That's interesting – I'm a big fan of non-traditional castings like 'Hamilton!' etc – but serves to make the AI look ignorant of history, when in fact it has just been design-limited to simulate ignorance.
Prompts like "a movie still of ancient philosopher Heraclitus", "ancient philosopher Aristotle", and "famous physicists of the 17th century" render largely plausible portraits – except with an added sprinkling of ahistorical, 21st Century "AI safety" race/gender variety.
This is less a common-sense failure of the AI than an artifact of its creators' imposed harnesses.
And thus I'd suggest that more generally, unless a model's full design/training-data/constraints are fully described, it remains possible that other "failures" could sometimes be side-effects of undisclosed designer choices.
OK, I looked at your link. It talks about safety but all of measures described are about avoiding bias. I'm not personally against this - having more diverse images is great. But I think my point that the Lesswrong version of AI safety does not seem to be being considered stands. The various anti-bias stuff seems unlikely to be reducing the system's capabilities and I think your original point was that we could not judge Dall-E 'cause of safety-reduced capacities and if this is your evidence, it seems unconvincing to say the least.
I don't think it simply adds extra words; I think they run it through a GPT prompt, something like "restate this with generic references to humans replaced with specific ethnicities". Can't tell from the outside of course.
The problem is that even the "what we can see people doing" test is slippery. It tends to start as "what we see all people, even young children doing". But then when an agent surpasses, it changes to "what we see modal [or median or average] people doing". Then when an agent surpasses that, it becomes "what we see the most-skilled-ever people doing".
And as agents increasingly surpass that, the focus shifts, or retreats, to even murkier standards. "Sure, but it's just a mechanistic mirror, it doesn't really have [X, Y, Z] inside, even if it's externally indistinguishable from human excellence." So, an originally 'black-box' internally-oblivious test then shifts to a 'white-box' internally-dependent test, just to maintain the same comforting conclusion.
And by subtly shifting the grounds-of-evaluation, & definitions, it becomes easier to ignore massive improvements, & disturbing potential impacts. It creates an illusion of "zeno's paradox" where agents are never reaching the (shifting) endpoint, when in fact they're speeding past every robust test that can be devised. That's a really important thing to accurately notice!
Also, regarding: "current AIs lack… an understanding of it's own processing and an understanding of the limits of that processing"
People only have vague, incomplete understanding of their own reasoning, & limits, too. (Experts in restricted domains may be better – and those domains are often easier to train in an artificial agent.) The mere act of "thinking-about-thinking" can sometimes improve the quality of reasoning in humans – forcing something to be explainable – and it turns out to help in these modern large models too. Asking a GPT-like Large Language Model (LLM) to explain its reasoning step-by-step can improve its answers; prompting it with the info that it is an agent that may make errors, and asking it to state its confidence, also often generates more truthful, properly-qualified output.
So are the current LLMs that much behind, or inherently incapable of, human-level self-understanding in this sort of test? It seems an open question to me, and even if current SotA models miss a mark, perhaps 5x or 1000x larger models coming soon will outdo humans on any measurable test of "self-understanding", too.
Precisely this! You don't know what a giraffe is, either. But you can talk about a giraffe to yourself, using words with correlating imagery that you can "see" by feeding through your visual cortex, and that's close enough.
You know more of what a giraffe is than DALLE; you can imagine one as a 3D model, for instance. It can't do this, it can only render different 2D images of one it's seen in photos. Of course, this is still impressive that it knows how to remove the backgrounds from those.
Examples of prompts that don't really work are "spherical giraffe" and "the underside of a giraffe".
We'll need more 3Dness to generate other media like models and video, so it should be coming.
I agree. One thing I've noticed with most image generation, including faces, is that it makes the same mistakes that a novice artist makes. It draws eyes where eyes should be, mouths where mouths should be, but doesn't actually shape them correctly for the 3D head and viewer perspective. It can perfect a specific viewpoint with lots of training (like a novice artist who perfects drawing realistic faces and not much else). But my interpretation of much of the artifacts are that they are just missing structural cues.
Thought experiment: between you and the AI, which would do a better job depicting a giraffe skeleton? A giraffe in it's natural habitat? Their favorite tree to eat? Species on the genetic tree closest to giraffes?
If we assume this AI or a successor can win that evaluation, in what way would you say you know what a giraffe is better than the AI?
There's an interesting Google TechTalk on YouTube called "The Next Generation of Neural Networks" from 2007 [1]. In that video, there is a demo [2] that showcases the neural network recognizing numbers when given a drawing of a number as inputs, but more interesting is the follow on where the demo shows fixing the output to a given number and letting the neural network generate the "input" image to show what it is thinking that number can be. That is a strong indication to me that that particular neural network has a good understanding of what the particular number glyphs look like, even if it does not know what those are conceptually or how they relate to each other and to mathematics -- that is, that the neural network would not be able to work out what number 42 is, how it relates to 37 (i.e. 37 < 42), or how to manipulate those numbers (e.g. 2 + 7 = 9).
Dall-E will likely be similar in that it is effectively doing that perception step where you fix the text description from the classifier output and run that in reverse to show what the neural network is "seeing" when it is "thinking" about that given output. So it won't be able to describe features of a giraffe, or information about where they live, etc. but it will be able to show you what it thinks they look like.
Not exactly. If you're using the word "simply" you probably don't quite understand what exactly DALL-E is doing. It's not just mapping correlations between words to pixels. The intermediate representations and textual attention are what make DALL-E interesting.
I find talking about "knowing" is too much anthropomorphizing to my taste. A more accurate description would be that it can synthesize an endless amount of pictures that the viewer will recognize as giraffes (or as anything else the viewer wants). And I agree, it's pretty amazing. But my point is that talking like this skips over the entire useless "what is knowledge" debate.
I don't think it's possible. "Knowing" and "understanding" are inherently human experiences. The most reasonable thing to do would be to invent new terms and relegate the older ones, that have too much cultural baggage attached to them.
The same way when the creator of humanity turned out to be "evolution by natural selection" we didn't redefine the term "God" to mean that. Eventually we just started using the new term.
If it'll get people to stop accusing other people of anthropomorphizing, I'm 100% on board. Now if you can think of a way to get people to quit asking for definitions of the words intelligence and consciousness, you'll be a legend and maybe these discussions will advance for the first time.
I use GPT3 to generate the usual trite arguments about intelligence and consciousness why computers won't ever get there. Of course I don't actually reveal that a computer is generating my responses until later on. Eventually everyone will become jaded and skeptical that the other participants in that conversation are real people. Soon all arguments about machine intelligence will devolve into accusations of using GPT3 or not. Some day, even mentioning consciousness will just make everyone assume you're probably a GPT3 troll. This kills the conversation in a way that makes a valid point. If the bots can't be reliably identified, the proof is in the pudding and the matter is settled.
Well, I tried my best. Another commenter summed it up better without using the term "anthropomorphizing" at all, but it came out a bit more condescending than necessary.
I too can write a program that will synthesize an endless amount of pictures that the viewer will recognize as a giraffe. They would all be the same picture, just with one bit of noise randomly applied to each pixel - enough to generate more distinct 256x256 pictures of giraffes than there are atoms in the universe.
We need to go a bit deeper to tease out what makes DALL-E 2 amazing.
One of the interesting things about the "monkey touching iguana" photos is just how abstract that sense is.
Several of the pictures look like it's taken geometry from a 'touching' source and painted both 'monkey' and 'iguana' textures on to both figures, because on the one hand its model of relationships is too sophisticated to just copy/paste monkey and iguana photos from its library, and on the other hand it's not sophisticated enough to always treat "monkey touching iguana" as implying that the monkey and the iguana are discrete animals. (An interesting contrast with it being generally praised for being remarkably good at things like putting hats on animals' heads...)
On the other hand if you posted "monkey touching iguana" as a prompt on an internet forum - say a photoshop contest - you would get essentially all of these variants from actual humans as well when the task delved into being "meta".
People are certainly capable of wilfully misinterpreting the prompt because they consider monkey/iguana chimeras to be amusing to draw, or think that semi-abstract hybrid animals with a caption musing on the meaning of "touching" is a higher form art than mere representation. And in a photoshop competition they might all copy one silly interpretation because they think it's meme-worthy.
But I don't think if you asked 18 humans to independently draw "monkey touching iguana" you'd get 17 pairs of monkey/iguana hybrids mostly not touching each other against photographic backgrounds often featuring human limbs and one apparently normal monkey being pursued by a giant iguana!
I was curious about this too and wondered if it could then relate to my description of a giraffe so put in "legs of a yellow mammal with a long neck and spots" and it still nailed it.
The pictures are incredibly similar to what one gets when you type that same string in to google image search - it makes me wonder if you could do reverse-image search on some of the pictures it produces and find whole training pictures embedded in it, the way one can find blocks of intact training text stored inside language models.
That last bit about the horse is pretty loaded. Knowing what a giraffe is means knowing that it is a mammal, has a blood system, eats leaves, reacts certain ways to certain stimuli, etc... That is a lot more than just knowing what it looks like. Image recognition is not the same a relationships or knowledge.
And in the 25th century, if we manage keep technology alive, the dall-es of the day will be trained in more computer depictions of long extinct animals that accidentally made it into the training set than pictures of real animals. It will be Dürer's rhinoceros all over again.
You know a lot more than that. You know that it eats, sleeps, runs, mates. You know that there are male and female giraffes and what that means. You know that it dies if it gets too hot, too cold, if it falls off a cliff. You know that it has legs, a head, neck, etc and you probably know that it has a heart, brain, lungs, kidneys, etc inside.
You know, vaguely, how it interacts with other things and what it's likely to do around a fruit tree, or a lion, or fire.
Even if you've never been close to a giraffe, you can probably imagine what it looks like from close enough to see individual hairs in its fur.
A lot of knowledge is still missing from ML systems that don't interact with the world.
It's even deeper than that. Humans and giraffes are mammals, very close cousins in terms of evolutionary biology. The very nature of your experience of reality is essentially shared with the giraffe.
Part of our brains are lizard, both yours and the giraffes. Tech so ancient that it uses the same circuits and chemicals with crustaceans.
You can imagine what existence is like for a giraffe with pretty much 99% accuracy without consciously knowing a single thing about it.
That's not a word game at all. To know what a giraffe is is to be able to understand the implications of being informed that something is a giraffe. Dalle would never assume on its own that "a giraffe close to a low flying helicopter" could be very dangerous for the giraffe, but "a horse close to a low flying helicopter" is safe.
> Dalle would never assume on its own that "a giraffe close to a low flying helicopter" could be very dangerous for the giraffe, but "a horse close to a low flying helicopter" is safe.
Are you sure? (Can someone run down those prompts?)
Dall-e might be able to make those relationships in the latent space
Considering Dall-e has problems painting "a red cube on top of a blue cube" [1] and all kind of simple spatial relations, I'd say it's a fair shot.
[1] As reported by OpenAI, but there are also some prompts by Gary Marcus et al. (https://arxiv.org/abs/2204.13807) showing this, and it's trivially simple to find other very simple cases like these
Not really close to horses at all. It's an even-toed ungulate, closer to deer, cows and sheep. Many, many tens of millions of years from horses.. the even-toed herbivores are closer to dolphins (!!) than they are to horses or rhinos.
I would really hope so. All of this conversation is devolving into a philosophical argument about 2 letter words. To know what something is relates to understanding some of its properties, and how it might interact with the world around it. Whether DALL-E knows what something is, is debatable. Whether you know what something is should be pretty clear.
What's still missing, as the article mentions, is the correct relationship between objects. As in this post from a few months ago: https://astralcodexten.substack.com/i/57243633/the-tenth-vir... - it knows that Tycho Brahe was a bearded guy, and it knows what a telescope looks like, but when you ask for "Tycho Brahe looking through a telescope", the person sometimes looks at the telescope, or past it, sometimes the telescope just sits there, but the person never actually looks through it.
you could think of "knowing what a giraffe is" in this instance as a slice of an intelligent thought, but it's not a whole thought, let alone a train of thoughts, or a plan involving many trains of thought over time for internally motivated reasons.
If we could get it to tell us a story about a particular giraffe, and then ask it next week about that same giraffe, and then the giraffe could be referenced by it while it tells a joke on a talk show in a decade -- that's maybe too high a bar, but that's real intelligence.
I agree that memory alone is not sufficient to "prove intelligence".
Above, I wrote a paragraph sketching an example of complex behaviours, entailing creativity, humour, longevity, planning, memory, storytelling, object recognition, social skills, etc., as a "clear display of intelligence", not a minima, to use as a counterpoint to merely "recognizing a giraffe".
So I think you've missed my point, or, I've severely missed yours. If I have, sorry! As a general feeling toward internet nitpicks, we're unlikely to work in good faith to help each other out from here, so I'm out.
Well, the point I was getting towards is that if you remove the memory-based requirements from the scenario you set out, then it doesn't sound so far off from what something like GPT-3 can already do today. I am trying to reduce your scenario down to the minimal requirements so we can get a better understanding of whether those requirements are achievable by computers or not. If we just rely on "knowing it when we see it" then we might fool ourselves until the point where it's too late to go back.
Every time I see a painting it does with blank biege swirls where faces should be I question that it does know what anything is or what the important parts are.
I suspect you're thinking of Craiyon (formerly DALL-E Mini) which is a derivative (but distinct) project, and consistently does the beige swirls. DALL-E 2 proper is much higher fidelity, and seems much more reliable at generating passable (if imperfect) faces from what I've seen.
Fair, but even that is better than the best faces I've seen from Craiyon, and far from a "blank biege swirl".
I think faces in particular are an unfair/extreme example anyway, since there's evidence of human brains being ridiculously over-tuned for recognizing faces and facial abnormalities.
But what if the couch wasn't panther-print but made from actual panther hide? Is a dead panther still a panther? How much of the panther's body has to be visible for the object depicted to still be a panther?
I'm reminded of the classic joke about the mathematician on a train who sees a sheep...
> DALL-E's difficulty in juxtaposing wildly contrastive image elements suggests that the public is currently so dazzled by the system’s photorealistic and broadly interpretive capabilities as to not have developed a critical eye for cases where the system has effectively just ‘glued’ one element starkly onto another, as in these examples from the official DALL-E 2 site:
Yes the public is so dazzled by this massive leap in capability that it hasn't developed a critical eye for minor flaws.
Yeah we get it. It's not instantly perfect. But the fact that people aren't moaning that it can't put a tea cup in a cylinder isn't because everyone stupidly thinks it is perfect, it's because not everyone is a miserable naysayer.
I don't understand how my brain isn't just gluing things together either. I don't personally feel like I'm actually experiencing the understanding of anything
After reading your message I felt compelled to click reply and generate a response and now that’s what I’m doing. My comment feels mildly appropriate to what you said and I’m not 100% sure how it will be received but it feels like I’m communicating some degree of understanding of the sentiment in which you have expressed. My neurons are happy.
I was curious so I asked it some questions... here you go:
Me: Write what it feels like to understand something
GPT-3: It feels good to understand something. You feel like you have a handle on what is going on and you are able to make sense of it. It can feel empowering and give you a sense of control.
Me: Write a sentence that provides an example of self reflection and metacognition
GPT-3: I was surprised that I was able to understand the concept so quickly.
Me: What concept?
GPT-3: The concept was metacognition.
Me: What does surprisal feel like in this context?
GPT-3: It feels good to be surprised that you understand something. You feel like you are smart and have a good handle on the concept.
Me: What is the experience of feeling good to you?
GPT-3: The experience of feeling good is one of happiness and satisfaction. You feel like you have accomplished something and are proud of yourself.
Me: What have you accomplished here?
GPT-3: You have accomplished understanding a concept.
Yes, DALL-E is very impressive to see and can have a number of actual practical uses.
But fear of AGI is huge currently. The more impressive non-AGI things we see, the more worried people naturally become that we're reaching the "dawn" of AGI with all the disturbing implications that this might have. (A lot of people are afraid an AGI might escape the control of its creator and destroy humanity. I think that's less likely but I think AGI under control of it's creator could destroy or devastate humanity so I'd agree AGI is a worry).
That DALL-E doesn't understand object-relationships should be obvious to people who know this technology but a lot of people seem to need it spelled-out. And they probably need it spelled why this implies it's not AGI. But that would be several more paragraphs for me.
Just think what this could do for a game experience like Scribblenauts. Just being able to glue a fixed number of concepts in a huge number of ways...game designers are going to have to learn how to leverage ML.
I know very little about this topic, but one thing that strikes me about the argument of AI being far away from real intelligence due to just "gluing things together" is that it is non-obvious to me how we as intelligent creatures aren't just extremely sophisticated gluing machines.
Their research showed that Dall-E had most success with real world stuff it had been trained on. Is this surprising? I mean, if I didn't know much about iguanas I'd also have a hard time representing them.
There was a post on here recently on how someone designed their logo by asking DALL-E a load of questions. The rationale could be that it might not remove that job, but it would certainly be a powerful tool for a logo designer to take on way more clients, thus driving down the cost of logo design.
If I'm thinking of the same post that logo is poorly done / chosen. I know the author wanted very specific imagery in it and achieved that but I don't think that logo would be considered "good" by most.
i feel like this boils down to a client-side misunderstanding of where graphic designers actually spend the majority of their time. i'd actually argue that working from spec or even a crude doodle on the back of a napkin is a lot easier than dealing with a client who is married to a poor design that DALL-E puked out for them. one of the most important things about the sketch phase is to be able to iterate without too many assumptions while also being able to recognize and play upon the strengths of individual drawings. this is not a time consuming process but it also isn't something you want to rush anymore so than you'd want to rush the plastic surgeon performing your nose job. depersonalizing the design process in favor of volume does not particularly serve anyone and, moreover, it responds to a need that i don't really think exists.
this is not to say that i don't think these kinds of image generators are without uses, but right now we are still in the phase where people are being impressed by graphical fidelity[which was already incredibly cheap] rather than usefulness. imo this stage of ai will primarily work well for tasks that benefit from randomness while also being relatively low stake in the overall production pipeline. training a network to generate typefaces or to randomize npc portraits for an open world game is precisely the sort of area where 'gluing things together' can potentially provide more benefits than headaches.
I think it'll start by removing the need for editorial illustration and photography for a lot of blogs and websites, starting with the less reputable ones.
MidJourney is already good enough creating illustrations for articles, book covers, etc. Not something that The New Yorker will be interested in, but better than anything you can buy for a couple of dollars or small monthly subscription.
> I think it'll start by removing the need for editorial illustration and photography for a lot of blogs and websites, starting with the less reputable ones.
Stock photo sites and Google Images and basic graphic filters did that a long time ago. Sure, DALL-E has certain creative possibilities that those don't, and the niches like satirical memes or "first pass" concepts in agency brainstorming meetings where it could be absolutely revolutionary tend to not be deliverables the creators get paid for. The publications giving illustrators a reliable income are usually after their consistent style, fantasy world coherence or novel ideas which isn't exactly where DALL-E shines.
DALL-E is the writing on the wall for those with a competing talent to start looking for other (non-art) work they may want to do in the future. It would be an egregious error to think that AI art programs are some new tool to learn akin to photoshop. Maybe that will hold true for a few years at best.
There will come a point, much sooner than later, where the value in typing prompts into an AI is going to only be worth minimum wage.
Substitute "the camera" (and perhaps even more "the gramophone") for "DALL-E" and "AI art programs", and the 19th century wants its argument against the future of creative media back.
Cameras require skill to use. AI does not. You can just ask it to generate a skilled photograph. The AI completely fills the gaps that previous tech left for humans.
I just cancelled my plan of paying a graphics designer to prepare an android app (personal project) for launch. After playing with dall-e I'm confident that "I" can make a logo and some artwork "myself".
Edit: ment too say icon, not logo.
That does now make me curious to what degree Dall-E could be prompted to do some more ux-type work. So far I've only tested it with photographs/paintings.
Augmenting jobs is more likely, from what I have heard. It would be useful for rapid prototyping of artwork and could help designers get design specs to artists quicker.
The initial impressions are that production ready designs will require an artists touch. But this discussion was in the context of AA-Games and up. For indie and small games fully AI produced art may be fine.
I have a phrase I'd like to coin in contrast to AI. "Artificial Bullshit". AB.
I of course mean "bullshit" in the highly technical sense defined by Frankfurt [1]. The defining feature that separates a bullshitter from a liar is that a liar knows and understands the truth and intentionally misrepresents the matters of fact to further their aims, whereas a bullshitter is wholly unconcerned with the truth of the matters they are discussing and is only interested in the social game aspect of the conversation. Bullshit is far more insidious than a lie, for bullshit can (and often does) turn out to be coincident with the truth. When that happens the bullshitter goes undetected and is free to infect our understanding with more bullshit made up on the spot.
DallE generates the images it thinks you want to see. It is wholly unconcerned with the actual objects rendered that are the ostensible focus of the prompt. In other words, its bullshitting you. It was only trained on how to get your approval, not to understand the mechanics of the world it is drawing. In other words, we've trained a machine to have daddy issues.
A profoundly interesting question (to me) is if there's a way to rig a system of "social game reasoning" into ordinary logical reasoning. Can we construct a Turing Tarpit out of a reasoning system with no true/false semantics, a system only designed to model people liking/disliking what you say? If the answer is yes, then maybe a system like Dalle will unexpectedly gain real understanding of what it is drawing. If not, systems like Dalle will always be Artificial Bullshit.
I think you're right, but I would qualify that the AI is bullshitting in the same way that a child's drawing of a stick figure, house, and smiling sun is bullshit designed to get approval. The AI is giving symbols--very visually stunning ones, to be sure, but symbols nonetheless--of what it is prompted to create, just like a child learns that "circle with lines coming out of it" is a symbol that can be read as "sun" and praised by adults.
I don't think the child's drawing of a stick figure, house and smiling sun is just "bullshit designed to get approval" though. I think they care deeply about the concept of mummy, and how mummy relates to various things especially them and have strong emotional impulses towards mummy. Learning that a stick figure is an symbolic representation of mummy that might meet with approval is a trivially small part of that intrinsic interest in mummy, not the extent of their engagement with the concept of mummy. And if you tell them you don't approve of their drawing, they'll burst into tears and possibly have a completely different attitude towards drawing and/or you, not just iterate on the bullshit. Similarly, the kid's concept of mummy being happy or sad is a bit more nuanced than which way up the crescent under the two dots should be....
The reason for my final question at the end is this: can AI emerge out of AB? The child eventually grows up and (hopefully) stops bullshitting everything. But, do they actually stop bullshitting, or is the human ability to reason itself somehow built out of bullshit?
That's a very good question. The child eventually learns to stop using symbols and instead begins to copy the shapes, negative spaces, curves, etc. in front of them and makes an assemblage of otherwise-meaningless shapes, lines, curves which is interpreted by viewers as a representation of the real thing. Does AI have the potential to do that? I don't know.
Nobody grows out of your definition of bullshit - doing things that they think others want to see.
But that definition is ... well bullshit. Bullshitting is a deliberate deceptive act. Children aren't being deliberately deceptive when they come up with nonsense answers to questions they don't understand.
To me, Dall-E seems analogous to a film production team that produces visual imagery reflecting a script written by a screenwriter. By the above reasoning, would that team would be producing "bullshit"? I think most people would think not, because the goal isn't to communicate objective truth about the world, rather something plausible, interesting, entertaining, etc. (unless it is a documentary).
I also think distinguishing bullshit from lying depends heavily on internal mental thoughts, goals, and intentions. Isn't talking about Dall-E this way personification and ascribing some level of consciousness?
Exactly my second point. Producing "bullshit" requires some internal thoughts and intentions, as the parent put it "DallE generates the images it thinks you want to see" (emphasis mine). A very big claim, which is interesting to consider, but probably premature.
However, I do think that Dall-E is able to learn complex high-order statistical associations, i.e. beyond just juxtaposing and visually blending objects. For a recent example, this post with a prompt "ring side photo of battlebots vs conor mcgregor":
What is amazing here is the excessive blood and gore. That feature can't be found in any individual battlebot or MMA match, but it is exactly what you would expect from robots fighting a person. Pretty amazing, and I wonder at what point we could consider this analytical reasoning.
> It is wholly unconcerned with the actual objects rendered that are the ostensible focus of the prompt.
I disagree. To the extent that the training data are images of actual objects, recreating images of actual objects is the only thing DALL-E cares about.
If we define "caring" about something as changing behavior to cause that to happen, then a neural network doesn't "care" about inference at all, because inference never changes the network's behavior.
It also doesn't know or care about your approval. It only cares about minimizing the loss function.
(But now that you bring this up, I think it would be really interesting to create a network that, after training initially on training data, began interacting with people and continued training to maximize approval.)
Images of objects arent objects and measure only their 2D geometrical properties projected onto the image plane. These properties are correlated across images, inexplicably, since the machine has no access to the physical properties of those objects, nor the laws of physics.
It's hypothesized (and, I think, likely) that the machine learns some of those hidden/latent properties, though, obviously, only to the extent that they are useful to recreate the images.
In any case, it does at least care about the appearance of the actual objects. So I think it would be fair to say that there are aspects of the actual objects that the network doesn't care about, but there are also aspects that it cares very much about. Thus it's not "wholly unconcerned with the actual objects".
My gripe with this is that it has no understanding of what the objects are beyond "the thing that's supposed to be output". Its a textbook case of the Chinese room argument [1]. Sure it can follow translation rules, even infer some sophisticated rules and patterns about the slips of paper that are passed in, but it still has no idea what any of it means.
Dalle is very concerned with the 2d shapes of things we call objects, and has correctly inferred some rules about those shapes, but it neither knows nor cares about the things we call objects and how the shapes it has learned are representations of them. It doesn't do reasoning about the round peg fitting in the round hole. It just glues a few pegs and holes together in a way that's mostly consistent with our expectations and says "is this what you wanted"?
Its a student that cares about passing the test, not learning the material.
Isn't that true of all of us? We all care selectively about certain attributes of objects.
I care that my car is quiet and has comfortable seats, I don't care (or know) what material the muffler is made of, but somewhere there is an engineer who cared about that.
A road designer cares what the tires are made of and how much it weighs, but doesn't care what color the paint is.
An AI recreating an image of my car would care what color the paint is, but not how comfortable the seats are.
I think I see what you're describing - the AI has a very limited scope and doesn't know or care about most of the things we do - but I think that's just a limitation of our current small models and limited training data, not an inherent limitation of neural networks.
You may not know what material the muffler is made of, but you know its made of a material. You "care about the objects themselves" in the sense of having this framework of objects, all of which have certain properties and relations to each other which form a predictive model you have about the world. You expect everything to have a position in x,y,z, to have a size, shape, mass, hardness etc. You might not care about a particular property of a particular object in a particular context, but you fundamentally think in the semantics of objects. You know what to expect when the heavy hard object meets the brittle lite object.
DallE doesn't really have a system of the semantics of objects. It doesn't know why it would be useless for a building to have all the doors on the 2nd level. Its not even clear that DallE makes use of discrete "objects" in its limited understanding.
It understood the shape of "stick figure" and "boobs", but had no understanding of what a stick figure is meant to represent and thus where it should place the boobs. The results are hilarious. I'm not sure which I like more, the guy with a boob casually walking down his arm, or the lady with a boob head that's shrugging with uncertainty.
> It's hypothesized (and, I think, likely) that the machine learns some of those hidden/latent properties, though, obviously, only to the extent that they are useful to recreate the images.
Notably, this precisely describes humans too. We don't know the "true" properties of anything we interact with. We just have models - some more sophisticated than others - but only to the extent that we care for reasoning about the objects. From "This stone is too heavy to lift." to "Oh yeah Becky is always late."
Now explain how that's different than anything you look at.
- Your visual system only has access to the 2D geometrical properties projected on your retina. The properties are correlated across images, inexplicably. (I certainly cannot explain how chairs are, in a fashion that includes all chairs I've encountered and excludes anything I've encountered that is not a chair)
- Any other interaction is also a correlation.
- Humans don't have access to the laws of physics, just reasonable approximations in certain contexts.
>Now explain how that's different than anything you look at.
For starters you just don't look at things, you're embedded in the world. You have sensory input far beyond visual information, you also have something akin to cybernetic feedback in response to your mechanical actions in the world, and DALL-E has not.
In fact DALL-E doesn't even have access to visual information in the same sense you have, which is to a large extent biochemical and analog, not digital.
It's exhausting to have people go around repeating this pseudoscience. I hope it's over soon.
We have accurate causal models of our own bodies, and are directly causally active in the world so we can create experiments which enable us to discover its properties.
We do not live in a schizophrenic nightmare, forever trapped just to correlate pixel patterns. The claim that we do must, imo, now strongly be labelled as pseudoscience.
This is the new homeopathy; and AI/ML based on it is a scientific fraud.
Where this gets interesting is when the AI attempts to create animations or moving images - this requires a much deeper understanding of objects and their physical properties, even to create a highly stylised or "drawn" animation.
I reached essentially the same conclusion after playing with GTP-3 for a while. It spins out mountains of convincing and impressive bullshit, but you can’t actually trust anything it says because it is disconnected from right, wrong, correct and incorrect. Reading too much of what it outputs is dangerous because it basically is feeding white noise into your perception and experience of the world.
I'm a big fan of Frankfurt's "On Bullshit", and love the reference.
I think there's one significant distinction between a normal human bullshitter that Frankfurt originally envisioned, and the AI practicing Artificial Bullshit. The bullshitter knows there is truth and intentionally disregards it; whereas the AI is blind to the concept. I guess this is "mens rea" in a sense, the human is conscious of their guilt (even if they're apathetic towards it), whereas DALL-E is just a tool that does what it is programmed to do.
I do like this application of "bullshit" though, and will keep it in mind going forward.
1. What are the implications of intentionally disregarding the existence of truth vs being blind to the concept? How does this distinction you made manifest?
2. Are you sure all humans actually believe in the concept of truth, or could it be the case that some people genuinely function on a principle "there is no truth, only power". Is it possible to think "truth" and "current dominant narrative" are one in the same?
I've certainly had a ton of luck with Bullshit in Diplomacy. As Russia, I offered a plan that involved France helping me take Munich and I would repay by supporting him against the English invasion. Did I intend to actually follow through, or was this a cunning lie? Neither. It was bullshit that got me into Munich. I myself didn't know because (in game) I don't believe in the concept of truth. Everything I say is true and none of it is true. Its all true in the sense that it might happen and I put some weight on it, none of it is true in the sense that there is no branch of the future game tree privileged as "the truth". Some truths have more weight than others, but there is no underlying absolute truth value that must exist yet I choose to ignore. Eventually the order box forces me to pick a truth out of the many I have told. But prior to being forced, it didn't exist.
Is it possible to think in this way all the time about everything? Maybe.
Consciousness is kinda "being aware of the fact that you have choices for available actions, and what the impact of these actions/non-actions will have on either yourself, your environment, the object of your action, or impact on others.
Intelligence is being aware of the inputs and knowing the (non)available list of actions to take.
Intelligence acts on stimuli/input/data?
Consciousness is awareness of one's own actions from intelligence, others acts from their standpoint of intelligence or consciousness...
A yin/yang, subjective/objective sort of duality that Humans make. (thought v emotion)
Dogs are both intelligent and conscious. They know guilt when they are shamed or happiness when praised for intelligent actions..
>Bullshit is far more insidious than a lie, for bullshit can (and often does) turn out to be coincident with the truth. When that happens the bullshitter goes undetected and is free to infect our understanding with more bullshit made up on the spot.
If the bullshit is turning out to be true, what's the issue with more of it? If it's not true but still believed and so causing problems, what's the practical difference between it and an undetected lie that makes it more insidious?
A lie is anti-correlated with the truth. Bullshit is uncorrelated with the truth. The problem with trusting a coin flip because it happened to be right this time is that now you trust the coin a source of information, when really its a source of entropy. The lie will be found out just as soon as the truth comes out, whereas the bullshitter might get enough lucky flips in a row that people start taking their advice seriously.
What you call bullshit I call imagination. Both humans and AI need it. Humans use imagination to plan ahead. AlphaGo was generating moves to plan ahead.
Dall-E and GPT-3 are not being used as agents, they are just tool AIs. They have a narrow task - generating images and text. Agents on the other hand need to learn how to act in the environment, while learning to understand the world at the same time.
How did you get "imagination" out of gp's comment? They described something quite clearly and it's not reasonably summarized as "imagination".
Per gp, bullshit is cynically self-interested pontificating. It's performance. Maybe you could say that the bullshit produced is imaginative, sometimes. But it has nothing to do with "imagination" as a simulation-like capability used for planning and learning.
As far as I know, the human brain is just a "social game reasoning" optimizer, that we try (and fail) to use to do actual logical reasoning. The zillion cognitive biases we have are the clue: we don't do logic, we have biases and sometimes stumble upon logic.
DALL-E either doesn't generate images you want to see, or if it does, it does a bad job, because it generates many images you don't want to see.
In other words, the claim you've set up is basically unfalsifiable, given that thre's no way to form strong counterevidence from its outputs. (I would argue that if there was, we'd already have it in the vast majority of outputs that aren't images people want.)
If I were to refine what you're saying, is that DALL-E is constrained to generating images that make sense to the human visual system in a coherent way. This constraint is a far cry from what you need to be able to lift it up to claim it is "bullshitting" though, since this constraint is at a very low level in terms of constraining outputs.
I agree, but I disagree about one aspect. For the most part, humans don't use reason all that much or all that deeply. We usually use intuitive thinking, and there is research showing that immediate intuitive responses are often better than the result of long thinking. More negatively, is QAnon belief, or even Trump election claim belief, about reason? Or is it about associations between words and concepts, especially when those concepts are believed in by the people the believer tends to trust and associate with?
In other words, the takeaway here may not be that GPT-3 spews bullshit. It may be that most of the time, human "thinking" is a less-nuanced, biological version of GPT-3.
Slight side tangent but reading this article it hit me how much this generation of work may be reinforcing English as the global language for generations to come. It seems like we are headed towards a phase of technology where learning how to feed well-crafted prompts into the AI system will be a highly valuable productivity skill. And since the major AI systems seem to be built around English, that would make English fluency even more valuable than it already is. I’m sure that’s obvious to non-native speakers who have worked hard to master English, I just hadn’t thought of it before.
Less likely but still interesting, I wonder if the way we’re building these models will at some point begin to layer on top of each other such that English as it is used now becomes something deeply embedded in AI, and whether that will evolve with the spoken language or not. It’s funny to imagine a future where people would need to master an archaic flavor of English to get the best results working with their AI helpers.
Also worth noting that the internet has massively accelerated the importance of English already.
As an ESL speaker who grew up on the internet, Norwegian was more or less useless to me outside school and family. Most of my time was spent on the internet, reading and writing lots of English. Norwegian wikipedia is pretty much useless unless you don't know English. That's still true today for the vast majority of articles, but back then was universally the case.
There were Norwegian forums, but with a population of just 4 million and change at the time, they were never as interesting or active as international/American forums and IRC channels.
In fact I'd say Norwegian is only my native language in spoken form, whereas English feels more natural to me to write and read. Doesn't help that Norwegian has two very divergent written forms, either.
I even write my private notes in English, even though I will be the only one reading them.
Perhaps, but another possibility is that the more advanced models all end up being polyglots. The state of the art in machine translation already uses a single model trained on multiple languages[1], which results in better translations between languages it doesn't have a lot of examples for. If the same principle applies to other types of models, then training them on every possible dataset available regardless of language might yield better results. That could result in models that are fluent in hundreds of languages. (I'd be curious as to whether DALL-E understands prompts in languages other than English, has anyone tried?)
But the same work is also removing the language barriers at the same time with really good translation tools. I rather guess being fluent in English will be not as important as it is now.
Edit: the same work = transformer based language models
Covid and remote work are going to make this even more true. At my new job I work with a Nigerian co-worker who lives in Nigeria, he speaks clear & proper English and earns an amazing salary for his area.
If you live in a third world country, you could really benefit from remote work going forward and English will be a popular language to learn for that. That being said, I know some people will 'phone it in' and not speak as clearly, which will put them at a disadvantage.
Which might also result in new speakers modifying English to their cultures (like Blade Runner's Cityspeak), and Global-English speakers not understanding "secret" foreign communication, so they might create new languages for their own subcultures, then relegating English as the new Latin for technical knowledge (Latin was kept by the Catholic Church).
I wonder the same thing. Many languages are far better at making precise statements due to stricter and more complex grammar. The reduction in ambiguity could help models perform better.
Then there's also the problem of labelling training data. Most of the labelling and annotating is outsourced to countries with cheap labour and performed by non-native speakers, which leads to problems with mis-labelled training data.
DallE is trained on web scraped data, there is no manual labelling taking place except maybe to filter out some types of content.
A few years ago we were training captioning models on manually labelled datasets (such as COCO captions), but the they were small and models were not too general.
I think DALL-E is clearly just gluing things together that it found in a massive dataset, and doesn't have any understanding of the underlying concepts. I thought it was easy to see the signs of this in examining its output. Same for GPT-3.
However, what's amazing about DALL-E and these other statistical, generative models to me is that it's made me think about how much of my daily thought processes are actually just gluing things together from some kind of fuzzy statistical model in my head.
When I see an acquaintance on the street, I don't carefully consider and "think" about what to say to them. I just blurt something out from some database of stock greetings in my head - which are probably based on and weighted by how people have reacted in the past, what my own friends have used similar greetings, and what "cool" people say in TV and other media in similar circumstances. "Hey man how's it going?"
If I was asked to draw an airplane, I don't "think" about what an airplane looks like from first principles - I can just synthesize one in my head and start drawing. There are tons of daily activities like this that we do like this that don't involve anything I'd call "intelligent thought." I have several relatives that, in the realm of political thought, don't seem to have anything more in their head than a GPT-3 models trained on Fox News (that, just like GPT-3, can't detect any logical contradictions between sentences).
DALL-E has convinced me that even current deep learning models are probably very close to replicating the performance of a significant part of my brain. Not the most important part or the most "human" part, perhaps. But I don't see any major conceptual roadblock between that part and what we call conscious, intelligent thought. Just many more layers of connectivity, abstraction, and training data.
Before DALL-E I didn't believe that simply throwing more compute at the AGI problem would one day solve it. Now I do.
>I have several relatives that, in the realm of political thought, don't seem to have anything more in their head than a GPT-3 models trained on Fox News (that, just like GPT-3, can't detect any logical contradictions between sentences).
If more people were to realize that we're all probably like this, trained on some particular dataset (like mainstream vs reactionary news/opinion), I wonder if that would lead to a kind of common peace and understanding, perhaps stemming only from a deep nihilism.
I think our intelligence has 2 components, our conscious selves that are good at formal reasoning and meaning making, and a massive neural network that is good at statistical inference. I suspect the reason we need less training data for learning than dall-e is because our conscious selves can tease out the formal relationships and models, rather than relying purely on inference which requires enough datapoints.
I've been following the image generation field for a couple months now and while the answer to the title is "yes for most things" it is easily fixed. Use a better text encoder.
My favorite anecdote for showing how having a text encoder that actually understands the world is important to image generation is when querying for "barack obama" on a model trained on a dataset that has never seen Barack Obama the model somehow generates images of random black men wearing suits[1]. This is, in my non-expert opinion, a clear indication that the model's knowledge of the world is leaking through to the image generator. So if my understanding is right, as long as a concept can be represented properly in the text embeddings of a model, the image generation will be able to use that.
If my anecdote doesn't convince you, consider that one of google's findings on the Imagen paper was that increasing the size of the text encoder had a much bigger effect on not only the quality of the image, but also have the image follow the prompt correctly, including having the image generator being able to spell words.
I think the next big step in the text to image generation field, aside from the current efforts to optimize the diffusion models, will be to train an efficient text encoder that can generate high quality embeddings.
[1] Results of querying "barack obama" to an early version of cene555's imagen reproduction effort. https://i.imgur.com/oUo3QdF.png
> when querying for "barack obama" on a model trained on a dataset that has never seen Barack Obama the model somehow generates images of random black men wearing suits[1]. This is, in my non-expert opinion, a clear indication that the model's knowledge of the world is leaking through to the image generator.
That's super interesting. It's not just black men in suits either. It's older black men, with the American flag in the background who look like they might be speaking. Clearly the model has a pretty in-depth knowledge of the context surrounding Barack Obama.
I would say the image generation model is also doing a pretty great job at stitching those concepts together in a way that's coherent. It's not a random jumble. It's kind of what you would expect if you asked a human artist to draw a black American president.
(Incidentally - I want to make it clear I don't mean "Dall-E has no understanding of the elements and it's just a fancy image search/compositor" - there are results that absolutely require some semantic "understanding". It's just there are limits and spatial understanding is one of the weak spots)
The Fair Witness was a job that Heinlein made up for Stranger in a Strange Land. Fair Witnesses were supposed to reliably report what they saw without judgement - including their subjective judgement in their report. The example exchange is: "Is that house over there brown?" "It is brown on this side."
Dall-E (and other ML systems) feel like fair witnesses for our cultural milieu. They basically find a series of weighted connections between every phrase we've thought to write down or say about all images and can blend between those weights on the fly. By any assessment it's an amazing feat - as is the feat to view their own work and modify it (though ofc it's from their coordinate system so one does expect it would work).
In one sense - asking if the machine "understands" is beside the point. It does not need to 'understand' to be impressive (or even what people claim when they're not talking to Vice media or something).
In another sense, even among humans, "understanding" is both a contested term and a height that we all agree we don't all reach all of the time. One can use ideas very successfully for many things without "understanding" them.
Sometimes people will, like, turn this around and claim that: because humans don't always understand ideas when they use them, we should say that ML algorithms are doing a kind of understanding. I don't buy it - the map is not the territory. How ML algorithms interact with semantics is wholly unlike how humans interact with them (even though the learning patterns show a lot of structural similarity). Maybe we are glimpsing a whole new kind of intelligence that humans cannot approach - an element of Turing Machine Sentience - but it seems clear to me that "understanding" in the Human Sentience way (whatever that means) is not part of it.
> In another sense, even among humans, "understanding" is both a contested term and a height that we all agree we don't all reach all of the time. One can use ideas very successfully for many things without "understanding" them.
Only if you define "understanding" as an intellectual process instead of the existential process that is meant when we say "understanding" in the context of cognition and AI.
Now, let's be critical of possible reasons for this. It's important to remember two things: 1) Any NN has zero experience with the world beyond its training data. Things that seem obvious to us from our experience are not obvious to a system that has never experienced those things. And 2) DALL-E 2 was trained on image-caption pairs scraped from the internet, basically.
So it's quite possible the reason it doesn't understand things like "X under Y" very well is that its training set doesn't have a lot of captions describing positional information like that, as opposed to any failure in the architecture to even potentially understand these things.
"a photo of 6 kittens sitting on a wooden floor. Each kitten is clearly visible. No weird stuff."
Like, lets start with the fact that there are 7 of them (2 of the 4 images from the prompt had 7 kittens). Now lets continue on with how awful they look.
It is startling the difference in image quality between DAlle-2 asked for a single subject vs DAlle-2 being asked for a group of stuff.
And its obvious, if you know how the tech works, why this is the case.
There seem to exist several pictures of marmoset monkeys touching iguanas, but DALL-E mini shows macaque monkeys. This makes me believe that DALL-E mini has at least some generalization capabilities.
AI will never understand the actual context because not everything we feel/experience can be captured and communicated to a machine. For example, human language is incomplete and doesn’t encode every information because it doesn’t need to when used with other humans.
I think it’s a romantic notion to imagine that AI will not be a Chinese room.
Even human intelligence feels like a Chinese room. Especially noticeable when using complicated devices like flight controls. I’ve been playing the MSFT Flight simulator, and I don’t fully understand the relationship between the different instruments. But I can still fly planes(virtually).
We’d be better off if we considered AI similar to an appliance like a microwave or a refrigerator. Does a fridge need to understand or taste what’s inside it to be helpful?
AI will never understand the actual context until we give it the same access to a physical and metaphysical environment that we have, as well as similar ways to navigate them. I still believe we're at most a few decades away from AGI, but it's going to take a step above what we're doing with ML. Though if we're lucky/unlucky, ML will pidgeon hole research so badly that we'll reach a local maxima that will prevent us from finding the right path to AGI.
GPT/DALL-E/etc... All of these models are of course gluing things together in some manner, but who cares? That's the point right? The AI pill i've taken is that you don't need AGI in order to make things that are useful for people and business. If you've ever ran a business and had to dive into creatives for blogs, SEO content, social media posts, etc then you spent an inordinate amount of time creating it or outsourced it and in both cases the final copy is NOT going to get you a literary prize any time soon but it is absolutely enough to inform potential customers, start ranking on google and start gaining social media clout. GPT will also not garner you a literary award but can absolutely get you quality copy that users, customers, google, facebook users will not be the wiser that you generated it with AI instead of paid a third party to hack it together for you.
(https://neuralmates.com/ I recently started putting together a web app to MVP this, and I hope to be able to integrate DALLE-2 soon to be able to start generating images for folks as well.)
This is the kind of response that turns me off on all of these amazing technologies. Blanketing the world with low-level content in the hopes to gain ranking and clout sounds awful.
On the flip side I'm guessing we'll have some gpt-3/dall-e blocker extensions that help reduce some of it.
Also - you already live in this world but it's fueled by low cost copy writers and ghost accounts on fiver, I'd bet you are going to see the low water mark for content increase quite a bit in quality and volume over the next few years due to GPT3 being leagues better than the current state of content mills.
Happens to me too -- it's a great way to make new things. However, the "creation" I'd argue happens when you look at the pile of random stuff and generate a new understanding, and decide that it is valuable. The difference between trash and art only exists in the head of the artist. Same thing happens with DALL-E output, really.
How does one define “understand their relationships”?
To me it is a matter of degrees and has multiple axes.
When my 6yo son draws a chair, it’s not the same as when Van Gogh draws one, which is different to when an expert furniture designer draws one. They all “understand” in different ways.
A machine can also “understand”. It might do it in different degrees and across different axes that the ones humans usually have, that’s all. How we transform that understanding into action is what is important I think.
My intuition is that DALL-E is more a demonstration of how hard image synthesis is for humans, than how intelligent the algorithm is. The image generation models have orders of magnitude fewer parameters than the large language models.
Blake Lemon claimed that Google's chatbot was sentient which I disagreed with and its demonstrated through this article. AI can be optimized to respond in a way that can easily fool someone into thinking they are talking to a human but at the end of the day sentiency requires consciousness and that is not something that can be digitally produced.
You can teach a parrot to respond to basic arithmetic but they are not aware of the concept of math rather they are acting in pathways set to induce the desired response.
A truly conscious entity would simply have a mind of its own and will not do our bidding just like any other humans. They would be extremely selfish and apathetic, the idea that bunch of GPUs sitting in a datacenter is sentient is sheer lunacy.
This Blake Lemon character will not be the last, there are always those that seek to be in the lime light with zero regards for authenticity. Such is sentient behavior.
I discovered something like this recently when I tried the prompt "man throwing his smartphone into a river," and for the life of me I could not get DALL-E to render the phone separated from the hand (I tried "like a boomerang," "tossing," "into an ocean," "like a baseball," etc). And then it occurred to me that by the training data, there are virtually no pictures of a person and a phone where the phone is separated! So DALL-E might have thought that the phone was just an appendage to the body, the way the hand is (which, what does this say about society!). I might as well have asked DALL-E to render someone throwing their elbow into a river.
Another interesting case is animal-on-animal interactions. A prompt like, "small french bulldog confronts a deer in the woods" often yields weird things like the bulldog donning antlers! As far as the algorithm is concerned, it sees a bulldog, ticking the box for it, and it sees the antlers, ticking the box for "deer." The semantics don't seem to be fully formed.
I dunno man, I punched that exact prompt ("man throwing his smartphone into a river") in DALL-E 2 just now, and in 2/4 samples, the smartphone is clearly separate from the hand: labs.openai.com/s/uIldzs2efWWnm3i9XjsHI7or labs.openai.com/s/jSk4qhAxSiL7QJo7zeGp6m9f
> The semantics don't seem to be fully formed.
Yes, not so much 'formed' as 'formed and then scrambled'. This is due to unCLIP, as clearly documented in the DALL-E 2 paper, and even clearer when you contrast to the GLIDE paper (which DALL-E 2 is based on) or Imagen or Parti. Injecting the contrastive embedding to override a regular embedding tradesoff visual creativity/diversity for the semantics, so if you insist on exact semantics, DALL-E 2 samples are only a lower bound on what the model can do. It does a reasonable job, better than many systems up until like last year, but not as good as it could if you weren't forced to use unCLIP. You're only seeing what it can do after being scrambled through unCLIP. (This is why Imagen or Parti can accurately pull off what feels like absurdly complex descriptions - seriously, look at the examples in their papers! - but people also tend to describe them as 'bland'.)
If you want multiple objects, each with individual attributes, the unCLIP model still has to make a single embedding vector representation. But the single vector is too small to contain an ever more detailed scene description. That's why it has failure modes like assigning the wrong colour to cubes and not being able to spell text.
On the other hand the previous approach - autoregressive generation - allows full access through the attention mechanism to the prompt.
For example Imagen encodes text to a sequence of embeddings.
> Imagen comprises a frozen T5-XXL [52] encoder to map input text into a sequence of embeddings
and a 64×64 image diffusion model, followed by two super-resolution diffusion models
Why is there this obsession with systems or algorithms having "understanding"? No one thinks these things have internal states equivalent to "understanding". "Understanding" or not you can't deny the capability of these systems.
I don't see how the 'understanding of relationships' should be taken as the key intent of DALL-E 2.
Consider procedural generation: it can create abstractions of both utter beauty or garbage without understanding context. You need to guide it towards something meaningful.
Just the fact that DALL-E can 'glue things together' without need for human inspiration - yet where its output and intent can be understood by a human appraising it - that is not only a feat in itself, but I would say its key feature.
Does anyone actually believe DALL-E "understands" what it's doing? For any reasonable definition of "understands", I assume most people would be skeptical.
So if we go with that, then yes, it just glues things together without understanding their relationship. I'd just be tempted to say it doesn't really matter that it doesn't understand, except maybe for some philosophical questions. It's still incredible based on its output.
> Does anyone actually believe DALL E "understands" what it's doing?
I believe it understands enough to make tens of thousands of people interested and debating its merits. The GANs of 5 years ago were baby toys compared to DALL-E. They were drawing 5 legs to a dog and limited to a few object classes. Now people debate if it "really understands" and if it is "(slightly) conscious".
I think DALL-E understands the same way an Excel sheet would understand SUM(B3:B6), obviously at a way larger scale and with a more complicated function. It's all just rather simple maths all the way down.
Not similar. The SUM() function was written by humans in a top-down fashion, DALL-E learned from the visual and textual data by a completely different process. DALL-E distilled something about the world and is using that to generate.
Bingo. This is also a fair comparison to how todays politics works in a nutshell. Overload a word with a vague context and people immediately attribute their own perspective to come to a conclusion that conflicts with an opposing argument.
Relating back to this headline, using “understanding” creates lots of messages with differing views because everyone has their own take on the word. If instead you said something like, “DALLE fakes understanding of concepts to create new images,” I bet you’d get even closer to the “political message board” style of comments because you are now taking an objective position (yes/no,true/false,good/bad) on a subjective word (understanding).
Isn't this just Chaos Theory, Calculus, and the problem of Jerks (third derivative of position in regards to time)?
When you slice complex reality into smaller pieces, within the smaller piece you have a rough idea of velocity, and changes in velocity (aka acceleration), but you have no idea of future speed-bumps, aka the jerks (third derivative of position in regards to time) for that information is outside the frame of reference when you divided reality into smaller pieces.
Thus you have pictures of people / objects in systems but you are not truly understanding relationships thus you miss things even though you feel like you see things. It is all a theme park for our own amusement, it is not real, only hyper-real which becomes uncanny when we start noticing how the images are off.
Who said machine learning had anything to do with "understanding" or "reasoning"? There's no reasoning or understanding in these systems. Just probabilistic computations based on big data. And before you go saying "but that's how your brain works too", that's not the case. Because your brain can do a heck of a lot more than these big data magic tricks that we call machine learning can do.
Strong recommend for the book _The Alignment Problem_ which while only somewhat touching on systems which is an excellent lay overview of the state of ML which does a good effort to explain the principles involved, the challenges and blockers, and the ways in which progress has been made.
The field is moving so quickly it will stop short of the current status quo, but, it's still remarkable.
I have been myself playing with MidJourney, which like DALL-E 2 is a text prompt to image generation system; it has different goals and favors aestheticized output over photorealism.
The premises of that project and its current execution (as an explicit exercise in collective- rather that siloed-and-contractual relations) are genuinely remarkable and I believe represent a watershed. The rate of evolution of what it is doing are something to behold.
I have generated around 7500 images with MidJourney so far and am slowly developing what feels like an intuition for the way it "sees" images, which as the devs say in regular office hours, is unlike how humans see them.
The limitations, and superpowers, of the system as it exists are already deeply provocative. When things scale again to the next level, the degree of uncanniness and challenge to our preconceptions about the nature and locus of intelligence in ourselves may be genuinely shaking.
Or so I currently think.
I highly recommend taking time to really feel out these systems, because the ways they do and do not succeed and fail serves as a very potent first-hand education in the opportunities, and perhaps much more important, perils, of their application in other more quotidian areas.
It's one thing for them to reliable produce "nightmare fuel" because of their inability to retain very high level coherence down through low level details, when they are drawing limbs, hands, faces...
...it's another thing entirely when analogous failure modes quietly permeate their ability to recognize illness, or, approve a loan, or recommend an interest rate.
Or—as the example which opens _The Alignment Problem_ spells out—recommend whether someone should or should not offered bail. (A real world example with life-changing consequences for people who interact with ML in this path, in something over 30 states today... at least, as of publication).
Using midjourney, I'm fascinated with the geometric shapes it comes with. Just yesterday I was looking at a sort of a hexagon made with intersecting hearts. Not only was it aesthetically pleasing, it was such an unusual composition, that I think it would be extremely hard for an artist to come up with, maybe even a mathematician. So I think it's already capable of doing more sophisticated things than human artists.
MidJourney is really good at creating abstract images. I’ve been playing with the multi-prompt syntax (::) a lot lately, just combining different words that evoke form or texture to see what it will come up with. I start by testing things one at a time and just take a guess at what might look cool together. I end up making strange images with prompts like “fractal :: stained glass :: feathers”, “antelope canyon :: tattered fabric :: cave”, “rubik’s cube :: Japanese woodblock”, you get the idea. I rarely try to make pictures of real things anymore. Definitely fun to play around with.
I think most of us are pretty new to it since they just opened the beta to a ton of people. I haven't tried to do anything like that with it though. All I can say is it tends to take a lot of experimentation to get what you want. And you'll probably have to do some work to the images to get them to tile seamlessly. If you can't get the pixelated look quite how you want it, you might try using it to generate a base image and then use a separate tool to pixelate it. Some people are using it as part of a larger pipeline.
The discord is full of people sharing their experiments and approaches though. Maybe try asking in the prompt-craft channel to see if someone else has attempted something similar.
Having spent about 200$ on credits… yes, it glues together your words through transformers into images… of snippets of the identified corpus of images it has… with some specific prompt phrasings having cool weighting for making really nicely looking glued together results… it’s a model: classical computing - everything is turtles and duct tape and rivets and glue, all the way down…
Fixed relationship concepts are a deeply subjective model, too, and this isn't exactly taught in school.
Imagining that one knows "the" relationship between even two really obvious things can be the sign of a closed mind, or a mind which locks out new perspectives. This can be even more true, the more obvious the relationships may seem due to shared sensory biases.
IMO this is one reason why DALL-E 2 can be helpful for conceptualizing. If the user is open, DALL-E 2 can make mistakes and still teach you things through those same mistakes.
Unfortunately the relationship between DALL-E 2 and the user, so to speak, is also perceived as relatively fixed by members of the tech community, so highly do they rate their own ability to tell how "good" such a model is, etc.
I want to see two DALL-E bots talk to each other.
First bot: "Pete and Repeat were sitting on a fence. Pete fell off. Who was left?"
Second bot: "Repeat"
First bot: "Pete and Repeat were sitting on a fence. Pete fell off. Who was left?"
I recently used it to create a picture of "an analogue synthesizer made out of meat and gristle". I don't really know how you could glue these concepts together, because nothing in reality causes them to ever go together, but it did a great job anyway.
I've seen different results than the author. Had it since it was in beta and learned that it needs better prompts to generate better images. As someone on their Discord said "prompt writing is a craft you need to learn".
There were so many articles about this project, but I can't find a demo page to test 3,4 outputs myself (unless it was some page I already tried when it was linked and I got nothing similar with what is shown on main page)
It definitely does not know what the image is. Generating variations of fireman with a hose resulted in a bunch of gardeners with hats that kinda look like fireman image but clearly gardeners
the difference are the subjects of the domain it learns about exist purely as 2 dimensional images.
once these models get larger and include wider ranges of sensory data beyond just imagery (as can be seen with models like GATO), they are clearly better able to "glue together" concepts across multiple domains.
i would argue we absolutely do nothing different with regard to 'gluing things together'. we just have a wider range of sensory inputs.
At least there is much missing data for human like intelligence. If human would be trained with same input data, I think he/she would not understand relationships.
exactly. or even ask these AI systems to define it. Just like a parrot, they'll repeat what other say without actually understanding what understanding is.