Critical Topics: AI Images, Class 16
Is the AI an Artist? Is the Human a Machine?
Ultimately, there is no such thing as an AI image. There are many different ideas of what AI is, and there are many different systems involved, even when we narrow things down to GANs or Diffusion models. There are also arguments about what images are: where do images come from, what do they represent, as well as whether an image is art, when it becomes art, and why. And of course, the idea of art isn’t any simpler.
So to bring these threads together, I think it’s helpful to review four systems that connect together when we make an AI image.
First is Data: The images in the dataset and the way that dataset was assembled.
Second is the Interface, or Prompts: The way we access that data to make images from it.
Third is the Image: The result of the data, user, and interface in interaction.
Fourth is Social: Notions of creative agency and attribution.
Data
Let’s start with data. Data is the core of these generative systems. Without data, there’s no image. So this means we can start asking some questions right at the start about AI images. The first is, what is the data? Where does that data come from? How was it selected and processed?
With Generative Adversarial Networks, this was relatively straightforward, because GANs were trained on data that was compiled into specific datasets. They could only extend patterns that were found in that dataset. So if you wanted pictures of people, you had to build a dataset of people pictures. These datasets would then be used to “train” the generative adversarial network. We can look at a GAN and see its underlying dataset pretty easily. If it’s making pictures of faces, we know it was trained on images of faces. Pictures of cats means it was trained on pictures of cats.
If we know what the data is, we can go in and look at it. We can open up a dataset, if we have permission, and we can look at the pictures used to make the image. That gives us some transparency into the system. If it’s trained on people, we can look at the data and see what those people looked like. Was it it diverse, for example. When datasets were assembled correctly, we could also identify the sources of those images. We know that the FFHQ dataset came from Flicker, for example, because it’s literally in the name.
These datasets were often assembled by people, oftentimes workers getting paid pennies on the dollar to identify, download, or crop these photos into a specific format.
Generative Adversarial Networks use this data as a training source. A generator creates an image, and a discriminator accepts or rejects that image by comparing it to images in the training dataset. If the generator sends something that looks like it could be in the training data, it then allows that image to pass through. It passes that success or failure information to the generator, which then replicates successful patterns but introduces new things, too, on a new level of training. The more training, the more aligned the generator is to the training data. Though at a certain point, it learns a kind of perfect image, and stops varying its output as much because it stops learning and starts to just constrain what it generates. So that’s GANs.
When we talk about Diffusion models, there’s a twist on this process. First, it’s important to note that the datasets used for GANs are still around in training a Diffusion model. But a Diffusion model will process these a little differently. It’s no longer just images, it’s images and the text associated with those images. So you have the images and their captions, metadata, or alt-text from the web. If the datasets for GANs are on the web, then they are also getting pulled into these models.
So the question of “what is the data” becomes far more complex with Diffusion models. The answer on one level is that the data is something like LAION5B, which drives Stable Diffusion and Midjourney. LAION5B is information collected about 5 billion pictures alongside their 5 billion captions. It’s not the images themselves, it’s the information about how the subjects of those images break apart and how they might be reconstructed.
The way this data is collected is through images being broken down into noise. When they start, the model knows nothing. And the model sees nothing, and to be clear, it really doesn’t learn anything. Instead, it follows instructions: a recipe for turning an image into a series of numbers. The image is encountered, and categorized by the caption.
Information is stripped away from this image, and noise appears where that information used to be. But this information is stripped out in a way that allows the model to identify a pattern. Gaussian noise follows a pattern, where it tends to cluster around the densest part of an image. This reduces that image to core shapes. But the model also walks backward to the starting image. So when the image is completely obliterated, it can trace a frame of pure static back to that picture.
It then stores that information in a way that makes it accessible through texts. In other words, it’s categorized. If you label an image “flower,” the model would break your flower down into noise, learn to move from that noise back to the original image, and store that path in a kind of abstract category of flowers, where it joins hundreds of thousands of other paths from hundreds of thousands of other images.
This data was not selected by human beings in the sense that nobody moved it from the world wide web into the model, the way we had to with GANs. This was automated — meaning it was the result of a human decision to automate it. That human decision to automate this meant that they were willing to accept a number of compromises about what got into this model.
As we’ve seen, a lot of problematic content is present in the dataset, including pornographic images, child abuse images, white supremacist and Nazi imagery, and racist memes and captions. It includes misogynistic content and sexist content, such as categorizing the word professor as white men. This is a key result of the way the data was assembled and organized. But with 5 billion captions and text pairs, it can be really hard to understand and unpack the information that goes into them.
There are also a lot of questions out there about whether this data was ethically collected. Much of it is being contested as a violation of copyright - Getty Images is suing Stability AI for using this information in their dataset without their permission. So are a handful of artists and illustrators whose work was in the dataset. And there is also something called Creative Commons licensing, which gives people the right to use your images under certain conditions. Some see the model and all of the information it learned about these images as copies of their images - the same way an mp3 file is a compressed copy of a musical performance.
Others see the data about how the image breaks down as the result of legitimate research — which is a fair use of that data under creative commons licenses. These people claim that the model is the result of the research, so it’s OK. They note that the model doesn’t contain images, it contains links to images with their descriptions. That dataset is open as research and free to use, but critically, the images themselves are not.
So some see the businesses of selling the use of that model as a whole other sphere of activity from making it. So this is all very complicated, and it leads to questions about whether AI is stealing from artists, or built by theft from artists, and hopefully, based on what we’ve discussed in this class, you have your own ideas — or you feel informed enough to be uncertain about where you stand, and know how to follow this as more legal cases and use cases start to come out.
That is the data part of the systems.
Interface
Next up is the interface of the system, the prompts.
When we write a prompt, we are writing a caption for an image that doesn’t exist yet. And the Diffusion model is searching across all of its many abstractions to find clusters associated with the words in your prompts, each of which accesses a certain category of its dataset. The nice way to put this is that it is breaking objects down into a generalized ideal form, and when you are asking for an apple, it can pull information about all of the apples it has seen and how it reversed the process of all of those apples breaking down into noise.
The prompt triggers a new, random frame of noise to be generated, and the model starts walking this random noise backward toward an image. It’s noise reduction, for an image that never existed. The only information it has about that image is the information in your prompt. If the noise is random, it’s pretty unlikely that it will walk its way back to any specific apple that it has ever seen. But it will walk its way back to a kind of core concept of an apple, drawing from a kind of composite made from all possible apple shapes.
Breaking objects and people down into core essences and then assigning categories to those core essences is a stereotype. And creating images that literally limit the range of photographs to align with those stereotypes is no big deal when you are making an image of an apple, but it can be a big deal when you are creating images of people.
Human captions often include things called invisible categories - for example, a wedding is the word used to describe a straight couple getting married, but when a gay couple get married we call it a gay wedding. There is a tendency among captions online to have a kind of normative shape to them, in which white men are not labeled as white men in photographs but everyone else is identified with some kind of modifier: a black president, a woman doctor, a gay wedding. That gets taken into these categories too, and they reinforce that bias when they generate new images that follow the same rules.
We have another issue with the prompts, which is that it’s very easy, on some level, to activate the styles of particular artists who appear in that dataset, many of whom are still working. There is the question of directly typing an artist’s name to get an imitation of their style, which we see often not just with illustrators but also with directors: a lot of re-imagining of certain movies in a particular film director’s visual style, such as Wes Anderson’s The Shining, or Alejandro Jodowrowsky’s Tron. When it’s someone who is quite famous, then arguably they have more power and can withstand this kind of satirical, playful reimagining of their styles. But when it’s a relatively obscure illustrator or artist with a particular style that they have worked an entire lifetime cultivating, it can have the effect of something akin to abuse. That visual style is part of their identity and livelihood, then flooding the internet with images that look like theirs, or like bad knock-offs that come up when people search their name, hurts their livelihood, but also infringes on their identity, memory, and vision as an artist.
At this stage, of course, we’re talking about a kind of middle ground between the prompt and the resulting image. So let’s go further into a conversation about what these images are. But before we do, I want to discuss this interface a bit further. We also talked this semester about the imagination, and how our interactions with technology exist in an imaginary sphere. We type a prompt that asks us to dream or imagine, which hides the reality of what’s happening behind these models. And the imagination of how these work relies on a kind of shorthand: that the machine is an artist, or that the machine is dreaming. These are metaphors, and they’re relatively harmless for simplifying the process. Nobody wants a button labeled something complicated and complex. But what’s important is to remember that these buttons are labeled that way for a reason, and those reasons include hiding aspects of the model that might be objectionable to the user. Asking a machine to dream is much nicer than asking a machine to scour a dataset of other artist’s drawings and make a drawing for you. And these decisions about AI hide the flaws of these systems, and they can also make people mistake the metaphor for reality. Even very smart people become convinced that the AI is some kind of sentient actor, even when they know that it is a bunch of data that has been stripped away from images, or text, and assembled into a dataset.
Now we talked about theft in the dataset earlier, and it’s important to bring this up again. Because there are two ways that people talk about these systems as theft. The first is in the data layer: That it took information without permission, and built a business around that information, and artists whose work was used were never compensated for their contribution of information or time. That is the suggestion that the companies selling these AI generated images as a service have stolen from the artists. The companies argue that they are using a model created for research purposes but had nothing to do with building that model so shouldn’t be held accountable for how that research was conducted. There are a lot of layers of additional nuance here — including questions about how distinct that funding really was. All of that is connected to the collection of data.
The second layer of this theft accusation comes from the use of these images. And that is the idea that people prompting images in these systems are supporting the additional theft of images from artists. There are two takes on this, the first of which is, to blunt, wrong. It stems from a misunderstanding of how these models work — that they are somehow remixing artists’ work, or cutting and pasting pieces of artists work. As we’ve seen, that’s not quite true. The other argument, which is more compelling, is that even if you ask for an image that isn’t associated with a specific artist, specific artists contributed to the image that you are making.
For example, you could prompt the system with a reference to Garfield, the lasagna-eating cartoon cat by Jim Davis. That’s an obvious infringement of Jim Davis’ style. You could, however, prompt “cartoon cat.” You might get a cat that looks nothing at all like Garfield. But these images are pulling from an abstract space and all cartoon cats were part of that infrastructure. So the output may not look like Garfield, but it is, in part, informed by the existence of Garfield as image information sorted into the category of “cartoon cat.”
The argument isn’t necessarily that you are stealing from Jim Davis. The argument would be that you are benefitting from work created by Jim Davis, and Jim Davis is not getting compensated for that. Think about the work you did in putting your dataset together. If you found out that someone was using that dataset to make a product that made a ton of money, would you want to be compensated? This is especially true if you built that dataset out of your own artwork.
Now these aren’t the only images in the dataset: there are also public domain works, and works by people who just don’t care all that much. But the current interface doesn’t allow you to cut those out and only prompt from those images: the status of public domain or not was never captured in the dataset, so it’s not a reliable category.
Finally, the interface is also a site of system level interventions. In other words, you type a prompt and the system intervenes. In the case of OpenAI’s DALL-E 2, there’s research that suggests they insert random, diversifying words into your prompt without you knowing about it. That way, you’re forced to confront these invisible categories in the dataset in ways you aren’t even aware of. DALL-E 2 will give you an image of a black professor, for example, even if the data most likely would not. Similar interventions prevent us from seeing pornographic or violent content, though that happens in a few places in the system. Stable Diffusion has a blur filter it applies when an image recognition algorithm determines something contains nudity, but as you can see from these examples, it’s quite unreliable — it sensors things that don’t need censoring, but it can also pass things through that don’t need to go through, which I am not going to show you for obvious reasons.
So those are also interface level interventions that mediate your ability to explore the data. But ultimately, you are still able to steer your way through the dataset with your prompts.
Cultural & Social
This brings us to the fourth system that exists around AI images. And that system is human culture. Notions of creative agency and attribution are cultural and social. Each of us is moved or repulsed by different images, these images reflect who we are, what we grew up with, or experiences as people.
Art can be made in 30 seconds or 30 years. Art can be created by pure luck or chance or through highly focused and sustained attention to detail. Art can be created by mistake, or with purpose. The value of an image as art is subjective. But your image exists in a context, and that context around AI is inevitably shaped by their engagement with these ethical questions. You can make work that ignores them, or engages them: as artists have always done, with every tool.
Does an AI see the way a human sees? Is an AI inspired the way a human is inspired?
You’ll hear — and have heard — that they do. I’m told that AI systems see the world of images, and learn from them. That they create new images from what they have learned. I’m assured that this is precisely what human artists do. But just like the dream button, or the imagine command, these are a form of shorthand that reduces a complex social and cultural reality down into something that seems a little too simple. Machines do not understand the social or cultural context of images. That is a major break from how humans make and understand images.
I think reducing what artists or photographers do to what a machine does is tempting as shorthand. But ultimately, it obscures the actual mechanics of the machine, and it obscures the experienced reality of the artist. Every encounter a human has with an image occurs within a context. We see images on billboards. We see them in museums. We see them online or on TV. We see them when we are on our way to an exciting date and when we are coming home from a disappointing one. We see pictures of our dog, and we see pictures of dogs we haven’t met. We see medical records in a TV drama and we see an X-ray of our uncle’s lungs inside a doctor’s office.
The experience of an image matters. It creates associations and feelings. It creates emotional responses. When images are juxtaposed, or placed in a sequence, we might read them as a story. The story of those images may be a projection of our associated experiences with similar images: it reminds us of something. We connect them to previous lived experiences. We interpret them based on their alignment to those experiences.
Are the people in the photograph meant to be us, or someone else? Is it meant to be seen by us, or by someone else? Who has crafted this image and what do they want me to do with it? Do I want to read this image in the way it was intended? Would I like to read it differently?
We don’t just encode and decode images. We use them to tell our own stories. Without those stories, images are data: latent, unactivated visual information. But because we activate these stories, we negotiate what these images mean against the meaning that was written into making them.
That is the individual interpretation of these images. In our lecture on advertising, we talked about Stuart Hall, who notes there is something called a hegemonic understanding of the images: the ones shaped through frequent use and description by authorities, from the law to politics to media. These hegemonic understandings are naturalized: in the US, a red traffic light means “stop.” To function as a society, with a shared understanding of the world, this hegemonic reading matters. But we also naturalize it: we say “red means stop.”
Red, of course, has no “meaning” outside of how we’ve learned to read it.
We process images through our eyes and into our memories. But we also process images through our memories. We see emotionally. We see through lenses we’ve learned and unlearned. When artists “recreate” what they have seen, it’s been molded by those experiences. We might see an image, abstract it, and regenerate it in some new form. But there’s more than information in that mix. More than visual data.
There’s an old saw about a designer who could sit down and design a brilliant logo in 30 seconds. Someone asked how they could do that so fast. Her response was that it didn’t take 30 seconds — it took 30 seconds and 30 years. In other words, it was her experience, the way she learned to see, that led her to create the work so quickly.
Yes, AI generates images quickly. I don’t think that is a reason to discount it as art. What’s missing from this instantaneous generation of a picture is any lived experience. Humans bring that. And we may choose which images make sense to us through our own experience.
But that isn’t the AI’s experience, it’s ours.
I would even issue a caveat that this definition of experience is not in and of itself a requirement for art. Early in the semester we talked about George Brecht, and his desire to strip the human completely out of the process of art making. The avant-garde have played with methods of stripping emotion out, to make “objective” art, for decades. I am not suggesting that this is not art, or that an AI cannot make images that can become art. I am simply suggesting that an AI does not see the way we see.
Despite this easy option out of the “AI can’t be an artist” trap, most AI artists argue for the opposite. That AI art should be perceived as emotional, beautiful, profound, in explicitly human ways, for what these images depict. But there’s something far more interesting about how it depicts things and the ways that the categories and logic of image making get into these images.
Again, the art an AI system makes might be art. It might not be. But what shapes that answer is never as simple as what tool was used to make it.
The allure of saying that the machines see the way humans do is, I think, partly shaped by two simplifications.
The Two Simplifications
The first is the simplification of how humans see. When we spend too much time coding complex problems into programs and algorithms, we begin to see everything as ultimately reducible. This is why AGI mythologies are so pervasive: if you are trained to see things as potentially reducible to a set of instructions for a machine to follow, you can be tempted to see all things as equally eligible.
The second simplification is a refusal to accept that there are, in fact, different ways to see. In other words: Diffusion models see in their own way. And despite my comparison to vomit, Diffusion systems are fascinating — poetic, even. They are able to see information break down to absolute obliteration, and then reverse it. They are able to move backward from a frame of noise into an image.
To apply the metaphor of human vision to this system is a tempting shorthand. But it’s also reductive — of humans, but also the system it describes. It is also, ironically, incredibly human-centered: closing us off to the ways of seeing beyond our own senses. It is possible to say that an AI does not think or feel, and to accept that we might find interesting new tools for our own imagination by contemplating what AI actually does instead.
It seems to me that when it comes to new technology, we should aim to see things as clearly as possible. The benefits of this clarity also help us to avoid a really big risk.
That’s the risk of dehumanizing ourselves. Seeing human vision as a mechanistic system is literally dehumanizing: it says we are machines. It suggests that we are NPCs — those video game characters that play out a script on a loop, existing only for the sake of the one human player moving through the world. If we adopt the simplicity thinking people are machines or that we function like neural networks or complex computers, that’s a way of simplifying all the messy entanglements of the world. Humans do not think like machines — we do not think like the LAION 5B dataset — and the LAION5B dataset does not “think.” Humans, in and of themselves, do not even share one way of thinking. There is a wide range of diversity in the ways we see, the thoughts we think, the way we make sense of the world and the ways that we don’t make sense of it.
And while it’s important not to dehumanize people, we might also learn a lot by resisting the demechanization of machines. We might aim to understand their logics, structures, and patterns as contributing to the diversity of the world through complex mechanized processes. They can be embraced as part of the systems we’re entangled with, rather than a replacement for that entanglement. Saying that machines think and that humans must think like machines is saying there is one way to think. But there isn’t. And if we accept that a dog can think, but that it doesn’t think like a person; or that an octopus can think, even if it doesn’t think like a person or a dog, then we might look at a machine and say: it’s doing something, and what it’s doing doesn’t have to be categorized as intelligence, or thinking, or sentience at all.
To be super simple about it: maybe a machine is doing its own thing. And maybe we can aim to understand that thing for what it is, instead of by comparison to ourselves. If we do, we might identify better strategies for managing their co-existence with our social and ecological systems. We can have relationships with things even if they can’t have a relationship with us.
So on that note, I hope this class has given you some firm footing to better make sense not just of AI images but of artificial intelligence more generally. I hope when you come up against claims about what AI can and can’t do, that you can think critically about what’s being said and find your own conclusions about how you want that technology to be used in the world around you. And I hope you found some things you agree with in this course, but I also hope you found things to disagree with. Because the cool thing about this space is that it is still emerging, still changing, and all of us are still learning. I encourage you to think critically about everything that has been presented here, and to find your own way through it, but also to stay open and curious about other ways of seeing.
Looking for Something Else to Do?
You can read a series of essays I wrote as a Flickr Foundation Research Fellow. It explores a specific dataset built by Flickr, and how it became a piece of AI infrastructure embedded into Diffusion models today. It is in three parts, the first of which is here.
This is the end of the course! If you want to follow my newsletter, I write more or less every Sunday on topics just like these. It’s free! Sign up or peruse the archives with the button below.