Critical Topics: AI Images, Class Seven

Generative Adversarial Network Fever

Up until now we’ve looked at the history leading up to contemporary AI images. Today, we’re going to start looking at two of the contemporary tools people are using, starting with Generative Adversarial Networks, or GANs. GANs co-exist with another way to make AI images, which are Diffusion models. I want to be clear that GANs pre-date Diffusion models, and aren’t the same tools you see in things like DALLE2 and Stable Diffusion. Nonetheless, they’re important to look at for a few reasons. 

First, they’re part of the history of AI images more generally. Second, they’re still out there, and until very recently, they were more accessible to artists and designers who want to create something unique. Diffusion models take tons of training data, and they’re proprietary: if you want to make something as an artist, you can make only make things around specific tools. With GANs, you can train your own models much more simply. 

Abe Lincoln Image Classification Series CC-BY-SA, Tomas Smits and Melvin Wevers. The image is first broken down into shades or colors. Those colors match an index of numerical values When certain sets of numerical values are consistently clustered, the system may associate them with facial features: for example, a series of very dark cells surrounded by gray cells would be recognized as an “eye", if a string of dark pixels is clustered near that cluster, it may be determined to be an “eyebrow.”

So let’s start with the history. We last talked about technologies of image recognition and how they worked. Basically, you break an image down into color gradients, these color gradients are broken down into numerical values, and then these values are turned into arrangements of numerical values. Machine Learning tools look for patterns in these numbers and where they might be in the context of the surrounding images. This means it can detect where an eye is, for example, by finding the cluster of pixel information that corresponds to a nose. 

This is all great if you want to identify a face. But the question of generating a face was significantly more challenging. There was simply too much information that could cause weird rendering issues. For example, if the machine understood what and eye was, would it be able to tell that it was the left eye or the right eye? 

The mathematical system worked, just barely, for identifying pictures of things out in the real world. But generating photorealistic images in this way was a massive problem. 

Progress in size, resolution and detail in GANs from 2014-2018.

Here we see the rapid progress in GANs since Ian Goodfellow introduced them in 2014. Goodfellow opted to work on this problem with what is, in hindsight, a pretty clever and simple idea. Since image recognition systems could do a decent job of recognizing faces, what if you created two layers — and consider them opponents in a guessing game? One system would make the image based on a random start point and pass it to another layer that would aim to recognize that image and identify it.

This is the concept of a GAN: a generative adversarial network. The GAN has two functions:

  • Generator: Creates an image of random pixels and tries to pass it through as real.

  • Discriminator: Tries to decide if the image it is handed is generated, or from the training dataset. If it’s real, it passes through. 

  • This relationship is generative — it makes new images from noise. It is adversarial because one part of the network is trying to create images that ‘trick’ the other, and one is trying to reject those images.

The Generator’s output is compared to the accepted and rejected images from the discriminator. Eventually, it writes pixels based on what is accepted, and stops writing pixels in patterns that get rejected. In the end, it merges this feedback to assemble images more closely aligned to patterns found in the training data. This relies on the evaluation step, which is guided by the Discriminator. The discriminator tries to decide if the image it is handed resembles the training data. If it’s real, it passes it along; if it’s fake, it’s rejected. 

Here’s an easy way to remember this: it’s in the name. This relationship is generative — it makes new images from noise. It is adversarial because one neural network is trying to create images that ‘trick’ the other, and one is trying to reject those images.

From Google’s Machine Anatomy course.

The above illustration offers greater detail. At the top you see the training dataset — “Real Images” — a bunch of images of, say, real people. A sample of those images is pulled out and shown to the discriminator. Meanwhile, the generator is getting little bits of random noise and using that as the start point for writing out new pixel clusters based on the patterns it has learned. So the generator is making something new, and passing it to the discriminator. 

Now the discriminator has a real image and a fake image. It’s designed to compare the pixel arrangements it’s used to seeing in the training data and making a guess. We know that the generator is, initially, not very good at this. It’s just sending noise over there sometimes. But when the discriminator rejects a generated image, information about that image goes back to the generator. And the generator updates its model about what to create. In other words, everytime it loses, it is basically given information about why it lost. So it learns to emphasize or de-emphasize certain elements of what it is generating. Eventually, the generator starts producing something that looks like a face, and the discriminator passes it through. At that point, the generator knows what it has done right, and updates its model to duplicate those patterns. 

Now, these images are being compared against training data. So here’s an important thing to note. The generator is not designed to create new images of faces. The generator is designed to create new images that resemble the parts of the training data that the discriminator is most familiar with. 

So let’s think about portrait photos. Here are three images of men found in a massive, 52,000 image dataset for training GANs, called the FFHQ dataset. All of these pictures were downloaded from Flickr by NVIDIA, and remote workers were paid to identify them as faces and crop the photos.

We see that the faces are consistently in the center of the frame, yeah? The eyes and noses are more or less in the same place. There are also some differences here. The hat, for example. Not everyone wears a hat and there are so many different kinds of hats. Look at the backgrounds. The backgrounds in almost every image are going to be different, too. 

What do you think this will mean for the output? Remember, the discriminator is looking to recognize images in the dataset. That means it is figuring out the patterns in the dataset, which is by definition the aspects that appear most consistently. If something isn’t appearing consistently in the training data, the discriminator is going to ignore it. It only looks for patterns. 

And crucially, GANs are designed to predict an image that will eventually trick the discriminator. Think about it: it’s generating images from noise, and being compared against a dataset. The discriminator has to decide, is this part of the dataset or not? That’s why I suggest the generator learns to make something that looks like the dataset by looking at patterns and predicting what will pass through the discriminator. So GANs are about recognizing patterns, and learning to predict what will fit into those patterns. We do the same stuff when we try to predict things like credit scores, or parole, or how many toothpaste tubes to put on a shelf. Machine learning looks at the dataset, identifies patterns in that dataset, and tries to predict what would fit into that pattern. 

Here are some examples of what passes. What do you see? Does it match what you predicted?

So the discriminator makes pretty realistic faces, but everything else can be kind of goofy. If there’s not enough data, there won’t be enough information for the discriminator to distinguish something from noise. So it’s allows some weird hats to pass through to us. It also does something else kind of… horrifying. 

What do you see in this pictures? What do you think happened here? What made this image look like this? Why didn’t the discriminator stop it? 

What we’re seeing is a system that emphasizes patterns across these images. All of these images were cropped in a certain way so that eyes were in the usual place in a frame. So eyes, and noses, and then ears and hairlines — those are pretty typical across the training data, showing up in the same place consistently enough to find and reproduce a pattern.

Now, most of these images in the training set will focus on just one person. But what’s clear here is that some people in the dataset had buddies in the picture with them, while many did not. Because a small but significant slice of the training photos included people posing with a friend, the discriminator allows more vague, unstructured imagery to pass through it.

So when the generator sends noise to the discriminator, the discriminator does the algorithmic version of a shrug. The information about these buddies is pretty weak in the training data, but strong enough that it corresponds to something in the dataset, so the discriminator doesn’t stop it.

When we talked about bias in algorithms, we mentioned that black faces were underrepresented in this data. As a result, there’s not enough information to create really photorealistic black women from the data in this dataset. This is true of GANs, but also true of the latest batch of AI imaging tools. If something you ask for shows up and it looks strange, or noisy, there’s a reason. That reason is because these systems rely on identifying patterns, and if one set of patterns is stronger than another, the strongest patterns will appear most clearly in your images, and the weakest patterns will appear distorted or strange.  

Part Two: Datasets as Archives

I want to stop here and bring this to a broader topic about archives and the kind of power that archives carry. This is kind of a long quote from Jacques Derrida, speaking about the power that archives have over the stories that these archives contain. And I point to this because these distorted images tell us something about archives broadly, but also serves as a warning about the way we use archives in making images. 

Derrida writes, (pg 3): 

Consignation aims to coordinate a single corpus, in a system or a synchrony in which all the elements articulate the unity of an ideal configuration. In an archive, there should not be any absolute disassociation, any heterogeneity or secret which could not separate or partition in an absolute manner. ... the archontic principle of an archive is also a principle of consignation, that is, of gathering together. ... Whatever secrets and heterogeneity would seem to menace even the possibility of consignation, this can only have grave consequences for the theory of the archive ... and its institutionalization. -Jacques Derrida, Archive Fever (1994)

In other words, an archive is often designed to impose a kind of order on the things that it contains — the unity of an ideal configuration — and anything that strays from that unity can create problems in the structure of the story that the archive is going to tell us. To be silly about it, an archive of gum packaging at the gum museum doesn’t make sense if it is half filled with cigarette packs.

But the principle can be more nefarious than that: it can speak to the inclusion and exclusion of evidence that betrays an official story that the archive has embraced. For example, if you have a museum that ignores the history of people native to the land, your museum doesn’t tell the full story. But the story it does tell exerts a form of power over what it ignores, and elevates a power over what it presents.

AI generated images are a manifestation of that, too. Here, it’s pretty visceral: the archive (the dataset) had things that didn’t fit the pattern needed for machine learning to tell this story of human portraits. And these images are a story: they are an infographic about the dataset, that is, they are a story about the dataset. When you look at the pictures above, it tells you: something inconsistent happened in the background of these images. 

As you heard before, this takes on a deeper and more frustrating sense when we consider the failure to include people of color in this dataset, and the consequences that has for image recognition and generation systems. But the response to this is not all that easy, and is not always a matter of making the archive bigger. 

But all of this is just to say: if you have 400,000 images of people’s faces where they are alone in the picture, that’s going to be the thing the model renders really well. But if just a few hundred of those have pictures of people in the picture with that person, they’re going to show up like weird ghosts asserting their presence in the dataset. Now, what’s cool, as artists, is that we can tell our own stories. But it’s important to understand how. And part of that is, we can use this idea of the archive, and this impossible idea of a “perfect archive,” to undermine, reveal, or elevate certain stories. 

Let’s go back to the beginning: data. Every dataset is collected for a specific purpose, and every model is created to extend that purpose. In other words: if you want to generate pictures of full bodies, you need to go and collect pictures of full bodies, and make sure they look similar enough for the model to identify and duplicate patterns. If you want to create images of oil paintings, you need to go find thousands of oil paintings about a specific subject and then use those to train the model.

Look at the image above: a selection of images created for NVIDIA’s StyleGAN dataset. It’s designed to generate pictures of people’s full bodies. What patterns do you see in this small sample? Who is missing? What kinds of images do you think they would produce? What kinds of images of people would it be impossible to produce?

Aside from selection bias, there are other issues. GANs were not all trained with the permission of the people in the dataset. But as artists, you can build, or extend, datasets with your own work. You can use your own photographs or drawings, for example. Here you see a piece I generated — just a sample of work — by creating a dataset full of flowers and dancers, taken from public domain images off the internet. In other words, I used images that I had permission to use. I have also built datasets with my own data, taking photographs for example. The one on the left is the result of taking hundreds of photos of willow branches against a white wall. 

GAN Art Works, Eryk Salvaggio, 2019-2020

Once you have a dataset, you have to train the model. I want to talk about another problem with GANs, and machine learning in general. That’s the difference between overfitting and overtraining. 

Overfitting is when your dataset is too similar. If you have 5,000 photographs of the same exact thing, you will get that thing. If you have 5,000 photographs of completely different things, the machine won’t find any patterns.

Training is where the two neural nets start playing their games of comparisons. If you don’t train enough, the generator model won’t learn enough about the patterns in your data. Very little of its output will be useful.

Overtraining is if you train too much. Two things can happen. The first is that the model simply stops finding patterns in the data — and you spend money and electricity training a model with very little benefit. The second is that you actually send the system into collapse. That’s when the discriminator stops being able to tell the difference between the training data and what’s produced by the generator. The generator then can send almost anything it wants through to the discriminator, and it will pass. That information, even when it’s weird and wrong, then goes back to the generator as information, and the generator starts making more and more noise. 

With really massive datasets like we talk about in 2024 — billions of images — overfitting isn’t really a problem. But overtraining might be, and we’re all waiting to see if the improvements to AI image resolution and quality eventually plateaus.

But the core idea is that with machine learning, you want to train your model at the right size and scope for what you have. 

Part Three: When AI Art was WikiArt

When you train a model you might keep it private for your own uses, or you might make it public. One of the first AI art scandals to come into the art world came about because of a public model. An artist’s group, called “Obvious,” created the portrait on the bottom left using GANs that were trained on 14,000 portraits. They painted one of these images on canvas, naming it after a French variation on Ian Goodfellow’s name - the inventor of GANs. This work went on to be one of the first from an artificial intelligence system to be auctioned of at Christie’s, a major art house auctioneer. The work sold for $432,000. 

Already we might ask: who made this work? The first answer could be, well, Obvious made it. The second answer is: well, Obvious made it, but they made it from 14,000 portraits by other artists, so who were those artists? Another question emerges, however, when we ask: where did the model of 15,000 images come from?

You might suspect that Obvious built that model if they were the ones who sold its output. But that’s not the case. Obvious used a pre-built model by a guy named Robbie Barrat. Barrat trained and shared that model for others to use. Obvious would suggest that, well, they were the ones who put the output on a canvas and framed it as an artwork. 

I also want to point out that there was a lot of work on GAN art around this time that focused on oil paintings and portraits. You had Obvious and Robbie Barrat here, you also had artists like Mario Klingemann, who’s “Memories of Passerby” was trained on a deep history of European portrait paintings. And you had work from Gene Kogan, which moved through the entire WikiArt corpus, scanning through the most abstract paintings, which you see on the left, and the most traditional, on the right. 

I’m pointing it out because it’s important to talk about why so much art of this time period relied on this style of exploring traditional art. The short answer is that GANs needed data. One of the largest sources of data at the time were public domain archives, like WikiArt. So a lot of GAN art from this period was sort of in dialogue with existing art archives. Another influence, I suspect, is benchmarking. When a company wanted to test whether a model could create art in the ways that a human could, you might turn to archives of art and see how well your model could match something from that archive. If you have a system that is aiming to fool a discriminator, and it can create a convincing oil painting in the style of a European master, then you can say something pretty powerful about your model. 

Lots of artists are still putting AI into dialogue with “old masters,” or more traditional forms of painting. A very notable work by the artist Refik Anadol uses GANs to create an entire history of the New York City Museum of Modern Art, in a style reminiscent of Gene Kogan’s “Generative Antipodes” from 2019. It blends in some other beautiful — though not particularly informative — data visualizations about the collection and the model it produced.

The accessibility of public domain art datasets oriented a lot of this early AI-generated art in a very traditional Western style of painting. There’s a lot of oil paintings and European names in the WikiArt archive. I point this out because it speaks to the ways that the data that’s available to us shapes the kind of work that we make. And if we think of AI art as “imagining” things, then it’s important to think about what that imagination has been built on. 

Now, as we’ve seen very recently, there’s a big controversy around building big, diverse datasets, too — which is that sometimes those datasets are built without anybody’s explicit consent. With the case of WikiArt and the public domain, on the other hand, you’re limited to the data that has been digitized. And what has been digitized tends to reflect a certain set of priorities. Museums may concentrate, for example, on traditional, often white, Western artists, at the expense of digitizing more contemporary works from a diverse set of cultures. Those kinds of works may not even be present in the archives that can afford to digitize their collections. And if they were digitized, who has the right to use those images? Who has proper ownership over art in a Dutch museum that was taken from the Congo? 

Director-General of UNESCO, Ahmadou-Mahtar M’Bow, spoke about returning artifacts taken by colonizers in 1978:

“Everything which has been taken away, from monuments to handicrafts—were more than decorations ... They bore witness to a history, the history of a culture and of a nation whose spirit they perpetuated and renewed.”

Who decides if that witness to history should be part of your training dataset? 

An archive is a major site of debate — what’s inside an archive and what is left out. The answer isn’t always to digitize it, or make it accessible. 

“[T]he technical structure of the ... archive also determines the structure of content even in its very coming into existence and in its relationship to the future.  The archivization produces as much as it records the event.”  (Derrida, 17)

Derrida was talking about archives — museums, libraries. But the warning is true for AI, too. The archive you gather and train is the archive the GAN will produce. There are lots of important questions about data and where it comes from that are so important to thinking about how we use it and what we make with it. The set of conclusions you draw about one set of images may have to be reconsidered entirely from scratch when you come to the next set of images.

So in the next class, when we talk about diffusion, I want to encourage you to keep this idea in mind. Because Diffusion, paired with something called CLIP, marked a shift away from this sort of bespoke collection of images to a vast sweep: a dataset that was, basically, the Internet.


Looking for Something Else to Do?

Works Referenced:

  • Ian Goodfellow, et al. (2020) "Generative adversarial networks." Communications of the ACM 63.11, 139-144. (Link)

  • Tomas Smits and Melvin Wevers, Abe Lincoln Image Classification Series (Image).

  • Ian Goodfellow (2019) 4.5 years of GAN progress on face generation. (Image) Twitter.

  • Google, Machine Anatomy: Diagram of Generative Adversarial Network.

  • Tero Kerras, NVIDIA Labs (2019) FFHQ Dataset. (Link)

  • Jacques Derrida (1994) Archive Fever: A Freudian Impression. University of Chicago Press. p. 3, p. 17

  • Jianglin Fu et al, NVIDIA Labs (2022) StyleGAN-Human: A Data-Centric Odyssey of Human Generation. (Link)

  • Hannah Hoch (1971) On Collage. In The Ends of Collage, ed. Yuval Edgar (2017) Luxembourg & Dayan Press p. 143.

  • Lorna Simpson (2018) Collages. Chronicle Books LLC.

  • Helena Sarin (Date Unknown) BashoGAN via AIartists.org.

  • Obvious (2018) Edmond de Belamy. Painting.

  • Robbie Barrat (2018) Output from Neural Network. Twitter.

  • Mario Klingemann (2018) Memories of Passerby. Interactive Video work.

  • Gene Kogan (2019) Generative Antipodes. AI/Video work.

  • Amadou-Mahtar M'Bow (1978) A Plea for the return of an irreplaceable cultural heritage to those who created it. UNESCO Courier. (Link)