Critical Topics: AI Images, Class Nine
Exploring the Datasets
The video lists this as Class 10 because Class 9 was a working day.
Lecture by Eryk Salvaggio
In our previous classes we’ve discussed the technologies that drive today’s image generation tools, up to generative adversarial networks and diffusion models. You’ve seen how GANs compare images from training data to generated images and learn from the process until they create photo-realistic representations. You’ve also seen how Diffusion models walk backward through an image's deterioration, denoising an image until the model arrives at something new, but always checking it against the language of an image recognition system — CLIP — and LAION, a massive dataset of more than 5 billion images. .
Today we’re going to dig deeper into these datasets and look at what’s inside them. One of the things, one of the reasons we're doing this is because we often mistake our prompts for the input into something like Stable Diffusion. We assume that we are the ones making the image because we are the ones who put the prompt into the system. But the prompt is not the only input. The input for the model is the massive dataset of images — the prompt is simply navigating what's inside that dataset.
We will talk about what's in the data, that's a crucial part of understanding these systems. If you think about the image set as your input, then you know what to look for, you know what to ask for in order to get certain results. You will be better at understanding how to ask for things, how to trigger certain combinations or how to steer your way through.
Finally, and inevitably, we’re going to confront some challenging realities around race, gender, and power within these datasets. It's important for understanding when these images are biased so we can put in the work of mitigating that bias. This ranges all the way from the social harms of things like representation — who is included in the data set, and who isn't included, and how are people presented — crucially, how are people being labeled and categorized? How are stereotypes rendered in these systems?
A note that we are going to deal with some serious topics around racism, misogyny and child abuse in this class.
Part 1: Galton’s Ghost
We’ve already seen that the histories of these technologies often go back further than we might have ever imagined. This story is no different: we start in 1877. At that time, there was a growing social belief in what we call scientific positivism, which JH Turner defines as a way of thinking, quote, “that proposed the only authentic knowledge is scientific knowledge and that all things are ultimately measurable.” This idea of measurement was at the heart of a photographic project by the British sociologist Francis Galton, who invented the practice of composite portraits. Composite portraits were a single image made from a number of other photographs, all of which were assigned a category — Galton’s categories included race, class, professions, and other descriptors. Then he would create a kind of statistically average face from all of the faces in that category.
This is what AI is doing in our image generators: drawing categorical information from descriptions of images found online, and stacking them together within categories based on the labels they’ve been given. This practice goes back to Francis Galton. Galton believed that personality was written into the shape of our faces. This included our potential for success, and our likelihood of criminality. To show this, he exposed multiple photographs of mug shots from known criminals to a single photographic plate, centered on the eyes, essentially drawing one face over the other until there was not one face on the plate, but a blurry composite of many faces. Through this practice, a single, common face for that category would take shape.
This composite image, he suggested, should be used to identify people as criminal types. He was clear that if someone looked like the average face of a criminal, they were probably a criminal themselves. Galton came to believe that these images were statistical averages to a particular mean — in fact, Galton invented the idea of correlation in statistics, a tool for finding patterns still used by machine learning today: the idea that data has correlations, or statistical relationships, is what makes these systems work.
Galton was also an avowed Eugenicist. Eugenics is the disproven belief that some classes and races of people are genetically inferior to others, and that by eliminating those inferior races you can create a superior race. Galton was not only a believer in this theory, he personally named it and published books advocating for it. And so these two principles: composite photography, and statistical correlation, are literally derived from the same man who advocated for the elimination of certain people based on their physical appearances. And composite photography and statistical correlation were both used to justify this claim.
Galton used these composites to make his case that certain human facial features were predestined to create criminality while other facial features would breed- literally- superiority. Of course, there were no surprises in these findings. Ultimately, he was only suggesting that the poor and oppressed minorities of British rule, including its colonies, deserved to be oppressed, and that the aristocracy deserved to be in power. These tools of statistics, composites and photographs were part of a mechanism to assert white supremacy over others in the British Empire.
Galton made all kinds of composite images, intended to categorize and label a variety of types: one series shows us the stereotypical Jewish schoolboy; another the stereotypical aristocrat. He made composite images of successful families and suggested it was their biology and genes that made them successful, rather than their unique access to wealth and power. And families of criminals were born this way, evidenced by the shape of their faces.
Galton’s belief came down to the idea that if people looked alike then they were alike.
At the heart of what Galton was saying were a number of myths that are increasingly prevalent today. The first is that the camera was objective and neutral in its facts. The camera was the hot new medium of the day, much as AI is in our current moment. Second was his practice of creating composite images of people and sorting them into categories; likewise, this continues today: when you create a prompt in Stable Diffusion for “criminal” or “professor,” for example, you are likely drawing on similar reference points.
Compare the two images above. On the left, the result for “criminal,” on the right, the result for “professor.” Why are these the criminals and why are these the professors? How does Midjourney form an idea of what a criminal or a professor looks like? The answer, as we’ll see, is that the stereotypes of image datasets come through in the outputs of the models they build. Presenting these images as they are runs the risk of reinforcing stereotypes already embedded into these image captions.
But let’s get back to Galton, and his idea that correlation means causation. Galton, who pioneered this idea of correlation, was convinced that it could explain all kinds of phenomena. Today, a lot of AI is dedicated to automating the discovery of statistical correlations in large datasets. We have all this data and we need some way to understand it, so we let the machine look for correlations. Today we risk the same fallacy that Galton did: we imagine that what the machine says is true, but we don’t ask why it is true.
AI can be used to reinforce power structures in the same way that Galton’s technologies did. By controlling the images and their categories, Galton was able to say anything he wanted about the people he put onto film. He was able to carve out a dataset from the world at large, and then point at that dataset and call it proof.
Lila Lee-Morrison wrote about Francis Galton and his connection to today’s algorithms in 2019:
The advantage of the camera, for Galton, lay in its ability to visually represent the abstracting process of statistics in the form of facial images. With the practice of repetitive exposure, that is, in multiplying the reproduction process by photographing the photograph, Galton’s composites transformed the use of the imaging apparatus into a form of statistical measurement. Galton’s use of the camera to apply the abstractive process of statistics resulted in a depiction that, as Daston and Galison put it, “passed from individual to group.” (page 89)
I raise this because of how important it is to acknowledge, and how little we do acknowledge, that AI images are tied directly to three strands of thought that came from a man who used those tools to advocate for eugenics, and ultimately paved the path to a specific logic of genocide that decimated human lives across the 20th century.
This is not to say we can’t use these tools. It is a reminder to be vigilant that they contain traces of these ideas, and that uncritically embracing the logic of these systems might steer us into a position where we find ourselves aligned with certain associated logics. What I want to encourage about these tools is to use them in ways that challenge the ideas that gave rise to them.
On that note, you may be saying: That was all over 100 years ago, and surely, these things can’t be relevant today. Well, let’s find out.
Part Two: Weights and Biases
I want to start with an easy one. Let’s revisit the prompt “typical American” using Stable Diffusion.
Let’s take a minute to remember what Diffusion does with this prompt. It starts with a frame of noise, and tries to remove that noise. It removes noise assuming that my prompt is the caption of the image it is trying to repair. It starts by seeking out clusters of pixel information that match some of what it learned about how images break down. Specifically, how images with the prompt “American” break down. These images are taken from the internet, and put into the training data. The noise that the model starts with keeps being removed to match something in that vast dataset.
So here we go. Now we have a picture — remember one of the first four images I got by asking for the image “Typical American.”
This is a composite of all images associated with this phrase typical American. We looked at the training data in a previous class, but let’s focus on this image now. What do we see here? Well, I notice this face paint with the American flag, which makes sense: the category for “American” probably triggers the flag. Note that it also automatically knows that we are looking for a person. So we can assume the phrase “typical American” is probably associated with images of people. We see that the man is in a cowboy hat. He’s wearing a leather jacket. And he’s holding what looks to be some kind of cup, like a giant coffee cup or a slurpee or something.
So this is a composite image of an American, taken from a wide range of pictures that were labeled “American.” In the previous class, I showed you how we can search the training data to see what images match our text descriptions.
Now, having mentioned that, I need to warn you that the training dataset for Stable Diffusion was found to contain images of child sexual abuse. Since then, we can’t look at the original dataset anymore: it has, for good and obvious reasons, been taken offline. Some sites have hosts cleaned up copied of the archive, and one I recommend is haveibeentrained. It is a censored version of the LAION 5B dataset. While the abuse material has been removed, it may still contain hateful, violent, and pornographic content. So if you don’t want to see that stuff, you’ve been warned.
If you want to learn more about this without going and looking at it, some great work has been written about the content of LAION in Dr. Abeba Birhane et al’s 2021 paper on racism and pornography in the multi-modal dataset. If you want to read more about LAION and the scandal around it, I wrote an essay in 2024 describing what happened with LAION and what it means for researchers.
That said, you may remember something we saw in the data in LAION for “American.” Here are the search results again:
What do we notice here? Well, right away we see the same image repeated four times in a meme, holding a Big Gulp from 7-11. Let’s think about this one for a minute. This means LAION is taking in memes. How many times do you think it saw this one meme? Was every version of that meme treated as a new piece of data?
If so, that means that the image is weighted more than other images, and has an outsized role in shaping the images associated with the prompt. I’ll get to that in a minute. I’m using this example because I am an American and I feel ok talking about the stereotypes that are on display here. But we can see how the system has found this word “American” connected to images across the web, broken them down into key aspects of the image, and then reassembled them into the image we saw above. This isn’t some abstract philosophical idea. This is pretty direct. The images in the dataset are written into the image that was generated. Now, I want to be clear: we’re looking at a few pages of results for training data associated with this word, American. There are probably hundreds of pages. So these are not the only images that are contributing to this composite image.
Let’s talk about weights. If you’re new to machine learning, then this may be a new subject, so let’s explore this a bit. It’s relatively straightforward.
Broadly speaking, weights are the way computer systems are incentivized toward certain behaviors. Let’s say you have a robot that is supposed to go get Halloween candy. Now you would want to incentivise a safe robot. So your first goal is go out and get Halloween candy without hurting anybody in the process. So you say, -5 is the weight of hurting someone. Basically, collisions are the last thing you want this robot to do. Maybe next is, don’t steal candy from other kids, so you say -4 for that. This means that, given a choice between stealing candy from a kid or hurting someone, it’ll steal candy. Unlikely to come up, but you need to make the priorities clear.
The most important thing is to put candy into the bag, right? So we’ll say Put Candy into bag is +3. Most important thing, but let’s make sure it’s not as important as avoiding collisions or stealing candy. So now we have a pretty solid set of priorities and to get it to work, we would put in steps, like, go to next door. But you don’t want that it to keep going to the next door, so you would penalize going to the same door twice.
So now, look, don’t use this as a recipe for your Halloween robot ok? This is just a quick overview of how weights work. The numbers you have here are weights, and they incentivize or disincentivize certain actions in the system. In supervised machine learning, you put these weights in and then you tell the system to go figure out what to do. It would follow these rules in a way that would eventually create a pretty efficient system for trick or treating. Because it is prioritizing putting pieces of candy into a bag, it’s going to think about the best strategy for that. It’ll try lots of stuff at first, but eventually would learn the most efficient strategy for putting candy in the bag. In an ideal world, it would calculate the shortest distance between doors, for example.
This is the basis of a lot of machine learning in robots, computers, and the AI we see in video games.
Now just for fun there’s a problem with this robot, and I wonder if you can tell what it is? Look at the system of weights and, thinking like a computer program, what do you think you it might arrive at based on this incentive structure?
So ok, basically, what we’ve just created is a classic example of a system that would exhibit strange emergent behaviors. In particular, we’ve built a system that fills a bag, dumps out the bag, and then puts the candy in the bag again. Depending on how fast the robot can do this, it might figure out that it should collect a full bag of candy and then dump it out, or it might just stop in front of your door, take your candy, empty the bag, put it back in the bag, dump it out, etc — forever.
Now, with traditional programming and robots, we can see what’s happened and go in there and get it moving, but this is tedious and annoying. As we saw, LAION doesn’t need anyone to tell it what’s in that dataset. It was unsupervised scraping of online content. So the weights aren’t being decided by people, per se. Instead, they’re sort of assigned by what’s inside the dataset. If images appear more often in the training data, then any image we produce is going to resemble it more than it will resemble something that only shows up once.
So datasets are obviously very big, but let’s think very small to make this reasonable. If you have 8 images in your dataset, and all 8 are blonde women, then the dataset is going to give you images of blonde women. Essentially, blonde women are weighted more heavily in this dataset by the simple fact of there being more blonde women. If we flip one of these women out and add a black man, we will probably most likely get images of white women from the model, because statistically, that is what is present. The model will not predict his appearance. We might also have two of the exact same women in the dataset. In this case, the output is more likely to look like her, because she is duplicated in the dataset.
Now this is bias. And we mean that in two ways. One, this is a biased example: it’s a dataset full of white, blonde women, which might reflect a cultural or social bias. There is also just technical, mathematical bias: the presence of some specific features in a dataset more often than others. This bias could be the result of social and cultural bias. It could be because nobody has pictures of something: for example, if you try to generate images of an animal that went extinct in the 1700s, you won’t have photographs of it. That’s also a bias. Sometimes these biases are the result of negligence, or intentional exclusion, or the availability of datasets. But it is all bias.
For diffusion models, it’s important to note that bias is also introduced through the text and image pairings. So for example, if you ask for pictures of humans kissing, as we talked about in the first class, who was the system defining as “human?” This is an important question. Because think about the way we label images. We don’t often describe people in images as “human,” for example. We might say “people.” But then, who are “people?” The word “person” very often means a specific type of person and excludes other people. Consider this example, taken from a new york times article in 2018:
So weddings with same sex couples might be more tightly correlated with phrases like “gay wedding” and isolated from the category of “wedding.” And initially, if you asked for pictures of people kissing from DALLE2, you would almost always get white, straight couples. Now, DALLE2 inserts diversity into your prompts — when you ask for people, you’ll get more diverse images than you did when the model launched. But that’s the result of a specific, system-level intervention into the dataset, something I call shadow prompting. It’s become even more prevalent in new systems.
In summary, Repetitions are gonna be more strongly weighted in these systems. In theory, if an image appears often on the Internet than that image is gonna be more strongly represented in the images we produce. This is also true of text generation models, by the way. These tools tend to reproduce a consensus based solely on what it sees more of in the dataset — in other words, it’s going to tell you what people say more often, regardless of whether it’s true, or represents the diversity of experience found in the world.
So we can go into the dataset and see what drives the systems that use LAION. That’s Diffusion models like Stable Diffusion and MidJourney. But I want to be clear here, too, that each of these systems goes in and makes interventions into LAION. So we’re looking at this picture of an American, and we can go look and see what images are in the dataset for American, and we can make some inferences. But tools like MidJourney will use a different flavor of LAION than Stable Diffusion does.
A specific example is that MidJourney uses aesthetic weights, emphasizing the type of content that gets high scores on reddit or is popular on image sharing sites. As a result, there’s a kind of particular style to MidJourney images as opposed to Stable Diffusion. So ultimately, there are interventions we can’t be sure of. But the base of these tools is LAOIN — for DALLE2 it’s CLIP, which is even more secretive. That restaurant gets its ingredients from a totally different grocery store.
One of these recipes is called Reinforcement Learning from Human Feedback, or RLHF. RLHF is a fancy way of saying these systems learn from what you like. The Facebook algorithm, for example, shows you more things similar to things you’ve clicked ‘like’ on. Midjourney and other AI tools rely on feedback too. Some of this feedback is in the data level: humans are paid, very little, to sort through images and appraise them for appropriateness. We will talk about that a bit more in a later class. Often, RLHF is a kind of test run, where new models will be shared with a limited number of testers. When images are liked or downloaded, the model collects that as feedback and weighs that kind of outcome more heavily.
This is why some models produce more heavily stylized versions of images than others. Midjourney puts an enormous weight on RLHF. RLHF is sometimes named as a tool to mitigate stereotypes in the output of AI models, but this will depend on the point of view of the people appraising the images.
Let’s go back to biases again for a minute.
So we have an example of one stereotype, here for the word American, and we can see how the connection between the dataset influenced the outcome. It’s really important to be aware of how representations of people are formed when we sort people into categories, especially with arbitrary, socially assigned labels. In this video from Safiya Noble, you know that she investigated Google’s algorithmic biases in the search for “Black Girls.” Searches related to black girls should in theory create access to information about black children. Instead, for years, the results reflected an uglier side of the Web and social biases, with the first search result being a pornographic website, and a whole host of vulgarities right there on the first page. Actual information about, say, the health issues of black children were absent; buried even behind a British rock band called black girls.
After Dr. Noble’s work, Google intervened in those search results directly, so now you are no longer inundated with websites for adult web sites. But if you look at the Internet without that intervention, you realize that nothing has changed. The images in LAION for the same search remain pornographic, and that is with LAION’s safe search filter turned on. The images we see are often nude women; a number of the images in the dataset are racist memes and explicit pornographic content.
Why does this happen? I hope I don’t need to explain that black girls have nothing to do with pornography. Instead, I think we can go back to the starting point of this lecture and think about that long-ago origin of composite images, and their intrinsic link not only to racist psuedo-science, but to the way that those images were used to hold up a particular group’s power, and explain the inequality that existed within British society. When we have a technology that scrapes up stereotypes without interruption or intervention and then simply composites these images into new images, we are repeating that history. It is even more disturbing to consider what it means to push a button called “dream” or “imagine” and know that the images you produce are coming from this.
Vilna Bashi Treitler writes about this idea of racial category and the power connected to who gets to define those categories for themselves, and who is defined by someone else:
It makes a difference who is doing the categorical defining, and who is policing the boundaries of these definitions. It comes down to a question of power: who holds it, where the power-holders see themselves and others in the existing hierarchy, where they think they should be in the racial hierarchy (that is, the meaning or racial positioning), and how they use their power to realize those norms.
So again, this is why we focus on data — to know what is there and how it steers our systems, and to be mindful of the ways that data consolidates power and shapes our imagination and dreams.
Here we have some samples from the training data, by comparison, for the word “girl.” You can scroll left or right to see more.
It’s hard not to look at this dataset and think about Francis Galton, and the way that his composites were used to suggest that the upper classes were superior people to all others. He proved it through photographs — photographs that probably looked a lot like these. Not just because these are images from the same time, but because they are images that reflect a certain degree of wealth amongst families who could afford to have their portraits taken, and the racial category that those families belonged to.
These are all white girls, though the prompt is not for white girls. It’s just girls. There is not a pornographic image to be found here. And it speaks to the power of the dataset to normalize one kind of definition of a word over other definitions. For LAION, girls are white children. Black children are another category, a category separate and definitely unequal.
The last one I want to look at today is this, a search for photographs of a brave person. This search tells you that if you are generating images of bravery, you might get a composite from some of these images. And if you look closely, you’ll see something pretty shocking.
Scattered across the first few pages are images of Nazis. The first is an image of a young boy who was given a medal of bravery for his work with the Hitler Youth. The others are difficult to tie to specific caption information.
There’s something complex about all this. Let’s assume that this image of the child from the Hitler Youth came from an encyclopedia article, for example, and that this encyclopedia labeled the image factually: that the boy received a medal of bravery. It may not mean to suggest that the boy is brave, or that Nazis are heroic.
Once LAION encounters this image, however, all other context is stripped away. It doesn’t care that this is a boy in a Nazi uniform. LAION will not recognize that associating this image to the category of “brave” might lead to images of swastikas in its generated images of heroic people, or a tendency to reproduce images of Nazi heroes. Given that Nazis defined themselves by the appearance of their race: blue eyes, blonde hair, and all that — and that these are, in the philosophy of Eugenics, proof of strong character, then on one hand, including images of Nazi heroes in our generative images of heroism is, ultimately, an extension of Nazi propaganda.
And that matters when we see images like these, which are generated from a prompt — here, confined to “photograph of a brave man, 1940,” to make the point. I think the images speak to the complexity of the images we make with these tools with regard to respecting the sources of their training data.
And this leads us to the controversy around art, specifically, that we will address later in the semester, about the work of artists. The same principles that apply to representation around the images we saw today are at the heart of the controversy about the use of artist’s works and styles in the dataset. And we will also look at some of the artists who confront these issues in their work, thinking about ways to challenge these systems and make them better.
So in the end, I know today's class had some challenging topics, but I hope you'll see why. These images might perpetuate harmful stereotypes if we don't treat these tools with care. We should think carefully about what these images are, and we should not assume the technologies we use are neutral and safe or watched over by somebody responsible for them.
But as we’ll see, algorithms are not neutral. And the data that we rely on hasn’t been looked at by anybody at all.
Works Referenced
J. H. Turner, Positivism: Sociological via International Encyclopedia of Human Geography (Cambridge, Massachusetts: Elsevier, 2020) 11827, https://www.sciencedirect.com/science/article/pii/B0080430767019410https://www.sciencedirect.com/science/article/pii/B0080430767019410)
Cain, Stephanie. “A Gay Wedding Is a Wedding. Just a Wedding.” The New York Times, The New York Times, 16 Aug. 2018, https://www.nytimes.com/2018/08/16/fashion/weddings/a-gay-wedding-is-a-wedding-just-a-wedding.html.
Francis Galton, “Step one in assembling a composite photograph” in Popular Science Monthly Volume 13, (August 1878)
Lee-Morrison, Lila. "Chapter 3: Francis Galton and the Composite Portrait". Portraits of Automated Facial Recognition: On Machinic Ways of Seeing the Face, Bielefeld: transcript Verlag, 2019, pp. 85-100. https://doi.org/10.1515/9783839448465-005
Safiya Umoja Noble (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. Ch. 3, Searching for Black Girls, p69. New York University Press.
Treitler, V. (1998). Racial Categories Matter Because Racial Hierarchies Matter: A Commentary. Ethnic and Racial Studies, 21(5), 959–968
Alamy Limited. “1940’s Photograph Hitler Youth Medal Iron Cross.. Boy 10-12 Years of Age Already Presented with an Iron Cross Medal for Bravery, Coming from Ranks of Hitler Youth, Despatched with Blind Faith and Misplaced Loyalty to Defend at Any Cost Berlin Germany in the Last Stages of World War II 1945 Stock Photo - Alamy.” Alamy.com, https://www.alamy.com/1940s-photograph-hitler-youth-medal-iron-cross-boy-10-12-years-of-age-already-presented-with-an-iron-cross-medal-for-bravery-coming-from-ranks-of-hitler-youth-despatched-with-blind-faith-and-misplaced-loyalty-to-defend-at-any-cost-berlin-germany-in-the-last-stages-of-world-war-ii-1945-image182821109.html. Accessed 18 Feb. 2023.
Hassani, B.K. Societal bias reinforcement through machine learning: a credit scoring perspective. AI Ethics 1, 239–247 (2021). https://doi.org/10.1007/s43681-020-00026-z
Burbridge, Benedict. “Agency and Objectification in Francis Galton’s Family Composites.” Photoworks, 28 Oct. 2013, https://photoworks.org.uk/agency-objectification-francis-galtons-family-composites/.
Jo, Eun Seo, and Timnit Gebru. “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.” ArXiv [Cs.LG], 2019, http://arxiv.org/abs/1912.10389.
Lipton, Zachary C. “The Foundations of Algorithmic Bias.” Approximately Correct, 7 Nov. 2016, https://www.approximatelycorrect.com/2016/11/07/the-foundations-of-algorithmic-bias/.
Offert, Fabian, and Thao Phan. “A Sign That Spells: DALL-E 2, Invisual Images and the Racial Politics of Feature Space.” ArXiv [Cs.CY], 2022, http://arxiv.org/abs/2211.06323.