Critical Topics: AI Images,
Class Eleven

DATASET DISSECTIONS

When engineers want to teach a machine how to see objects in the road, one way to do it is to look at the ground. When objects on the ground interrupt that line of sight, a camera can, in theory, send a response to the brakes of an autonomous vehicle (AUVs).

In 2017 Volvo had trained its AUVs to look for objects on the ground to avoid collisions. But it couldn't account for kangaroos. When a kangaroo jumped, the system read it as moving further away, to the horizon. The system recognized European land mammals, things like elk and caribou, because it was a Swedish company, and things like moose and deer, because of the American market. Kangaroos were a different story. When they jumped, they may as well have been wizards: the machine saw it as being on the ground and then suddenly somewhere far off in the distance, and as a result, it wouldn’t brake for kangaroos.

Machines don't understand reality, they understand models of reality. Those models are described to it. The way we describe the world to create those models is through data. To get a system working, you have to translate observations into data, limited by the constraints of your system. Someone decides what matters most to get that system working. The observations that relate to what you want to teach the machine become your data, while the least important gets discarded. That relies on subjective judgments and frames of reference. For example, there is no need to warn a system about Swedish kangaroos.

Lots of problems can worm their way in through gaps left by what gets discarded in a dataset. Even more come through when that dataset gets adapted to other purposes that they weren’t intended for. If you collect data on one population, or from just one place, it's very unlikely to be usable in others. Likewise, if you collect images of one type of person or one type of place, it’s going to be harder to make pictures of other people or places with Gans or Diffusion models. 

“Most data arrives on our computational doorstep context-free and ripe for misinterpretation. And context becomes extra-complicated [with] poor data documentation.” — Catherine D’Ignazio & Lauren Klein, Data Feminism (2018)

We’ve talked a lot about data collection in image sets, and now, we’re going to do a full on dataset dissection. We’re going to pick a dataset, open it up, and see what’s inside. Here’s why. 

First, from Kate Crawford and Trevor Paglen:

“Training sets ... are the foundation on which contemporary machine-learning systems are built. They are central to how AI systems recognize and interpret the world. These datasets shape the epistemic boundaries governing how AI systems operate, and thus are an essential part of understanding socially significant questions about AI.”

In other words, we study training sets because if we want to know what an AI is giving us, we have to understand the data that it us using. 

Second, datasets used for making AI generated images come from many different sources, which are created for many different purposes. On the one hand, remember, if you want to generate images with GANs, you needed specific categories of images to train it with. Then, the GAN could make more images of that category. Diffusion models took in all kinds of images from all kinds of sources — but many of them were still these massive datasets used for GANs.

These datasets were often collected for a variety of machine learning tasks, not just training GANs. For example, you might have a dataset that was trained on sign language, so that an image recognition model could read hand gestures and translate them. You might have a dataset that was used to determine when someone was looking at or away from a camera, so it could monitor students watching videos. It might detect a kangaroo in the road or in the air. It might be used for surveillance, as we’ve discussed, or for figuring out if someone is picking their nose, or if a laundromat or parking lot has room for a new customer or not. 

These image sets have been created for all kinds of specific purposes, and then re-used for making images in GANs or Diffusion models. Other datasets were built specifically for image generation. Datasets are hard to make, and once you’re done with them, they sometimes don’t serve much purpose, so you can find lots of image datasets online. They’re often recycled by people who don't know where the data came from or why it was collected. They don't know what decisions informed it, or what the limits of the data were. But making data available for others to use is a nice gesture, too. It helps people create meaningful work, or projects, without starting from scratch.

The question I like to ask is: what if we were tourists visiting a new place, tasked with describing what we saw to our friends, who might want to go there? Our role with datasets can be the same way. If we are looking at datasets — not making them, though we’ll do that too — then we’re tourists in the data. How do we make sense of it? How might we decide if we want to use a dataset or not?

Finally, looking at a dataset helps us understand what image making tools see, and how it is shaped for those machines to make sense of. We can get a sense of things like categories and labels, as well as how they’ve been cropped or prepared for use before being analyzed. We can see who and what is represented there, and think about how that translates to what we see. 

If you have never opened up a dataset and explored it before, the video above will walk you through the process of looking. It’s hard to replicate in text, so I do encourage you to watch for ideas on how to find, explore, analyze and critique datasets.