Sounds Like Music: Toward a Multi-Modal Media Theory of Gaussian Pop

Diffusion Glitch (2024) Eryk Salvaggio. 

This is an edited and expanded transcript of a seminar delivered on October 23 2024 at the Royal Melbourne Institute of Technology (RMIT) as part of a series of dialogues around This Hideous Replica, at RMIT Gallery, arranged by Joel Stern. It was presented in association with ADM+S, Music Industry Research Collective, and Design and Sonic Practice. 


I've been an artist for 27 years, working initially with the Internet in the 1990s, thinking about what this technology was, how it worked, and the social questions it raised. Art making has always been a research practice into technology, a means of exploring unstructured curiosities about what the technology is and how it works, but also zooming out and looking at its entanglements with culture and society. In essence, every media had fuzzy borders, because I was never interested in any specific medium so much as how such media broke down. 

With technology, you could always combine all kinds of approaches. This is well-trodden territory. You can be a visual artist thinking about music. You can be a musician thinking about writing and video making. It’s possible to say that every medium could be melted down and rearranged in the service of this curiosity. 

Generative AI and its corresponding media formats seem as if it is not tied to any kind of specificity. It’s one big churning of media into a single media format: text becomes images, images become sound, sound can be video. Such a capacity for translation between media forms is called “multimodal.” It may be tempting to suggest that artists like myself, working in the loose in-between of formats and genres, work the way Generative AI does. But as we'll see, a lot of strings are attached to Gen AI’s breakdown and reassembly of media. In particular, the translation for many of these — video, images, and audio, though not LLMs — depends on a diffusion process, the addition and subtraction of noise as a mechanism for steering production. This is a talk about AI-generated music, but I will talk about generative visual AI because AI-generated audio is images. It is, literally, an image of sound.

What do we mean when we talk about Generative AI? When I talk about AI, there are different formats, some of which are more responsive, machine learning-based tools that can respond to data and allow you to improvise in real-time. But in this talk, even if I say “AI,” I really mean generative AI, and within generative AI, I mean these diffusion models that are at the heart of most commercial systems today. Diffusion often means the user is prompting a model — and as it is with images, the result is a fully formed piece of media — a picture, a song, or a video. Diffusion can be used to produce audio samples, but this shifts the definition of the medium. Generated loops assembled by a person into a song or various AI-generated images assembled into a collage are a pastiche of generated media.

What I want to look at today is not that but the generated media itself: the default status for most commercial AI tools is to produce whole pieces of media in response to — typically but not always — a human-authored prompt. By this I mean websites like Udio and Suno where you can ask its audio model to produce, say, Yacht Rock. You do this by typing “Yacht Rock” into a prompt window. In a few seconds, two tracks will have been generated for you. 

What about multi-modality? Multi-modal media refers to the capacity to work with multiple media formats at once, and it goes way back — if you study rhetoric, you study gestures alongside language. But those are complementary streams, whereas now we’re referring to alchemical streams, transmutation through melting one media format to another after its reduction to data. In AI, multi-modality is defined according to data types, what goes in and what it becomes when it goes out. IBM says it this way: 

“Multimodal AI refers to machine learning models capable of processing and integrating information from multiple modalities or types of data. These modalities can include text, images, audio, video and other forms of sensory input.” 

Moving from text to image, image to text, and image to sound are all forms of multimodal media production. To get at a multimodal media theory of generative AI, we are looking at a subset of media theory and applying it, specifically, to the soup that all of this data becomes. In AI models, we are talking about diffusion: breaking media into noise, and then reconstructing it. So, let’s take this system apart. 

Interfaces 

We can start by looking at the interface of this tool. If you do, you’ll notice that it resembles Spotify. Udio and Suno and others like it are not digital audio workstations. This is not a platform like Kopi Su’s Koup, designed for a user to generate samples from my stems and integrate them into a sampler-based system, such as creating a drum track to play guitar over. Udio and Suno are clones of Spotify, but the library of recorded music has been replaced with access to probability models that can generate recorded music that fits specific genre requests. 

Interfaces reflect priorities. We can see what those priorities are just by reading the interface. These priorities leave behind design artifacts, a result of decisions designers make about what to do, what to emphasize, what to allow and dissuade us from

Of course we can use these systems in all kinds of ways, but interfaces reflect priorities. We can see what those priorities are just by reading the interface. These priorities leave behind design artifacts, a result of decisions designers make about what to do, what to emphasize, what to allow and dissuade us from —as reflections of a way of thinking about the world and about the product. Reading the products of these interfaces — their generative  “multimodal media” outputs —  means looking at the structure that produces it (the interface) and how it operationalizes certain positions and beliefs about the product. 

Udio, Suno and commercial generative AI music systems aren’t tools, they are jukeboxes — the name, by the way, of the original OpenAI audio generation tool (a GAN). Rather than replacing a digital audio workstation, they are designed to replace Spotify and streaming music services. They exist so that audiences can listen to playlists that reflect or blend genres — in this session, we generated opera and yacht rock at the same — and though lawyers have made it so that we can't say Taylor Swift, we can still describe Taylor Swift-like music, and it will generate a bunch of similar songs. It will do so endlessly, and the company doesn't have to pay anyone.

This is a long tradition in Silicon Valley: Lyft and Uber were subsidized rides, paid against the future that they hoped their data was bringing into the world, the driverless vehicle. Spotify has flirted with royalty-free music for a long time, allegedly hiring session musicians to play songs they then put on their playlists. When you listen to their playlists, Spotify has created a window where you can listen, but they don’t have to pay anybody. They get your attention for free. 

So we generated a song here — we requested opera and yacht rock — but notice that it took my prompt and changed it. This is something called shadow prompting, and it shows up a lot in image generation and video imaging. It just translates what you've described into terms the model can work with and figures out how to summarize your prompt into something more precise. You can turn this off, but it’s helpful to look at interfaces from their defaults because that tells you something about priorities, as well as shows you the average audience encounter with the system. In this case, as it often is, shadow prompting further alienates the user from their desires, translating them into language better suited to a machine’s capabilities.

Training Data 

AI-generated music is not musician-less because where else is the training data coming from? If you talk about multi-modal media, you must talk about training data. Much of it comes from stock libraries — public domain music — but this gets boring fast. Some of it comes from free sample libraries, and frankly, the “clean” models, i.e., the models trained on appropriately licensed, free music, sound like it. What the better models do, unfortunately, is scrape YouTube, which hosts vast piles of music that is not licensed. Artists are getting penalized on both sides. Someone's uploading music that isn't theirs to upload, so the artist isn't getting those streaming revenues, and then they're going into training data and aren’t compensated. 

A book of pictures has a different sense of meaning if you look at it as your family’s photo album, as opposed to training data. But these distinctions can occupy the same space.

Let’s talk about memory. In my work, I think a lot about the relationship between archives, such as archives of cultural memory and training data. What is the difference between our relationships with data and with archives and cultural memory — even personal memories? A book of pictures has a different sense of meaning if you look at it as your family's photo album, as opposed to training data. But these distinctions can occupy the same space.

An essential part of AI conversations, one that's often missing, is where such training data comes from — we talk about this more often these days — but also what it means and what we're doing to it. That is, what is the context this memory sits in, and what does the machine do to that image and sound? What does that image or sound do to the originals? And frankly, who has the right to do that? Before the photo album becomes multimodal, it has to become data. To become data, it has to be datafied through a process of becoming noise. More on that in a moment. 

Let’s start with the archive. Richard Carter has an excellent paper on the subject of something he’s called critical image synthesis. He describes critical image synthesis as:

“...the prompting of imagery that variously interrogates and makes visible the structural biases and cultural imperatives encoded within their originating architectures.” 

He's talking about training data in AI and the process of preparing and using this training data, and writing with it. 

“In framing AI image synthesizers as an inverted form of machine vision, as generating rather than classifying imagery through text, an opportunity is afforded to consider how they reflexively characterize themselves within their own latent spaces of representational possibility.”

Generative AI is digital humanities run in reverse. Everything you use to classify something, everything that you gather into an archive, every taxonomy or metadata structure you use to describe what's in your archive, everything that Spotify has assigned as a genre label based on the waveform patterns of a piece of music, gets sucked out of the training data, and that association gets flipped.

So now, instead of saying, this is a new song that's been uploaded, and analyzing the wave patterns to determine that it has similarity to other songs that are, say, in the chillwave genre, and then tagging it chillwave, we do it backward. You type in chillwave, and the idea that it's used to rely on to define that piece of music is now being reversed. To say, we will trace the conventions that align with this classification and render that information as a waveform. It's a reversal.

AI becomes a system for producing approximations of human media that align with all the data swept together to describe that media. That can help us think about how we might probe and interrogate it. I do this with images. I'll talk about this a bit, because I have not done it as much with music. But today I want to think about how we might approach some of these models to do this with music. 

Critical Media Synthesis 

Carter asks us, specifically, “what kinds of imagery do these systems yield in response to prompts, centering on keywords associated with machine vision technologies?” In the walkthrough of the exhibition, just before we started, I had a conversation about using unspecified as a genre tag for music synthesis. A way of generating a sample of the things that come into the training data that are unclassifiable — or at least, nobody’s gotten around to annotating them, and they get this label of unspecified, itself, perhaps, appended by a lazy algorithmic default. 

What might unspecified music sound like, and what might it tell us about the data and the system that uses it?  Asking that question is I think, an example of critical music synthesis might entail. 

It reminds me of this way of generating images using similarly autogenerated metadata. Some folks use prompts like “IMG_5929.jpg,” which is a way of referencing a category of photos that don’t have human captions—you can use any number. The point is to reference autogenerated metadata straight out of a camera rather than a human caption. “Untitled Document” might do a similar thing in an LLM. 

That type of interrogation seems like idle wordplay, and it is. But it comes out of a process of thinking critically about the system. Rather than aiming at automating media production, this approach relies on curiosity to prompt a system not to produce a media artifact but as a way of getting to its logic, exposing its training data, or finding fissures in the internal automation pipeline. It creates an object for analysis. These inner workings are increasingly more challenging to access. It offers us access by producing generalizations of the training data around that category, which we can then read. 

So, I think we can take Carter’s label — critical image synthesis — and broaden it to become: “critical audio synthesis, the prompting of audio that variously interrogates and makes audible the structural biases and cultural imperatives encoded within their originating structures,” etc. 

We can go a step further and call this critical media synthesis or perhaps some new kindling to enflame the embers of a multimodal media theory —because these systems used to produce images, sound and video are not as distinct as the media categories themselves. An image, a video, and a sonic experience all seem like very different experiences for us, with their own rules and ways of breaking them. This is, of course, true for the human who sees them. But it’s less accurate for the system that produces them. There, they are all the same until the last moment, when the file, raised like a spirit from the incantation of our prompt, is revealed to the world. The inner workings of the thing are shockingly similar, be it audio, video, or images. They’re numbers, without essential properties that define them as a specific media format.

Generated images, audio and video all start with noise.

Generated images and audio and video all start with and center around noise. When I say noise, sometimes people think too conceptually. Yes, there is a conceptual layer to the noise, which is fascinating and we may get there. But I mean a quite literal technical definition, a jpg full of randomized color values, a chaotic dispersal of arbitrary color information imposed into a jpg. It is a jpg — or a png or a gif, but always an image file — whether you are writing images or audio or video. The heart of 21st century media production is a noisy jpg. 

Noise at the start of a generation process in a diffusion model.

It is worth asking what traces of this process are left across these media formats when they are produced in a similar way. How might noise haunt the media we generate, how might the logic of diffusion shape our encounter with it?

When you start prompting and open up an AI synthesis tool, the first thing it does is throw random color values onto a JPEG. So I just want to be clear. I'm being very literal with this image. This is an example of the noise that forms the starting point of almost every generated image, an approximation of random clusters of pixels. The reason for this is that this noise is the end state of the training process, and it starts here because it has to reverse this noise in order to get to an image.

It is worth asking what traces of this process are left across these media formats when they are produced in a similar way. How might noise haunt the media we generate, how might the logic of diffusion shape our encounter with it?

Training Diffusion 

So, just to walk through it, I like to do what I call a slow reading of the training process. Say I’ve uploaded a photograph in a Facebook post saying flowers for my sweetheart. Facebook, in the US, arguably has the legal right to take that caption, and many places that don't have that right might take it anyway.

Now this image and my caption, flowers for my sweetheart, go through this training process. That training process looks like a quick removal of information, in stages. So, the first thing that happens is to remove some information and pull the background out. Then, it compares the noisier image to the original. The noise is moving through because it follows a predictable Gaussian distribution pattern: we can predict how the noise will spread through the image.

Because it's predictable, you can figure out how to get from an image of randomly colored pixels — a noisy JPEG — back to a perfect picture of that flower. It follows what is called a U-net structure, where higher levels get compressed, then the lower complexity remnants — fuzzy shapes of flowers — get compressed, and then it’s this tiny image with basically no detail at all. That’s one half of the “U,” the descent into noise. It does this to thousands or millions of images of flowers, cats, or whatever. Then, the noise is all clustered together, offering a vast explosion of possible content used for the second part of the U, where basic structures are filled in with increasing levels of detail and refinement. 

The U-Net means that what degrades first are background, and what lingers longest are pure, general shapes. These general shapes are then the first things generated in the production process, which is the structuring of the composition. So, just to be clear, you have all this training data that's degraded into literal noise at every step. That noise is being peeled away, with the last thing that remains in that image being more strongly correlated to the word used for it. When you ask it to generate an image, all of that stuff's been amalgamated into patterns to search for in the noise by generating random noise and then reducing it based on those patterns.

I’ve written a lot about this, and if you’re interested in learning more about the diffusion process and its cultural and social implications, you can watch a short film I made about it, Flowers Blooming Backward Into Noise. But for now, the key is to abstract training data by reducing it all to noise, map out how it breaks apart into mathematical patterns, and then piecing random noise back together by following those mathematical rules.

Gaussian Distributions of Noise 

Let’s talk, then, about the Gaussian distribution. This quote is from an early computer artist, Leslie Mezei, because I like to quote artists instead of computer science researchers. I like to quote writing from the 1960s instead of today when I talk about AI, because I need to be clear that it is all just computers and that nothing we call AI today is new. Mezei writes: 

“Many computer applications involve the use of random numbers generated from within the computer. The two most common distributions of such random numbers are the so-called rectangular or uniform, where the likelihood of obtaining any particular number within the allowed range is the same as the likelihood of any other number. The gaussian distribution means that the nearer to the average a number is, the more likely it is to be chosen.”

Contemporary AI-generated music uses a diffusion model to generate a waveform. Generation is a misnomer. It doesn’t create the sound. It generates an image of the sound. And even further, because this data is intended to be sonified, it generates an image of a sound file full of white noise. Of course, this sound file is also just a stream of numbers — it is the numbers that are malleable, but the numbers, at the outset, are intended to represent the closest a computer can get to chaos. 

To be very clear, if you're unfamiliar with music, most audio in electronic environments are waveforms. You look at the shape of the sound wave, the shape of air vibrating, which in turn shakes our eardrum. Since we can understand the relationship between vibrating air and the human ear, we have long known that sound has a visual counterpart. This has been represented as a wave, rising and falling as dips and peaks over time. Sound is air moving in waves that rattle inside our ears. Noise is a solid wall rather than a wave: it is sound that exceeds the threshold of our capacity to find a signal within it. If you generate a set of random pixels, or numbers, into the form of a sound wave, you get noise because the wave does not flow. It represents an absence of signal by representing too many signals simultaneously. 

When generating sounds, diffusion models simply draw these waves, looking at what shapes came before and predicting what might come next. It does so, at the outset, by taking such random noise and working it backward into a more recognizable sound. Diffusion trains by adding noise to training data: what we just saw about images. When we train a diffusion model with music, we're actually thinking about taking a piece of music and adding noise. Add fuzziness, hiss, and static to remove information from a sound recording with discernible patterns until it becomes indiscernible as anything other than white noise. A file full of white noise will then need to be sculpted in the shape of something we have described in a dataset. Often, this relies on genre tags, such as “Chillwave.”

White noise becoming a snare drum, from a paper by Simon Rouard and Gaëtan Hadjeres.

To the right is an illustration of that process for generating the sounds of a snare drum. This is from a white paper on audio synthesis by Simon Rouard and Gaetan Hadjeres. What you have at first is a drawing of a block of white noise. If I were to play this file, it would sound like a burst of the most blaring, horrible rainstorm recorded through a bad mic.

But what it does is figure out the waveform's shape, which is unsuitable for a snare drum. A snare drum waveform has high peaks, and then that trails off. So, a diffusion model will take this noise and then shape parts of it to make a sound. And if you do this with long enough periods of noise, you can move from generating a drum snare to a drum loop to a drum and bass loop and then longer and longer extrapolations of that until you get to a whole piece of music.

That is noise in the shape of a sound, a data visualization that sounds like music. Audio synthesis produces a recording of an object that doesn’t exist in a place that doesn’t exist, producing a sound that was never produced. 

Now, with newer models, it seems that the tonality and timbre and whatnot come through spectrograms. If you take a spectrogram and try to play it back through an arbitrary spectrograph system, it doesn't work. It must be paired with a tool with a local reference to what created that spectrogram. A spectrogram is a map of music, not a music file. 

Demonstration of an algorithmically-generated audio track featuring bossa nova music accompanied by electric guitar, created using Riffusion, an open-source fine-tuned derivative of the Stable Diffusion image-generation diffusion model that has been retrained to generate images of audio spectrograms, which can then be converted into audio files. By Benlisquare - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=126736300

Generated audio is noise in the shape of a sound, a data visualization that sounds like music. It is a recording of an object that doesn’t exist in a place that doesn’t exist, producing a sound that was never produced. 

I'm sorry if you're not a computer music nerd, but basically, what we are looking at here is an image of sounds that's a bit different and a bit less directly translatable than the image that we saw before, at least for me, compared to the straightforward peaks and valleys of sound waves.  Waveforms are direct representations of the shape of air — that snare drum is the shape of the waves emanating from a drum, right? This shape of air moving through the room and how it gets translated as vibrations in our ear drums. 

With spectrograms, you're getting to something that's more about timbre and tonality and texture and structure over time. And so in the model — and I have tried this — I've tried going into Stable Diffusion or one of these image making tools and generating a spectrogram and playing it back in a spectrogram player. And it doesn't quite work because it has to have an understanding of how that information was encoded in the first place. They just sort of generate a spectrogram of sound from the training data, and then they have these spectrograms, and they analyze them because it's a visualization of the sound that I could not directly translate to this for this room.

(At this stage François Tétaz, a composer and music producer, politely interjects). 

François Tétaz: I couldn't tell you what all of this visual information represents. I know that there are elements of frequency — it's just frequency over time. So if this is frequency, these things (points at bottom of the spectrogram) are fundamental. So this would be an instrument forming a chord (gestures up and down across the spectogram). This is time — (gestures at blocks in the spectrogram moving left to right).  So these are all beats in the top here (shows spiking blurs in the image) These are beats for me. So that's pitch you can hear. And then these are harmonics forming over the top. And up the top here would be noise, and you probably wouldn't hear. Those (gesturing below the upper section) would be the fundamentals you hear. (Points to a repeating, darker pattern) This is a bass drum or something like that, where it's constant. See this thing here? So it's constant. And it's in 4/4. You can see there's four beats in every bar. So there's a whole map in here.

And some things can convert images to spectrograms. But if I took this and put it in a random spectrogram player, I don't think I'd get the same sound. They aren’t playable, right?

François Tétaz: Right. They're not at all.

Otherwise, this would be the world's most efficient music impression software. So there are, unfortunately, no ways of getting to this inside most AI models, playing with it, and trying to generate something differently. So, audio models work with the spectrogram as an image, and/or they work with that sort of waveform image. Both are essentially being noised and denoised in the ways we saw with that image and then it's being translated to sound. There is also, sometimes, an element of LLMs here: tokens, which are used to steer the direction of the track musically, such as, you know, if this note is here, the next note can be x, y, or z. So that's it in a nutshell. That's AI music synthesis.

Looking at this spectrogram is helpful because AI video works like this, in that you have this compression of multiple images over time. So, if you were to look at the AI video generation model's training artifact, you would see slices of time stretched across the image according to its duration. If I were to walk across the room, you would see my leg starting here and then being stretched as I moved across the room, according to the slices of time you were sampling. But it flattens it down to one image and then unpacks it by piecing the slices together, knowing that the width of each vertical slice in the image represents a specific time span.

This is why if you watch an AI-generated video, you'll see people sometimes moving where their legs stay still, and they're moving because the time slice doesn’t capture the full movement of the leg. I once generated a video of a metronome to test this, and it doesn't know what to do with the metronome needle because it's moving too fast. So it just sort of floats around because it's an amalgamation of things that move in time in different ways, and then it's trying to render that.

Based on what was said about spectrograms then, this is working the same way that generated video would work, and that it's slices of music that are then being sort of stretched out over time. 

The Averaging of Music

When talking about the Gaussian distribution, we're talking about the most likely thing to occur in what seems to be randomness. Like polling, for example, right? You sample 1000 random people to figure out who will get elected president because, in theory, you can expand from a random sample into a much larger sample because random samples of large enough populations produce a kind of representative slice of the larger, hypothetical population you’re sampling. As with voters, with notes, words, or pixels, you predict them based on a large enough dataset to produce something from this hypothetical population. 

That's the same principle that applies to data in terms of the decisions made from one slice of music to the next. I have talked about this before with images. When we were looking at the London cholera epidemic, we had rows of numbers and spreadsheets.

AI learns to predict based on what is predictable, and what it generates are predictions of predictable patterns.

It wasn't until later that someone decided to map that information to see what that data corresponded to. When they had just data, no one could figure out what the hell was going on and why people were getting sick. But when they mapped this with little black rectangles, they could see that they all clustered around a center. And they went to that center. It was a water pump. They broke the water pump. People stopped getting sick.

A similar thing happens in image and sound generation. Everything clusters around the center because it relies on Gaussian distribution. It learns to predict based on what is predictable, and what it generates are predictions of those predictable patterns. This is data analysis, and if you wanted to predict where someone might get sick, you could do the same thing backward. You would say that around here, around the pump, there's much more likelihood that someone will get sick. 

Predicting a disease outbreak seems like an excellent use of data and predictive analytics, though, of course, there are all kinds of biases baked in, so we don’t rely on data exclusively to make such predictions. Predicting music, on the other hand, seems less like something that most people would want to do. But we are doing it. We’re running music classification backward. But there are ways of complicating this. 

Noise Music

I want to talk about how I play with noise as an artist in the system. If you think about what it's doing, you're passing an image of noise, and you're prone to saying, “I don't want noise; I want a picture of (flowers).” The model readjusts and recalibrates that image of noise towards the direction of flowers until it meets an acceptable flower-like resemblance.

I was curious, then: what happens if I prompt the system for noise? Essentially, there's nothing it would say yes or no to. So, what does it start generating? What passes through? I discovered that if you do that, you produce these artifacts of the AI image generation process.

Gaussian Noise Glitch (2024) Eryk Salvaggio

This is one such result. They aren’t always as elegant, really. But I am asking: How do we get into the systems and play with the systems' expectations? The designers of these tools hold expectations of user behavior—how do I become a problematic user? 

How do I become a problematic user? 

So, how might this logic of prompting transfer into sound? What does this sound like in the parlance of AI-generated music? What do you do with noise? I've been playing with these AI generative audio software and trying to think about tactics we can use to make something like that image of noise that sounds different from its intent. And the thing that sounds different to me is playing with some of the errors that I can generate.

We will come back to the result of asking the models to do what they aren’t meant to do. But I also want to consider what happens if you do what it asks you to do. So, for now, we’re going to talk about defaults. 

Multi-Modal Media Theory

What if AI-generated media is not media but a hypothetical media that, rather than serving as a reference to the world as media otherwise does, instead represents likely outcomes in a hypothetical population, which are constrained by an analysis of available data, an analysis filtered through infrastructures of media technology and power?

What is referenced in music is quite different from what's referenced in our understanding of images. We've always had distinct categories of media theory about these things. We've never really had a multimodal media theory (focused on the common aspects of AI-generated media). And so this is what I'm throwing out there: what if AI-generated media is not media but a hypothetical media that, rather than serving as a reference to the world as media otherwise does, instead represents likely outcomes in a hypothetical population, which are constrained by an analysis of available data, an analysis filtered through infrastructures of media technology and power?

So, in other words, you start by generating noise. And if you think about what noise is, it's a lack of determination for any specific outcome, even technically, in the system. There's nowhere that is pre-selected in noise regarding what it could become. What shapes that noise is this infrastructure of media referentiality — what we have trained on, the politics of training on that data, obtaining that data, funding models, funding, GPU's funding infrastructure, and then power, right? Who gets that money? Who gets to decide what's collected, how it's collected, and how that noise is shaped? 

I think it’s worth considering a multimodal media theory of generative AI because if it's all noise, everything can be reduced into noise and regenerated into anything else. There's so much possibility of potential in that. But many people, when they talk about AI, are just like — “Potential! The future is wide open! We can do anything!” 

Except that... the noise is being constrained. That's the critical part. You're running the noise and putting it into a tight center: this reference to other things we've already gathered and made. This is just going through the tech. But everything I just said, you're correcting that noise. You're structuring things according to previous structures in music. I mean this quite literally but also philosophically, and these distinctions collapse.

It doesn’t have to be music. It just has to sound like music.

Quite often, the structure of the song is determined by the structures of anything you've prompted. And so I say it doesn't have to be music. It just has to sound like music. That's an important thing. I've been talking about hypothetical images or hypothetical music: music that represents information about a dataset and pictures that are infographics.

AI music is a sonification of data in the same way. 

A Hypothetical Genre of Hypothetical Music

(We play a single AI-generated pop song, embedded below, three times and discuss what we hear. The following are excerpts from my contributions within a more extensive group discussion.)

You might approach it differently if you are trying to listen to this music as if a human wrote it. This isn't something that a human wrote. Part of what the exercise is, in listening to it three times, is trying to get away from that idea of approaching it like human music and to think about, for example, what sonic elements are emulating what a human would do — or are a hideous replica of what a human would do — as opposed to thinking about a comparison to what a human would do.

For example, in the first play through we noted that the separation between audio tracks is not precise. Drums, bass, and other tones are tightly compressed. That’s because it is, literally, all one sound, because it is a single wave form being reduced into a simulation of a multi-track recording. So we’ve developed an ear for a precise artifact of the generation process. We also noted that the tone changed on slightly regular rhythms, roughly around every 32 seconds — a residue of the generation window, that is, this model could only create 32 seconds at a time with a 4-to-6 second transition that blends the new and old context windows.

AI music is not meant to be expressive. It is meant to be plausible. That changes the experience of listening to it.

On the second listen, we discussed how it has processed the “human” into this. For example, it goes for what's plausible. No human is thinking about plausibility, asking if their sound passes for human or not. Maybe if you have imposter syndrome, or a human might ask, is this good? But that’s a different question from is this plausible. If you are asking that question, and that is the question motivating your decisions, you get a different result.

There is also a logical constraint on the music that sets up a specificity to it as a genre, perhaps, as a genre of hypothetical music.

An AI-generated pop song is a sonification of data. It is the same way an image is a data visualization, or an infographic. This is a sonification of a training data set. We wouldn't ask, you know, if population data is the population. That is what I'm getting at with the idea of a hypothetical. Looking at polling data, we want to know if the data representation aligns with the larger, hypothetical population. I am curious how this data representation, in pop music format, aligns with the possibilities of pop music in the sampled population, which is what humans make. So we think about the human not because we want the music to be more human per se, but because we want to understand the differences and the contrasts between the prediction and the reality. 

AI-generated audio is meant and designed to represent data about music sonically rather than be the music itself. It is music that sounds like music.

Sounding like music means it references a vast collection of music, but it is not necessarily designed to be music. Of course, that's a problematic distinction. Everything could be music. I should have worn my John Cage t-shirt, but it’s blank... anyway, I wanted to go a level down from the “what is music?” conversation because it’s all music. I am more interested in what makes this music distinctive: AI-generated audio is meant and designed to represent data about music sonically rather than be the music itself. Now I know this is a contradiction. It’s still music. But that is only contradictory if you, you know, get caught up in the contradiction. It isn’t “not music.” It’s music that sounds like music. 

The distinction is not necessarily a condemnation. It's just a question of the same way you would want to think about an analog synthesizer and how the patches are working, how the data works in the system, what it is doing, and what it is driving. It's only when you go away from the technical specificity of how this music is produced and into the social specificity of how it's created that it becomes a criticism rather than a critique. If you are hearing a sonification of data, whose data is it? What's in there?

What’s the answer, then? Let's just pretend as a thought experiment that they're doing this on publicly available, non-commercial or appropriately licensed music. What is that? It's Creative Commons library music. It's music from before 1928 in the US. What you're looking at is a recipe for kitsch, the worst of nostalgia, which is like the hold music, or the music that people give away.

So, that's the acceptable training data that forms the training data. But we know that they're scraping YouTube and other music websites. When scraping YouTube, they're getting all kinds of music churned with all this stuff they actually have the right to use.

When you average all of that together into this amalgamation and give areas of it category definitions that segregate it, the only outcomes that I could imagine are either Deerhoof or kitsch: you can set these things to an extreme temperature where the boundaries are very imprecise, resulting in wildly experimental in terms of genre and consistency (or lack thereof) contained within a pop structure — or turn the temperature, that is, the acceptable variety from the mean — the average — and get telephone hold music, right? I'm on hold, but maybe someone's singing opera or whatever.

Roland Meyer has done great work on this in the images domain. 

Nothing against Deerhoof, by the way. It’s just what I hear from AI when I set the models to the least constrained settings.

So this is a choice that people have. Still, because it’s buried in the interface or downright inaccessible, you can tell that the commercial models want music that sticks pretty close to the central tendencies of the genres. You get a different outcome if you shift the training data to more commercial material.

This idea of averaging is not just an issue of the training data. It’s the premise of the system. You could put only the harshest edges of experimental music into the training data, and what you would get is the banal central tendencies at the heart of the harshest edges.

However, this idea of averaging is not just an issue of the training data. It's the premise of the system. You could put only the harshest edges of experimental music into the training data, and what you would get is the banal central tendencies at the heart of the harshest edges. It may be interesting for 10 or 20 song generations. But you’d quickly learn to hear the standard tropes. That can be useful if you’re a musicologist. However, as serious music listeners, we would soon hit a constraint on the variety that makes music so interesting. Deerhoof is an excellent example because they constrain themselves to pop music structures but can go far abroad within them.

Deerhoof could make a pop song or a 30-minute jam session as a band. AI-generated music doesn’t really decide. It models. What it generates is limited by what can be predicted by those models. It is mechanistically constrained. Whereas artists work to break categories and genre, generative AI is limited to the conventions associated with categories and genres. It can only recombine them, and it can only recombine their averages.

When Models Collapse

You may have heard of model collapse. If you train an AI image generator on a training data set of AI-generated images, you're training on compression, and so you're compressing compression. And so the result is even worse images than you started with. A similar thing happens with audio, where model collapse essentially means there is nothing to work with in the data.

Because audio generation extends samples over time, if you're trying to extend that glitch, you are just not relying on the training data at a certain point. You only rely on what has previously been generated, so much of the texture becomes impossible for the machine to predict. So what I'm playing here, which is, I think, gonna be a little bit glitchy. So, I'll start from the beginning. This is a track generated through prompting for an atonal drone. I then extended it further and further, and you can hear it seek patterns and then lose them.

There's a drone. You give the context window, like, a very small amount of time on that drone, and then it has to extrapolate 31 seconds further of that drone, and then you do it over and over again, and, like, it's essentially trying to find patterns where there isn't anything. And so it does this or some variation of it, which I'm very excited about.

There is something sonically interesting in it as noise music, as a genre of noise, which is the most appropriate way of thinking about it. The most exciting way to think about AI-generated pop music is through the lens of noise music. If you are trying to get away from the pop, the centrality, and the averageness that the machine is putting into these tracks, you must steer it very hard into noise.

That forces it out of a pathway through data aggregation. And what it does, in the absence of a path, is what we're hearing. You're also hearing it identify patterns in the previous minutes of audio, try to evolve those patterns, and fail.

But it does interesting things, and I haven't really been able to interpret it the way I'd like. It'll find a sonic texture and then amplify that sonic texture. It will find something of a noise that repeats, and it will try to predict that pattern and extrapolate that pattern. Anyway, I have hours of this stuff, and someday, someone will let me do a live performance of it to nobody.

 
The most exciting way to think about AI-generated pop music is through the lens of noise music.

I think it’s helpful to end here, with noise: thinking about the idea of the blank page and all of the noise that goes into the decisions about how you will organize it. The work of creativity is a series of decisions, and every decision moves you away from this noise, this emptiness that needs to be arranged. You express yourself through those decisions and make one decision after the other. I may be misphrasing Cecile Malaspina here, but every decision about how we move through noise narrows the possibilities available to us. The challenge is to choose the constraints; this, perhaps, is one way of thinking about creative process.

There’s overlap in that practice with diffusion models, but the range of creative decisions is far wider than those made through the constraints of a mathematical model. With generative AI, the decision is simply whether to use it. Once you use it, the decision is, what will you ask it to do? And there is the break from other forms of writing or art making: once you ask, everything else is out of your hands, the moment-to-moment decision-making that would go into a piece of music of what note follows the next, what sequence follows the next, or what do I want to do here is kind of taken away from you. You’re not being asked to make music so much as to listen to music and decide if you like it. 

The power of organizing blank space, the power of organizing noise, is handed away to a machine that organizes noise on your behalf, and then you decide whether it has done it right or nothing. That changes the relationship, I think, to the real pain in the ass work of creativity. Now, I am talking about defaults: the prompt window and the music that comes back. Of course, you can expand that number of decisions. I think that, once you do, the medium of AI isn’t the constraint. There are many issues with AI, and I’m not ignoring those.

Many artists are very present in the work they make with generative AI. Many aren’t, but that’s just art. Just as anything can be music, any tool can make music. Or art. One decision can open up a new series of interesting choices: for example, what happens if we aim to make noise with a noise reduction algorithm? The problem is the default, the way it encourages us to consume instead of transform. AI is not inherently limiting. However, it promotes self-limitation through how we discuss it, the ways the interfaces of commercial models are designed, and the role that generative music systems are most likely to play. Which is the jukebox, not the drum machine.

Listening

Whatever comes back, there’s nobody on the other end.

What we listen to is a choice, too: a cultivation of our own preferences and taste. When we use generative AI as a Jukebox, you hand away your taste to the machine, automating all these decisions about what you will hear. It becomes very solitary, because you aren’t listening to anybody else. At the heart of it, one big commonality of all these generated media formats is that whatever comes back, there’s nobody on the other end. It’s a residue of artists' decisions from the past that echoes through what we hear. I think this is a very lonely idea of music, and art. It follows a pattern of atomization: the isolation of algorithmic curation.

In the music I make with AI, I try to write lyrics that point out the experience of listening to it. I come back to the same lyrics often. One of them is just: This is not a voice. And what’s interesting is that… it is music that is purely, literally, a reference to pop music, a map of what has previously been done. So in this project, The Absentees, the absentee is the musician. And it has to be self-referential because that is what generated music is.

the puncture that
does not bruise me
a memory that
does not linger

a pinprick but
it doesn't bleed
leaves no traces
leaves no marks

some harmony
without a choir
a puncture
that leaves no mark

A banality A trifle
A something else to do
on a lazy afternoon
without people around you
you want to hear a voice?
well this is not a voice
no this is not a voice, no

Expert Systems

In the 1990s, there was a form of AI called expert systems. They would ask people in a specific field, "What are the decisions you're making?" And constantly, they would try to write these into programs: if this, then that. They were trying to automate what you might assume is straightforward work with seemingly straightforward rules: math, legal stuff, chemistry, and safety rules. 

But there was no cultural context for those decisions. So, the machines could never make very informed decisions because experts operate in a genuinely multimodal space: They take in body language and office culture, and they know some paperwork takes longer than others. Yet, they likely could never articulate what was coming in to inform those decisions. That was the invisible stuff of culture. 

Those machines never really worked that well. Today, we are trying to build an expert system to automate musicians. But here it is, the same principles as an expert system, which failed because it didn't understand culture, and they want to use it to produce culture? There's an inherent tension there. So Gaussian Pop is not the production of music but a close enough resemblance to pass, to fill in the space where music is desired but not listened to. It’s a state of attention that seems particularly contemporary, a form of half-attention and half-distraction. Gaussian Pop is noise at its heart, but noise is constrained toward the illusion of signal. Gaussian Pop is a hypothesis of what music ought to sound like. Gaussian Pop sounds like music. 


14 Theses on Gaussian Pop

(Read the original essay)

1. Gaussian Pop emerges from the era of scrolling: what would have been idle time, which should spawn a restlessness in search of creative production or cheap thrills, has become pacified into consuming an interface.

2. Gaussian Pop alienates us from our own preferences. When we scroll, brief moments of attention interrupt streaks of inattention, shaped by an algorithm responding to our previous engagements. We seek stochastic pleasures from the scroll, moments when our attention and preferences align.

3. Gaussian Pop is the generated music, or image, or text which satisfies the urge to scroll: it is tailored to inattentiveness.

4. Gaussian Pop lures us in but denies full immersion. It lacks detail, or contains strange tonalities, upon inspection: but this music was never meant to be inspected. Just as an AI generated image holds our attention for 7 seconds, and AI generated text designed to be skimmed, an AI generated song is only meant to be heard once.

5. Gaussian Pop sounds like music. Just as the AI image has a passing resemblance to photographs. Regardless of detail, the way these songs are made — regardless of quality or complexity — will never allow them to transcend the a mere resemblance.

6. Gaussian Pop can produce something profound just as the sublime can sometimes be achieved through coincidence. These coincidences will be pointed to as evidence of a capacity for it to transcend these limits: it isn’t.

7. Gaussian Pop flourishes in the space between our eyes or ears and the screen, where attention is slightly disembodied — loosely suspended to allow our attention mechanisms to be augmented by the algorithm. Gaussian Pop thrives in inattentiveness.

8. Gaussian Pop doesn’t have to be subtle or unobtrusive to be ignorable. It occupies the mind for exactly as long as idleness affords before seeking out something else. Likewise, Gaussian Pop art resembles the punctum of an image, but it’s a finger prick that doesn’t bruise us, a memory that dissolves with the next thought.

9. Gaussian Pop satisfies a demand for fleeting engagement. It is designed to dissolve seamlessly into the “next.” Being an extrapolation from latent space, Gaussian Pop boldly hovers in-between spaces: it is the standard deviation of culture, comfortably camouflaged within the consensus definitions we use to prompt it. Gaussian Pop is immediately outdated: it relies on the past to predict the next 7 seconds of culture.

10. Gaussian Pop is a highly democratized art form, an ultimate but politically eviscerated evolution of Fluxus music. It’s produced by consumers, who consume the joy of automated production. Consumption and production are fused into the same vehicle.

11. Gaussian Pop artists document the outcome of their idleness to share with others in the memetic sense of “sharing” a social media post: it is self-expression through curation. Few Gaussian Pop songs resonate beyond the listener who generated it: that follows a logic of customizable isolation that permeates the Big Data era.

12. Gaussian Pop is the literal aestheticization of Big Data and social media metrics, a response to algorithmic culture curated by algorithmic systems for algorithmic systems.

13. Gaussian Pop is always infused with a mourning for the absence of what it has erased: ghosts of those whose emotions have been trapped within the confines of a standard deviation.

14. Gaussian Grotesque is a response to Gaussian Pop: not a sincere enthusiasm for the mechanisms but a texturing of its mediocrity, a testing of the limits of Gaussian distribution, a bundling of mode collapse and glitches into banality as to become invisible.


Below is an AI-generated “hypothetical podcast” summarizing the text above. Listen with skepticism.


If you enjoyed this text you can follow my newsletter, Cybernetic Forests, or find me on Mastodon, Instagram, or Bluesky. You can find some of the music discussed in this post on Bandcamp.