Music for Generated Metronome (2024)

Compressing Time Within The Latent Space of Generative AI Video & Sound Models

AI-generated video is AI-generated images, plus time. That is, the images move — they’re animated. But time for a generative video model runs differently than it does for the rest of us. Internally, time is frozen: the movement of objects in a video, broken down into training data, becomes fixed like the paths of raindrops on a window.

When the model unpacks that, it has a certain tempo, but it’s unclear to me how the speed of the video is determined. There’s a certain translation of movement as it gets frozen into place: the trail in any given frame represents a segment of time; the next snapshot represents another segment, and so forth. If the trail runs across the entire frame, the model can unpack that by however long the frame was meant to be - and the thing will move slower than a trail that moves ever-so-slightly in the same frame.

A visualization of how a video model might compress motion within a time span into a single frame.

For example, above you see a metronome, which has a needle in the middle, which is meant to sway back and forth to keep a steady tempo. Typically the metronome clicks each time it hits the center. If we were to freeze a video of a metronome while training a video diffusion model (like Sora, or Gen 3, etc) we might capture a moment in time that looks like what we see above. The needle is not moving, but we can see the trace of where it moved. That movement is frozen as a range in a single frame, suggesting that the needle moves from side to side in that specific area in the time span that the frame of the video represents.

The needle would not appear beyond that range. But if you have thousands of styles of metronomes in the training data, and you have the concept of a metronome based on those thousands of metronomes, the needle will not move reliably over any specific span of time.

The result explains the extremely strange behavior — at least in comparison to the real-world physics — from the needle in generated video of a metronome. The range of possible locations for a metronome is compressed, but the behavior of that needle is lost because it moves too quickly — the model is making a blur of a blur. The blur becomes a site where the needle could spontaneously emerge at any moment, and so in the resulting video, that is what we see. Pixels are drawn, and the model understands the rapid swaying of the needle between two points as potentially two needles, then animates two needles in that space.

This piece takes that at face value. I scored the ticking that this metronome would have produced for its tempo, matching clicks to its movement from side to side. That track was then loaded into Udio, a diffusion model for music. Udio offers a “remix” function — which is badly named because the original track can stay intact. Instead, it adds layers of sound on top of it or shifts the tonal qualities of the audio track. So, for example, if I have a clicking metronome track, I can convert it to synthesizer noises, or a human vocalist.

In the video above, that’s exactly what I did. Using the strange time signature of the generated metronome, I built a manual click track with a few different tonalities of click. I uploaded that to Udio and generated a variety of tracks that altered the click track while maintaining its underlying structure.

It’s important to note that a metronome is a way of keeping time. It isn’t a cue for a sound. Musicians who played to a metronome would be highly repetitive. The metronome just sets the tempo for playing the notes, which can occur anywhere within the beat.

The result is 11 audio tracks that respond to the tempo of the metronome. They add other elements, too — the video and audio tracks are not responsive. The metronome is just setting the tempo for which other pieces of audio are generated. The tempo is so strange that Udio didn’t provide any kind of structure — nothing in here sounds like pop music.

To be clear I also hope people find the futility of this exercise kind of funny. I don’t know if I’d go so far as to call it “commentary” on relying on AI for tasks it wasn’t designed to do, but I think that’s a layer of it.

Stochastic Pareidolia

There’s an element of pareidolia within AI: the tendency to perceive a specific, often meaningful image in a random or ambiguous visual pattern. The scientific explanation for some people is pareidolia, or the human ability to see shapes or make pictures out of randomness. This error is precisely what image models do by design: they make mistakes in identifying patterns in noise that don’t happen.

In watching these videos there’s a strange sense of coming in and out of alignment with the metronome, even though there isn’t supposed to be, per se. The video isn’t responding to the music, the music is responding to the video, awkwardly. The tempo track sets the tempo to which the other sounds are arranged. Each click of the metronome does not have to make a sound. But it is disorienting when that doesn’t happen, and very satisfying when it does. This is a sonification of the hunt for quality images when we’re using AI. Mostly misses, then it hits, and then we move on. Here I want every sound to land on time, on the beat, but the audio model doesn’t work that way. Music isn’t structured that way.

I’m also mindful of my role in the process. One thing I did in this piece was to operate as a stenographer. The video is generated, and I respond to the visual of the metronome to facilitate a pattern-analysis / improv session by the Udio audio model. It’s a complete inversion of what machines are supposed to do for artists. Of course, this is intentional — the thing I am doing is weird — but it still left me with a real feeling of “why did I stay up until 3 am doing this?” afterward. The machine led and I followed.

I’ve been wondering about generative art and why stripping away human bias was so intriguing to artists of the genre. I consider myself one of them, but AI feels different. Now I question what was so interesting to me about generative systems before AI.

It came down to building systems — environments in which sounds emerged. The work of art (work as a verb, as opposed to the noun in art work) was in setting up the system.

It did help me understand, and think about, the way time is processed by AI video: every frame compressed into a tick of some clockhand. I don’t know how long a second lasts once it’s in the data, or if the needle of a metronome could ever be rendered precisely, or whose time would it be keeping? What tempo might it be?

Eryk Salvaggio