Critical Topics: AI Images, Class Thirteen

Cinema Without Cameras

Lecture by Eryk Salvaggio

Today we’re going beyond AI images to talk about AI cinema. Right now there are a number of generative video tools out there, though they are slightly behind in the technology we use to generate images. Many video tools are simply treating video as a series of images, and then applying a style transfer to every frame. But a new slew of generative video tools is emerging, with Runway’s GEN-3 and Luma Labs already publicly available. OpenAI’s Sora is still not public, but like the others, promises near-HD resolution video quality and stable physics.

For this talk I want to think about the concept of a cinema without cameras. AI images don’t use cameras at all, and if we can generate images from text, it offers some new ways of thinking about the ways we tell these stories. Now, of course, we know that these images are the result of pictures in a dataset - pictures taken by people with cameras, or a scanner, or some other optical input device. It’s clear that the next wave of AI video tools will have been trained on millions of hours of YouTube videos.

We have explored the subject matter of datasets, data provenance (where the images come from) and data rights (consent and control over the images you share online). So we’re not talking so much today about where the pictures come from, but what might be possible in terms of what we do with those images and how we make sense of them.

I want to start by mentioning Vilém Flusser’s “Towards a Philosophy of Photography,” written by the Brazilian-Czech philosopher in 1983. It is bafflingly prescient. As a philosopher with an interest in technology, Flusser wrote not just one but two entire books imagining a world where photorealistic images could be made by computers and machines. In those books he explored the philosophical concepts of what an image even is, and how this idea of the image would need to change if computers were making them. Well, 40 years later, that seems to be exactly where we are. And long before the invention of DeepDream, GANs or DALLE, Flusser described what he called “technical images.” 

Roughly speaking, technical images are those produced by a technological apparatus, which he defines as a “play thing that simulates thought.” Any technological apparatus — from a camera, to AI — was a product of applied science, and therefore, he suggests, they reflect a scientific way of thinking. In cameras you have the sciences of chemistry and optics; with AI generated images, you have information and computer sciences, and so on. Now, we could — and some do — have an entire semester about Flusser, but I want to highlight what he sees as a connection of images with magic and myth.

It’s this relationship with images and technology that is interesting to us. Because models may change over time, but the relationships that produce AI images will be pretty stable. Whether it is a 4K video projected on an IMAX screen or a line drawing scribble on the back of a napkin, images convey stories and myths, paint visions of the world, share information and give us a way of seeing. Technology constrains the ways we can create those myths and tell those stories. For Flusser, this means that whenever we build a technology, we first build it as a model of the way people think. Soon enough, though, people start to think like the technology. You can look beyond photography to things like assembly lines: we build assembly lines based on the way people assembled things, but then the technology became so fast and efficient that we started demanding people become faster and more efficient, too. 

So let’s think about this entanglement of myths and technologies. When we talk about cinema, we usually think about the invention of the camera, but the movie camera or projector more or less emerged from a handful of other, often overlooked precursors. 

Consider the Magic Lantern, here illustrated in 1720. This was a box with a lantern inside it that could project images onto surfaces, and was often used to add surreal effects to performances, such as introducing a demonic or angelic presence into a space. You would paint the image onto glass and insert it into the lantern and the image would be cast outward. Earlier than that, we have shreds of sketches created for a kind of mirror projection system, here a series of images in which a skeleton removes its head. In a weird historical footnote, the creator of this device found it silly and embarrassing and when it was slated to be shown to the King of France, he arranged to have the device sabotaged by his brother to avoid bringing shame to his family.

Christiaan Huygens, Sketches for Lantern, 1659

These magic lanterns were designed as a kind of augmenting technology for stories. They were dim, candle-lit projections that showed up on stage and were narrated by human actors or human performers. It gave rise to what eventually became known as projectors, where images could be cast forward with powerful light bulbs, flipping between images at 15 then 30 then 60 frames per second, bringing humans to life within the projection rather than merely nearby it. 

The glass paintings made way for film strips, and film strips gave us specific ways of shaping the content of the stories we told. Camera lenses let us do close ups or shoot scenes of big, enveloping landscapes. We could cut film — literally, slice it up with scissors and rearrange it — to traverse across time and space, seamlessly shifting where we were in the arc of a story. 

And if magic lanterns were projections with people meant to beside the images, and cinema was projections of people within the image, well, what exactly is an AI image or AI cinema? It’s a projection without people at all: a cinema without cameras. The camera no longer has to be present to any scene in order to generate a compelling image of that scene. These images can still be coupled together in sequences that tell a story.

AI isn’t the first time we’ve explored the language of cinema without cameras, though. One famous example of no-camera-cinema is the experimental filmmaker Stan Brakhage, who made a film called Mothlight 1963. Mothlight is a film, run through a projector. But instead of capturing images on that film, Stan Brakhage collected an assortment of flowers, leaves and moth wings, stuck them to a film spool with transparent tape, and then played the assemblage through a projector, which he then recorded.

Mothlight is still a film, dependent on the technologies of cinema, but used in absolutely the wrong way. It’s exciting to think about what this would mean for AI cinema: is there any way that this could be possible? Is each frame of film a piece of data? What if you cut out data from an AI system? What might it generate then? 

But what’s important here too is that the way we tell stories is shaped not just by the technology we have at our fingertips, but the way people actually use that technology. When people first started to use film cameras, they didn’t know how to tell stories with them. People had to try things: somebody had to take a razor blade and cut the film up and present what they’d shot out of order. Someone, somewhere, tried that to see if it worked. And that’s something fun about where AI images are at the moment. From a historical perspective: we simply don’t know how to use it. 

So in this class we’re going to look at some of the ways we might use it. And we actually don’t have to start with super contemporary thinkers or artists. In 1983, Gene Youngblood wrote an essay called “Cinema and the Code” that was interested in what computer cinema would be. Of course computers could replicate the language and tools of cinema: You can cut with a scissor icon in Premier, or you could cut with real scissors. You can move from 30 frames per second to 60. But was there anything that computers might do for movies that film could not do?

Youngblood looked at a number of experiments being done with computers and code at the time and came up with three, and they’re worth looking at today because a lot of AI tools that are only now becoming widely accessible are using them. Youngblood mentioned four specifically — Image Transformation, Parallel Event-Streams, Temporal Perspectives, and Images as Objects. So let’s look at some of those. 

Image Transformation

Youngblood noted that image transformation was distinct from image transitions. If you look at films, you have transitions that are constrained by the fact that film is a strip of pictures in a sequence. AI cinema can transcend that. Instead of flipping quickly between images, you can transform from one image to the next. This is called interpolation. Youngblood wrote about it in 1989: “... In digital image synthesis, where the image is a database... One can begin to imagine a movie composed of thousands of scenes with no cuts, wipes or dissolves, each image metamorphosing into the next.”

So what does that look like? Well, right now RunwayML offers something called Interpolation, which allows us to upload images, or a sequence of images, and the model will analyze the information in each frame and render a transition between the two that fuses the two images together instead of cutting. They metamorphosize. This is a sample film made using interpolation. It’s just a few pictures of flowers, put together and then the machine draws the transitions. 

It isn’t perfect: sometimes it jumps, but sometimes it morphs, as if one yellow flower is reaching into the other in order to become it.

Interpolation is an interesting aspect of animation with AI, and the word comes from the French. It used to mean a kind of forgery: to interpolate would be to find a larger archive or book, and then create a fake page in that book’s style. You would then insert that page into the book to make the book larger. The most notable early use of this word is connected to the work of a scribe who deliberately inserted false laws into books in order to help advance land claims and even allow some corrupt Bishops to get away with crimes.

So interpolation is a useful concept when we think about AI. It’s adding false information to an existing body of work. In this case, it’s adding fake frames to connect two real ones, and then moving on to the next. But it’s also a perpetual flow of information: there’s no cut, like you might see in the cinema. And Youngblood suggests that this allows for interesting story dynamics to play out.

In my own work, I’m interested in how this reveals the shifting relationships between people and nature: by merging the two, I hope to suggest the interconnectedness and entanglements between people and the natural world. If you have to cut between a person and a mushroom, you’re forced to present them as two distinct things. But if you can move from one to another continuously, the viewer sees that relationship as blending, not distinct. 

Temporal Perspective

Next we have the idea of a shift in temporal perspective. Youngblood writes:

“We arrive at two possibilities: first, cinema looks from one point to infinity in a spherical point of view. That is one vector, we shall say. The other is the opposite: one looks from each point in space towards a single point. If all these points are in motion around one point, that is the space in which ideal cinema operates. But as long as we are talking about psychological realism we will be bound to an eye-level cinema.”

In terms of understanding this with today’s tools, I think we might think of this as two aspects of machine learning. Basically, you can look out from your own eyes at the stars, and you can see all of infinity from wherever you are looking outward. That’s this spherical point of view. One way to make sense of this in digital cinema today is the latent space walk. When you train a model, the latent space is that huge realm of possibilities: it’s an algorithmically generated space where every possible permutation or variation of the images in your dataset can be seen.

Looking at this space is a bit like standing on Earth and looking out at the stars. But instead of looking at a vast universe, you’re looking at the vast possibilities to be found between every point in the dataset. This is an interesting tool for telling stories, and many artists have tapped into this space to tell stories, though they’re still pretty constrained by the capability of our tools. 

But these grids are maps of space, with every square representing a variant of that space that emerges from our data. While you probably won’t find a narrative in this data, you can find stories, patterns, and connections. And when you put these images together in a sequence, you end up with a video that moves through all of the possibilities of a single moment in time. Now, this is a bit heady, I know. Maybe this example is useful. 

Vadim Epstein’s Ghosts is from 2021. A key element of this video was created by moving through the latent space of a dataset of original images. This video is an exploration of possible images that exist in the space between the points in the training data: it’s like our eyes are peering into the latent space. The video animates this, so we experience various possibilities over the course of nine minutes. But when we look at any depiction of possibilities in a GAN model, we move through many possibilities of an image, all laid out for us simultaneously. And that is, to say the least, not something that happens with traditional animation or cinema.

Youngblood suggests that this offers some very interesting ways of thinking about time, space, and possibility beyond the idea of showing us a single image from the perspective of a single camera. So that’s fun to think about and explore. 

That’s one aspect of this: one person looking out and seeing the whole world of possibilities that exist from a small series of images. The other possibility that Gene Youngblood gets at is the idea of fusing many different views of time and space into a single film. In essence, this is what Diffusion is: it’s taking all kinds of pictures that people have taken of apples, or the Eiffel Tower, and it’s showing us a composite of those images. Whenever we look at an image from a Diffusion model, we’re basically seeing this collection of hundreds, if not millions of people presented simultaneously in one image.

With cinema, this could go in a lot of different and interesting directions. What would a scene of a documentary film look like, for example, if you literally rendered it from the perspective of everyone involved? We don’t have the technology to do this quite yet, but think about the way that we tell stories today. They’re usually centered on the person telling the story: not even the person speaking to the camera, but the person deciding where the camera points. Could an AI cinema change this? And who would control the camera then? Again, it’s fun to think about what this could mean or look like, but it’s pretty early yet. 

Here’s a work called Critically Extant by artist Sophia Crespo, who used the possibility space to imagine the bodies of non-existent animals and animated them. These animals are the result of training images of animals that do exist to create possibilities, to bring forward animals that do not exist. We can think of this as looking into the dataset, finding interesting permutations and bringing them forward into the world - here, we see it literally on the screens of Times Square in New York. 

Image as Object

The next point from Youngblood is the idea of the image as an object. It’s helpful to think about the relationship that we have with images and the way that technology allows us to tell stories in certain ways. Aspect ratios are, for example, a constraint. If your screen is widescreen, or horizontal like a phone, or square like an old TV from the 1980s, you tell stories in ways that fit that frame. When you change the shape of a screen, you have all kinds of ways to tell stories. 

The artist duo known as the Vasulkas are part of Youngblood’s thinking. These artists were using computers as early as the 1970s to break television images out of these constrictions. What you are about to see is a series of works they made. The TV sets aren’t showing these as videos or computer animations. Instead, they are using computers to hack a TV, and use it to render the light beam of cathode ray tube in visually compelling ways. So these are not computer generated the way a video game might be generated from a program, and they aren’t recordings. They’re live pictures, created by twisting the way images are rendered on a screen. 

I could only find this very long video, but it will give you a sense of it if you skim through. Here’s a clip.

So as I mentioned, Gene Younglood wrote all of this in 1989. So this is before data became purely integrated into the way we think about pictures and films. This new era of AI is happening all around us as we speak. The future of digital cinema as Vilem Flusser and Gene Youngblood had envisioned it has arrived, and now we get to beyond even those horizons. 

What’s important to note is that in the early days of any new technology, so much creativity in these tools emerges from creative misuse. Sometimes that creative misuse makes something incomprehensible, or ugly at first. But it could also be part of an emerging language of images and cinema that we haven’t discovered yet. So I am always eager to see the way that people push boundaries on what these tools can do, and bend these technologies in ways that give stories new shapes. 

So on the topic of no-camera cinema, there is also some interesting potential in reinventing the camera itself using AI. One example is Ross Goodwin’s Word Camera from 2016, which pieced together various image and text models. If you pointed a camera at an object, it would classify that object, then send its classification to a text generator, which would then expand on that word in the form of a poem. Here’s one example. 

Part Two: How does AI Video Work?

So how exactly does generative video work? 

Well, you might assume that it simply takes video, breaks it down into frames, and then applies a style transformation of every frame. That is one way to make these videos, but the problem with that is that the consistency between every frame would fluctuate pretty wildly. If you’ve ever run the same prompt a few times through these systems, you’ve seen that it rarely reproduces the same image twice. You can constrain these variations, but, for example, rotating around a single object is incredibly challenging. 

The researchers behind GEN-1 broke this process down into two approaches: the first deals with content, the other with structure. They write:

By structure, we refer to characteristics describing its geometry and dynamics, e.g. shapes and locations of subjects as well as their temporal changes. We define content as features describing the appearance and semantics of the video, such as the colors and styles of objects and the lighting of the scene. The goal of our model is then to edit the content of a video while retaining its structure.

In the demo, then, you can imagine a person walking down the street and the things that happen as a result. The person, or the geometrical shape of the person, as well as the motion and location of objects moving around them in the video, are part of the video’s structure. The content, as defined by this paper, is the kind of style elements: in the case of a video of someone walking down the street, that style is photorealistic. But you can render that into the style of claymation or animation or even a different kind of photorealism using Gen1, which they call editing the content

So there are some differences here in the way we render video vs the way we render images. With images, you don’t have to deal with models of space or their movements. Remember, your model doesn’t need to know all of the details about the clusters of moving pixels in between frames. If you’ve done models in video games, you know you have wireframes that represent the items in space, which you then apply your images to. That’s not exactly what happens here, but it’s a useful framework for thinking about it. So in essence, with Gen1, you have one process that compresses the information about spatial relationships, such as depth estimates and edge recognition, which contribute to the structure of the video, while another process generates information about the content and kind of maps those two together again in the final product.

In this way, as they write, “To correctly model a distribution over video frames, the architecture must take relationships between frames into account.”

Another distinct thing about training these models is that rather than looking at noise, as we saw with stable diffusion and dalle-2’s image generating diffusion models, Gen-1 relied on a deblurring process. Remember that diffusion builds images around the breakdown of images into noise, and then walks pure noise backward toward the image described by your prompt. 

For Gen1, Runway’s researchers integrated video information into its training data starting with blurry abstractions rather than static. In their paper they suggest that blurry movements are easier to map and track, and the object shape is more general and broad.  Once you know the general movement of the objects in space — the structure — you can de-blur it in the direction of the prompt, or map video to the style of an existing image, cultivating its content elements such as style and so on. 

The next leap in generative video came with OpenAI’s Sora demo, and subsequent releases of Runway’s Gen 2 and Gen 3 models. These models flatten video in an interesting way — packing all kinds of movement into a single frame, as a way of tracking how that object moves.

So in the end, we’ve seen how emerging tools like RunwayML’s Gen1 can be used to map style and content to tell stories, in ways that correspond to the way we understand cinema as we now know it. Right now we’re watching one of the earliest 3D renderings ever made, in 1972 at the University of Utah. Rendering wireframes for hands and human faces, these images were, at one point, a major leap forward in storytelling. We take them for granted today: every pixar film, very video game you play, most of the effects in most Hollywood films, are a result of these hands and faces traced by scan lines on the surface of a cathode ray tube. 

Our stories - from magic lanterns to hollywood blockbusters - may seem like they follow clear patterns, that there’s always a beginning, a middle, and an end - though, as Jean Luc Godard would say, perhaps not always in that order. Technology is always pushing us to tell stories in new ways, and sometimes those new shapes are radical, or hold radical possibilities. Sometimes they augment and extend the stories we’ve already told.

Computers alone have already changed the way we tell stories: you’re streaming a video at home, edited on a computer, with clips pulled from all kinds of people who generously shared them. Afterward you might open up a phone that you’ve never used for a phone call to touch a screen and tell a story interactively. All of that has come about because somebody, somewhere, thought about doing something in a new way. 

There’s a short piece of cinema called Zen for Film by Nam June Paik. The author of a post about the film describes it this way: 

In 1964, Nam June Paik released an empty film: Zen for Film. This 16mm film is just a roll of transparent film, without sound, projected on a loop. The only thing you see on the screen—or wall, as it is usually shown at museums, not at film theatres—is the light passing through the film. It is like staring at nothing, like looking at a wall. This may sound like pure asceticism, non-cinema or the negation of cinema, but it is quite the opposite. Zen for Film is a film in constant evolution that captures its surroundings. You can see, among other things, the dust particles that have adhered to the film and the scratches provoked by the inner mechanisms of the film projector. The footsteps, voices, and coughs from the public create the soundtrack.

You may ask if this is a movie, if this is really a story at all. And I’m reminded of light and its role in the history of computers: light passes through a punch card or does not. If there’s a hole in the punch card, light moves through it, representing a 1: a yes, the presence of information, whereas darkness, or zero, is its absence. So in watching this film, in which light just passes through a projector on to a wall, I want to see it as giant yes - and ask what more we can do to push and shape that yes with the tools we have on hand. 

Looking for Something Else to Do?

Fabian Mosele has assembled a comprehensive list of films made with generative AI since 2021. Not all of them are masterpieces, but it is a very thorough list — whether you’re looking at this historical moment or seeking inspiration in your own work. You can find that list here.

I ran a workshop with Fabian as part of the AIxDesign Story&Code program in 2023. Fabian’s own short film was an early generative technical marvel, and you can see how it was made in the video below, though as always, technology ages and the workflow may not work for you as it did then.


Works Referenced

Youngblood, Gene. “Cinema and the Code.” Leonardo. Supplemental Issue, vol. 2, 1989, pp. 27–30. JSTOR, https://doi.org/10.2307/1557940. Accessed 26 Mar. 2023.

Esser, Patrick, et al. “Structure and Content-Guided Video Synthesis with Diffusion Models.” ArXiv [Cs.CV], 2023, http://arxiv.org/abs/2302.03011.

Davis, Jenny L. How Artifacts Afford: The Power and Politics of Everyday Things. MIT Press, 2020.

Flusser, Vilem. Towards a Philosophy of Photography. Reaktion Books, 2013.

_blank. “Zen for Film.” _blank [at] Null66913, 1 Apr. 2023, https://null66913.substack.com/p/zen-for-film.

Films Referenced (Video Lecture Version)

Lee Harrison (1968) Mr. Computer Image for ABC News (Proposal, Never Aired). USA.

Nikolai Konstantinov (1968) Koshechka. USSR.

Eryk Salvaggio (2023) The Salt and the Women. Short Film. USA

Deniz Kurt (2023) Cybernetic Genesis. Short Film. Denizekurt.design

Refik Anadol (2019) Latent History Stockholm. 

Sofia Crespo (2022), Critically Extant.

Steina and Woody Vasulka (1968-1975) E-Object, Time/Energy Objects. 

Ed Emschwiller (1979) Sunstone.

Memo Akten (2018) Deep Meditations: A brief history of almost everything.

Ed Catmill and Fred Park (1972). Halftone Animation Demos. 

Nam June Paik (1964) Zen for Film.