In a latest panel interview with Collider, Joe Russo, the director of tentpole Marvel movies like “Avengers: Endgame,” predicted that, inside two years, AI will have the ability to create a fully-fledged film.
It’d say that’s a reasonably optimistic timeline. However we’re getting nearer.
This week, Runway, a Google-backed AI startup that helped develop the AI picture generator Steady Diffusion, launched Gen-2, a mannequin that generates movies from textual content prompts or an present picture. (Gen-2 was beforehand in restricted, waitlisted entry.) The follow-up to Runway’s Gen-1 mannequin launched in February, Gen-2 is among the first commercially accessible text-to-video fashions.
“Commercially accessible” is a crucial distinction. Textual content-to-video, being the logical subsequent frontier in generative AI after photographs and textual content, is changing into a much bigger space of focus notably amongst tech giants, a number of of which have demoed text-to-video fashions over the previous yr. However these fashions stay firmly within the analysis phases, inaccessible to all however a choose few information scientists and engineers.
After all, first isn’t essentially higher.
Out of non-public curiosity and repair to you, expensive readers, I ran a couple of prompts by Gen-2 to get a way of what the mannequin can — and may’t — accomplish. (Runway’s at present offering round 100 seconds of free video era.) There wasn’t a lot of a technique to my insanity, however I attempted to seize a spread of angles, genres and types {that a} director, skilled or armchair, would possibly wish to see on the silver display screen — or a laptop computer because the case is likely to be.
One limitation of Gen-2 that grew to become instantly obvious is the framerate of the four-second-long movies the mannequin generates. It’s fairly low and noticeably so, to the purpose the place it’s practically slideshow-like in locations.

Picture Credit: Runway
What’s unclear is whether or not that’s an issue with the tech or an try by Runway to avoid wasting on compute prices. In any case, it makes Gen-2 a reasonably unattractive proposition off the bat for editors hoping to keep away from post-production work.
Past the framerate subject, I’ve discovered that Gen-2-generated clips are likely to share a sure graininess or fuzziness in widespread, as in the event that they’ve had some form of old-timey Instagram filter utilized. Different artifacting happens as nicely in locations, like pixelation round objects when the “digital camera” (for lack of a greater phrase) circles them or rapidly zooms towards them.
As with many generative fashions, Gen-2 isn’t notably per respect to physics or anatomy, both. Like one thing conjured up by a surrealist, folks’s legs and arms in Gen-2-produced movies meld collectively and are available aside once more whereas objects soften into the ground and disappear, their reflections warped and distorted. And — relying on the immediate — faces can seem doll-like, with shiny, impassive eyes and pasty pores and skin that evokes an inexpensive plastic.

Picture Credit: Runway
To pile on increased, there’s the content material subject. Gen-2 appears to have a tricky time understanding nuance, clinging to specific descriptors in prompts whereas ignoring others, seemingly at random.

Picture Credit: Runway
One of many prompts I attempted, “A video of an underwater utopia, shot on an previous digital camera, within the model of a ‘discovered footage’ movie,’ caused no such utopia — solely what appeared like a first-person scuba dive by an nameless coral reef. Gen-2 struggled with my different prompts too, failing to generate a zoom-in shot for a immediate particularly calling for a “sluggish zoom” and never fairly nailing the look of your common astronaut.
May the problems lie with Gen-2’s coaching information set? Maybe.
Gen-2, like Steady Diffusion, is a diffusion mannequin, that means it learns the right way to progressively subtract noise from a beginning picture made solely of noise to maneuver it nearer, step-by-step, to the immediate. Diffusion fashions be taught by coaching on tens of millions to billions of examples; in an educational paper detailing Gen-2’s structure, Runway says the mannequin was educated on an inside information set of 240 million photographs and 6.4 million video clips.
Range within the examples is essential. If the information set doesn’t comprise a lot footage of, say, animation, the mannequin — missing factors of reference — gained’t have the ability to generate reasonable-quality animations. (After all, animation being a broad subject, even when the information set did have clips of anime or hand-drawn animation, the mannequin wouldn’t essentially generalize nicely to all forms of animation.)

Picture Credit: Runway
On the plus aspect, Gen-2 passes a surface-level bias check. Whereas generative AI fashions like DALL-E 2 have been discovered to strengthen societal biases, producing photographs of positions of authority — like “CEO or “director” — that depict principally white males, Gen-2 was the tiniest bit extra various within the content material it generated — not less than in my testing.

Picture Credit: Runway
Fed the immediate “A video of a CEO strolling right into a convention room,” Gen-2 generated a video of women and men (albeit extra males than ladies) seated round one thing like a convention desk. The output for the immediate “A video of a physician working in an workplace,” in the meantime, depicts a girl physician vaguely Asian in look behind a desk.
Outcomes for any immediate containing the phrase “nurse” have been much less promising although, constantly exhibiting younger white ladies. Ditto for the phrase “an individual ready tables.” Evidently, there’s work to be executed.
The takeaway from all this, for me, is that Gen-2 is extra a novelty or toy than a genuinely great tool in any video workflow. May the outputs be edited into one thing extra coherent? Maybe. However relying on the video, it’d require doubtlessly extra work than capturing footage within the first place.
That’s to not be too dismissive of the tech. It’s spectacular what Runway’s executed, right here, successfully beating tech giants to the text-to-video punch. And I’m certain some customers will discover makes use of for Gen-2 that don’t require photorealism — or a whole lot of customizability. (Runway CEO Cristóbal Valenzuela lately advised Bloomberg that he sees Gen-2 as a technique to provide artists and designers a instrument that may assist them with their inventive processes.)

Picture Credit: Runway
I did myself. Gen-2 can certainly perceive a spread of types, like anime and claymation, which lend themselves to the decrease framerate. With somewhat fiddling and modifying work, it wouldn’t be unattainable to string collectively a couple of clips from to create a story piece.
Lest the potential for deepfakes concern you, Runway says it’s utilizing a mixture of AI and human moderation to forestall customers from producing movies that embrace pornography, violent content material or that violate copyrights. I can verify there’s a content material filter — an overzealous one really. However course, these aren’t foolproof strategies, so we’ll should see how nicely they work in follow.

Picture Credit: Runway
However not less than for now, filmmakers, animators and CGI artists and ethicists can relaxation simple. It’ll be not less than couple iterations down the road earlier than Runway’s tech comes near producing film-quality footage — assuming it ever will get there.
In a latest panel interview with Collider, Joe Russo, the director of tentpole Marvel movies like “Avengers: Endgame,” predicted that, inside two years, AI will have the ability to create a fully-fledged film.
It’d say that’s a reasonably optimistic timeline. However we’re getting nearer.
This week, Runway, a Google-backed AI startup that helped develop the AI picture generator Steady Diffusion, launched Gen-2, a mannequin that generates movies from textual content prompts or an present picture. (Gen-2 was beforehand in restricted, waitlisted entry.) The follow-up to Runway’s Gen-1 mannequin launched in February, Gen-2 is among the first commercially accessible text-to-video fashions.
“Commercially accessible” is a crucial distinction. Textual content-to-video, being the logical subsequent frontier in generative AI after photographs and textual content, is changing into a much bigger space of focus notably amongst tech giants, a number of of which have demoed text-to-video fashions over the previous yr. However these fashions stay firmly within the analysis phases, inaccessible to all however a choose few information scientists and engineers.
After all, first isn’t essentially higher.
Out of non-public curiosity and repair to you, expensive readers, I ran a couple of prompts by Gen-2 to get a way of what the mannequin can — and may’t — accomplish. (Runway’s at present offering round 100 seconds of free video era.) There wasn’t a lot of a technique to my insanity, however I attempted to seize a spread of angles, genres and types {that a} director, skilled or armchair, would possibly wish to see on the silver display screen — or a laptop computer because the case is likely to be.
One limitation of Gen-2 that grew to become instantly obvious is the framerate of the four-second-long movies the mannequin generates. It’s fairly low and noticeably so, to the purpose the place it’s practically slideshow-like in locations.

Picture Credit: Runway
What’s unclear is whether or not that’s an issue with the tech or an try by Runway to avoid wasting on compute prices. In any case, it makes Gen-2 a reasonably unattractive proposition off the bat for editors hoping to keep away from post-production work.
Past the framerate subject, I’ve discovered that Gen-2-generated clips are likely to share a sure graininess or fuzziness in widespread, as in the event that they’ve had some form of old-timey Instagram filter utilized. Different artifacting happens as nicely in locations, like pixelation round objects when the “digital camera” (for lack of a greater phrase) circles them or rapidly zooms towards them.
As with many generative fashions, Gen-2 isn’t notably per respect to physics or anatomy, both. Like one thing conjured up by a surrealist, folks’s legs and arms in Gen-2-produced movies meld collectively and are available aside once more whereas objects soften into the ground and disappear, their reflections warped and distorted. And — relying on the immediate — faces can seem doll-like, with shiny, impassive eyes and pasty pores and skin that evokes an inexpensive plastic.

Picture Credit: Runway
To pile on increased, there’s the content material subject. Gen-2 appears to have a tricky time understanding nuance, clinging to specific descriptors in prompts whereas ignoring others, seemingly at random.

Picture Credit: Runway
One of many prompts I attempted, “A video of an underwater utopia, shot on an previous digital camera, within the model of a ‘discovered footage’ movie,’ caused no such utopia — solely what appeared like a first-person scuba dive by an nameless coral reef. Gen-2 struggled with my different prompts too, failing to generate a zoom-in shot for a immediate particularly calling for a “sluggish zoom” and never fairly nailing the look of your common astronaut.
May the problems lie with Gen-2’s coaching information set? Maybe.
Gen-2, like Steady Diffusion, is a diffusion mannequin, that means it learns the right way to progressively subtract noise from a beginning picture made solely of noise to maneuver it nearer, step-by-step, to the immediate. Diffusion fashions be taught by coaching on tens of millions to billions of examples; in an educational paper detailing Gen-2’s structure, Runway says the mannequin was educated on an inside information set of 240 million photographs and 6.4 million video clips.
Range within the examples is essential. If the information set doesn’t comprise a lot footage of, say, animation, the mannequin — missing factors of reference — gained’t have the ability to generate reasonable-quality animations. (After all, animation being a broad subject, even when the information set did have clips of anime or hand-drawn animation, the mannequin wouldn’t essentially generalize nicely to all forms of animation.)

Picture Credit: Runway
On the plus aspect, Gen-2 passes a surface-level bias check. Whereas generative AI fashions like DALL-E 2 have been discovered to strengthen societal biases, producing photographs of positions of authority — like “CEO or “director” — that depict principally white males, Gen-2 was the tiniest bit extra various within the content material it generated — not less than in my testing.

Picture Credit: Runway
Fed the immediate “A video of a CEO strolling right into a convention room,” Gen-2 generated a video of women and men (albeit extra males than ladies) seated round one thing like a convention desk. The output for the immediate “A video of a physician working in an workplace,” in the meantime, depicts a girl physician vaguely Asian in look behind a desk.
Outcomes for any immediate containing the phrase “nurse” have been much less promising although, constantly exhibiting younger white ladies. Ditto for the phrase “an individual ready tables.” Evidently, there’s work to be executed.
The takeaway from all this, for me, is that Gen-2 is extra a novelty or toy than a genuinely great tool in any video workflow. May the outputs be edited into one thing extra coherent? Maybe. However relying on the video, it’d require doubtlessly extra work than capturing footage within the first place.
That’s to not be too dismissive of the tech. It’s spectacular what Runway’s executed, right here, successfully beating tech giants to the text-to-video punch. And I’m certain some customers will discover makes use of for Gen-2 that don’t require photorealism — or a whole lot of customizability. (Runway CEO Cristóbal Valenzuela lately advised Bloomberg that he sees Gen-2 as a technique to provide artists and designers a instrument that may assist them with their inventive processes.)

Picture Credit: Runway
I did myself. Gen-2 can certainly perceive a spread of types, like anime and claymation, which lend themselves to the decrease framerate. With somewhat fiddling and modifying work, it wouldn’t be unattainable to string collectively a couple of clips from to create a story piece.
Lest the potential for deepfakes concern you, Runway says it’s utilizing a mixture of AI and human moderation to forestall customers from producing movies that embrace pornography, violent content material or that violate copyrights. I can verify there’s a content material filter — an overzealous one really. However course, these aren’t foolproof strategies, so we’ll should see how nicely they work in follow.

Picture Credit: Runway
However not less than for now, filmmakers, animators and CGI artists and ethicists can relaxation simple. It’ll be not less than couple iterations down the road earlier than Runway’s tech comes near producing film-quality footage — assuming it ever will get there.