Everything to Know About OpenAI’s New Text-to-Video Generator, Sora

A machine-learning tool that transforms text prompts into detailed video has generated excitement—and skepticism

A person holding a smart phone playing video generated by Sora AI of a woman walking down a city street at night — NurPhoto/Getty Images

At first glance, the clip looks like footage from a music video or an ad for a stylish car: a woman in sunglasses strides down a city street at night, surrounded by pedestrians and brightly lit signs. Her dress and gold hoop earrings sway with each step. But it’s not a recording for a TV spot or music video. In fact, it’s not footage of anything real. Beyond the screen, the woman doesn’t exist, and neither does the street.

Everything in the video was created by OpenAI’s new text-to-video tool, Sora, the latest generative artificial intelligence (GAI) widget from the company behind Dall-E and ChatGPT. Give Sora a simple still image or a brief written prompt and it can produce up to a minute of startlingly realistic video—in what has been described as the time it takes to go out for a burrito.

OpenAI announced Sora on February 15 but hasn’t yet released it to the public. The company says it’s currently limiting access to a select group of artists and “red-team” hackers who are testing the generator for beneficial uses and harmful applications, respectively. But OpenAI has shared a few dozen sample videos generated by the new tool in an announcement blog post, a brief technical report and CEO and founder Sam Altman’s profile on X (formerly Twitter).

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

In terms of the duration and realism of its output, Sora represents the latest in what’s possible in AI-generated video. “[My colleagues and I] are very surprised to see the level of quality shown by Sora,” says Jeong Joon Park, an assistant professor of electrical engineering and computer science at University of Michigan. Park develops generative three-dimensional modeling techniques using machine-learning methods. Seven months ago Park had told Scientific American that he thought AI models capable of producing photorealistic video from text alone were far-off, requiring a major technological leap. “I didn’t expect video generators to improve this fast, and the quality of Sora completely exceeded my expectations,” he says now. He’s not alone.

Ruslan Salakhutdinov, a computer science professor at Carnegie Mellon University, was also “a bit surprised” by Sora’s quality and capabilities. Salakhutdinov has previously developed other methods of machine-learning-based video generation. Sora, he says, is “certainly pretty impressive.”

Sora’s emergence signals just how rapidly certain AI advances are being made, fueled by billions of dollars in investment—and this breakneck pace is also accelerating concerns about societal consequences. Sora and similar tools threaten millions of people’s livelihoods in many creative fields. And they loom as probable amplifiers of digital disinformation.

What Sora Can Do

Sora generates videos up to 60 seconds long, and OpenAI says users can extend that by asking the tool to create additional clips in sequence. This is no mean feat; previous GAI tools have struggled to maintain consistency between video frames, let alone between prompts. But despite its capabilities, Sora does not represent a significant leap in machine-learning technique as such. “Their algorithm is almost identical to existing methods. They just scaled it up on larger data and models,” Park says. It’s “not necessarily novel,” Salakhutdinov agrees. “It’s a brute-force approach.”

In basic terms, Sora is a very large computer program trained to associate text captions with corresponding video content. More technically, Sora is a diffusion model (like many other image-generating AI tools), with a transformer encoding system resembling ChatGPT’s. Using an iterative process of removing visual noise from video clips, developers trained Sora to produce outputs from text prompts. The main difference between Sora and an image generator is that instead of encoding text into still pixels, it translates words into temporal-spatial blocks that, together, compose a complete clip. Google’s Lumiere and many other models work in a similar way.

OpenAI hasn’t released much information about Sora’s development or training, and the company declined to respond to most of Scientific American’s questions. But experts including Park and Salakhutdinov agree the model’s capabilities result from massive amounts of training data and many billions of program parameters running on lots of computing power. OpenAI says it relied on licensed and publicly available video content for training; some computer scientists speculate that OpenAI may have also used synthetic data generated by video game design programs such as Unreal Engine. Salakhutdinov agrees that that’s likely, based on the unusually smooth appearance of the output and on some of the generated “camera” angles. This resemblance to video games’ artificiality is just one reason why Sora, though “remarkable,” is far from perfect, he says.

A closer inspection of the video of the woman walking reveals that certain details are off. The bottom of her dress moves a little too stiffly for fabric, and the camera pans feel uncannily smooth. In a cut to close-up, the dress has a splotchy pattern that wasn’t there before. A necklace is missing in some shots, the fasteners on the leather jacket’s lapels have moved and the jacket itself has grown longer. These sorts of inconsistencies pop up throughout the videos OpenAI has shared so far, even though many of these have likely been cherry-picked to cultivate hype. In some clips, entire people or furniture items disappear or suddenly multiply within a scene.

Possibilities and Peril

If AI video progresses the same way image generation has, all these flaws will soon become much less common and much harder to spot, says Hany Farid, a computer science professor at the University of California, Berkeley, who is enthusiastic about Sora and other text-to-video tools. He sees the potential for “really cool applications” that allow creators to tap into their imagination more easily. Such technology could also lower the barrier to entry for filmmaking and other often-expensive artistic endeavors, he adds.

This is “something we have been dreaming of, as AI researchers,” says Siwei Lyu, a computer science professor at the University at Buffalo. “It’s a great achievement, scientifically.”

But where computer scientists might see accomplishment and potential, many artists are likely to see theft. Sora, like its image-producing precursors, almost certainly contains some copyrighted material within its trove of training data. And it’s liable to replicate or closely mimic those copyrighted works and present them as its own original, generated content. Brian Merchant, technology journalist and author of the book Blood in the Machine, has identified at least one instance where a Sora clip appears to be very similar to a video likely contained within its training dataset. In both videos, a striking blue bird with a feathered head crest and red eyes turns in profile against a green, leafy background.

And then, of course, there are the broader fears of a future where fact becomes increasingly hard to separate from fiction.

Fuel for the Fake-News Fire

Through his work in detecting deepfakes, Farid is keenly aware of how generative AI can be used for nefarious ends. As with every new quick and simple content-generation tool, Sora is poised to further escalate the persistent problem of online misinformation and disinformation. Currently, making fake videos involves working with a combination of AI alterations and real footage. Text-to-video platforms eliminate a user’s need for source material, accelerating and expanding potential abuses. Tools such as Sora may be “an amplifying factor” for harmful content that includes deepfake pornography and political propaganda, Farid warns.

Lyu, who is also a digital forensics expert, has concerns too—particularly for the casual social media user who might scroll by a short clip and absorb it without careful analysis. “For unaware users, AI-generated videos will be very deceptive,” he warns. And new analysis tools will be required to suss out fake content. Lyu and his colleagues tried out some existing detection algorithms on Sora’s videos, and he says that “it didn’t work very well.” Such tools were only slightly better than chance at recognizing Sora’s videos as AI-generated.

OpenAI says that it’s making steps to make Sora safer, including the platform’s measured release as well as internal testing, content guardrails and the use of a protocol known as the Coalition for Content Provenance and Authenticity (C2PA) standard, which uses metadata to make it easier to tell where a piece of content originates. Farid and Lyu both agree that these steps are important but that they’re not enough to prevent all potential harms. For every safety feature, they say, there is also a work-around.

Unreality’s Reality Check

Yet disinformation exists beyond Sora, and tackling that problem is ultimately a social question, not a technical one, says Irene Pasquetto, an assistant professor at the University of Maryland researching misinformation and disinformation. She warns that overstating Sora’s risks or possible harms can easily contribute to the cloud of hype around AI. Companies have a financial incentive to promote ideas of how powerful their models are, Pasquetto adds—even if some people believe that these products pose existential threats to society.

It’s important, she says, to keep the harms in context and to focus on root causes: although Sora makes it easier and quicker to produce short videos—currently the dominant content on social media—it doesn’t, in itself, pose a new problem. There are already numerous ways to manipulate online videos. Even posting a real recording with the wrong caption can lead to new conspiracy theories, Pasquetto says.

While Pasquetto notes that social, legislative and educational solutions are necessary to stanch the flow of harmful online content, she acknowledges that there’s no quick fix. In the meantime, be aware that objects, places and people in videos may be less real than they appear.