OpenAI’s Sora is designed to be a “world simulator.” Right now it’s having trouble breaking a glass.
A.W. Ohlheiser is a senior technology reporter at Vox, writing about the impact of technology on humans and society. They have also covered online culture and misinformation at the Washington Post, Slate, and the Columbia Journalism Review, among other places. They have an MA in religious studies and journalism from NYU.
A tiny fluffy monster kneels in wonder beside a lit candle. Two small pirate ships battle inside a churning cup of coffee. An octopus crawls along the sandy floor of the ocean. A Dalmatian puppy leaps from one windowsill to another. These are among a series of demo videos of OpenAI’s Sora, revealed last week, which can turn a short text prompt into up to a minute of video.
The artificial intelligence model is not yet open to the public, but OpenAI has released the videos, along with the prompts that generated them. This was quickly followed by headlines calling Sora “eye-popping” and “terrifying” and “jaw-dropping.”
OpenAI researchers Tim Brooks and Bill Peebles told the New York Times that they picked “sora,” Japanese for “sky,” to emphasize the “idea of limitless creative potential.” There is another term, though, that OpenAI uses to describe Sora: a potential “world simulator,” one that, over time, could create “highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.”
It’s not there yet. While the available demo videos of Sora at work can feel uncanny and realistic, OpenAI’s technical paper on the model notes its many “limitations.” While Sora can sometimes accurately represent the changes on a canvas when a paint-laden brush sweeps across it or create bite marks in a sandwich after showing a man taking a bite, Sora “does not accurately model the physics of many basic interactions,” such as a glass breaking. People and objects can spontaneously appear and disappear, and like many AI models, Sora can “hallucinate.”
Some AI experts, like Gary Marcus, have raised doubts about whether a model like Sora could ever learn to faithfully represent the laws of physics. But just as DALL-E and ChatGPT improved over time, so could Sora. And if its goal is to become a “world simulator,” it’s worth asking: What is the world that Sora thinks it’s simulating?
Unknown worlds
OpenAI has made that question kind of tough to answer, as the company has not disclosed much about what data was used to train Sora. But there are a couple of things we can infer. First, though, let’s look at how Sora works.
Sora is a “diffusion transformer,” which is a fancy way of saying that it combines a couple different AI methods in order to work. Like many AI image generators (think DALL-E or Midjourney), Sora creates order from chaos based on the text prompt it receives, gradually learning how to turn a bunch of visual noise into an image that represents that prompt. That’s diffusion. The transformer bit has to do with how those still images relate to each other, creating the moving video. And Sora, OpenAI says, is designed to be a video-generating generalist.
In order to do this, Sora would need a lot of data to learn from, reflecting a wide variety of styles, topics, duration, quality, and aspect ratios. OpenAI said in its technical paper that its development “takes inspiration from large language models which acquire generalist capabilities by training on internet-scale data.” While not directly saying this, it’s probably safe to guess that Sora, too, learned from some training data that was taken from the internet.
It’s also possible, argued Nvidia AI researcher Jim Fan, that Sora was trained on a data set that incorporates a large amount of “synthetic” data from the latest version of Unreal Engine, a 3D graphics creation tool that is best known for powering the visuals in video games. OpenAI also has some agreements with companies that could provide large amounts of data for training purposes, like Shutterstock. As for the data that OpenAI did not, in the past, use with the agreement of its creator or publisher, well, there are some pending copyright lawsuits.
Biased worlds
AI bias is not new, and as Vox has explained before, it can be tough to combat. It creeps into training data and algorithms that power AI models in a lot of different ways. Since we don’t know what data Sora was trained on, and the tool is not available for the public to test, it’s hard to speak in much detail about how biases might be reflected in the videos it creates.
Sam Altman, OpenAI’s CEO, has said that he believes AI will eventually learn to rid itself of bias.
“I’m optimistic that we will get to a world where these models can be a force to reduce bias in society, not reinforce it,” he said to Rest of World last year. “Even though the early systems before people figured out these techniques certainly reinforced bias, I think we can now explain that we want a model to be unbiased, and it’s pretty good at that.”
AI bias and ethics experts like Timnit Gebru have argued that this is exactly what people should not trust AI companies to do, telling the Guardian last year that we shouldn’t simply trust AI systems, or the people behind them, to self-regulate harms and bias.
Made-up worlds
A lot of the praise for Sora’s demo videos stems from their realism. And that’s exactly why disinformation experts are concerned here.
A new study indicates that AI-generated propaganda created by GPT-3 (i.e., not even the newest GPT model powering the current generation of AI tools) can be just as persuasive as human-written content and takes a lot less effort to produce. Now apply that to video. Even without being able to faithfully replicate Earth physics, there are plenty of ways that a tool like Sora could be used, right now, to hurt and mislead people.
“This is definitely slick, but I see two main uses: 1) to sell people more stuff (via ads) 2) to make non-consensual/misleading content to manipulate or harass people online,” wrote Sasha Luccioni, an AI research scientist at HuggingFace, on X. “Genuine question – why is everyone so excited?”
OpenAI announced Sora a couple weeks after a wave of explicit, nonconsensual deepfakes of Taylor Swift circulated on social media. The images, as 404 media reported, were created with AI by exploiting loopholes in the systems that are designed to prevent exactly this from happening.
To address potential biases and misuses of Sora, OpenAI is allowing only a small group of testers to evaluate its safety risks: “We are working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who are adversarially testing the model,” the company said in a statement on X.
A world with podcasting AI dogs, I guess
Underneath all this are concerns about what Sora and other tools like it will do to the livelihoods of creative professionals, whose work has been used — often without payment — to train AI tools in order to approximate their jobs.
Altman, on X, was taking follower suggestions for new Sora videos in order to show off glimpses of our glorious future, which will evidently be these AI-generated podcasting dogs:
https://t.co/uCuhUPv51N pic.twitter.com/nej4TIwgaP
— Sam Altman (@sama) February 15, 2024
Source: vox.com