Everyone is talking about OpenAI’s new text-to-video algorithm, Sora. And it is really quite exciting. As someone who spends their 9-5 (and beyond) creating video content, I wanted to see if it’s really all it’s cracked up to be. So, let’s take a deeper dive into its strengths and weaknesses to see what this new technology really offers.
Strength #1: Simulating Drone/Aerial Shots
Of all the videos Sora has generated, the most impressive ones are its drone and aerial footage. These are fairly realistic as long as you don’t look at them for too long – we’ll talk more about this in an upcoming section, but for now I’ll point out that if you look closely at any aerial video featuring architecture, often there will be stairwells or steps that lead to nothing. However, the quality of these videos when compared to the high costs and dangers of flying manned helicopters or unmanned drones seem to highlight a possible future for this technology.
Strength #2: Creating 3D Animations
Let’s face it: creating full 3D scenes and characters is time-consuming and difficult, not to mention animating all of those individual assets. There’s also lighting, post-processing, accounting for multiple camera angles…it’s a whole process. So being able to type a 3D animation into existence is extremely compelling and has the potential to shake up the entire field. While a lot of Sora’s sample animations are bland and lack a strong, cohesive sense of visual style, it remains to be seen how these animations can handle minor tweaks in post. There is a captivating future here for anyone who has ever used stock animations for their projects from sites like GettyImages.
Strength #3: Mixed Media Potential
While it’s easy to look at Sora’s output and judge those videos as finished and final products, there’s also great potential here for mixing outputs created by Sora – like a 3D character on a greenscreen – and then using traditional compositing to add that character to pre-existing footage. Compositing, for those who don’t know, is a VFX term for combining media from various sources into a single, final video. For instance, adding a Sora-generated background to interview footage, or using a simple text-generated animation to enhance an otherwise boring talking-head tutorial. Sora doesn’t need to be the end of the creative process, but rather it can be used as another tool for creatives to incorporate into their arsenal.
Now that I’ve elaborated on some of the positives about Sora, I don’t want to make it seem like the technology doesn’t have its downsides. While there is great potential for this text-to-video algorithm, there are also some glaring flaws with it, which shouldn’t come as a surprise as this technology is in its infancy. Here are some of Sora’s weaknesses that I’ve found so far.
Weakness #1: Details
Take a magnifying glass to any Sora-generated video and you’ll start to see problems with the budding technology. Pay close attention and you’ll witness objects disappearing after being obscured, assets being duplicated unintentionally, or things melting unnaturally – think a spoon melting into food. There’s a worrying amount of melting in Sora-generated videos actually. And Sora doesn’t quite understand how many fingers human hands have. Or how many feet. Or how many heads. Even the simulated drone/aerial shots I mentioned previously fall apart when you notice the waves of the ocean in the background are flowing the wrong way. And while it’s possible these issues could be resolved by adjusting the text prompts used to generate them, the fact that OpenAI used these videos to promote the technology speaks volumes.
Weakness #2: Generating Text
While there are some examples of Sora incorporating text successfully into a scene, any time text is a background element of a shot, Sora creates some kind of weird mashed-potato language that’s more of a dead-giveaway than the OpenAI logo at the bottom-right. I’ll have to look at more Sora-generated video samples to see if this is a major issue, but considering how few of the already-released Sora videos incorporate text prominently I’d wager that this is an area that OpenAI isn’t too eager to feature at the moment.
Weakness #3: Simulating Humans
A good portion of Sora’s attempts at generating humans unfortunately fall directly into the uncanny valley. There are many instances where Sora doesn’t understand why and how humans make gestures like waving or clapping. Sora also gets tripped up by simple things like walking – an example of which is when one woman’s right leg strangely morphs into her left leg mid walk cycle. Sora also doesn’t understand which direction people run on treadmills. My point in highlighting these flaws is that relying on an LLM (Large Language Model) to generate video is going to have some quirks, so anyone adopting this technology at the outset should be aware of that fact. Like any new technology there will be a learning curve, but the things Sora needs to learn are so simple that it’s downright comedic.
So was it a good idea for Tyler Perry to put his $800 million dollar studio expansion on hold because of Sora? Probably not. From what I’m seeing Sora’s faults are severe enough to overtake its advantages, especially when it comes to anything involving making real humans look and act like real humans. So unless you’re David Lynch I don’t think Sora is for you. But could Sora replace stock animations on GettyImages? I can see that happening, actually – just remember to count the number of hands and feet.