They refer to world models and simulators. These have become just one of the most-discussed concepts these days by AI experts. They raised two titans; AI pioneer mind Fei-Fei Li led the investment firm World Labs raised $230 million in funding with a goal of reaching "large world models" whilst Google owned DeepMind went out to hire an inventor at video generator Sora, who is now tasked with building "world simulators.
But what the heck are those?
World models draw inspiration from the mental models of the world that humans naturally create. Our brains take abstract representations from our senses and make them into more concrete understandings of the world around us, producing what we called "models" long before AI adopted the phrase. The predictions our brains make based on these models influence how we perceive the world.
A paper by AI researchers David Ha and Jürgen Schmidhuber gives the example of a baseball batter. Batters have milliseconds to decide how to swing their bat — shorter than the time it takes for visual signals to reach the brain. The reason they're able to hit a 100-mile-per-hour fastball is because they can instinctively predict where the ball will go, Ha and Schmidhuber say.
For professional players, this all happens subconsciously, the research duo writes. Their muscles reflexively swing the bat at the right time and location in line with their internal models' predictions. They can quickly act on their predictions of the future without the need to consciously roll out possible future scenarios to form a plan.
It's these subconscious reasoning aspects of world models that some believe are prerequisites for human-level intelligence.
Modeling the world
The actual world models have emerged rather recently, partly because of some promising applications in generative video, despite being an idea in theory for decades.
Most if not all videos generated with the help of AI move towards the uncanny valley. Watch them for long enough and something very strange will happen such as limbs twisting and merging with each other.
While a generative model trained on years of video might correctly predict the basketball bounces, it doesn't actually have any idea why — just as language models don't really understand the concepts behind words and phrases. But a world model with even a basic understanding of why the basketball bounces like it does will be better at showing it do that thing.
Such a perception can only be done through training of the world models on all the various sources, like images, sounds, videos, and words, which ultimately are expected to make an internal model of how the world functions and from these models make reasonings that explain what consequences of certain actions could be.
"A viewer expects that the world they are watching behaves in a way quite similar to their own," Mashrabov said. "If a feather drops with the weight of an anvil or a bowling ball shoots up hundreds of feet into the air, it's jarring and takes the viewer out of the moment. With a strong world model, instead of a creator defining how each object is expected to move — which is tedious, cumbersome, and a poor use of time — the model will understand this."
The other things the world models would be able to do will just come with better video generation. Someday, the models will be used for very high-level forecasting and planning not only in the digital domain but also in the real world. World models one day will have an achievement that they will arrive at a certain goal through reasoning, Yann LeCun, Meta's chief AI scientist, said.
Earlier this year, LeCun delivered a lecture on the said topic. LeCun said that anything can be attained using a world model. A model which had, at its roots, the representation of some sort of "world", for instance, a dirty room as represented by the video could come up with the appropriate sequence of action toward realizing that objective: getting a dirty room to the state it should be as indicated, for example, vacuums to sweep through it, clean the dishes, take out the garbage -not because this's some kind of pattern she's ever seen but knows on some deeper level, how to get from this place of dirt to being clean.
We need machines that understand the world; [machines] that can remember things, that have intuition, have common sense — things that can reason and plan to the same level as humans. This is not current technology today. Despite what you may have heard from some of the most enthusiastic people, none of this is going to be possible with today's AI systems.
While LeCun estimates we are at least a decade away from the world models he envisions, the world models of today already show promise as elementary physics simulators.
OpenAI also mentions in a blog that Sora, which it considers to be a world model, can simulate actions like a painter leaving brush strokes on a canvas. Models like Sora — and Sora itself — can also effectively simulate video games. For example, Sora can render a Minecraft-like UI and game world.
Eventually, we will have 3D worlds on demand for gaming, virtual photography, and other applications," Justin Johnson, co-founder of World Labs, said in an a16z podcast.
"We can already create virtual, interactive worlds, but it's hundreds and hundreds of millions of dollars and a lot of development time," said Johnson. "What it will let you do is to get not only an image or a clip out but a fully simulated, vibrant, and interactive 3D world."
High hurdles
This is a seductive concept, but it is surrounded by a sea of technical challenges to be overcome.
Training and running world models requires massive compute power even compared to the amount currently used by generative models. While some of the latest language models can run on a modern smartphone, Sora (arguably an early world model) would require thousands of GPUs to train and run, especially if their use becomes commonplace.
World models, like any other AI model, hallucinate and internalize the biases in their training data. A world model mostly trained on videos of sunny weather in European cities cannot understand or describe Korean cities in snowy conditions, for instance, or does it wrong.
A general lack of training data threatens to make these problems worse, according to Mashrabov.
"We have seen models being really limited with generations of people of a certain type or race," he said. "Training data for a world model must be broad enough to cover a diverse set of scenarios, but also highly specific to where the AI can deeply understand the nuances of those scenarios."
According to the CEO of AI startup Runway, Cristóbal Valenzuela, writing recently in a blog post, data and engineering problems are what today's models lack to precisely capture the behavior of inhabitants of any world, including humans and animals. "Models will need to generate consistent maps of the environment," he said, "and the ability to navigate and interact in those environments."
If all the major hurdles are overcome, however, Mashrabov believes world models could "more robustly" bridge AI with the real world — leading to breakthroughs not only in virtual world generation but robotics and AI decision-making.
They could also spawn more capable robots.
Today, robots are limited in what they can do because they don't have an awareness of the world around them (or their own bodies). World models could give them that awareness, Mashrabov said — at least to a point.
"With an advanced world model, an AI could develop a personal understanding of whatever scenario it's placed in," he said, "and start to reason out possible solutions.".