Nothing's in my cart
7 minutes read
Can generative AI create worlds, just like it generates text, images, and videos? Ever since Google launched the Genie series and Microsoft introduced WHAMM, the possibilities of AI in world modeling have expanded. Many people immediately think, "Could this be used to create games? Is this the prototype for the next generation of virtual world platforms?" Such thoughts aren't surprising, as these models resemble interactive 3D space generators.
However, the real focus of these models might not be on creating entertaining spaces, but rather on providing a training ground for AI agents to repeatedly experiment, reason, and learn. The problem with past models was a lack of consistency. Now, the upgraded Genie 3 not only enhances visuals and real-time operations but also achieves long horizon consistency for the first time, making training truly effective and laying a more concrete foundation for the development of Artificial General Intelligence (AGI).
Genie 3 can generate highly realistic 3D virtual spaces. (Source: Google DeepMind)
According to Google DeepMind, Genie 3 is a general-purpose world model that can generate interactive environments based on text prompts in real-time at 720p and 24 frames per second. Compared to its predecessor, Genie 2, it has three major advancements:
Older models, like Genie 2 or WHAMM, either had delayed interactions or could only maintain scenes briefly. Genie 3 updates the screen and responds instantly after user input, making the exploration feel as real-time as a game engine.
During continuous interaction, Genie 3 can remember and maintain the state of previously generated scenes—even if a player returns to the same spot a minute later, the objects, lighting, and even weather remain consistent, reducing the jarring feeling of physical disarray. For example, if you walk past a tree and circle back, it will still be there, unchanged.
Genie 3 achieves spatial consistency through "auto-regressive generation," meaning the model references the previous frame it generated to create the next one, continuously building on its output.
The challenge is that the "memory range" of this approach extends over time. For Genie 3 to remember details from a minute ago is quite a feat. It must balance short-term memory (where you walked a minute ago, where things are) and response speed (immediate reflection of new actions) to achieve a smooth and consistent interactive experience.
Beyond basic operations like observing and moving, Genie 3 introduces "promptable events." Users can change world conditions in real-time through text commands, almost like playing god. For instance, you can switch the weather, add new characters or objects, or even trigger story-like changes, expanding interaction from simple space exploration to dynamic story and scenario generation. In the demo, a brown bear suddenly appears in a meadow scene, or a flying dragon descends into the Thames in London.
While Genie 3 visually resembles an "on-the-fly interactive game," its core value lies in providing a simulated world where AI agents can repeatedly experiment and reason.
These AI agents aren't just chatbots; they're virtual "actors" with perception, decision-making, and action capabilities. Google DeepMind's previous release, SIMA (Scalable Instructable Multiworld Agent), is a prime example. SIMA is designed to receive instructions in various 3D virtual environments, observe, plan, and execute step by step. In the demo, it can be instructed to buy specific items at a market, find an exhibit in a museum, or complete complex tasks requiring multiple steps.
(Source: Google DeepMind)
In the past, agents like SIMA were often limited by environmental consistency and predictability: if a scene changed illogically in a short time, the agent's decision chain would be disrupted, preventing it from truly "learning" to handle long-term situations. Genie 3's long-horizon consistency solves this issue. Now, AI agents can perform dozens of actions in a continuously existing world and remember their processes and outcomes.
More crucially, Genie 3's Promptable World Events allow researchers to introduce new variables in real-time, such as sudden weather changes, the addition of unfamiliar characters, or even completely altering mission conditions, forcing agents to reassess strategies in uncertain scenarios. These "counterfactual scenarios" are essential for achieving AGI, as they require AI to not just follow a set script but to adapt flexibly to any possible event.
In the future, whether it's to develop control systems for self-driving cars or collaborative robots, or digital assistants capable of completing tasks autonomously, these "world models" will be their starting point and testing ground.
While Genie 3 achieves long horizon consistency, there are still many limitations:
These limitations mean that Genie 3 is still just a "closed testing ground," not yet capable of allowing AI to reside long-term and gradually accumulate experience. DeepMind has chosen to open it as a "limited research preview," allowing a small number of academic institutions and creators to test it, collecting feedback and observing, gradually stress-testing how much complexity and change this world can handle. After all, to achieve AGI, the worlds generated by models need longer memory, more organic actions, and the ability to interact with multiple intelligences simultaneously.
From Genie, Genie 2, WHAMM, to the current Genie 3, world models have evolved from generating videos and 3D scenes to maintaining consistent 3D spaces. Perhaps we're not too far from an AI virtual town with organic interactions.