Nothing's in my cart
7-minute read
After teaching AI to play Minecraft and aiming to make Copilot your gaming guide, Microsoft has taken another leap forward with the WHAMM project. This initiative explores how AI models, including the innovative AI game maker, can generate interactive game content in real-time without relying on traditional game engines. Microsoft chose the classic 1997 first-person shooter game, Quake II, as its simulation target. The goal is not just to create game visuals but to teach AI how to craft a playable experience. This research has been published in Nature, under the title "World and Human Action Models Towards Gameplay Ideation," highlighting WHAMM's potential and challenges in creative and interactive generation. You can try it out in your browser now.
WHAMM stands for "World and Human Action MaskGIT Model," and it's part of Microsoft's Muse series of generative models. Rather than being a "game system," it's more of an AI engine designed to simulate games. It attempts to infer in real-time what players should see next as they interact with the environment.
The goal of this model is to train an AI system that can simultaneously understand environmental states and human actions, generating a fully interactive game experience within the model itself. Unlike traditional game development that relies on logic engines, WHAMM doesn't execute game code. Instead, it predicts each frame based on the data relationships learned during training.
In simple terms, when you move, shoot, or jump, WHAMM doesn't just see these as game commands. It uses them as contextual clues to predict what the next frame will look like. This operation is akin to how language models generate sentences or image models complete pictures, but this time, it's generating an entire playable game.
WHAMM significantly improves upon its predecessor, WHAM-1.6B, which could only run games at one frame per second, making interaction nearly impossible. WHAMM, however, has been upgraded in both technology and strategy: it now generates over 10 frames per second, meeting the basic threshold for real-time interaction. The resolution increased from 300×180 to 640×360. It uses the MaskGIT (Masked Generative Image Transformer) parallel generation strategy, replacing the previous autoregressive method (predicting token by token) to enhance frame generation efficiency.
What's truly impressive is WHAMM's minimal training data requirement. It was trained using just a week's worth of Quake II gameplay data, focused on a single level. Professional testers recorded using diverse actions to cover various behaviors and scene transitions. This provided the model with crucial clues for learning game logic and player interaction. In contrast, the previous WHAM-1.6B required seven years of Bleeding Edge gameplay records to barely establish frame prediction and interaction relationships.
When people hear about AI generating game scenes in real-time, they might immediately think of Roguelike games like Spelunky, Noita, or Dead Cells, where each level is different, seemingly offering a "real-time changing" experience.
However, WHAMM operates on a fundamentally different logic than Roguelikes:
・Roguelikes derive randomness from "preset generation rules." The game assembles the entire map, enemy positions, and mechanics using algorithms before the player enters, following set rules. While these random elements are rich, the game's reactions and logic are hardcoded into the engine.
・WHAMM, on the other hand, predicts what will happen next and generates the frame "as you play." It doesn't pre-design levels or execute game engine logic. Instead, it infers the next frame based on what you've done in the past few seconds.
This real-time generation is more like AI "performing a game" in front of you: every action you take prompts an immediate frame response, recording the environmental changes you cause to drive subsequent interactions. In other words, you're not playing a pre-designed random level but participating in an AI-generated simulated world based on your actions. It sounds revolutionary but also presents several challenges.
According to the research, WHAMM's design focuses on three "creative ideation capabilities": consistency (visual and logical coherence), diversity (ensuring environments and content aren't always the same), and continuity of user modifications (scene changes should be remembered and persist). These are considered fundamental conditions for applying generative AI in interactive creation. So, how does it fare in achieving these capabilities?
After trying out this AI-generated version of Quake II in a browser, I must say, it's still quite a distance from being a "playable" game. Although WHAMM has significantly improved from the previous one frame per second to over 10 frames per second, it still falls short of the basic acceptable standard for modern games. Even at the lowest quality settings, mainstream games typically maintain at least 30 frames per second.
During gameplay, I also noticed some technical limitations. For instance, when firing at an enemy, if you slightly shift your view, the enemy might suddenly disappear from the screen. This indicates that WHAMM still struggles to maintain "object continuity" and "state memory" within scenes. While the model can predict the next frame based on context, whether to retain certain elements when the focus shifts or the view changes still relies on what it "learned" from training data, rather than built-in logic to record it.
This highlights a fundamental difference from traditional game design: traditional game engines explicitly track the state and position of each object, whereas WHAMM predicts the future based on "past visuals and actions," not as a true physics simulator.
WHAMM showcases the potential of generative AI in interactive content, but it also reveals current limitations: whether it's generation speed, visual continuity, or logical consistency, it still falls far short of the implementation standards of typical games.
The real question might not be whether WHAMM can create a complete game, but can such a model serve as a tool for sparking inspiration in the early stages of game development. Can it help developers quickly simulate player perspectives in narrative creation? Will future AI be able to receive commands and also "co-create scenes, respond in real-time, and offer choices?" The definitions of game design and even creative tools is being rewritten right before our eyes.