Advertisement
News

This Is Doom Running on a Diffusion Model

GameNGen is an interesting proof-of-concept for a diffusion model-based “game engine.”
This Is Doom Running on a Diffusion Model

You’ve probably seen some version of the video below hundreds of times. Maybe it was a smart toothbrush display, an Ikea lamp, or a Roomba.

Doom’s E1M1 - The Hanger, the iconic first person shooter’s first level, is often used to showcase how the open source game can run on almost any device you can think of. The video below is novel not because of what device it's running on, but how it's running at all. What you’re looking at is not the Doom game engine, but a diffusion model, a type of generative AI model most commonly used to generate media, that’s responding to player input in real time.

This is “GameNGen” (pronounced “game engine”), and is the work of researchers from Google, DeepMind, and Tel Aviv University. They call it “the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.” Without getting too deep into the weeds, essentially the way it works is that the diffusion model is trained gameplay footage of Doom to produce the next frame based on the frames that came before it and player input. 

All generative AI models essentially work like this. They are trained on massive amounts of data in order to predict what the next word, frame, or pixel is to automatically generate the desired output. GameNGen has impressively extended this method to a somewhat functioning, real time interactive video game. At the moment, GameNGen is running at about 20 frames per second, which is incredibly slow, especially for an old video game, but it does look like Doom. According to the GameNGen paper, 10 human raters presented with 130 random short gameplay clips had only a slightly better chance than random of telling the difference between a GameNGen-generated clip and a “real” Doom gameplay clip. I think that I, a Doom scholar, would do a lot better than that, but that’s neither here nor there. 

It’s kind of hard to tell the difference just by looking at the video, and if you look closely all you’ll see is a kind of crappy version of Doom (it’s interesting that the only real “hallucination” I can see in the video pops up when the player shoots an enemy, which results in some blurry feedback animations) but the future implied by the researchers and the project’s name is that the technology could get to a point where it completely changes how games are made. 

“Today, video games are programmed by humans. GameNGen is a proof-of-concept for one part of a new paradigm where games are weights of a neural model, not lines of code,” the researchers write. “GameNGen shows that an architecture and model weights exist such that a neural model can effectively run a complex game (DOOM) interactively on existing hardware. While many important questions remain, we are hopeful that this paradigm could have important benefits. For example, the development process for video games under this new paradigm might be less costly and more accessible, whereby games could be developed and edited via textual descriptions or examples images. A small part of this vision, namely creating modifications or novel behaviors for existing games, might be achievable in the shorter term.”

Is it possible that in the future all video game engines would just be different diffusion models? Maybe, I don’t know. As the researchers note, “important questions remain,” such as, how do you make a diffusion model version of Doom without training on an already existing version of Doom, or as is the problem with all generative AI, how do you make games that are not directly derived from existing games, and if you do, are you just stealing from all the game developers who created that training data?

Advertisement