Wednesday, February 3, 2010

A Data-driven Discourse

I am currently TAing CS428 at UIUC. This class is the second in a sequence of software engineering classes. In this class, students form groups of 6-8 and work on projects for an entire semester. Many students choose to work on games, which is serendipitous, as I am interested in game architecture.

In an email I sent out to some of the students, I suggested they look into data-driven design. Unfortunately, there isn't a lot of concrete information on data-driven design. Most of the documentation is along the lines of “put as much of the game in data as you can, it's a good idea” and not a lot of specifics. There was much confusion. The truth of the matter is, I did not fully understand data-driven design until I used it in a real game. I think, in general, data-driven design is one of those topics that people do not fully understand until they've done it. There are lots of experienced people evangelizing a data-driven approach, and a lot of inexperienced people simply not understanding why the evangelists are so fervent. This is my attempt to bridge that gap.

What Does it Look Like?


Data-driven systems can look like pretty much anything. Data can be sucked from virtually any source, .ini files, .xml files, or a custom configuration format. This is part of the reason everyone waves hands when the talk about data-driven systems, they share little in common other than a high-level focus on data. Certain types of data may have their own formats, such maps or images. Yes, images. I see you raising your hand in the back of the room. Let me guess, your game was already loading images? The truth of the matter is that most games are partially data-driven, as they usually load images and sounds. The fundamental idea of being data-centric is extremely simple, but unanticipated benefits emerge when the scale is increased. Images and sounds are just the tip of the iceberg.

Isolating Change


One of the reasons data-driven design works is because it helps isolate changes. For example, when tuning a game, you will fiddle with a lot of parameters, and keep testing until the gameplay feels right. With data-driven design, you can centralize these parameters in a single configuration file. Tuning becomes much easier if you are not forced to comb through your entire codebase to make changes. As a related issue, data-driven systems do not need to be recompiled when the data changes. (Well... if you did everything right.) This is a huge advantage for large C programs, where linking can take five minutes or longer. In data driven systems, you can see the changes as fast as you can load the data.

Isolating change is also handy if you're working with dedicated artists or designers. If everything they edit is data, there is no need for them to recompile the game. The artist I worked on Scamper Ghost loved this freedom. Not only is this faster and simpler for them, but there is no need to keep them synced up with the current version of the source, or for them to install a compiler or IDE.

From a software engineering perspective, data-driven design cleanly orthogonalizes the system: the game definition gets separated from the implementation issues.

Online Reloading


One of the holy grails of data-driven systems is online loading of data. The idea being that as soon as data changes on disk, it should be possible to load it into the game, even while the game is running. A image doesn't look right? Edit it, hit a button, and it changes in-game. Or, if you're particularly good with OS API Kung Fu, monitor the data directory for changes, and automatically reload it.

Personally, I have never reached this data-driven nirvana. There are too many sticky issues about reloading arbitrary data. What kind of data can be reloaded, and what kind of data cannot? For example, reloading a level that is currently being played could cause problems, as the player may find themselves suck inside a solid object, or similar. If you change the number of hitpoints for a certain kind of monster, do you attempt to adjust the hitpoints of already instantiated monsters, or just the ones instantiated in the future? There are likely many heuristics to address these issues, but I have not had a chance to explore them.

I Declare!


My personal favorite reason to go data-driven is that configuration files are declarative, not imperative. The data is not coupled with how it is used. This means the same data can be used in multiple ways. For example, in Flash data can either be loaded from external sources, or embedded in the binary. Embedding data makes it much faster to load (the fixed overhead of transferring a file across the network is comparatively huge if there is only a small amount of data), and simplifies distribution, as it reduces the number of files that need to be distributed. Unfortunately, embedded data cannot be changed without recompiling the binary. Ideally a game would not embed data during development, but embed it for release. Sadly, Flash does not provide an easy way to switch between these two configurations. File embedding is specified by an explicit directive in the source code for each embedded file, and loading an embedded file requires slightly different code than loading an external file.

In Scamper Ghost, the game loaded data off the disk by default. When creating a release build, a Python script is used to parse the config file and finds all of the images and sounds the game uses. The script then generates an additional source file that embeds all of the resources, and provides hooks for loading them. In general, the declarative nature of data allows tools to analyze and manipulate the game data in arbitrary ways.

So What About Game Logic?


Most evangelists insist that a game should contain no game-specific code, rather everything should be defined in data. At first glance this is a little scary and quite impossible, but it is mitigated by three unspoken assumptions.

First, quite a bit of “game specific” code can be generalized into a parametrization of generic code. For example, if a game has particle effects, there will likely be some code, somewhere that implements particle effects. It is patently absurd to imagine that particle effects could or should be defined from first principles using data. At that point, the configuration language would have become a programming language. It should be possible to share the same particle effect code between games, but each game may have different looking particle effects due to how they're parametrized. In addition, a game may require a novel parameter, such as controlling how much the particle effect is affected by wind, and adding support for that parameter is not considered game-specific code (by the evangelists), even though that parameter may only be used in one game.

Second, the evangelists assume there is a scripting language, and that the scripting language does not count as code. The amount of scripts needed by a game should be minimized, however, by clever and liberal parametrization. If you are not using a scripting language, your goal should be to keep the “code” decoupled from the game logic. (The reverse will likely be untrue, the game logic will likely need to understand the code unless it is isolated with an interface.) In any case, best practice that game logic be attached to the game using data. For example, a configuration file could indicate that a certain type of monster will call a given script when hit.

Third, the evangelists assume that you're using some kind of component system to define game object structure. Component systems allow you to define game object structure via the composition of know substructures. Without a component system, you would need to hardcode the structures for various game objects. Of course, best practice would dictate that these structures should be part of the “game logic” and not the main code.

Bigger Than You'd Think


Another aspect of data-driven systems that is easy to underestimate is the magnitude of data most games contain. Scamper Ghost, a small but polished game, contained over 500 images. Modern commercial games contain megabytes to gigabytes of data. Techniques that work at smaller scales, such as embedding image names in code, simply do not scale. With smaller amounts of data, many of the benefits of data-driven design can still be realized, but for “normal” amounts of data, being data-centric is a virtual necessity.

Permission to Sin


Data-driven systems are hard to bootstrap. It is often tempting to hardcode something, just to get the game up and running. My advice is this: do it. The true value of data driven systems is only realized when almost everything is data, however. You will likely come to regret hardcoding whatever it was, and go back to generalize it later. Hardcoding is simply part of how data-driven systems are bootstrapped, attempting to be strictly orthodox from the beginning will simply slow down the boostrapping process. Just be sure that you don't paint yourself into a corner.

No comments:

Post a Comment