GDC 2010: Streaming Massive Environments from 0 to 200 MPH
Posted by Ben Zeigler on March 17, 2010
Here’s my notes for the talk Streaming Massive Environments from 0 to 200 MPH presented by Chris Tector from Turn 10 Studios. He’s listed as a Software Architect there, and obviously has a deep understanding of the streaming system they used on Forza 3. This talk was nice and deep technically, and touches all parts of the spectrum. I did get a bit lost when it got down to the deep GPU tricks, so I may have missed a bit. Anyway, here’s things that I think were probably said:
- The requirements are to render the game at a constant 60 FPS, which includes tracks, cars, the UI, crowds particles, etc. Lots of things to do, so very little cpu available at runtime for streaming.
- It has to look good at 0 mph, because that’s where marketing takes screenshots, and where the photo mode is enabled. It also has to look good at 200 mph, because that’s how people play the game.
- As an example, the LeMans track is 8.4 miles long, has 6k models and 3k textures. To have entire track loaded would take 200% of the consoles memory JUST for models and textures.
- Much information can be gathered from the academic area of “Massive Model Visualization”. However, beware that academic “real time” does not equal game real time, because of all the other things a game has to do.
- First, the tracks are stored on disk in modified .zip files, using the lzx data format. Tracks take from 90MB to 300MB of space compressed. This data is read off disk in cache-sized blocks. The only actual I/O that is performed is done strictly in order, to avoid the horrible seek times of a DVD.
- The next stage is the in-memory compressed data cache. The track data is stored in this format in the same format as on disk. Forza 3 uses 56MB for this cache, and uses a simple Least Recently Used algorithm to push blocks out of this cache. Each block is 1MB large.
- The next stage is a decompressed heap in memory. There’s a 360-specific LZX decompresser that runs at 20 MB/Sec. They had to optimize the heap heavily to get really fast alloc and free operations. Forza 3 uses 194 MB for this heap and allocates everything aligned and contiguous
- The next stage is the GPU/CPU caching layer. They do something semi tricky for textures. Textures can be present in either Mip 0 (full res), Mip chain (Mip 1 down to 32×32), or Small Texture (a single 32×32 texture) form. There is special 360 support to allow the Mip chain to be split up in different memory locations, so they can stream the Mip 0 in after the rest of the chain and it will display correctly.
- A few special things happen in the GPU/CPU itself. First, there is NO runtime LOD calculation, as the streaming data gives the correct LOD to show, and they are seperate objects in the stream. They did add a basic instancing system to allow a single shader variable. They spent a lot of time optimizing the GPU/CPU for the 360. He mentioned using Command Buffers as much as possible. Spent time right sizing assets to fit optimal shader use. 360 has special controls to reduce MIP memory access (Note: This got a bit too deep for me)
- Many projects use conservative occlusion to determine visibility, often because it can run real time. However, Forza does per-pixel occlusion in an extensive pre process step. Uses depth buffer rejection to figure out what’s occluded. It also does the LoD calculation at this point, and will exclude any objects that aren’t big enough to be visible (contribution rejection). Many games do LoD and contribution rejection runtime, but the data set is huge so they end up having horrible cache performance. (Note: I asked later, and this whole process takes up to 8 hours offline, for a very large track)
- First step in the process is to Sample the visibility information. The tracks have an inner and outer spline that defines the “Active” area, so the sampler picks a set of points inside those splines (and maps them to a grid relative to the center spline). At this point it creates “zones” which are chunks of track.
- To actually sample at each point, it uses a constant height and 4 angle views, relative to track direction. Visibility for each point are automatically placed at adjacent point, because the object came to exist at some undefined spot between two sample points.
- The engine then renders all of the models that are plausible (without textures). It then runs a D3D occlusion query to see what and how much is visible. Each model keeps track of it’s object ID, camera location, and pixels visible. The LoD calculation happens at this point, as it uses the distance info. It can do LoD, Occlusion, and Contribution in a single pass, after 2 renders. So fairly quick individual operation. It then keeps track of the pixel count of each object in a zone, as opposed to just a binary yes/no for visibility.
- After sampling a Splitting process takes place. Many of the artist placed objects are extremely large in their source data, to avoid seams and such. So, it will break these large objects up and cluster smaller objects together into single draw calls. Instancing breaks the clustering, so artists have to be careful
- The next step is the Building process. At this point it maps the textures on to the models. There’s a pass that aggressively removes duplicate textures. It looks for renamed textures, exact copies, and MIP parent/child relationships and will combine them as necessary. It also computes the 32×32 “small textures” at this point. The small textures for an entire track are put into a separate chunk and are preloaded for the entire track. This chunk is from 20-60 MB depending on track and is the only track data that is preloaded. This is so when the low LoD for an object is up and running, it will at least be colored correctly.
- Optimization is the next phase and is somewhat complicated. For each zone it finds the models and textures used in that zone as well as the two adjacent zones. It finds the models and then the textures, and sorts them by number of pixels visible. For any textures that are unneeded (if it’s < 32×32 strip it entirely, if it’s < MIP 0 strip MIP 0) it does the trivial reduction.
- It then does two memory reduction passes. First, it has to lower the total number of models/textures loaded in a zone to be < the decompressed heap. It removes models/textures as required, starting with the least pixels visible. After that it computes a delta of models/textures relative to the zone before and after it. The delta has to be lower than the available streaming bandwidth, so it strips for that reason.
- Once it computes the set of assets in a chunk, it has to package them. It places them in a cache efficient order and places the objects in “first seen” order. Objects that are frequently used end up near the front of the package and will stay in memory throughout, while objects for later in the track are farther back.
- The last step is Runtime. The runtime code is responsible for keeping track of what individual objects to create and destroy, based on the zone descriptions. It could do a reference count but does a simple consolidate where it frees everything first. This reduces fragmentation
- The keys to the Forza system are work ordering, heap efficiency, decompression efficiency, and disk efficiency. Each level of the data pipeline is critical, and anything that can be done to improve a level is worth doing. Don’t over specialize on a particular aspect of the pipeline
- System isn’t perfect. Popping can happen either due to late arrival caused by memory/disk bandwidth not keeping up, or it can be caused by visibility errors. They did have to relax some of their visibility constraints eventually, because certain types of textures threw off the calculation. They provided artists with a manual knob that can tweak an individual object to be more visible at the expense of possibly showing up late. Finally, you have to deal with unrealistic expectations.
- For future games, Chris had a few ideas. First, he would like to expand the system to work for non-linear environments. This would entail replacing linear zones with 3d zones, but would allow open world racing. There are probably more efficient forms of domain specific decompression that could up the decompression bandwidth. The system could do texture transcoding. It should be expanded to add another layer on top of the disk cache: network streaming (Note: Trust me when I say that’s a whole other lecture by itself)
3 Responses to “GDC 2010: Streaming Massive Environments from 0 to 200 MPH”
Sorry, the comment form is closed at this time.