####################################################################### Overview of game file formats and archives by Luigi Auriemma e-mail: aluigi@autistici.org web: aluigi.org Date: 13 Apr 2013 collected statistics and writing 13 Dec 2018 revision and release ####################################################################### ------------------------ Introduction and formats ------------------------ Games use a lot of different file types: textures, sounds, musics, 3d models, AI scripts, animation scripts, configuration files, images, videos and so on. Instead of having those files sparse in the game's folder, the developers prefer to store them in one or more archives for the following main reasons: - performance accessing one single file (the whole archive) requires less resources than opening and closing every single file resource, it results in minor loading times, less memory and disk usage (no disk allocation unit alignment and continuous opening of different files). - content protection often these archives contain encrypted content, game developers and publishers try to avoid its usage for modding or personal user (for example listening a soundtrack) and obviously to avoid its embedding in other commercial projects. In this case the adopted solutions range from the simple obfuscation of the content by XORing the data with a fixed byte or key to customized encryption algorithms. - saving space many archives use compression algorithms and other mechanisms for saving space on disk for their games, it was quite common in the past just like it's necessary nowadays where games occupy gigabytes of space. These solutions can be used alone or often combined, so it's not rare to see an archive containing compressed and encrypted content. When we are in front of an archive or an encrypted/compressed file our target is just dumping its content and later understanding how to use the dumped files, for example a 3d model in software like 3ds Max or an Ogg file in a media player or customized formats that must be converted to other formats and so on. The part of the procedure covered by this document is just the first step, understanding a file format for extracting its content "as is". Usually only the following parameters are necessary: - offset of the resources, location of the file inside the archive (where it begins) - size of the resource - optional compressed/uncompressed size if the file has been shrunken with a compression algorithm - optional name of the resource, often the original name of the archived file Usually these information are stored in an index table usually called TOC (table of content), in some games it may be encrypted to avoid the correct dumping of the resources while in other formats the resources are stored sequentially avoiding to specify an offset field for each file. In the next examples I will use some words to identify some common fields: - OFFSET location of the file in hexadecimal (0x22 = 34) - ZSIZE compressed size - SIZE normal and uncompressed size - FILES amount of files stored in the archive - NAME name of the stored file - FILE the content (data) of the stored file Example of Index table: +-----------------+ | FILES 2 | +-----------------+ /-| OFFSET 00000022 | | +-----------------+ | | SIZE 41 | | +-----------------+ | | NAME test.txt | | +-----------------+ /-+-| OFFSET 0000004b | | | +-----------------+ | | | SIZE 20 | | | +-----------------+ | | | NAME blah.dat | | | +-----------------+-------------------------+ | \>| FILE 1 | | +----------------------+--------------------+ \-->| FILE 2 | +----------------------+ Example of sequential files: +-----------------+ | SIZE 41 | +-----------------+ | NAME test.txt | +-----------------+-------------------------+ | FILE 1 | +-----------------+-------------------------+ | SIZE 20 | +-----------------+ | NAME blah.dat | +-----------------+----+ | FILE 2 | +----------------------+ Example of sequential table followed by sequential files: +-----------------+ | FILES 2 | +-----------------+ | SIZE 41 | +-----------------+ | NAME test.txt | +-----------------+ | SIZE 20 | +-----------------+ | NAME blah.dat | +-----------------+-------------------------+ | FILE 1 | +----------------------+--------------------+ | FILE 2 | +----------------------+ Some variants and customizations: - relative file offsets, usually the absolute offset from which are calculated the relative file offsets is specified directly at the beginning of the archive or calculated before or after having read the whole TOC: - before: it can be accomplished only with fixed size file entries, for example with filenames having a maximum length: BASE_OFF = offset_first_entry + (entries * sizeof(entry)) - after: it's necessary to parse the whole entries before knowing this offset - sector offset: quite common on PlayStation games where the specified offsets must be multiplied by 2048 (size of disk sector) - TOC at the end: the TOC is often located at the beginning of the archive but some games prefer to put it at the end for being able to update the archive in future with new content, usual methods: - header at beginning telling the offsets where is located the TOC - few bytes of information at the end containing the TOC offset or just the size of the TOC from which can be retrieved the offset - nested tree: usually the filenames already include the full path like models\character\chara_1.mdl but sometimes the whole directory tree is stored in the archive (folders and files) and it requires to be parsed recursively - sometimes TOC may be compressed - chunked files: see later - TOC in a separate file: usually called "index file", a small file that contains all the information of the files archive in a "data file", usually they share the same name and different extension, for example: archive.idx and archive.dat - ZIP format: sometimes games use just a ZIP archive for containing their files, some games may try to implement a custom version of the ZIP format as it happens with those that add a new compression algorithm (Forza Motorsport and Dark Sector) or those that use some different fields or don't use the classical "PK" magic values for the various sections of the ZIP archive. There are even archives in which the format is really complex because they don't store the original files but they use them as direct "resources" ready to be used in the game engine and so there are more steps to accomplish our target. If an archive uses a block cipher encryption like AES or Blowfish there is also a third size component to take in consideration, the block aligned size of the resource. If this value is missing, usually it's automatically calculated or the game uses CipherFinal of OpenSSL or stream modes like CTR. Example of stored file encrypted with a block cipher: +-----------------+ | OFFSET 00000022 | +-----------------+ | ZSIZE 41 | compressed size +-----------------+ | SIZE 180 | uncompressed size +-----------------+ | XSIZE 48 | archive size (aligned) +-----------------+ | NAME test.txt | +-----------------+--------------------------------+ | FILE 1 (compressed and encrypted) PADDING | +--------------------------------------------------+ A solution that is often used to save space is dividing the archived files in small parts called "chunks". The advantage of this technique is that the chunks are compressed only if the compressed size is lower than the uncompressed one but the disadvantage is that the usage of small chunks doesn't take the benefits of the most advanced compression algorithms because the dictionary/window doesn't have enough data to be filled and used. Usually the decompressed size of the chunks is not specified because it's hardcoded in the game. A compressed chunk with size zero or equal than the chunk decompressed size means it's stored "as-is" without compression. Example of chunk based file: +-----------------+ | OFFSET 00000022 | +-----------------+ | SIZE 180 | +-----------------+ | NAME test.txt | +-----------------+ | CHUNKS 3 + +-----------------+ | CHUNK ZSIZE 30 | * let's say CHUNK SIZE is 64 +-----------------+ | CHUNK ZSIZE 42 | +-----------------+ | CHUNK ZSIZE 35 | +-----------------+--------------+ | CHUNK 1 | +--------------------------------+-----------+ | CHUNK 2 | +-------------------------------------+------+ | CHUNK 3 | +-------------------------------------+ ####################################################################### ----------------------------------- The modding perspective: rebuilders ----------------------------------- Usually the purposes of obtaining a resource from an archive are the following: - using the resource a typical example is the music of a game to listen on the own computer or images to use as wallpaper - modding the same game editing the extracted content and reinjecting it back in the archive or just rebuilding the whole archive from scratch - using the resource obtained from a game in another different game reinjecting the resource or rebuilding the archive of another game In the last two cases the user needs a way to force the game to load a non archived file or to rebuild the archive or to reinject it in the original archive: - Usage of non archived resources in some cases it's possible to use the extracted resources in the game by default because the developers left this feature enabled for debugging or because the usage of archives was meant only to improve loading performances. In some other cases it's necessary to activate a specific option from a configuration file or command-line (like in Need for Speed Shift), while in other situations there is no way to force the game to read the extracted files. - Archive rebuilding this is the best solution but unfortunately it's also the most expensive because extracting a file is completely different than rebuilding the whole archive. For rebuilding an archive it's necessary to know "all" the fields used in the TOC and it's not possible to ignore most of them as we did with extraction, additionally creating a rebuilder requires more effort and programming work than writing an extractor. - Reinjecting/Reimporting this is the way that requires the minimal effort and in most cases can be implemented even automatically just like I do in my QuickBMS tool that allows an extraction script to be used also in reimport mode without any change. The downsides of this method are: - no CRC/checksum/hash recalculation if used in the archive, exist some work-arounds that can be applied like automatically recalculating and overwriting the CRC field but this is not possible if the algorithm is not a common one, some games ignore the different CRC, others will reject the edited file - in the past there was a limitation with the size of the new files which has been bypassed by a new reimport method (reimport2), but still some archives are incompatible if they use sequential offsets - in case of custom encryption and compression algorithms it's possible that doesn't exist the code to re-encrypt or re-compress the data (this is valid for the rebuilding solution too) - in some cases it's possible that the new version of the archive is not fully compatible with the game, maybe the game checks the hash of the archive before using it or something else Anyway it's worth to note that the benefits of this solution are incredible for both the writer of the script and the modder and many mods, cheats and customizations have been created in this way. If the archive uses asymmetric cryptography and/or digital signature it's not possible to perform rebuilding or reimporting due to the lack of the private key. An example are the GameGuard files. In these cases the only solution is modifying the game executable for removing the check of the signature or using a known private/public key generated by us. ####################################################################### ----------------------------- Sources used by this document ----------------------------- All the material that has been evaluated for creating this document comes from my personal research available on my personal website. The main source is composed by the scripts for my QuickBMS program started in 2009: http://quickbms.aluigi.org The secondary source are the stand-alone tools available in my Research page: http://papers.aluigi.org The last source used is my collection of archives passwords: http://aluigi.org/papers.htm#info The scripts and tools selected for the statistics are those that work on the files of the games, so any tool related to the encryption of network data or the decryption of content generated by the user (savegames) or non-game related stuff have not been included. Evaluated scripts: about 810 (this document has been originally created in 2013), these scripts are too many for being listed here. They cover many types of games of big and small vendors, of any platform like Xbox, Xbox 360, PC, PS3, PS2, PSP, Wii and others. They even cover multiple versions of the same file format. So it's possible to see the script for Crysis 2 and at the same time the one for games of which I have never heard their name. Evaluated tools: rfactorgmdec, rfactordec, wtcced, hldlldec, halomus, rdbigext, scfdec, umodext, unxwb, uniginex, mmviewer_dumper, osrwdec, molebox2ext, sdgundamext, tdudec, partydec, ttarchext, asurauncmp, ssaext, canhelpaczip, sgpdec, uodemoext, egoxext, cauldronext, bsrdec, motorm4xdec, pyroblazerext, worldshiftext, ssnam67ext, msmixext, xsoext, ysext, orkdec, ps2ext, vitalext, hedwadext, borpak, ccftfext, fsbext, nexusext, tnt2zip, cbfext, virtdec, unvirt, zanzapak, gguardfile, rtwsndext, manext, lin2ed. Note that many scripts/tools work on multiple games and in some cases two or more scripts may overlap (different script but same game), so for realizing these statistics I counted just the scripts/tools and not each single game they cover just because it's hard if not even impossible to know what games are covered by a specific engine or if a file format is used in other games. Note also that some scripts use more than one algorithm, that's why the sum of entries is bigger than the number of scripts and tools which have been evaluated. All the information have been collected the 13 Apr 2013 with the manual and automatic checking of each source. If you are interested in other externals sources (to which I contribute too) take a look at the ZenHAX forum: https://zenhax.com Regarding the results showed below, please note that they have been obtained automatically by using a program over all the scripts available on my website so some results may be redundant (for example used multiple times in the same script or maybe two versions of the same script) and some information may be missing (some scripts are difficult to parse automatically). So PLEASE do not take these results too seriously. ####################################################################### ----------------------------------- Results: Encryption and Obfuscation ----------------------------------- +-----------------------------------------------------------+---------+ | no encryption | 676 | +-----------------------------------------------------------+---------+ | XOR with one byte | 44 | +-----------------------------------------------------------+---------+ | XOR with key (multiple bytes) | 53 | +-----------------------------------------------------------+---------+ | rotate (add/sub) with one byte | 4 | +-----------------------------------------------------------+---------+ | rotate (add/sub) with key (multiple bytes) | 12 | +-----------------------------------------------------------+---------+ | AES | 18 | +-----------------------------------------------------------+---------+ | Blowfish | 10 | +-----------------------------------------------------------+---------+ | DES/3DES | 3 | +-----------------------------------------------------------+---------+ | charset / substitution table | 3 | +-----------------------------------------------------------+---------+ | incremental XOR | 9 | +-----------------------------------------------------------+---------+ | RC4 | 12 | +-----------------------------------------------------------+---------+ | TEA/XTEA/XXTEA | 4 | +-----------------------------------------------------------+---------+ | custom encryption / obfuscation | 48 | +-----------------------------------------------------------+---------+ +-----------------------------------------------------------+---------+ | password protected archives (mainly ZIP, RAR and FSB) | 53 | +-----------------------------------------------------------+---------+ ####################################################################### -------------------- Results: Compression -------------------- +-----------------------------------------------------------+---------+ | no compression | 500 | +-----------------------------------------------------------+---------+ | zlib | 188 | +-----------------------------------------------------------+---------+ | LZO | 20 | +-----------------------------------------------------------+---------+ | deflate | 36 | +-----------------------------------------------------------+---------+ | LZMA | 20 | +-----------------------------------------------------------+---------+ | Microsoft XMem (LZX) | 27 | +-----------------------------------------------------------+---------+ | LZSS | 13 | +-----------------------------------------------------------+---------+ | gzip | 10 | +-----------------------------------------------------------+---------+ | bzip2 | 9 | +-----------------------------------------------------------+---------+ | custom / proprietary / less known | 41 | +-----------------------------------------------------------+---------+ ####################################################################### ------------------ Results: Structure ------------------ Sorry, not available yet. +-----------------------------------------------------------+---------+ | Index table | ? | +-----------------------------------------------------------+---------+ | Sequential files | ? | +-----------------------------------------------------------+---------+ | Chunks | ? | +-----------------------------------------------------------+---------+ ####################################################################### --------------------- Notes and information --------------------- During the reverse engineering of these files formats have been noticed some interesting things. In some cases the target platform makes the difference due to possible in-hardware optimizations or the endianess of the CPU. For example, on Xbox 360 it's quite common to see the Microsoft LZX algorithm (XMemCompress) in use in place of zlib used for the same games on other platforms and it's also common to see the archives packed using the big endianess instead of the little endianess of the PC versions. Another interesting point is about the version of the file formats because some of them (like the MAS one for the ISI Gmotor engine) exist from various years and have been used in many games with the result of creating many versions very different between each other. This is caused not only due to the enhancing of the format in the years but mainly due to desire of customizing the format adopted by different developers. Games like those developed by Simbin use common archives (like the MAS one mentioned above) with an additional layer of encryption that has been updated game after game trying to make harder the life of the maintainer of the decryption tools. This is valid also for the Telltale Games archives in which these continuous changes lasted various years for various versions. In other cases a more complex and custom encryption algorithm has been added after the developers have been aware of the existence of tools for decrypting and extracting the content of the archives, a recent example is Farming Simulator 2013 1.4 beta. The most common compression algorithms are the zlib and deflate ones, note that zlib is just a deflate stream with a header and a CRC so basically they are the same thing. This algorithm is used really in a lot of games and it's also the most easy to identify because all the job can be performed with programs like offzip that scan the whole archive finding the zlib data (thanks to its CRC that avoids false positives) and returning the offset plus the compressed and uncompressed size that can be used to identify the index table in the archive. On the encryption and obfuscation side the most used is without doubts the classical and simple XOR solution followed by the custom and proprietary solutions that go from simple obfuscations to the customizing of known algorithms and even the implementation of algorithms never seen online. The password protected archives are a lot but they rely on known file formats like ZIP, Rar and Fmod FSB so I have preferred to keep them out from the final considerations. Why developers opt for this solution? Because there are libraries already available to handle these known archives and just a simple password trying to keep modders out. When a researcher encounters a custom encryption or compression algorithms there are usually the following ways to solve the puzzle: - try to reverse engineer the pre-compiled algorithm in a higher level language like C or others - use binary to C/pseudo code converters like IDA Pro or REC and then fix the resulted code (it may be a painful process) - dump the whole function and fix it where necessary, depending by the interest in the game and the complexity of the algorithm usually this is a very good compromise - if you are very lucky probably the game uses an external dll that can be used to perform the same tasks from any custom tool As already said, remember that this document is based ONLY on the work publicly available on my website so doesn't cover other game extractors written by other people or the scripts for QuickBMS written by users in the community (that I personally thank for their feedback and support). #######################################################################