#######################################################################


Overview of game file formats and archives
by Luigi Auriemma
e-mail: aluigi@autistici.org
web:    aluigi.org

Date: 13 Apr 2013   collected statistics and writing
      13 Dec 2018   revision and release


#######################################################################

------------------------
Introduction and formats
------------------------


Games use a lot of different file types: textures, sounds, musics, 3d
models, AI scripts, animation scripts, configuration files, images,
videos and so on.

Instead of having those files sparse in the game's folder, the
developers prefer to store them in one or more archives for the
following main reasons:

- performance
  accessing one single file (the whole archive) requires less resources
  than opening and closing every single file resource, it results in
  minor loading times, less memory and disk usage (no disk allocation
  unit alignment and continuous opening of different files).

- content protection
  often these archives contain encrypted content, game developers and
  publishers try to avoid its usage for modding or personal user (for
  example listening a soundtrack) and obviously to avoid its embedding
  in other commercial projects.
  In this case the adopted solutions range from the simple obfuscation
  of the content by XORing the data with a fixed byte or key to
  customized encryption algorithms.

- saving space
  many archives use compression algorithms and other mechanisms for
  saving space on disk for their games, it was quite common in the past
  just like it's necessary nowadays where games occupy gigabytes of
  space.

These solutions can be used alone or often combined, so it's not rare
to see an archive containing compressed and encrypted content.


When we are in front of an archive or an encrypted/compressed file our
target is just dumping its content and later understanding how to use
the dumped files, for example a 3d model in software like 3ds Max or an
Ogg file in a media player or customized formats that must be converted
to other formats and so on.

The part of the procedure covered by this document is just the first
step, understanding a file format for extracting its content "as is".

Usually only the following parameters are necessary:

- offset of the resources, location of the file inside the archive
  (where it begins)

- size of the resource

- optional compressed/uncompressed size if the file has been shrunken
  with a compression algorithm

- optional name of the resource, often the original name of the
  archived file

Usually these information are stored in an index table usually called
TOC (table of content), in some games it may be encrypted to avoid the
correct dumping of the resources while in other formats the resources
are stored sequentially avoiding to specify an offset field for each
file.

In the next examples I will use some words to identify some common
fields:
- OFFSET    location of the file in hexadecimal (0x22 = 34)
- ZSIZE     compressed size
- SIZE      normal and uncompressed size
- FILES     amount of files stored in the archive
- NAME      name of the stored file
- FILE      the content (data) of the stored file

  Example of Index table:

        +-----------------+
        | FILES         2 |
        +-----------------+
      /-| OFFSET 00000022 |
      | +-----------------+
      | | SIZE         41 |
      | +-----------------+
      | | NAME   test.txt |
      | +-----------------+
    /-+-| OFFSET 0000004b |
    | | +-----------------+
    | | | SIZE         20 |
    | | +-----------------+
    | | | NAME   blah.dat |
    | | +-----------------+-------------------------+
    | \>| FILE 1                                    |
    |   +----------------------+--------------------+
    \-->| FILE 2               |
        +----------------------+


  Example of sequential files:

        +-----------------+
        | SIZE         41 |
        +-----------------+
        | NAME   test.txt |
        +-----------------+-------------------------+
        | FILE 1                                    |
        +-----------------+-------------------------+
        | SIZE         20 |
        +-----------------+
        | NAME   blah.dat |
        +-----------------+----+
        | FILE 2               |
        +----------------------+


  Example of sequential table followed by sequential files:

        +-----------------+
        | FILES         2 |
        +-----------------+
        | SIZE         41 |
        +-----------------+
        | NAME   test.txt |
        +-----------------+
        | SIZE         20 |
        +-----------------+
        | NAME   blah.dat |
        +-----------------+-------------------------+
        | FILE 1                                    |
        +----------------------+--------------------+
        | FILE 2               |
        +----------------------+


Some variants and customizations:

- relative file offsets, usually the absolute offset from which are
  calculated the relative file offsets is specified directly at the
  beginning of the archive or calculated before or after having read
  the whole TOC:
  - before: it can be accomplished only with fixed size file entries,
    for example with filenames having a maximum length:
    BASE_OFF = offset_first_entry + (entries * sizeof(entry))
  - after: it's necessary to parse the whole entries before knowing
    this offset

- sector offset: quite common on PlayStation games where the specified
  offsets must be multiplied by 2048 (size of disk sector)

- TOC at the end: the TOC is often located at the beginning of the
  archive but some games prefer to put it at the end for being able to
  update the archive in future with new content, usual methods:
  - header at beginning telling the offsets where is located the TOC
  - few bytes of information at the end containing the TOC offset or
    just the size of the TOC from which can be retrieved the offset

- nested tree: usually the filenames already include the full path like
  models\character\chara_1.mdl but sometimes the whole directory tree
  is stored in the archive (folders and files) and it requires to be
  parsed recursively

- sometimes TOC may be compressed

- chunked files: see later

- TOC in a separate file: usually called "index file", a small file
  that contains all the information of the files archive in a "data
  file", usually they share the same name and different extension, for
  example: archive.idx and archive.dat

- ZIP format: sometimes games use just a ZIP archive for containing
  their files, some games may try to implement a custom version of the
  ZIP format as it happens with those that add a new compression
  algorithm (Forza Motorsport and Dark Sector) or those that use some
  different fields or don't use the classical "PK" magic values for the
  various sections of the ZIP archive.


There are even archives in which the format is really complex because
they don't store the original files but they use them as direct
"resources" ready to be used in the game engine and so there are more
steps to accomplish our target.

If an archive uses a block cipher encryption like AES or Blowfish
there is also a third size component to take in consideration, the
block aligned size of the resource.
If this value is missing, usually it's automatically calculated or the
game uses CipherFinal of OpenSSL or stream modes like CTR.

  Example of stored file encrypted with a block cipher:

        +-----------------+
        | OFFSET 00000022 |
        +-----------------+
        | ZSIZE        41 |     compressed size
        +-----------------+
        | SIZE        180 |     uncompressed size
        +-----------------+
        | XSIZE        48 |     archive size (aligned)
        +-----------------+
        | NAME   test.txt |
        +-----------------+--------------------------------+
        | FILE 1 (compressed and encrypted)        PADDING |
        +--------------------------------------------------+


A solution that is often used to save space is dividing the archived
files in small parts called "chunks".
The advantage of this technique is that the chunks are compressed
only if the compressed size is lower than the uncompressed one but the
disadvantage is that the usage of small chunks doesn't take the
benefits of the most advanced compression algorithms because the
dictionary/window doesn't have enough data to be filled and used.
Usually the decompressed size of the chunks is not specified because
it's hardcoded in the game.
A compressed chunk with size zero or equal than the chunk decompressed
size means it's stored "as-is" without compression.

  Example of chunk based file:

        +-----------------+
        | OFFSET 00000022 |
        +-----------------+
        | SIZE        180 |
        +-----------------+
        | NAME   test.txt |
        +-----------------+
        | CHUNKS        3 +
        +-----------------+
        | CHUNK ZSIZE  30 |     * let's say CHUNK SIZE is 64
        +-----------------+
        | CHUNK ZSIZE  42 |
        +-----------------+
        | CHUNK ZSIZE  35 |
        +-----------------+--------------+
        | CHUNK 1                        |
        +--------------------------------+-----------+
        | CHUNK 2                                    |
        +-------------------------------------+------+
        | CHUNK 3                             |
        +-------------------------------------+


#######################################################################

-----------------------------------
The modding perspective: rebuilders
-----------------------------------


Usually the purposes of obtaining a resource from an archive are the
following:

- using the resource
  a typical example is the music of a game to listen on the own
  computer or images to use as wallpaper

- modding the same game
  editing the extracted content and reinjecting it back in the archive
  or just rebuilding the whole archive from scratch

- using the resource obtained from a game in another different game
  reinjecting the resource or rebuilding the archive of another game

In the last two cases the user needs a way to force the game to load a
non archived file or to rebuild the archive or to reinject it in the
original archive:

- Usage of non archived resources
  in some cases it's possible to use the extracted resources in the
  game by default because the developers left this feature enabled for
  debugging or because the usage of archives was meant only to improve
  loading performances.
  In some other cases it's necessary to activate a specific option from
  a configuration file or command-line (like in Need for Speed Shift),
  while in other situations there is no way to force the game to read
  the extracted files.

- Archive rebuilding
  this is the best solution but unfortunately it's also the most
  expensive because extracting a file is completely different than
  rebuilding the whole archive.
  For rebuilding an archive it's necessary to know "all" the fields
  used in the TOC and it's not possible to ignore most of them as we
  did with extraction, additionally creating a rebuilder requires more
  effort and programming work than writing an extractor.

- Reinjecting/Reimporting
  this is the way that requires the minimal effort and in most cases
  can be implemented even automatically just like I do in my QuickBMS
  tool that allows an extraction script to be used also in reimport
  mode without any change.
  The downsides of this method are:

  - no CRC/checksum/hash recalculation if used in the archive, exist
    some work-arounds that can be applied like automatically
    recalculating and overwriting the CRC field but this is not
    possible if the algorithm is not a common one, some games ignore
    the different CRC, others will reject the edited file

  - in the past there was a limitation with the size of the new files
    which has been bypassed by a new reimport method (reimport2), but
    still some archives are incompatible if they use sequential offsets

  - in case of custom encryption and compression algorithms it's
    possible that doesn't exist the code to re-encrypt or re-compress
    the data (this is valid for the rebuilding solution too)

  - in some cases it's possible that the new version of the archive is
    not fully compatible with the game, maybe the game checks the hash
    of the archive before using it or something else

  Anyway it's worth to note that the benefits of this solution are
  incredible for both the writer of the script and the modder and many
  mods, cheats and customizations have been created in this way.

If the archive uses asymmetric cryptography and/or digital signature
it's not possible to perform rebuilding or reimporting due to the lack
of the private key. An example are the GameGuard files.
In these cases the only solution is modifying the game executable for
removing the check of the signature or using a known private/public key
generated by us.


#######################################################################

-----------------------------
Sources used by this document
-----------------------------


All the material that has been evaluated for creating this document
comes from my personal research available on my personal website.

The main source is composed by the scripts for my QuickBMS program
started in 2009: http://quickbms.aluigi.org

The secondary source are the stand-alone tools available in my Research
page: http://papers.aluigi.org

The last source used is my collection of archives passwords:
http://aluigi.org/papers.htm#info

The scripts and tools selected for the statistics are those that work
on the files of the games, so any tool related to the encryption of
network data or the decryption of content generated by the user
(savegames) or non-game related stuff have not been included.

Evaluated scripts:
    about 810 (this document has been originally created in 2013), these
    scripts are too many for being listed here.
    They cover many types of games of big and small vendors, of any
    platform like Xbox, Xbox 360, PC, PS3, PS2, PSP, Wii and others.
    They even cover multiple versions of the same file format.
    So it's possible to see the script for Crysis 2 and at the same
    time the one for games of which I have never heard their name.

Evaluated tools:
    rfactorgmdec, rfactordec, wtcced, hldlldec, halomus, rdbigext,
    scfdec, umodext, unxwb, uniginex, mmviewer_dumper, osrwdec,
    molebox2ext, sdgundamext, tdudec, partydec, ttarchext, asurauncmp,
    ssaext, canhelpaczip, sgpdec, uodemoext, egoxext, cauldronext,
    bsrdec, motorm4xdec, pyroblazerext, worldshiftext, ssnam67ext,
    msmixext, xsoext, ysext, orkdec, ps2ext, vitalext, hedwadext,
    borpak, ccftfext, fsbext, nexusext, tnt2zip, cbfext, virtdec,
    unvirt, zanzapak, gguardfile, rtwsndext, manext, lin2ed.

Note that many scripts/tools work on multiple games and in some cases
two or more scripts may overlap (different script but same game), so
for realizing these statistics I counted just the scripts/tools and not
each single game they cover just because it's hard if not even
impossible to know what games are covered by a specific engine or if a
file format is used in other games.

Note also that some scripts use more than one algorithm, that's why the
sum of entries is bigger than the number of scripts and tools which
have been evaluated.

All the information have been collected the 13 Apr 2013 with the
manual and automatic checking of each source.

If you are interested in other externals sources (to which I contribute
too) take a look at the ZenHAX forum: https://zenhax.com

Regarding the results showed below, please note that they have been
obtained automatically by using a program over all the scripts
available on my website so some results may be redundant (for example
used multiple times in the same script or maybe two versions of the
same script) and some information may be missing (some scripts are
difficult to parse automatically).
So PLEASE do not take these results too seriously.



#######################################################################

-----------------------------------
Results: Encryption and Obfuscation
-----------------------------------


+-----------------------------------------------------------+---------+
| no encryption                                             |     676 |
+-----------------------------------------------------------+---------+
| XOR with one byte                                         |      44 |
+-----------------------------------------------------------+---------+
| XOR with key (multiple bytes)                             |      53 |
+-----------------------------------------------------------+---------+
| rotate (add/sub) with one byte                            |       4 |
+-----------------------------------------------------------+---------+
| rotate (add/sub) with key (multiple bytes)                |      12 |
+-----------------------------------------------------------+---------+
| AES                                                       |      18 |
+-----------------------------------------------------------+---------+
| Blowfish                                                  |      10 |
+-----------------------------------------------------------+---------+
| DES/3DES                                                  |       3 |
+-----------------------------------------------------------+---------+
| charset / substitution table                              |       3 |
+-----------------------------------------------------------+---------+
| incremental XOR                                           |       9 |
+-----------------------------------------------------------+---------+
| RC4                                                       |      12 |
+-----------------------------------------------------------+---------+
| TEA/XTEA/XXTEA                                            |       4 |
+-----------------------------------------------------------+---------+
| custom encryption / obfuscation                           |      48 |
+-----------------------------------------------------------+---------+

+-----------------------------------------------------------+---------+
| password protected archives (mainly ZIP, RAR and FSB)     |      53 |
+-----------------------------------------------------------+---------+


#######################################################################

--------------------
Results: Compression
--------------------


+-----------------------------------------------------------+---------+
| no compression                                            |     500 |
+-----------------------------------------------------------+---------+
| zlib                                                      |     188 |
+-----------------------------------------------------------+---------+
| LZO                                                       |      20 |
+-----------------------------------------------------------+---------+
| deflate                                                   |      36 |
+-----------------------------------------------------------+---------+
| LZMA                                                      |      20 |
+-----------------------------------------------------------+---------+
| Microsoft XMem (LZX)                                      |      27 |
+-----------------------------------------------------------+---------+
| LZSS                                                      |      13 |
+-----------------------------------------------------------+---------+
| gzip                                                      |      10 |
+-----------------------------------------------------------+---------+
| bzip2                                                     |       9 |
+-----------------------------------------------------------+---------+
| custom / proprietary / less known                         |      41 |
+-----------------------------------------------------------+---------+


#######################################################################

------------------
Results: Structure
------------------


Sorry, not available yet.

+-----------------------------------------------------------+---------+
| Index table                                               |       ? |
+-----------------------------------------------------------+---------+
| Sequential files                                          |       ? |
+-----------------------------------------------------------+---------+
| Chunks                                                    |       ? |
+-----------------------------------------------------------+---------+


#######################################################################

---------------------
Notes and information
---------------------


During the reverse engineering of these files formats have been noticed
some interesting things.

In some cases the target platform makes the difference due to possible
in-hardware optimizations or the endianess of the CPU.
For example, on Xbox 360 it's quite common to see the Microsoft LZX
algorithm (XMemCompress) in use in place of zlib used for the same games
on other platforms and it's also common to see the archives packed using
the big endianess instead of the little endianess of the PC versions.

Another interesting point is about the version of the file formats
because some of them (like the MAS one for the ISI Gmotor engine) exist
from various years and have been used in many games with the result of
creating many versions very different between each other.
This is caused not only due to the enhancing of the format in the years
but mainly due to desire of customizing the format adopted by different
developers.

Games like those developed by Simbin use common archives (like the MAS
one mentioned above) with an additional layer of encryption that has
been updated game after game trying to make harder the life of the
maintainer of the decryption tools.
This is valid also for the Telltale Games archives in which these
continuous changes lasted various years for various versions.

In other cases a more complex and custom encryption algorithm has been
added after the developers have been aware of the existence of tools
for decrypting and extracting the content of the archives, a recent
example is Farming Simulator 2013 1.4 beta.

The most common compression algorithms are the zlib and deflate ones,
note that zlib is just a deflate stream with a header and a CRC so
basically they are the same thing.
This algorithm is used really in a lot of games and it's also the most
easy to identify because all the job can be performed with programs
like offzip that scan the whole archive finding the zlib data (thanks
to its CRC that avoids false positives) and returning the offset plus
the compressed and uncompressed size that can be used to identify the
index table in the archive.

On the encryption and obfuscation side the most used is without doubts
the classical and simple XOR solution followed by the custom and
proprietary solutions that go from simple obfuscations to the
customizing of known algorithms and even the implementation of
algorithms never seen online.

The password protected archives are a lot but they rely on known file
formats like ZIP, Rar and Fmod FSB so I have preferred to keep them out
from the final considerations.
Why developers opt for this solution? Because there are libraries
already available to handle these known archives and just a simple
password trying to keep modders out.

When a researcher encounters a custom encryption or compression
algorithms there are usually the following ways to solve the puzzle:

- try to reverse engineer the pre-compiled algorithm in a higher level
  language like C or others

- use binary to C/pseudo code converters like IDA Pro or REC and then
  fix the resulted code (it may be a painful process)

- dump the whole function and fix it where necessary, depending by the
  interest in the game and the complexity of the algorithm usually this
  is a very good compromise

- if you are very lucky probably the game uses an external dll that can
  be used to perform the same tasks from any custom tool

As already said, remember that this document is based ONLY on the work
publicly available on my website so doesn't cover other game extractors
written by other people or the scripts for QuickBMS written by users in
the community (that I personally thank for their feedback and support).


#######################################################################