Automatic Serialization

Sep 07, 2023

For every project, you need to store some data to reload it at a later time. The most common case, at least for a game engine, are assets like fbx files for your meshes, ttf files fors your text, png, jpeg, bmp, for your images, wav for your music, etc… This is the easy case, as the format for these won’t change, at least not very often, so you can clearly have a write/read functions for those, and be done.

But, in games, one kind of file may change drastically during development and even during the life cycle of the product, is the data for saves, levels, settings... A typical save file, could change its format multiple times a day, which involves a new load and save function for it every time and that’s always a hard thing to get right.

Quick problem overview

Adding a new field

Let imagine a very small struct, that represents all we need to save for a player early in development:

#define PLAYER_DATA_VERSION 0

struct PlayerData
{
     u32 version;
     u32 gold;
     u32 gamesPlayed;
     r32 totalTimePlayed;
};

We assume here, that the first field of any serialized struct have to be a version of type u32, as we have to know which version it is when loading from memory.

The first straightforward way to save and load it is as a binary file. But what if you want to add a field to it :

#define PLAYER_DATA_VERSION 1

struct PlayerData
{
     u32 version;
     u32 gold;
     u32 gamesPlayed;
     r32 totalTimePlayed;
     u16 myNewAwesomeField; //@Note new field
};

If we add it to the end, then it’s still works fine, as you copy the old part of the structure, and just leave the others fields as default values. But if you want to add a field anywhere in the structure, then it just won’t work, as the data layout changed.

This is the first issue, and you can mitigate it by adding a rule to always add a field at the end. That works for simple structures, but may be difficult to enforce and is not easy to detect if you load it wrong. But this is not the only issue when adding a field. Imagine this scenario instead :

#define PLAYER_DATA_VERSION 1

enum CurrencyType
{
    CurrencyType_None,
    CurrencyType_Gold,
    CurrencyType_Blood,
};

struct Currency
{
     CurrencyType type;
     u32          value;
};

struct PlayerData 
{
     u32 version;
     u32 gamesPlayed;
     Currency myGold;
     r64 myNewtotalTime;
};

What happens if you modify Currency, even at the end :

#define PLAYER_DATA_VERSION 2

enum CurrencyType
{
    CurrencyType_None,
    CurrencyType_Gold,
    CurrencyType_Blood,
};

struct Currency
{
     CurrencyType type;
     u32          value;
     u64          lastTimeGained; // @Note new field   
};

struct PlayerData 
{
     u32      version;
     u32      gamesPlayed;
     Currency myGold;
     r64      myNewtotalTime;
};

You guessed it, you can’t load old versions directly anymore, as the data layout changed. You can mitigate this by not allowing other structs in your serialized struct, but it will be more work, as your everyday code wont be able to just plain copy it into the saved structure, you’ll need a save function for it and to add the new fields in both structures:

#define PLAYER_DATA_VERSION 2

enum CurrencyType
{
    CurrencyType_None,
    CurrencyType_Gold,
    CurrencyType_Blood,
};

struct Currency
{
     CurrencyType type;
     u32          value;
     u64          lastTimeGained; // @Note new field   
};

struct PlayerData 
{
     u32          version;
     u32          gamesPlayed;
     CurrencyType type;
     u32          value;
     u64          lastTimeGained; // @Note new field
};

void saveCurrency(PlayerData& in, Currency& currency)
{
    in.type           = currency.type;
    in.value          = currency.value;
    in.lastTimeGained = currency.lastTimeGained;        
}

Any operation that change the data layout, will mess the load function, like removal of a field, or just moving a field around in the struct.

The most common way to mitigate this that I’ve seen in game engines, is to store these files in a text format (xml, json, custom), and it solves the layout problem for sure, but you do take a speed hit for parsing those formats ( if you want to be convinced about this, go take Casey Muratori’s course on Computer Enhance ). And, even if you can now add, move, remove fields anywhere in the structure, which is good, you still need to update your save/load functions anytime you update any of those. So in my experience, engines that were text based for their format, still did not allowed other structs in serialized structs, to avoid this issue.

Another issue, that is not solved by the text format, is the versionning which you have to bump yourself. So anytime you change anything touching the serialized structure, you need to bump the version number associated to it, and update the load function.

So, as a solo developper with no QA team, I wondered if I could have it all, meaning the speed of the binary format with the flexibility of the text format, and without the need to manually do anything when a change happened.

So, what could help ?

So what I want is a system that can load any version of the struct into the newer version, ideally without adding any code when the change happen.

For this to be possible, I need some things:

The ability to iterate my structure programaticaly, also known as Reflection, a language feature that is not present in C++.
Have some kind of history of the structure, so you can load any previous version.
Have a way to detect a change to a structure, to be able to have an automated versionning.

What I go with

Our first problem is that C++ does not give us reflection, but luckily I already wrote a C++ parser for my code, that generates stuff that enables reflection (iterating over struct members), and Enum iterations. It reads the whole source code, and output helpers, named StructWalker, that enable iteration with the type, name, offset, … of each field. So as a matter of fact, pain point number one was already solved in my codebase. Good! Onto the next.

The second problem is much easier when the first one is tackled. All I have to do, is write a h file that contains the StructWalker of each version of the structure. So anytime a change is detected, append the new StructWalkers at the end of the file, for this version. In reality, we save every StrutcWalker that is necessary. The ones included directly, the one included in the included ones, etc..

Third one can be done when the second point is done. You need to load the structs of the last version from the generated file, compare them to the generated one for this build, and do this recursively for each field. If any changed, bump the version, and trigger the write of the new version. You need to parse the code on every build, but I was already doing that, so that is no additional costs.

Here how it looks like in the code :

struct PlayerData // #Versionned CURRENT_DATA_VERSION
{
     u32          version;
     u32          gamesPlayed;
     CurrencyType type;
     u32          value;
};

The Versionned keyword will tell my parser to generate a version file for this. It will also generate the current version number in a #define with the following identifier, CURRENT_DATA_VERSION in our case. Then I have a load function, that gets both structWalker the current one, and the one of the saved file version, and iterate over all members of the current version, and copy every one I found in the old structure into the new one. A member here is mainly just an offset and a size, which means everything can just be memcpy into place. All these informations are stored in a single h file Versions_<NameOfStruct>.h, that you need to include before your load function.

And my entire load function is this:

#include "Versions_PlayerData.h"

// @Note u8* dataBuffer is the loaded saved file into memory

u32 version = *(u32*)dataBuffer; // @Robustness version better be the first member
Assert(version <= CURRENT_DATA_VERSION);       
StructWalker* dataWalker    = getPlayerDataWalker(version);
StructWalker* currentWalker = &getWalker(PlayerData);

for(u32 i = 0; i < dataWalker->count; ++i)
{
    StructMember* member = dataWalker->members + i;
    for(u32 j = 0; j < currentWalker->count; ++j)
    {
        StructMember* c = currentWalker->members + j;

        if(stringsAreEqual(member->name, c->name)) // @Speed store the nameHash too, to just compare ints
        {
            u32 size = 0;
            Assert(c->type == member->type); // @Improve we should be able to change some types like u32 to r32
            Assert(c->isPtr == false);
            Assert(member->isPtr == false);
            switch(c->type)
            {
                case MetaType_bool:   size = sizeof(bool); break;
                case MetaType_u8:     size = 1; break;
                case MetaType_s8:     size = 1; break;
                case MetaType_u16:    size = 2; break;
                case MetaType_s16:    size = 2; break;
                case MetaType_u32:    size = 4; break;
                case MetaType_s32:    size = 4; break;
                case MetaType_u64:    size = 8; break;
                case MetaType_s64:    size = 8; break;
                case MetaType_r32:    size = 4; break;
                case MetaType_r64:    size = 8; break;
                case MetaType_struct: size = member->other->size; break;
                case MetaType_enum:   size = 4; break;
           } 

           size *= Minimum(c->arrayCount, member->arrayCount);
           Intrinsics::memcpy((u8*)&data + c->offset, dataBuffer + member->offset, size);
           break;
        }
    }
}

As each structWalker stores the offset of each member, you can move members around no problem, and removed fields as they will just not get copied into the new struct. And everything is automatic, meaning I do not need to think about it or bump any version number. For example adding a value into an enum included in the PlayerData struct, when compiling will increase the version number, and append all the new structWalker into the generated file

Example of compilation output generating a new version

Drawbacks and Final Thoughts

This solution works well for me, as it does everything automatically, and remove the mental burden of having your load breaks anytime you may change a structure. And it allows to have any structs included in the serialized one. It is a very specific solution, tailored for me, as you ned to parse your code manually for this to work.

It does not handle everything thought, as renaming is not included in this version, it will think it is a new field, and will not copy the other field over. It could be fixed with annotation though like a // #Rename <oldname> syntax. The second one is if you change the field type, meaning old type does not match new type. I could have automatic conversions for basic types. I did not implement any of those two improvements though, as it is not something that happen in my codebase( for now).

Another issue may be, that I do copy every StructWalker anytime anything change, which is a lot more that really necessary. This is an adressable problem, but in my case, this struct is not changing rapidly enough for it to be a problem. But I can see it be a problem for a bigger team.

My real question while implementing this, would be why does no version control that I’m aware of does this automatically. It knows all versions of the code, and could create this for the programmer. I understand they all want to be language agnostic, which is an important point, but it would be really baller, if a version control could spit out a file which allows struct versionning for you, giving you a function that could load into your most recent structure any previous version of it. If anyone reading this wants to do this, that would be awesome.

Thank you for reading,

Guillaume

Kiroxas