Using ProtoBuf for Huge Object Serialization

Development with ProtoBuf

I want to share my experience with using ProtoBuf to optimize the size of transferred/cached objects. We at Distillery faced a problem that the size of Memcached objects are huge. We’ve come close to exceeding the Memcached server storage limit which can trigger the eviction of objects, putting more load on the databases. The goal here is to lessen the traffic from IIS to Memcached servers by reducing the size of the Memcached objects.

By default we use BeIT.Memcached (a Memcached client in C#) which uses BinaryFormatter to convert our objects to an array of bytes which are then compressed (using built-in DeflateStream compression) and pushed to the Memcached server.

To decrease the size of the objects stored in the Memcached servers, the client implements a custom serialization scheme. Objects are serialized as follows:

  • bool, byte, short, ushort, int, uint, long, ulong, float, double are serialized into their native byte representation;
  • DateTime is serialized into a long containing the Ticks value;
  • string is encoded as UTF8;
  • byte[] is stored straight without any conversion;
  • all other objects go through the regular BinaryFormatter runtime serializer.

Because BinaryFormatter is sub-optimal, we chose ProtoBuf (binary serialization for .NET using Protocol Buffers) to serialize these “other” objects which should reduce traffic due to the high compression rate. The main idea is to add the ProtoBuf serializer before we put data into the Memcached server. We assumed that ProtoBuf compression would be much better than BinaryFormatter (Now we know).

ProtoBuf

ProtoBuf (Protocol Buffers) is the name of the binary serialization format used by Google for much of their data exchange. It is designed to be:

  • small in size (efficient data storage, far smaller than XML);
  • cheap to process (at the client and server);
  • platform independent (portable between different programming architectures);
  • extensible (able to add new data to old messages)

ProtoBuf can be declared two ways:

  1. Using attributes like [ProtoContract] for each Class that should be cached. Example:
    [ProtoContract]
    public class ProtoBuffClass
    {
       [ProtoMember(1)]
       public int ProtoMember1 { get; set; }
    
       [ProtoMember(2)]
       public string ProtoMember2 { get; set; }
    
       [ProtoMember(3)]
       public bool ProtoMember3 { get; set; }
    }
    
  2. Using a Meta implementation (based on Reflection). Example:
    var model = ProtoBuf.Meta.RuntimeTypeModel.Default;
    var serializableTypes = Assembly.GetExecutingAssembly().GetTypes();

Each type from the current assembly would pass through several methods to prepare the correct model for ProtoBuf. We didn’t pursue this further as we believed using reflection for all our Memcache objects might be too expensive.

Design

Base / Derived Classes

Each derived class must have its base class marked with [ProtoInclude(<num>, typeof(ProtoBuff-Derived-Class))]. If not, all values will be NULL.

[ProtoContract]
[ProtoInclude(100, typeof(HomeFolders))]
[ProtoInclude(200, typeof(PublicFolders))]
public class Folders
{
   [ProtoMember(1)]
   public int ProtoMember1 { get; set; }

   [ProtoMember(2)]
   public int ProtoMember2 { get; set; }
}

[ProtoContract]
public class HomeFolders : Folders
{
   [ProtoMember(1)]
   public int ProtoMember4 { get; set; }
}

[ProtoContract]
public class PublicFolders : Folders
{
   [ProtoMember(1)]
   public int ProtoMember5 { get; set; }
}

Avoid duplicate property tags

Using the same number for ProtoInclude and ProtoMember will generate an error about duplicate property tags. The example below is NOT correct.

[ProtoContract]
[ProtoInclude(1, typeof(PublicFolders))]
public class Folders
{
   [ProtoMember(1)]
   public int ProtoMember1 { get; set; }
}

So you need to use a different number for ProtoInclude. Corrected example:

[ProtoContract]
[ProtoInclude(100, typeof(PublicFolders))]
public class Folders
{
   [ProtoMember(1)]
   public int ProtoMember1 { get; set; }
}

Null vs. Empty Collections

ProtoBuf does not understand the difference between a collection (List, IEnumerable etc) being null versus empty (zero count). For example, if you put these objects into the cache,

List<int> list1 = new List<int>();
List<int> list2 = null;

after deserialization, both the lists will have the same value—that is NULL. There are two ways to solve this:

  1. Using a private field (we are using this):
    [ProtoMember(12, OverwriteList = true)]
    private List _publicFolders;
    public List publicFolders
    {
       get
       {
           if (_publicFolders == null)
           {
               _publicFolders = new List();
           }
           return _publicFolders;
       }
       set
       {
           _publicFolders = value;
       }
    }
    
  2. Using the OnDeserialized attribute:
    [ProtoMember(2, OverwriteList = true)]
    private PublicFolder[] publicFolders;
    [ProtoMember(3, OverwriteList = true)]
    private PrivateFolder[] privateFolder;
    [ProtoMember(4, OverwriteList = true)]
    private SecureFolder[] secureFolder;
     
    [OnDeserialized]
    private void HandleSerializationMismatch(StreamingContext context)
    {
       publicFolders = publicFolders ?? new PublicFolders[0];
       privateFolder = privateFolder ?? new PrivateFolder[0];
       secureFolder = secureFolder ?? new SecureFolder[0];
    }

Things to Remember

ProtoBuf ignores properties if the class inherits from a collection and the Items property for that collection is null. Example:

public class Folders : List
{
   public int value1 { get; set; }
 
   public int value2 { get; set; }
}

Folders folders = new Folders() { value1 = 5; value2 = 6; };

After deserialization, the value of the Folders object will be NULL, because the count of items on is 0.

Classes that inherit from special collections are also not supported.

public class Folders : ReadOnlyCollection
{
   public int value1 { get; set; }
 
   public int value2 { get; set; }
}
 
Folders folders = new Folders() { value1 = 5; value2 = 6; };

AllowParseableTypes

AllowParseableTypes is a global switch that determines whether types with “.ToString()” and “Parse(string)” methods should be serialized as strings. We can use this setting for types that can’t be marked in the ProtoContract but can be parseable.

static ProtoBufClient()
{
   RuntimeTypeModel.Default.AllowParseableTypes = true;
}

For example, to solve the serialization problem with the Version type:

[Serializable]
[ProtoContract(SkipConstructor = true)]
[ProtoInclude(100, typeof(PrivateFolder))]
[ProtoInclude(200, typeof(PublicFolder))]
[ProtoInclude(300, typeof(SecureFolder))]
public abstract class FolderBase : Folder
{
   ...
 
   [ProtoMember(3)]
   private string name;
   [ProtoMember(4)]
   private Owner owner;
 
   ...
}

Testing

Setup:

  1. 1605 Virtual Users (devices) used to generate load.
  2. Tested on an environment with 5 web servers and 1 database server.
  3. Large object distribution.
  4. Throughput ~200k RPM

Database Metrics

Original

Data Base

ProtoBuf

Database Metrics

Memcached Usage

Original

Memcached Usage

ProtoBuf

Memcached Usage with ProtoBuf

Original

Memcached Usage

ProtoBuf

Memcached Usage with ProtoBuf

CPU, Memory and Network

Original

CPU, Memory and Network Usage

ProtoBuf

CPU, Memory and Network Usage with ProtoBuf

Original

CPU, Memory and Network Usage

ProtoBuf

CPU, Memory and Network Usage with ProtoBuf

Conclusion

After we applied the ProtoBuf optimization for objects we saw a huge impact on system performance. The RPM of our product increased by 25%, and the average size of items decreased from 1 MB to 200 KB (a factor of 5). Network I/O decreased from 1 Gb/s to 500 Mb/s.