BB.Net / ramblings / posts / thrift and protocol buffers

I've been experimenting with thrift and protocol buffers recently. For the most part when I need to serialize something I've been using JSON or compressed JSON. Thrift and protocol buffers have a couple of advantages, and are also supposedly faster and produce smaller output.

The test I've been using is a simple list of hashes, nothing too complicated. here is the protocol buffers file. The thrift file is pretty much the same thing.

package passive_dns;

message DnsRecord {
  required string key = 1;
  required string value = 2;
  required string first = 3;
  required string last = 4;
  optional string type = 5 [default = "A"];
  optional int32  ttl = 6 [default = 86400];
}

message DnsResponse {
  repeated DnsRecord records = 1;
}

The optional and default values are one of the benefits of both serialization libraries. A record that matches the default value does not need to be included in the serialized output.

I wrote up a simple test program to compare thrift, protocol buffers, json, and compressed json for size and speed. The results, at least for the type of data I use, are very interesting:

5000 total records (0.745s)

get_thrift          (0.044s)
get_pb              (0.608s)

ser_thrift          (0.474s) 554953 bytes
ser_pb              (3.087s) 414862 bytes
ser_json            (0.273s) 718191 bytes
ser_yaml            (13.121s) 623191 bytes

ser_thrift_compressed (0.545s) 287617 bytes
ser_pb_compressed     (3.150s) 284297 bytes
ser_json_compressed   (0.326s) 292904 bytes
ser_yaml_compressed   (13.665s) 290993 bytes

serde_thrift        (1.289s)
serde_pb            (5.411s)
serde_json          (1.474s)
serde_yaml          (45.637s)

EDIT: Updated to include yaml results

The get_* functions are the times needed to covert the python data structure into the classes that the library needs.

The ser_* functions are the times needed to get and serialize the python data structure to a string.

The ser_*_compressed functions are the times needed to get, serialize, and compress the python data structure.

The serde_* functions are the times needed to get, serialize, and de-serialize the python data structure to and from a string.

The results show that serializing to compressed JSON is both smaller and faster than thrift, and serializing+de-serializing is only slightly slower. If I converted the python data to be (header, rows) like a csv file, rather than a flat list of dicts, the json output would be smaller, and likely faster to serialize.

The totally unexpected result was that protocol buffers clocked in at over 4 times slower than thrift. I find it hard to believe that protocol buffers could be that slow, so I will have to run some more tests to make sure that I am using the library correctly.

If you want to run my tests for yourself, the code is available from sertest.tgz