← Home

Binary Encoding Comparison of Avro and Protocol Buffer

2022/07/11

Version used: Apache Avro 1.11.1 Protocol Buffer 3

data typeAvroProtobuf
nullzero bytes
boolean1 byteas int32, 1 byte
int, longvariable-length zig-zagint32(two's complement), sint32(zig-zag), uint32, fixed32, sfixed32
float4 bytes4 bytes
double8 bytes8 bytes
bytesa long + bytesvarint + bytes
stringa long + UTF-8varint + UTF-8
record/messageencoded fields, no length/separatorTag-Length-Value, varint key: (field_number << 3) | wire_type
enumas intas int32
array/repeatedblocks (a long + items)primitive numeric types are packed: varint + items; other types just repeats
mapblocks (a long + items)as repeated nested tuple
fixedfixed size bytes
unionan int (position) + value

For scalar types, Avro and ProtoBuf use similar encodings; while for complex types(record/message, array/repeated, map), Avro uses a packed encoding, Protobuf just writes the k-v multiple times.

Most of the differences come from this design choice:

This make Protobuf more suitable for spare messages.

Other differences:

Nullable

Avro uses union to support nullable values. Then if a type is nullable, every encoded value will contain 1 byte as the type position in union's schema.

For Protobuf, we can use optional to mimic the null value.

Default values

In Avro, a default value is only used when reading instances that lack the field for schema evolution purposes. The presence of a default value does not make the field optional at encoding time. Avro encodes a field even if its value is equal to its default.

Protobuf3 only supports default zero-false-empty values.

So if we have a very sparse record/message that defines 100 nullable fields, using union in Avro and optional in Protobuf, how an instance is encoded when all fields are null?

Avro: 100 position union index, so 100 bytes.

Protobuf: no fields are encoded.

Submessages

Because Protobuf uses Tag-Length-Value, so nested submessage fields must use the LEN wire type for the parser to know how long the encoded field is.

While Avro encodes all submessage's fields, no length is needed.

Field presence

Avro record fields are always present in wire format. Protobuf has more complicated field presence.

singularoptional
not write any valuenot-encodenot-encode
write default valuenot-encodeencode
write non-default valueencodeencode

So when to use which?