Protobuf and Avro, both popular binary serialization formats, are often compared because they solve similar problems but approach them with fundamentally different philosophies. The most surprising thing about them is how their core design choices directly lead to completely opposite strengths and weaknesses, making one a better fit than the other depending on your exact needs.
Let’s see Avro in action. Imagine you have a Kafka topic user_events that receives a stream of user actions. We’ll use Python to produce and consume messages.
First, define an Avro schema for our user event:
{
"type": "record",
"name": "UserEvent",
"fields": [
{"name": "user_id", "type": "long"},
{"name": "event_type", "type": "string"},
{"name": "timestamp", "type": "long"}
]
}
Now, let’s produce a message:
from confluent_kafka import Producer
import avro.schema
from avro.io import DatumWriter, BinaryEncoder
import io
# Kafka producer configuration
conf = {'bootstrap.servers': 'localhost:9092'}
producer = Producer(conf)
# Avro schema and writer
schema_str = """
{
"type": "record",
"name": "UserEvent",
"fields": [
{"name": "user_id", "type": "long"},
{"name": "event_type", "type": "string"},
{"name": "timestamp", "type": "long"}
]
}
"""
schema = avro.schema.parse(schema_str)
writer = DatumWriter(schema)
# Create a message
user_event = {
"user_id": 12345,
"event_type": "page_view",
"timestamp": 1678886400
}
# Serialize the message
buffer = io.BytesIO()
encoder = BinaryEncoder(buffer)
writer.write(user_event, encoder)
raw_bytes = buffer.getvalue()
# Produce to Kafka
producer.produce('user_events', value=raw_bytes)
producer.flush()
print(f"Produced Avro message: {user_event}")
And consume it:
from confluent_kafka import Consumer, KafkaException
import avro.schema
from avro.io import DatumReader, BinaryDecoder
import io
# Kafka consumer configuration
conf = {'bootstrap.servers': 'localhost:9092', 'group.id': 'my_avro_consumer', 'auto.offset.reset': 'earliest'}
consumer = Consumer(conf)
consumer.subscribe(['user_events'])
# Avro schema and reader (must match producer's schema)
schema_str = """
{
"type": "record",
"name": "UserEvent",
"fields": [
{"name": "user_id", "type": "long"},
{"name": "event_type", "type": "string"},
{"name": "timestamp", "type": "long"}
]
}
"""
schema = avro.schema.parse(schema_str)
reader = DatumReader(schema)
print("Waiting for Avro messages...")
try:
while True:
msg = consumer.poll(1.0)
if msg is None:
continue
if msg.error():
if msg.error().code() == KafkaException._PARTITION_EOF:
# End of partition event
print('%% %s [%d] reached %d' % (msg.topic(), msg.partition(), msg.offset()))
elif msg.error():
raise msg.error()
else:
# Deserialize the message
raw_bytes = msg.value()
buffer = io.BytesIO(raw_bytes)
decoder = BinaryDecoder(buffer)
user_event = reader.read(decoder)
print(f"Consumed Avro message: {user_event}")
except KeyboardInterrupt:
pass
finally:
consumer.close()
This example shows Avro’s schema-centric approach. The schema is paramount, defining the data structure. Both producer and consumer must agree on this schema (or have compatible schemas) for deserialization to work. The serialized Avro data itself is relatively compact and includes schema information implicitly.
Protobuf, on the other hand, focuses on generating code from a schema definition. You define your *.proto file, and the Protobuf compiler generates language-specific classes for serialization and deserialization.
Here’s the equivalent UserEvent in Protobuf:
// user_event.proto
syntax = "proto3";
message UserEvent {
int64 user_id = 1;
string event_type = 2;
int64 timestamp = 3;
}
After compiling this with protoc --python_out=. user_event.proto, you get a user_event_pb2.py file.
Producing with Protobuf (using kafka-python for simplicity here, though confluent-kafka also supports Protobuf):
from kafka import KafkaProducer
import user_event_pb2 # Generated from user_event.proto
# Kafka producer configuration
producer = KafkaProducer(bootstrap_servers='localhost:9092')
# Create a Protobuf message
user_event = user_event_pb2.UserEvent()
user_event.user_id = 12345
user_event.event_type = "page_view"
user_event.timestamp = 1678886400
# Serialize the message
serialized_event = user_event.SerializeToString()
# Produce to Kafka
producer.send('user_events_pb', serialized_event)
producer.flush()
print(f"Produced Protobuf message: {user_event}")
Consuming with Protobuf:
from kafka import KafkaConsumer
import user_event_pb2 # Generated from user_event.proto
# Kafka consumer configuration
consumer = KafkaConsumer(
'user_events_pb',
bootstrap_servers='localhost:9092',
group_id='my_protobuf_consumer',
auto_offset_reset='earliest'
)
print("Waiting for Protobuf messages...")
for message in consumer:
# Deserialize the message
user_event = user_event_pb2.UserEvent()
user_event.ParseFromString(message.value)
print(f"Consumed Protobuf message: {user_event}")
Protobuf’s strength lies in its generated code, which provides a strongly typed, object-oriented interface. Serialization is fast, and the format is very compact because it uses field numbers, not field names, and employs efficient encoding for different data types.
The core difference: Avro is schema-driven, meaning the schema is the primary artifact and data is serialized based on it. Protobuf is code-driven, generating code from the schema which then handles serialization. This leads to Avro’s flexibility in schema evolution (readers can handle older or newer schemas than the writer, as long as they are compatible) and its strong integration with systems like Kafka where the schema can be stored and managed separately. Protobuf, with its generated code, offers better compile-time checks and often superior performance due to its compact encoding and optimized C++ implementation.
The one thing most people don’t realize is how Protobuf’s field tags (the numbers like 1, 2, 3 in the .proto file) are critical. They are not just identifiers; they are the only way the deserializer knows which piece of data corresponds to which field. If you change a field number, it’s a breaking change for existing data. This is why Protobuf is so strict about field numbers and emphasizes that they should never be reused once a field has been assigned one.
Ultimately, Avro shines when you need dynamic schema evolution, especially in distributed systems like Kafka where schemas can be managed independently. Protobuf is ideal for high-performance, low-latency RPC (like gRPC) or when you prefer compile-time safety and a more object-oriented approach to your data structures.
The next concept you’ll likely encounter is how to manage schema evolution effectively with each format in a real-world scenario.