Protobuf messages, when serialized, are significantly smaller than their JSON or Avro equivalents, making them ideal for high-throughput event streams.
Let’s watch this in action. Imagine we have a simple UserEvent protobuf message:
syntax = "proto3";
package com.example.events;
message UserEvent {
string user_id = 1;
string event_type = 2;
int64 timestamp = 3;
map<string, string> metadata = 4;
}
Now, let’s serialize this to bytes and send it to Kafka, and then deserialize it.
First, the producer side (Python example):
from google.protobuf.json_format import MessageToJson
from google.protobuf import json_format
import avro.schema
import avro.io
import io
import json
import time
from kafka import KafkaProducer
from user_event_pb2 import UserEvent # Assuming user_event_pb2 is generated from .proto
producer = KafkaProducer(
bootstrap_servers='localhost:9092',
value_serializer=lambda v: v.SerializeToString() # Crucial: serialize to bytes
)
user_event = UserEvent(
user_id="user-123",
event_type="LOGIN",
timestamp=int(time.time() * 1000),
metadata={"ip_address": "192.168.1.1", "user_agent": "Chrome"}
)
topic = 'user-events'
future = producer.send(topic, value=user_event)
try:
record_metadata = future.get(timeout=10)
print(f"Sent message to topic {record_metadata.topic} partition {record_metadata.partition} offset {record_metadata.offset}")
except Exception as e:
print(f"Failed to send message: {e}")
producer.flush()
producer.close()
On the consumer side (Python example):
from google.protobuf.json_format import MessageToJson
from google.protobuf import json_format
import avro.schema
import avro.io
import io
import json
import time
from kafka import KafkaConsumer
from user_event_pb2 import UserEvent # Assuming user_event_pb2 is generated from .proto
consumer = KafkaConsumer(
'user-events',
bootstrap_servers='localhost:9092',
auto_offset_reset='earliest',
enable_auto_commit=True,
group_id='user-event-consumers',
value_deserializer=lambda x: UserEvent.FromString(x) # Crucial: deserialize from bytes
)
print("Waiting for messages...")
for message in consumer:
user_event_proto = message.value # This is already a UserEvent object
print(f"Received message: {user_event_proto}")
print(f"User ID: {user_event_proto.user_id}")
print(f"Event Type: {user_event_proto.event_type}")
print(f"Timestamp: {user_event_proto.timestamp}")
print(f"Metadata: {user_event_proto.metadata}")
consumer.close()
The core idea is that Protobuf defines a strict schema, and the serialization process encodes the data in a compact binary format. When you send this to Kafka, you’re sending raw bytes. The value_serializer on the producer and value_deserializer on the consumer are the critical pieces. The producer tells Kafka "here’s the binary representation of my Protobuf object," and the consumer tells Kafka "when you give me bytes, treat them as a Protobuf object and deserialize them for me."
The problem this solves is the inefficiency of text-based formats like JSON or even Avro for high-volume data streams. JSON is verbose, and while Avro is binary and schema-driven, Protobuf often achieves even greater compression due to its wire format, especially with repeated fields and its efficient encoding of integers (varints). This means less network bandwidth, lower storage costs in Kafka, and faster processing because there’s less data to move and parse.
Internally, Protobuf uses a technique called "tag-value" encoding. Each field in your message has a unique number (the "tag," like 1 for user_id). The serializer writes the tag and then the encoded value. For example, an integer 123 might be encoded as a single byte 0x7B (varint encoding). A string is prefixed with its length. This is incredibly efficient. The schema itself isn’t transmitted with every message; consumers need to have the .proto definition to know how to interpret the bytes.
The exact levers you control are within your .proto file. The data types (string, int64, map), field numbers, and nesting of messages directly impact the resulting binary payload size and structure. For example, using bytes instead of string for binary data is more efficient, and choosing the correct integer type (int32 vs. int64) can save space if your values are small.
A common misconception is that Protobuf is self-describing. It’s not. Unlike JSON, where field names are part of the data, Protobuf relies on field numbers. If your consumer code doesn’t have the correct .proto definition to generate the corresponding classes (like user_event_pb2.py in Python), it won’t know what tag=1 means, and deserialization will fail or produce garbage. This strict schema dependency is both a strength (performance, type safety) and a potential point of friction.
The next step is often handling schema evolution in your Protobuf definitions and ensuring your Kafka consumers can gracefully handle older or newer versions of messages.