Protobuf schemas can evolve without breaking existing clients, which is the core of backward compatibility, and the magic lies in how it handles missing fields.
Let’s watch a User message evolve.
Initial schema (user.proto):
syntax = "proto3";
message User {
string name = 1;
int32 age = 2;
}
A client serializes this:
{
"name": "Alice",
"age": 30
}
Now, we add a new field, email, but crucially, we don’t change the field numbers.
syntax = "proto3";
message User {
string name = 1;
int32 age = 2;
string email = 3; // New field
}
A new client serializes this:
{
"name": "Alice",
"age": 30,
"email": "alice@example.com"
}
An old client, which only understands name and age, receives the new message. It will happily parse name and age, and simply ignore the unknown field email because it doesn’t have a corresponding field number in its internal schema.
The problem this solves is the bane of distributed systems: deploying new code without breaking old code. Imagine a microservice that uses Protobuf to communicate. If you deploy a new version of the service that adds a field, and old clients try to communicate with it, they’ll fail if the schema isn’t backward compatible. Protobuf’s design elegantly sidesteps this by treating unknown fields as ignorable data.
Internally, Protobuf is a binary format. Each field is encoded as a tag-value pair, where the tag is the field number and the wire type indicates the data type. When an old client encounters a tag it doesn’t recognize (like tag 3 for email), it reads the wire type and the length of the value, and then simply skips over that data. It doesn’t try to interpret it, thus preserving its own operational integrity. This is why never reusing field numbers is paramount. If you reassign field number 3 to something else, an old client might incorrectly parse the new email data as, say, an integer, leading to corruption.
The mental model is this: Protobuf is a schema-driven serialization format. The schema defines how to map structured data to a sequence of tag-value pairs. When serializing, it emits tag-value pairs based on the current schema. When deserializing, it reads tag-value pairs and matches them against its own schema. If a tag from the incoming data doesn’t exist in the deserializing schema, it’s ignored. The crucial invariant is that tag numbers are stable identifiers for fields across schema versions.
This backward compatibility is achieved by following two simple rules:
- Never reassign field numbers: Once a field number is used, it’s permanent for that field.
- Do not make fields required: Protobuf fields are optional by default. If you add a new field, it’s implicitly optional. If an old client receives a message with a new field, it will simply ignore it.
The counterintuitive part is that while you’re adding new information, the old system treats it as if it’s not there, yet it doesn’t break. It’s like adding a new wing to a house; older residents might not even notice it’s there if they don’t have a reason to go into that part of the building, but the house itself remains functional. The key is that the structure for reading existing fields remains identical.
The next hurdle in schema evolution is forward compatibility.