Protobuf field numbers are not just arbitrary identifiers; they are the stable backbone of your data serialization, and violating their rules can silently corrupt your data.
Let’s see this in action. Imagine we have a Person message:
syntax = "proto3";
message Person {
string name = 1;
int32 age = 2;
}
Now, we want to add a new field, email. The intuitive thing is to just pick the next available number, 3:
syntax = "proto3";
message Person {
string name = 1;
int32 age = 2;
string email = 3; // New field
}
This looks fine. But what if an older client, compiled with the first version of the schema, receives data serialized by a newer client that includes email? The older client will simply ignore field number 3 because it doesn’t know about it. No problem, right?
The real magic (and potential disaster) happens when you reuse field numbers. Consider this:
syntax = "proto3";
message Person {
string name = 1;
int32 age = 2;
// Oops, we deleted 'email' and are now adding 'city'
string city = 3;
}
If a new client serializes a Person with city = "New York" and field number 3, and an old client (that still has email defined as field number 3) deserializes this, it will happily parse "New York" as the email field! This is where your data becomes corrupted. The type of the data is different (string for city vs. string for email is okay here, but imagine if age was reused for a bytes field). The meaning is completely wrong.
This is why field numbers must be unique and stable within a message type. They are the keys that protobuf uses to identify fields during serialization and deserialization. When a parser encounters a field number, it looks it up in its schema to know what type of data to expect and how to interpret it. If you change a field number, or reuse one, you’re essentially changing the key for that data, leading to misinterpretation.
The core problem protobuf solves is enabling efficient, language-neutral, platform-neutral, extensible mechanism for serializing structured data. It’s like XML, but smaller, faster, and simpler. You define your data structure once, then you can use special generated source code to easily read and write your structured data to and from a variety of data streams. The "play in the sandbox" aspect of protobuf is that your schema is the contract. Field numbers are the contract’s immutable identifiers.
When you send data, protobuf serializes your message into a sequence of key-value pairs. The "key" is a combination of the field number and the wire type (which indicates the data type, like VARINT, 64-bit, length-delimited, etc.). For example, field number 1 with a string value might be serialized as (1, LENGTH_DELIMITED) followed by the length of the string and the string bytes.
When a parser receives this stream, it reads the key, looks up the field number in its compiled schema, determines the expected wire type, and then reads the appropriate amount of data. If field number 3 was email (a string) in one version and city (also a string) in another, the parser sees (3, LENGTH_DELIMITED), knows it’s a string, and happily assigns it to whatever field is currently associated with number 3 in its schema.
The key takeaway is that field numbers are persistent identifiers for data fields. They are not just for human readability or internal compiler use; they are part of the on-the-wire format.
The most surprising aspect of protobuf’s schema evolution is that you can add new fields (with new, never-before-used field numbers) to a message, and old code will simply ignore them. Conversely, you can remove fields from a message, and old code that previously sent them will be gracefully ignored by new code. This backward and forward compatibility is a core design principle. The danger arises only when you reuse a field number that was previously used for a different field, or when you change the type associated with a field number (which is also effectively a schema violation, though less common than reusing numbers).
When you encounter a field number that’s not in your current schema, the protobuf runtime treats it as an unknown field. For proto3, unknown fields are generally discarded by default during deserialization. However, if a field was previously defined and then removed and its number reused, the old code might be sending data that the new code will incorrectly interpret as a different field.
The next step is understanding how to manage optional vs. required fields in proto2 and how proto3 handles field presence and default values.