A protobuf message definition is more than just a data schema; it’s a blueprint for efficient, language-agnostic serialization that can often outperform JSON and other text-based formats by orders of magnitude.
Let’s see this in action. Imagine you’re building a simple user profile service. You need to store a user’s ID, name, and email.
syntax = "proto3";
message UserProfile {
int64 user_id = 1;
string name = 2;
string email = 3;
repeated string tags = 4;
Address shipping_address = 5; // Nested message
}
message Address {
string street = 1;
string city = 2;
string zip_code = 3;
}
Here, syntax = "proto3"; declares we’re using Protocol Buffers version 3, which has a cleaner syntax and fewer edge cases than proto2.
The message UserProfile block defines the structure. Each line within the message is a field. int64 user_id = 1; means we have a field named user_id of type int64 (a 64-bit integer), and crucially, it’s assigned a unique field number of 1. These field numbers are the keys used during serialization, not the field names. This is why you can rename fields in your .proto file without breaking compatibility with older serialized data, as long as the field numbers remain the same.
string name = 2; and string email = 3; are straightforward string fields.
repeated string tags = 4; introduces a list or array. The repeated keyword signifies that this field can contain zero or more elements of the specified type (string in this case). When serialized, repeated fields are encoded specially, often more compactly than repeated JSON arrays.
Address shipping_address = 5; demonstrates nesting. Address is another message type defined separately. This allows for complex, hierarchical data structures. When UserProfile is serialized, the entire Address message is embedded within it, using its assigned field number 5.
The Address message itself has three string fields: street, city, and zip_code, each with its own unique field number.
The problem protobuf solves is efficient data exchange between different services, languages, and platforms. Unlike JSON, which is human-readable but verbose and requires parsing text, protobufs are binary. The compiler generates efficient code for serializing (packing data into bytes) and deserializing (unpacking bytes back into objects) in various programming languages. This leads to smaller payloads, faster parsing, and reduced CPU usage.
The exact levers you control are primarily the field types and field numbers. Choosing the right primitive type (like int32 vs. int64, float vs. double) impacts size and precision. Using bytes for raw binary data is more efficient than encoding it as a string. For enumerations, enum types are preferred over strings or integers for type safety and clarity.
The field numbers are critical for evolution. You can add new fields with new numbers, and old code will simply ignore them. You can also deprecate fields by removing them from the .proto file, but crucially, you should never reuse a field number that was previously used, as this would lead to data corruption. If you need to change a field’s type, you must be extremely careful, and it’s often safer to treat it as a new field with a new number and deprecate the old one.
The way repeated fields are encoded is a key differentiator. Instead of a length-prefix followed by each element (like [1, 2, 3]), protobufs encode them as a series of key-value pairs, where the key is the field number and the value is the encoded data for that element. For primitive types, this can be significantly more compact, especially for large lists.
When you compile your .proto file using the protoc compiler (e.g., protoc --go_out=. --go_opt=paths=source_relative user_profile.proto), it generates source code for your chosen language (Go in this example). This generated code provides classes or structs representing your messages, along with methods like Marshal() (to serialize) and Unmarshal() (to deserialize).
The most surprising thing about protobuf evolution is that you can change a field’s name, its type (with caveats), or even remove it entirely, and existing serialized data will still be readable by newer code, provided the field numbers remain constant and you don’t reuse numbers. This "schema evolution" capability is a core strength that makes protobufs ideal for long-lived systems and APIs.
The next concept you’ll likely encounter is how to manage multiple .proto files and use imports to create larger, more organized schema definitions.