The most surprising thing about Protocol Buffers is that they’re not a serialization format at all, but a schema definition language.

Let’s see it in action. Imagine we’re building a simple contact management system. First, we define our data structure in a .proto file.

syntax = "proto3";

message Person {
  string name = 1;
  int32 id = 2;
  string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    string number = 1;
    PhoneType type = 2;
  }

  repeated PhoneNumber phones = 4;
}

This Person message describes a contact with a name, ID, email, and a list of phone numbers, each having a number and a type. The syntax = "proto3"; line specifies we’re using Protobuf’s third version. The = 1, = 2, etc., are field numbers, not order. These numbers are crucial because they are what Protobuf uses to identify fields in its binary encoding.

Now, we need to compile this schema into code for our programming language. Let’s use Python. Assuming you have the Protobuf compiler (protoc) installed, you’d run:

protoc --python_out=. person.proto

This command generates a person_pb2.py file. This file contains Python classes corresponding to our message definitions.

Here’s how you’d create a Person object and serialize it in Python:

import person_pb2

# Create a new Person object
person = person_pb2.Person()
person.name = "Alice"
person.id = 1234
person.email = "alice@example.com"

# Add phone numbers
phone_number = person.phones.add()
phone_number.number = "555-4321"
phone_number.type = person_pb2.Person.MOBILE

phone_number = person.phones.add()
phone_number.number = "555-8765"
phone_number.type = person_pb2.Person.HOME

# Serialize the Person object to a binary string
serialized_person = person.SerializeToString()

print(f"Serialized data (bytes): {serialized_person}")

This SerializeToString() method takes our Python object and converts it into a compact binary representation. This is the "serialization" part.

To deserialize, you read the binary data and parse it back into a Python object:

# Create a new Person object to hold the parsed data
new_person = person_pb2.Person()

# Parse the binary string
new_person.ParseFromString(serialized_person)

# Access the data
print(f"Name: {new_person.name}")
print(f"ID: {new_person.id}")
print(f"Email: {new_person.email}")
for phone in new_person.phones:
    print(f"Phone: {phone.number} (Type: {phone.type})")

The beauty here is that Protobuf handles the encoding and decoding for you, using those field numbers to map binary data back to the correct fields, even if the order of fields in the schema changes later. This makes it robust against schema evolution.

The core problem Protobuf solves is efficient and language-neutral data serialization. Unlike JSON or XML, which are text-based and verbose, Protobuf produces a highly compact binary format. This is especially beneficial for network communication where bandwidth is a concern, or for storing large amounts of data. It’s also significantly faster to serialize and deserialize because it doesn’t require complex text parsing.

The schema definition (.proto file) acts as a contract. Anyone can take this .proto file, compile it for their language, and be able to read and write data conforming to that schema. This interoperability is a massive advantage in distributed systems.

When you define a repeated field, like phones in our example, Protobuf encodes it as a sequence of packed elements. For primitive types (like int32 or bool), if they are also packed (which is the default for repeated fields in proto3), they are encoded as a length-delimited sequence of values. For message types, each instance of the message is encoded as a length-delimited message. This packing is a key reason for Protobuf’s efficiency.

The next concept you’ll likely encounter is managing schema evolution, specifically how to add new fields or deprecate old ones without breaking existing systems.

Want structured learning?

Take the full Protobuf course →