Protobuf scalar types are surprisingly flexible, often allowing for implicit conversions that mask underlying data representations.

Let’s see what this looks like in practice. Imagine we have a simple Protobuf definition:

syntax = "proto3";

message Person {
  string name = 1;
  int32 age = 2;
  bool is_active = 3;
}

Now, let’s serialize and deserialize this using Python:

from google.protobuf import message

# Create a Person message
person = Person()
person.name = "Alice"
person.age = 30
person.is_active = True

# Serialize to bytes
serialized_data = person.SerializeToString()
print(f"Serialized: {serialized_data}")

# Deserialize from bytes
new_person = Person()
new_person.ParseFromString(serialized_data)

print(f"Deserialized Name: {new_person.name}")
print(f"Deserialized Age: {new_person.age}")
print(f"Deserialized Is Active: {new_person.is_active}")

Running this will produce output similar to:

Serialized: b'\n\x05Alice\x10\x1e\x18\x01'
Deserialized Name: Alice
Deserialized Age: 30
Deserialized Is Active: True

The \x10\x1e might look like gibberish, but it’s Protobuf’s efficient binary encoding of the integer 30. \x05Alice is the length-prefixed string "Alice", and \x18\x01 is the boolean True.

The core problem Protobuf solves is efficient, language-agnostic serialization. Before Protobuf, you’d often deal with verbose formats like XML or JSON, or custom binary protocols that were hard to maintain across different languages. Protobuf provides a schema-driven approach: you define your data structure once in a .proto file, and Protobuf tools generate code for various languages that can serialize and deserialize that structure.

Internally, Protobuf uses a system called Protocol Buffers Encoding. For scalar types, it’s quite clever:

  • Varints: Integers (int32, int64, uint32, uint64, sint32, sint64, bool) are encoded using a variable-length encoding called Varints. This means smaller numbers use fewer bytes, making it very space-efficient. For example, 0 is one byte, 150 is two bytes (\x96\x01), and 300,000 is three bytes (\x80\xd2\x13). bool is also encoded as a Varint: 0 for False and 1 for True.
  • Fixed-width: Floating-point numbers (float, double) and fixed32, fixed64, sfixed32, sfixed64 use fixed-width encodings (4 or 8 bytes), ensuring consistent size regardless of the value.
  • Length-delimited: Strings and bytes are encoded as a Varint representing their length, followed by the raw bytes of the string or byte sequence.

The age = 2 field in our example, with the value 30, is encoded as a Varint. The number 30 in binary is 00011110. As a Varint, it’s simply 1e in hex, which is 30 in decimal. This is written as \x1e. The field tag 2 is also encoded as a Varint. Tag 2 is 00000010 in binary. For Varints, values less than 128 are represented by a single byte, and the most significant bit (MSB) indicates if more bytes follow. So, tag 2 is 0x02. However, Protobuf combines the field number and wire type into a single tag byte (or more for larger tags). The wire type for Varints is 0. So, field 2 with wire type 0 becomes (2 << 3) | 0 = 16, which is 0x10 in hex. Thus, the integer 30 (encoded as 0x1e) becomes \x10\x1e.

The is_active = 3 field, with value True (which is 1 for Varint encoding), has tag 3. (3 << 3) | 0 = 24, which is 0x18 in hex. So, True (encoded as 0x01) becomes \x18\x01.

The name = 1 field, with value "Alice", is a string. Strings have wire type 2 (length-delimited). The field tag is 1. (1 << 3) | 2 = 8 | 2 = 10, which is 0x0a in hex. The string "Alice" has 5 bytes. So, the encoding is \x0a followed by the length 5 (encoded as a Varint, which is just \x05), followed by the bytes of "Alice". This results in \x0a\x05Alice.

Putting it all together, the serialized data b'\n\x05Alice\x10\x1e\x18\x01' is actually:

  • \n (0x0a): Field tag 1, wire type 2 (length-delimited)
  • \x05 (0x05): Length of the string "Alice"
  • Alice: The string data
  • \x10 (0x10): Field tag 2, wire type 0 (Varint)
  • \x1e (0x1e): The integer 30, encoded as a Varint
  • \x18 (0x18): Field tag 3, wire type 0 (Varint)
  • \x01 (0x01): The boolean True, encoded as a Varint

One subtle point is how int32 can accommodate values that technically fit into a uint32 but might be interpreted differently due to sign extension if treated as a signed integer later. However, Protobuf’s int32 and uint32 use the same Varint encoding. The distinction becomes critical when the most significant bit of the highest byte is set, as it signifies a negative number for signed types.

The next step in understanding serialization is exploring how Protobuf handles repeated fields and nested messages.

Want structured learning?

Take the full Protobuf course →