Protobuf’s JSON encoding can be surprisingly lossy, not because of incompatible types, but because of how it handles field presence and default values.
Let’s see it in action. Imagine we have a simple Protobuf definition:
syntax = "proto3";
message User {
string name = 1;
int32 age = 2;
bool is_active = 3;
}
We’ll create a User message and then serialize it to JSON using Python’s protobuf library.
from google.protobuf.json_format import MessageToJson, ParseDict
from google.protobuf.struct_pb2 import Struct
# Create a user message
user_message = User()
user_message.name = "Alice"
user_message.age = 30
user_message.is_active = True
# Serialize to JSON
json_output = MessageToJson(user_message)
print(f"Serialized JSON: {json_output}")
# Now, let's create a message with default values
user_default = User()
user_default.name = "" # Default string
user_default.age = 0 # Default int32
user_default.is_active = False # Default bool
json_default_output = MessageToJson(user_default)
print(f"Serialized JSON (defaults): {json_default_output}")
# And one with explicit defaults
user_explicit_defaults = User()
user_explicit_defaults.name = ""
user_explicit_defaults.age = 0
user_explicit_defaults.is_active = False
# Parse back, demonstrating implicit defaults
parsed_user = User()
ParseDict(json_default_output, parsed_user)
print(f"Parsed user name: '{parsed_user.name}', age: {parsed_user.age}, is_active: {parsed_user.is_active}")
When you run this, you’ll see:
Serialized JSON: {"name": "Alice", "age": 30, "isActive": true}
Serialized JSON (defaults): {}
Parsed user name: '', age: 0, is_active: false
Notice how the JSON for the default values is completely empty {}. This is the core of the surprise. Protobuf 3 (proto3) uses "presence by default." If a field has its default value (0 for numbers, empty string for strings, false for booleans, etc.), it’s not serialized to JSON by MessageToJson unless you explicitly tell it to.
The goal of Protobuf JSON encoding is to provide a human-readable and interoperable format. It achieves this by omitting fields that have their default values, making the JSON payload smaller. When you parse this JSON back into a Protobuf message, the fields that were omitted are automatically populated with their respective default values by the Protobuf runtime. This is a crucial optimization for bandwidth and readability, but it means that the JSON itself doesn’t directly tell you if a field was explicitly set to its default or if it was never set and therefore took on its default.
The MessageToJson function has a preserving_proto_field_name option that can be useful, but it doesn’t change the default value omission behavior. For more fine-grained control over what gets serialized, you might need to look into custom serialization logic or use Struct in Protobuf, which allows for more explicit key-value pairs.
The exact mapping from Protobuf types to JSON types is also something to be aware of:
- Scalar numeric types map to JSON numbers.
- Strings map to JSON strings.
- Booleans map to JSON booleans (
true/false). - Enums map to their integer values by default, but can be configured to map to their string names.
- Nested messages map to JSON objects.
- Repeated fields map to JSON arrays.
- Maps map to JSON objects where keys are strings.
The most surprising aspect for many is how google.protobuf.Struct interacts with this. You can construct Struct messages in Protobuf, which are essentially arbitrary key-value maps. When you serialize a Struct to JSON, it behaves much more like a standard JSON object, including serializing keys with empty or default-like values, because Struct itself doesn’t have the same concept of "default" as a generated Protobuf message type.
When you need to distinguish between a field that was explicitly set to false or 0 versus a field that was never set, Protobuf’s JSON encoding can be problematic. You might need to add a separate boolean field to your Protobuf message to track explicit setting, or rely on a different serialization format if this distinction is critical.
The next challenge you’ll likely face is understanding how to handle enums, especially when you want their string names in JSON instead of their integer values.