Protobuf
Protocol Buffers (a.k.a. Protobuf) are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.
One common use of them is to define gRPC specifications — essentially a form of remote procedure calls. With gRPC service definitions, you create a “service” that has RPC methods. These RPC methods take a request “message” and return a response “message”.
Similar to Apache Avro, Protobuf is a method of serializing structured data. A message format is defined in a .proto file and you can generate code from it in many languages including Java, Python, C++, C#, Go and Ruby. Unlike Avro however, Protobuf does not serialize schema with the message. So, in order to deserialize the message, you need the schema in the consumer, or a Protobuf schema registry.
Since Confluent Platform version 5.5, Protobuf and JSON schemas are now supported. Pulsar Schema Registry also allows Protobuf structures.
Hackolade is a visual editor for Protobuf schema for non-programmers. To perform data modeling for Protobuf schema with Hackolade, you must first download the Protobuf plugin.
Hackolade was specially adapted to support the schema design of Protobuf schema. The application closely follows the Protobuf terminology.
Protobuf Schema
Protocol buffer style has evolved over time. You may want to read this artcle about the differences between Proto2 vs Proto3. Hackolade supports both.
Protobuf message schemas are defined in a .proto file which can be easily generated from Hackolade.
All files should be ordered in the following manner:
- License header (if appllicable)
- File overview
- Syntax
- Packages
- Imports
- File options
- Everything else
Multiple message types can be defined in a single .proto file. This is useful if you are defining multiple related messages.
Each field is assigned a so-called field number, which has to be unique in a message type. These numbers identify the fields when the message is serialized to the Protobuf binary format. Google suggests using numbers 1 through 15 for most frequently used fields because it takes one byte to encode them. The numbers being auto-generated by the applicaiton during script generation, you amy influence the numbering through simple ordering of the fields.
Field rules
In Proto2, you could specify that message fields are one of the following:
- required: a well-formed message must have exactly one of this field.
- optional: a well-formed message can have zero or one of this field (but not more than one).
- repeated: this field can be repeated any number of times (including zero) in a well-formed message. The order of the repeated values will be preserved.
You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead.
A second issue with required fields appears when someone adds a value to an enum. In this case, the unrecognized enum value is treated as if it were missing, which also causes the required value check to fail.
In Proto3, it seemed that they originally dropped the possibility to have optional fields. Then they reinstated that with 3.15.0 https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0 but implemented differently.
Message fields can be one of the following:
- singular: a well-formed message can have zero or one of this field (but not more than one). And this is the default field rule for proto3 syntax.
- repeated: this field can be repeated any number of times (including zero) in a well-formed message. The order of the repeated values will be preserved.
As a result, it should be noted that null is not supported directly as a data type. But is supported either via optional in proto2 or singular in proto3, though not explicitly. You may want to read this article on Protobuf and Null support.
Data Types
Protobuf supports different families of data types: scalar (string, bytes, int, fixed, double, float, bool, and any) and nested (message, enumeration, map).
Fields may also consist of sub-messages that are defined in the same proto file, or any imported proto files.
Forward-Engineering
Hackolade dynamically generates Protobuf schema for the structure created with the application.
The script can also be exported to the file system via the menu Tools > Forward-Engineering, or via the Command-Line Interface.
This structure can be forward-engineered to a file with .proto extention or copied/pasted to code. It can also be forward-engineered to a Confluent or Pulsar Schema Registry instance.
Reverse-Engineering
Hackolade easily imports the schema from .proto files to represent the corresponding Entity Relationship Diagram and schema structure. You may also import and convert from JSON Schema and documents.
For more information on Protobuf in general, please consult the website.and documentation.
Cloud Object Storage
In the context of large-scale distributed systems like data lakes, data is often stored in object storage solutions like Amazon S3, Azure ADLS, or Google Cloud Storage. Protobuf is designed to be a compact and efficient binary format for serializing data.
With Hackolade Studio, you can reverse-engineer Protobuf files located on:
- Amazon S3
- Azure Blog Storage
- Azure Data Lake Storage (ADLS) Gen 1 and Gen 2
- Google Cloud Storage
Schema Registries
A key component of event streaming is to enable broad compatibility between applications connecting to Kafka. In a large organizations, trying to ensure data compatibility can be difficult and ultimately ineffective, so schemas should be handled as “contracts” between producers and consumers.
The main benefit of using a Schema Registry is that it provides a centralized way to manage and version Protobuf schemas, which can be critical for maintaining data compatibility and ensuring data quality in a Kafka ecosystem.
Hackolade Studio supports Protobuf schema maintenance in:
- Azure EventHubs Schema Registry
Schemas can be published to the registry via forward-engineering, or reverse-engineered from these schema registries.