Apache Avro

Apache Avro is an open-source data serialization system that is part of the Apache Hadoop ecosystem. It provides:

Rich Data Structures: Supports complex data types including primitives (null, boolean, int, long, float, double, bytes, string) and complex types (record, enum, array, map, union, fixed).
Compact Binary Data Format: Serializes data into a dense, efficient binary format, reducing storage size and network bandwidth usage compared to text-based formats like JSON.
Schema Definition: Uses JSON to define data schemas ('.avsc' files) or an Interface Definition Language (IDL) ('.avdl' files). Schemas are crucial for both serialization and deserialization.
Schema Evolution: Provides robust support for evolving schemas over time (adding fields, removing fields) in a way that allows old code to read new data and new code to read old data.
RPC Framework: Includes support for defining Remote Procedure Call (RPC) protocols.

Avro is widely used in Big Data environments, particularly within the Apache Kafka ecosystem, often in conjunction with a Schema Registry.

Context: Why Avro?

When transmitting data between systems or storing it efficiently, especially in high-volume scenarios like event streaming, choosing the right serialization format is important. Text-based formats like JSON are human-readable but can be verbose and slower to parse. Binary formats offer compactness and performance but require a schema to interpret the data.

Avro emerged as a solution offering:

Efficiency: Compact binary representation.
Flexibility: Supports complex nested data structures.
Robustness: Strong schema definition and evolution capabilities prevent data interpretation errors when schemas change.
Dynamic Typing Integration: Unlike some other binary formats, Avro schemas are typically packaged with the data (or easily accessible via a registry), making it easier to handle data even if the reader doesn't have the exact schema version compiled in beforehand.

How Avro Works

Schema Definition: You define the structure of your data using a JSON schema. For example:
Serialization (Writing): When writing data, the Avro library uses the writer's schema to encode the data into a compact binary format. Crucially, the schema itself (or a reference to it, like a schema ID from a registry) is often included or associated with the data.
Deserialization (Reading): When reading data, the Avro library requires both the writer's schema (used to encode the data, often retrieved alongside the data) and the reader's schema (the schema the reading application expects). Avro uses these two schemas to resolve any differences (schema evolution) and correctly decode the binary data into the reader's expected data structure. This schema resolution logic is key to Avro's flexibility.

Schema Evolution Rules

Avro defines specific rules for how schemas can evolve while maintaining compatibility:

Backward Compatibility: New code (using the new schema) can read old data (written with the old schema). Achieved by:
- Adding fields with default values.
- Removing fields (readers using the old schema will ignore the missing field).
Forward Compatibility: Old code (using the old schema) can read new data (written with the new schema). Achieved by:
- Adding fields (old readers simply ignore the new fields).
- Removing fields only if the old schema had a default value for that field.
Full Compatibility: Both backward and forward compatible. Achieved by only adding or removing fields that have default values.

Using a Schema Registry is highly recommended when using Avro with streaming platforms like Kafka. The registry stores schema versions, assigns unique IDs, and helps producers and consumers coordinate schema usage and evolution safely.

Key Benefits

Compact & Fast: Efficient binary serialization reduces data size and speeds up processing.
Rich Data Types: Supports complex, nested structures.
Strong Schema Enforcement: Reduces runtime errors due to data format mismatches.
Excellent Schema Evolution: Handles changes in data structure gracefully without breaking consumers.
Language Interoperability: Libraries available for many programming languages.

Common Use Cases

Serializing events in Apache Kafka topics.
Storing data efficiently in Hadoop HDFS or data lakes.
Defining RPC protocols between services.
Anywhere compact, schema-driven data serialization is needed.

Avro and RisingWave

RisingWave provides robust support for consuming data serialized in Avro format, especially from Apache Kafka sources:

'CREATE SOURCE': When defining a Kafka source in RisingWave, you can specify 'FORMAT AVRO'.
Schema Registry Integration: RisingWave integrates with Schema Registries (like Confluent Schema Registry) to automatically fetch Avro schemas based on schema IDs embedded in Kafka messages. This eliminates the need to manually define the full schema in RisingWave's SQL DDL and ensures compatibility as schemas evolve.
Data Deserialization: RisingWave uses the fetched schema to correctly deserialize the binary Avro messages into rows that can be processed by its SQL engine.

This makes Avro a popular and well-supported choice for getting strongly-typed, efficiently serialized data into RisingWave from Kafka.