Apache Avro is an open source project providing a data serialization framework and data exchange services often used in the context of Apache Kafka and Apache Hadoop, to facilitate the exchange of big data between applications. It is also used for efficient storage in Apache Hive or Oracle for NoSQL, or as a data source in Apache Spark or Apache NiFi.
It is a row-oriented object container storage format and is language-independent. Avro uses JSON to define the schema and data types, allowing for convenient schema evolution. The data storage is compact and efficient, with both the data itself and the data definition being stored in one message or file, meaning that a serialized item can be read without knowing the schema ahead of time.
An Avro container file consists of a header and one or multiple file storage blocks. The header contains the file metadata including the storage blocks schema definition. Avro follows its own standards for defining schemas, expressed in JSON.
Avro schema provides future-proof robustness in streaming architectures like Kafka, when producers and an unknown number of consumers evolve on a different timeline. Avro schemas support evolution of the metadata and are self-documenting the data, making it future-proof and more robust.
Producers and consumers are further decoupled by defining constraints for the way schemas are allowed to evolve over time. These evolution rules can be published in a schema registry, as provided by Confluent, HortonWorks, or NiFi. With a schema registry, you may also increase memory and network efficiency by sending a schema reference ID in the registry instead of repeating the schema itself with each message.