Apache Avro is a data serialization and remote procedure call framework which is developed within the Apache Hadoop project where it provides both a serialization format to get persistent data and a wire format for providing communication between Hadoop nodes, as well as connecting client programs to the Hadoop services.
Avro uses the JSON format for defining protocols and data types, as well as serializes data into a compact binary format.
Apache Avro is a big data serialization framework which produces data in a compact binary format which does not require code generation or proxy objects.
It is used as a data serialization component for Apache Hadoop. Avro works on the concept of schemas. When Avro data are being read, the schema which was used during the writing of that specific data is always present.
This allows each data set without per-value overheads, which makes the serialization both fast and relatively small in size. And since data and their schema are fully self-describing, this makes it easy to use with dynamic scripting languages.
When the Avro data are stored in a specific file, the schema is also stored with them to be later processed by another program. So if a program reading the data is expecting another schema, then this can easily be resolved since both schemas are present.
Avro provides:
- A compact and fast binary data format
- Rich data structures
- A container file for storing persistent data
- Remote procedure call (RPC)
- Integration with dynamic languages
Generation of code is not a requirement for reading or writing data files or to use or implement RPC protocols.
0 Comments