Apache Avro — Schema Evolution & Serialize

Avro เป็นรูปแบบ serialize ข้อมูลแบบแถวที่ฝังสกีมา JSON ไว้ในไฟล์ เก่งในเรื่อง schema evolution — สามารถเพิ่ม ลบ หรือเปลี่ยนชื่อฟิลด์ได้โดยไม่ทำลายเครื่องอ่านที่มีอยู่

ประเภท MIME

application/avro

ประเภท

ไบนารี

การบีบอัด

ไม่สูญเสียคุณภาพ

ข้อดี

+ Schema evolution — add/remove fields without breaking readers
+ Compact binary encoding with efficient compression
+ Self-describing — schema embedded in the file
+ Standard in Kafka and the Hadoop ecosystem

ข้อเสีย

− Row-based — less efficient than Parquet for analytical queries
− Not human-readable in binary form
− JSON schema specification has a learning curve

เมื่อใดควรใช้ .AVRO

ใช้ Avro สำหรับสกีมาข้อความ Kafka, ไปป์ไลน์ข้อมูล Hadoop/Spark และระบบที่ต้องการ schema evolution และการ serialize แบบแถวที่กระชับ

รายละเอียดทางเทคนิค

ไฟล์ Avro ประกอบด้วยส่วนหัวสกีมา JSON ตามด้วยบล็อกข้อมูลเข้ารหัสไบนารีที่บีบอัดด้วย DEFLATE หรือ Snappy Schema evolution ช่วยให้เพิ่ม/ลบฟิลด์ที่มีค่าเริ่มต้นได้

ประวัติ

Doug Cutting สร้าง Avro ในปี 2009 เป็นส่วนหนึ่งของระบบนิเวศ Hadoop ต่างจาก Thrift และ Protocol Buffers ตรงที่ Avro ออกแบบมาให้สกีมาอยู่ในข้อมูล ทำให้รองรับ schema evolution ได้ดี

แปลงจาก .AVRO

.avro → .arrow .avro → .csv .avro → .json .avro → .ndjson .avro → .parquet .avro → .xlsx

แปลงเป็น .AVRO

.arrow → .avro .csv → .avro .json → .avro .ndjson → .avro .parquet → .avro .xlsx → .avro

รูปแบบที่เกี่ยวข้อง

.arrow .bson .geojson .hdf5 .msgpack .ndjson .parquet .protobuf .sqlite

Categories

Apache Avro (การ Serialize แบบแถว)