Parquet, Avro or ORC
Optimised file formats for use in Hadoop clusters
Last updated
Optimised file formats for use in Hadoop clusters
Last updated
The data can be formed in a human-readable format like JSON or CSV file, but that doesn’t mean that’s the best way to actually store the data.
There are three optimized file formats for use in Hadoop clusters:
ORC stores data in a columnar format, similar to Parquet, but it also supports row-oriented storage for efficient write operations. ORC also includes advanced features such as predicate pushdown, which allows for more efficient filtering of data, and bloom filters, which enable faster query execution by reducing the number of disk reads required.
Like Parquet, ORC is a binary format that is optimized for efficient storage and processing of large datasets. It uses compression and encoding techniques to reduce the amount of storage space required and improve query performance.
AVRO is a data serialization system developed by the Apache Software Foundation. It provides a compact, fast, and binary data format that can be used for efficient data exchange between applications written in different programming languages.
Parquet is an open-source columnar storage format developed by the Apache Software Foundation. It is designed to optimise the storage and processing of large datasets, particularly in big data processing systems, such as Apache Hadoop, Apache Spark, and Apache Hive.
Parquet stores data in columns rather than rows, which provides several benefits. For example, it allows for efficient compression and encoding of data, which reduces storage costs and improves query performance.
Parquet is a binary format. It stores data in a binary format that is optimised for efficient storage and processing.
Una vez instalado el paquete parquet-tools, podemos utilizar el comando parquet-tools para ver el contenido de archivos de Parquet de forma legible. Por ejemplo, para ver el contenido de un archivo de Parquet llamado parquet.parquet, podemos usar el siguiente comando: