Logo

dev-resources.site

for different kinds of informations.

Working with Parquet files in Java using Carpet

Published at
6/19/2024
Categories
parquet
java
bigdata
dataengineering
Author
Jerรณnimo Lรณpez
Working with Parquet files in Java using Carpet

After some time working with Parquet files in Java using the Parquet Avro library, and studying how it worked, I concluded that despite being very useful in multiple use cases and having great potential, the documentation and ecosystem needed for adoption in the Java world was very poor.

Many people are using suboptimal solutions (CSV or JSON files), applying more complex solutions (Spark), or using languages they are not familiar with (Python) because they don't know how to work with Parquet files easily. That's why I decided to write this series of articles.

Once you understand it and have the examples, everything is easier. But, can it be even easier? Can we avoid the hassle of using strange libraries that serialize other formats? Yes, it should be even easier.

That's why I decided to implement an Open Source library that makes working with Parquet from Java extremely simple, something that covers it: Carpet.

Carpet is a Java library that serializes and deserializes Parquet files to Java 17 Records, abstracting you (if you want) from the particularities of Parquet and Hadoop, and minimizing the number of necessary dependencies, because it works directly with Parquet code. It is available on Maven Central and you can find its source code on GitHub.

Hello world

Carpet works by reflection: it inspects your class model and there is no need to define an IDL, implement interfaces, or use annotations. Carpet is based on Java records, the primitive created by the JDK for Data Oriented Programming.

Continuing with the same examples from previous articles, we will have a collection of Organization objects, which have a list of Attributes:

record Org(String name, String category, String country, Type type, List<Attr> attributes) { }

record Attr(String id, byte quantity, byte amount, boolean active, double percent, short size) { }

enum Type { FOO, BAR, BAZ }

With Carpet, it is not necessary to create special classes or perform transformations. Carpet works directly with your model, as long as it fits the Parquet schema you need.

Serialization

With Carpet, you don't need to use Parquet writers or Hadoop classes:

try (OutputStream outputStream = new FileOutputStream(filePath)) {
    try (CarpetWriter writer = new CarpetWriter<>(outputStream, Org.class)) {
        writer.write(organizations);
    }
}

The code can be found on GitHub.

If your records match the required Parquet schema, class conversion is not necessary. If you don't need special Parquet configuration, you don't have to create builders, and you can use a Java OutputStream directly.

By reflection, it creates the Parquet schema, using the names and types of the fields in your records as column names and types.

Carpet supports complex data structures, as long as all objects are records, collections (List, Set, etc.), and maps.

Deserialization

Deserialization is equally simple, or even simpler.

List<Org> organizations = new CarpetReader<>(new File(filePath), Org.class).toList();

You can also iterate through the file with a stream:

List<Org> organizations = new CarpetReader<>(new File(filePath), Org.class).stream()
    .filter(this::somePredicate)
    .toList();

The code can be found on GitHub.

Since Carpet uses reflection, it conventionally expects the types and names of the fields to match those of the columns in the Parquet file.

None of the Parquet or Hadoop classes are imported into your code.

Deserialization using a projection

Carpet reads only the columns that are defined in the records and ignores any other columns that exist in the file. Defining a projection with a subset of attributes is as simple as defining a record in Java:

record OrgProjection(String name, String category, String country, Type type) { }

var organizations = new CarpetReader<>(new File(filePath), OrgProjection.class).toList();

In this case, reading time is reduced to hundreds of milliseconds.

The code can be found on GitHub.

The Parquet way

If for any reason you need to customize some parameter of file generation or use it with Hadoop, Carpet provides an implementation of the ParquetWriter and ParquetReader builders. This way, all Parquet configurations are exposed.

Serialization

We will need to instantiate a Parquet writer:

OutputFile outputFile = new FileSystemOutputFile(new File(filePath));
try (ParquetWriter.<Org> writer = CarpetParquetWriter.<Org>builder(outputFile, Org.class)
        .withCompressionCodec(CompressionCodecName.GZIP)
        .withWriteMode(Mode.OVERWRITE)
        .build()) {
    for (Org org : organizations) {
        writer.write(org);
    }
}

The code can be found on GitHub.

Carpet implements a ParquetWriter<T> builder with all the logic to convert Java records to Parquet API calls.

To avoid using Hadoop classes (and importing all their dependencies), Carpet implements the InputFile and OutputFile interfaces using regular files.

Therefore:

  • OutputFile and ParquetWriter are classes defined by the Parquet API
  • CarpetParquetWriter and FileSystemOutputFile are classes implemented by Carpet
  • Org and Attr are Java records from your domain, unrelated to Parquet or Carpet

Carpet implicitly generates the Parquet schema from the fields of your records.

Deserialization

We will need to instantiate a Parquet reader using the CarpetParquetReader builder:

InputFile inputFile = new FileSystemInputFile(new File(filePath));
try (ParquetReader<Org> reader = CarpetParquetReader.builder(inputFile, Org.class).build()) {
    List<Org> organizations = new ArrayList<>();
    Org next = null;
    while ((next = reader.read()) != null) {
        organizations.add(next);
    }
    return organizations;
}

You can find the code on GitHub.

Parquet defines a class called ParquetReader<T>, and Carpet implements it with CarpetParquetReader, handling the logic to convert internal data structures of Parquet to your Java records.

In this case:

  • InputFile and ParquetReader are classes defined by the Parquet API
  • CarpetParquetReader and FileSystemOutputFile are classes implemented by Carpet
  • Org (and Attr) are Java records from your domain, unrelated to Parquet

The instantiation of the ParquetReader class is also done with a Builder to maintain the pattern followed by Parquet.

Carpet validates that the schema of the Parquet file is compatible with the Java records. If not, it throws an exception.

Performance

With identical schemas and data, the file sizes compared to parquet-avro and parquet-protobuf are the same. However, what is the overhead cost of using reflection?

Library Serialization Deserialization
Parquet Avro 15,381 ms 7,665 ms
Parquet Protocol Buffers 16,174 ms 11,025 ms
Carpet 12,769 ms 8,881 ms

Writing, Carpet is 20% faster than using Avro and Protocol Buffers. The overhead of reflection is less than the work required to create Avro or Protocol Buffers objects.

In terms of reading, Carpet is slightly slower than the fastest version of Parquet Avro. The use of reflection does not significantly penalize performance, and in return, we avoid using custom data types of the library.

Conclusion

Parquet is a very powerful format, yet underutilized in the Java ecosystem. This is partly due to lack of awareness and the difficulty in working with it, and partly because being a binary format, it is not very comfortable to work with it.

Even if you're not into Big Data, Parquet can still be useful in scenarios involving large datasets. Often, due to unfamiliarity, complex or inefficient solutions and architectures are adopted.

The format, with its schema, ensures that the defined types are satisfied or the data cannot be null. How many times have you struggled parsing a CSV file?

Carpet provides a very simple API, making it extremely easy to write and process Parquet files in 99% of use cases. For me, working with Parquet files is now more convenient than CSVs.

Carpet is an open-source library under the Apache 2.0 license. You can find its source code on GitHub and it's available on Maven Central.

The README.md of the project provides a detailed explanation of its various functionalities, customization options, and how to use its API. I encourage you to use Carpet and share your feedback or tell me about your use cases working with Parquet.

Featured ones: