Logo

dev-resources.site

for different kinds of informations.

Integration of Apache Iceberg in S3, Glue, Athena, Matillion, and Snowflake – Part 1

Published at
7/29/2024
Categories
aws
cdk
data
governance
Author
zirkonium88
Categories
4 categories in total
aws
open
cdk
open
data
open
governance
open
Author
11 person written this
zirkonium88
open
Integration of Apache Iceberg in S3, Glue, Athena, Matillion, and Snowflake – Part 1

TL;DR

If you aim to establish transactional object control in your Data Lake, you should definitely consider Apache Iceberg. However, the path is somewhat rocky and involves manual efforts. AWS Glue, in particular, needs to offer more feature sets. Once you overcome these hurdles, the integration of Iceberg in Snowflake, Glue, and Matillion is as comfortable as dealing with other file formats. This article is less of a technical guide and more of a narrative about my journey from S3 to Glue. This part focuses on infrastructure deployment, while the second part will focus on the Matillion pipeline and Snowflake integration.

Transactional Data Lakes – Why and How?

When working with a Data Lake, the question of how to handle changes and data updates inevitably arises. In Europe, explicit deletion concepts are required. In a database, a data point can be relatively easily identified and updated by deleting or anonymizing it.

Apache Iceberg – Why and How?

There are three frameworks that meet the requirements of a transactional Data Lake: Apache Hudi, Apache Iceberg, and Delta Lake. AWS has published a blog post offering a purely technical comparison: Choosing an open table format for your transactional data lake on AWS.

From my perspective, the relevant aspect was that Apache Iceberg has the most extensive integration with Amazon Athena. Both read and write operations can occur here. In combination with Glue, it is possible to update objects via SQL from AWS without maintaining complex systems.

Automation is Key – But Not Given at the Crucial Point

The following section outlines my project process.

Data Integration: From S3 to Glue to Apache Iceberg

When we started, I quickly realized that we couldn't use CSV or JSON files for Apache Iceberg but rather ORC, Avro, and Parquet. In Matillion, I often use Pandas, so migrating from CSV to Parquet wasn't a big problem. My DataFrame only needed to use a different method and exchange a few parameters.

df = pd.DataFrame.from_dict(response_dict["data"])
df["Upload Date"] = file_date
df.columns = [replace_no_ascii(unicode_string=col) for col in df.columns]
df.to_parquet(
    f"s3://{s3_bucket}/{directory}/{year}/{file_date}.snappy.parquet",
    engine="auto",
    compression="snappy",
    storage_options={
        "key": access_key,
        "secret": secret_access_key,
        "token": sessions_token,
    },
)
Enter fullscreen mode Exit fullscreen mode

The storage options are only needed if you want to write the data cross-account.

In the next step, I thought Glue could easily create Iceberg tables from Parquet files – wrong. So, what can Glue do here? You can build Iceberg tables from existing Glue tables. But here's the problem: you have to provide the table definition, even though the crawler has already created it for the "simple" files. This is a significant lack of automation when I want to build my data pipelines with Infrastructure as Code (IaC).

I then wondered about the Python library PyIceberg. It supports Python and AWS but cannot use S3 as a catalog, only Glue: Creating an Iceberg table on S3 using PyIceberg and Glue catalog.

The attempt to let Glue read Iceberg tables directly also failed. Unfortunately, our ELT tool Matillion cannot directly handle Apache Iceberg as of July 2024.

Data Integration: From S3 to Glue to Apache Iceberg – With the Help of Athena

Technically, I always thought it had to continue with the Glue tables that my crawler had already created. I then discovered that Athena can not only create Apache Iceberg tables via SQL but also generate Iceberg tables using DDL statements without giving a table definition.

Thus, you can indeed create Iceberg tables in AWS via custom resources without table definitions and using "IaC":

CREATE TABLE "glue-database-rexx"."datalake_api_iceberg" 
WITH (table_type = 'ICEBERG', format = 'PARQUET', location = 's3://mybucket/datalake-iceberg/', is_external = false) 
AS SELECT * FROM "glue-database-iceberg"."datalake";
Enter fullscreen mode Exit fullscreen mode

The only downside: you create a copy of the existing Parqute files. So cleanup process and data INSERT mechanism is needed. You will find this in part two of this series.

Closing Words

Infrastructure as Code (IaC) for Apache Iceberg is possible – but with detours. In summary, you can use IaC to create S3 buckets, Glue crawlers, the database, and the Athena query for the initial statement to create the Apache Iceberg catalogs. The path is not very explicit, which is why I wrote it down here. Once I understood this, I could then codify my journey with the Cloud Development Kit.

governance Article's
30 articles in total
Favicon
Implementing Kubernetes Security with Kyverno: A Journey Through Resource Management
Favicon
Cloud Cost Optimization
Favicon
Data Governance… when there's no Data nor Governance
Favicon
Gobierno de Datos… cuando no hay Gobierno ni Datos
Favicon
5 Signs You’ve Built a Secretly Bad Architecture (And How to Fix It)
Favicon
Building a Robust Data Governance Framework: Best Practices and Key Considerations 
Favicon
The Pitfalls of Philosopher King -Why Morality Alone Isn’t Enough for Effective Leadership
Favicon
Understanding Data Governance in the Digital Age
Favicon
Identifying the Risks Organizations Face - Key Considerations for Risk Governance
Favicon
Crafting a Balanced Governance Strategy for a Growing Fintech Startup
Favicon
Conventional Use of AWS CDK
Favicon
The AI Alliance: Shaping the Future of Artificial Intelligence Together
Favicon
The Power of Storytelling in Digital Business Case Development
Favicon
Navigating the Future of Business in the Digital Age
Favicon
Improving Cloud governance using an automated naming generation tool.
Favicon
Data Governance in Modern Data Engineering
Favicon
Integration of Apache Iceberg in S3, Glue, Athena, Matillion, and Snowflake – Part 2
Favicon
Integration of Apache Iceberg in S3, Glue, Athena, Matillion, and Snowflake – Part 1
Favicon
Rapidly Ship Industry-Standard APIs
Favicon
Why your performance work is not seen
Favicon
API Security Best Practices: Enable Good Governance
Favicon
Understanding the governance of Cilium.
Favicon
Open source at Fastly is getting opener
Favicon
The 5 Disciplines of Cloud Governance
Favicon
Top GRC Trends for 2024 and Beyond
Favicon
New Features: Snowflake International Tags & Shared Tags
Favicon
The Dawn of Technocracy
Favicon
Cybersecurity frameworks
Favicon
Risk management frameworks
Favicon
Why your DevOps Toolchain Needs a Governance Platform

Featured ones: