Logo

dev-resources.site

for different kinds of informations.

Understanding COW and MOR in Apache Hudi: Choosing the Right Storage Strategy

Published at
11/19/2024
Categories
devops
technology
software
trending
Author
anshul_kichara
Author
14 person written this
anshul_kichara
open
Understanding COW and MOR in Apache Hudi: Choosing the Right Storage Strategy

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a powerful framework designed for managing large datasets on cloud storage systems, enabling efficient data ingestion, storage, and retrieval. One of the key features of Hudi is its support for two distinct storage types: Copy-On-Write (COW) and Merge-On-Read (MOR). Each of these storage strategies has unique characteristics and serves different use cases. In this blog, we will explore COW and MOR.

Prerequisites

Before you begin, ensure you have the following installed on your local machine:

Docker
Docker Compose
Local Setup

To set up Apache Hudi locally, follow these steps:

Clone the Repository:
git clone https://github.com/dnisha/hudi-on-localhost.git cd hudi-on-localhost
Start Docker Compose:
docker-compose up -d
Access the Notebooks:
Open your browser and navigate to http://localhost:8888 for the Jupyter Notebook.
Also, open http://localhost:9001/login for MinIO.
Username: minioadmin Password: minioadmin

What is Copy-On-Write (COW)?

Copy-On-Write (COW) is a storage type in Apache Hudi that allows for atomic write operations. When data is updated or inserted:

Hudi creates a new version of the entire data file.
The existing data file remains unchanged until the new file is successfully written.
This ensures that the operation is atomic, meaning it either completely succeeds or fails without partial updates.

Steps to Evaluate COW

1.Open the Notebook:
In your browser, navigate to hudi_cow_evaluation.ipynb.

2.Run Configuration Code:
Execute all configuration-related code in the notebook.
Ensure you specify the COPY_ON_WRITE table type, as shown in the provided image.

3.Updating a Record:
a.Focus on updating a record in the 34 partition of the COW bucket.

b.Since you are using the COPY_ON_WRITE table type, a new Parquet file will be created for this update. You can find this file in the bucket located at warehouse/cow/transactions/document=34. Open

What is Merge-On-Read (MOR)?

Merge-On-Read (MOR) is an alternative storage type in Apache Hudi that employs a different approach to data management. Here’s how it works:

Base Parquet Files and Log Files: In MOR, Hudi maintains a combination of base Parquet files alongside log files that capture incremental changes.
On-the-Fly Merging: When a read operation is executed, Hudi merges the base files and log files in real-time, providing the most up-to-date view of the data.
This approach allows for efficient handling of updates and inserts while enabling faster read operations, as the system does not need to rewrite entire files for every change.

You can check more info about: COW and MOR in Apache Hudi.

trending Article's
30 articles in total
Favicon
Data Privacy Challenges in Cloud Environments
Favicon
What Is SRE Support?
Favicon
Can Cloud Data Be Hacked
Favicon
How to Secure APIs in Microservices
Favicon
Dynamic Infrastructure Provisioning with Serverless DevOps
Favicon
What is Machine Learning? A Beginner's Guide to Understanding the Basics
Favicon
Securing Software Supply Chains with SLSA
Favicon
What is a Network Operations Center (NOC)
Favicon
Generative AI vs. Traditional AI: Key Differences and Use Cases
Favicon
How to Activate Virtual Environment in Python VS Code
Favicon
Ctrl+Shift+Epic : Deployment Strategies Unleashed
Favicon
Unlocking the Power of Database as a Service (DBaaS): A Comprehensive Overview
Favicon
Understanding OAI and OAC in AWS CloudFront: Concepts, Configuration, and Best Practices
Favicon
Modern Traffic Management with Gateway API in Kubernetes
Favicon
Implementing GitOps with ArgoCD
Favicon
Restoring a Backup Stored in S3 to an EC2 Instance Using XtraBackup
Favicon
AWS Firewall- Samurai Warriors
Favicon
Understanding COW and MOR in Apache Hudi: Choosing the Right Storage Strategy
Favicon
How to Create a Sitemap for a Website
Favicon
The Remaining Issues With Path Of Exile 2’s Early Access Endgame - Forbes
Favicon
How to Use Generative AI for Video Production?
Favicon
Transforming Legacy Systems: Common Pitfalls and Best Practices
Favicon
Introduction to cloud data engineering with AWS
Favicon
Using Apache Flink for Real-time Stream Processing in Data Engineering
Favicon
Setup Cross Cluster Replication for Data migration in Elasticsearch
Favicon
Database Migration Service in AWS
Favicon
Tangle Free Robot Vacuum Cleaner with 2.4GWiFi/App/Alexa Control, Automatic Vacuum Robot Cleaner for Low Carpet Pet Hair
Favicon
Blocking Web Traffic With WAF In AWS
Favicon
Addressing the Rise of Cloud Security Threats: Best Practices for 2024
Favicon
Addressing the Rise of Cloud Security Threats: Best Practices for 2024

Featured ones: