Logo

dev-resources.site

for different kinds of informations.

Unraveling the Secrets of Hadoop Sorting

Published at
9/24/2024
Categories
labex
hadoop
coding
programming
Author
labby
Categories
4 categories in total
labex
open
hadoop
open
coding
open
programming
open
Author
5 person written this
labby
open
Unraveling the Secrets of Hadoop Sorting

Introduction

MindMap

In a mysterious night market, a captivating figure adorned in an ornate mask gracefully moves through the bustling crowd. This enigmatic mask dancer seems to possess a secret power, effortlessly sorting the chaotic stalls into an orderly arrangement with each twirl and sway. Your goal is to unravel the mystery behind this remarkable talent by mastering the art of Hadoop Shuffle Comparable.

Implement the Mapper

In this step, we will create a custom Mapper class to process input data and emit key-value pairs. The key will be a composite key comprising two fields: the first character of each word and the length of the word. The value will be the word itself.

First, change the user to hadoop and then switch to the home directory of the hadoop user:

su - hadoop
Enter fullscreen mode Exit fullscreen mode

Then, create a Java file for the Mapper class:

touch /home/hadoop/WordLengthMapper.java
Enter fullscreen mode Exit fullscreen mode

Add the following code to the WordLengthMapper.java file:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordLengthMapper extends Mapper<LongWritable, Text, CompositeKey, Text> {

    private CompositeKey compositeKey = new CompositeKey();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] words = line.split("\\s+");

        for (String word : words) {
            compositeKey.setFirstChar(word.charAt(0));
            compositeKey.setLength(word.length());
            context.write(compositeKey, new Text(word));
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

In the above code, we create a WordLengthMapper class that extends the Mapper class from the Hadoop MapReduce API. The map method takes a LongWritable key (representing the byte offset of the input line) and a Text value (the input line itself). It then splits the input line into individual words, creates a CompositeKey object for each word (containing the first character and length of the word), and emits the CompositeKey as the key and the word as the value.

Implement the CompositeKey

In this step, we will create a custom CompositeKey class that implements the WritableComparable interface from the Hadoop MapReduce API. This class will be used as the key in our MapReduce job, allowing us to sort and group the data based on the first character and length of each word.

First, create a Java file for the CompositeKey class:

touch /home/hadoop/CompositeKey.java
Enter fullscreen mode Exit fullscreen mode

Then, add the following code to the CompositeKey.java file:

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;

public class CompositeKey implements WritableComparable<CompositeKey> {

    private char firstChar;
    private int length;

    public CompositeKey() {
    }

    public void setFirstChar(char firstChar) {
        this.firstChar = firstChar;
    }

    public char getFirstChar() {
        return firstChar;
    }

    public void setLength(int length) {
        this.length = length;
    }

    public int getLength() {
        return length;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeChar(firstChar);
        out.writeInt(length);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        firstChar = in.readChar();
        length = in.readInt();
    }

    @Override
    public int compareTo(CompositeKey other) {
        int cmp = Character.compare(firstChar, other.firstChar);
        if (cmp != 0) {
            return cmp;
        }
        return Integer.compare(length, other.length);
    }

    @Override
    public int hashCode() {
        return firstChar + length;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof CompositeKey) {
            CompositeKey other = (CompositeKey) obj;
            return firstChar == other.firstChar && length == other.length;
        }
        return false;
    }

    @Override
    public String toString() {
        return firstChar + ":" + length;
    }
}
Enter fullscreen mode Exit fullscreen mode

In the above code, we create a CompositeKey class that implements the WritableComparable interface. It has two fields: firstChar (the first character of a word) and length (the length of the word). The class provides getter and setter methods for these fields, as well as implementations of the write, readFields, compareTo, hashCode, equals, and toString methods required by the WritableComparable interface.

The compareTo method is particularly important, as it defines how the keys will be sorted in the MapReduce job. In our implementation, we first compare the firstChar fields of the two keys. If they are different, we return the result of that comparison. If the firstChar fields are the same, we then compare the length fields.

Implement the Reducer

In this step, we will create a custom Reducer class to process the key-value pairs emitted by the Mapper and generate the final output.

First, create a Java file for the Reducer class:

touch /home/hadoop/WordLengthReducer.java
Enter fullscreen mode Exit fullscreen mode

Then, add the following code to the WordLengthReducer.java file:

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordLengthReducer extends Reducer<CompositeKey, Text, CompositeKey, Text> {

    public void reduce(CompositeKey key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        StringBuilder sb = new StringBuilder();
        for (Text value : values) {
            sb.append(value.toString()).append(", ");
        }
        sb.setLength(sb.length() - 2);
        context.write(key, new Text(sb.toString()));
    }
}
Enter fullscreen mode Exit fullscreen mode

In the above code, we create a WordLengthReducer class that extends the Reducer class from the Hadoop MapReduce API. The reduce method takes a CompositeKey key (containing the first character and length of a word) and an Iterable of Text values (the words that match the key).

Inside the reduce method, we concatenate all the words that match the key into a comma-separated string. We use a StringBuilder to efficiently build the output string, and we remove the trailing comma and space before writing the key-value pair to the output.

Implement the Driver

In this step, we will create a Driver class to configure and run the MapReduce job.

First, create a Java file for the Driver class:

touch /home/hadoop/WordLengthDriver.java
Enter fullscreen mode Exit fullscreen mode

Then, add the following code to the WordLengthDriver.java file:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordLengthDriver {

    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("Usage: WordLengthDriver <input> <output>");
            System.exit(1);
        }

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Word Length");

        job.setJarByClass(WordLengthDriver.class);
        job.setMapperClass(WordLengthMapper.class);
        job.setReducerClass(WordLengthReducer.class);
        job.setOutputKeyClass(CompositeKey.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
Enter fullscreen mode Exit fullscreen mode

In the above code, we create a WordLengthDriver class that serves as the entry point for our MapReduce job. The main method takes two command-line arguments: the input path and the output path for the job.

Inside the main method, we create a new Configuration object and a new Job object. We configure the job by setting the mapper and reducer classes, the output key and value classes, and the input and output paths.

Finally, we submit the job and wait for its completion. If the job completes successfully, we exit with a status code of 0; otherwise, we exit with a status code of 1.

To run the job, you can use the following command:

javac -source 8 -target 8 -classpath "/home/hadoop/:/home/hadoop/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/common/lib/*" -d /home/hadoop /home/hadoop/WordLengthMapper.java /home/hadoop/CompositeKey.java /home/hadoop/WordLengthReducer.java /home/hadoop/WordLengthDriver.java
jar cvf word-length.jar *.class
hadoop jar word-length.jar WordLengthDriver /input /output
Enter fullscreen mode Exit fullscreen mode

Finally, we can check the results by running the following command:

hadoop fs -cat /output/*
Enter fullscreen mode Exit fullscreen mode

Example output:

A:3 Amr
A:6 AADzCv
A:10    AlGyQumgIl
...
h:7 hgQUIhA
h:8 hyrjMGbY, hSElGKux
h:10    hmfHJjCkwB
...
z:6 zkpRCN
z:8 zfMHRbtk
z:9 zXyUuLHma
Enter fullscreen mode Exit fullscreen mode

Summary

In this lab, we explored the concept of Hadoop Shuffle Comparable by implementing a MapReduce job that groups words based on their first character and length. We created a custom Mapper to emit key-value pairs with a composite key, a custom CompositeKey class that implements the WritableComparable interface, a Reducer to concatenate words with the same key, and a Driver class to configure and run the job.

Through this lab, I gained a deeper understanding of the Hadoop MapReduce framework and the importance of custom data types and sorting in distributed computing. By mastering Hadoop Shuffle Comparable, we can design efficient algorithms for data processing and analysis, unlocking the power of big data like the enigmatic mask dancer sorting the chaotic night market stalls.


πŸš€ Practice Now: Mystical Hadoop Sorting Secrets


Want to Learn More?

hadoop Article's
30 articles in total
Favicon
How to check HDFS file metadata
Favicon
How to handle diverse data types in Hadoop MapReduce?
Favicon
How to define the schema for tables in Hive?
Favicon
Introduction to Hadoop:)
Favicon
Big Data
Favicon
Unveil the Secrets of Atlantis with Hadoop FS Shell cat
Favicon
Uncover HDFS Secrets with FS Shell find
Favicon
Unravel the Secrets of Distributed Cache in Hadoop
Favicon
Mastering Hadoop FS Shell mv: Relocating Ancient Scrolls with Ease
Favicon
How to optimize Hadoop application performance using storage format strengths?
Favicon
Introduction to Big Data Analysis
Favicon
Processando 20 milhΓ΅es de registros em menos de 5 segundos com Apache Hive.
Favicon
The Journey From a CSV File to Apache Hive Table
Favicon
Mastering Hadoop FS Shell rm: Effortless File Removal
Favicon
Unraveling the Secrets of Hadoop Sorting
Favicon
Hadoop Mastery: Unveil the Secrets of Atlantis, Conquer the Abyss, and Beyond! πŸ—ΊοΈ
Favicon
Dive into Hadoop: Mastering the Hadoop Practice Labs Course
Favicon
Explore the Future of Martropolis with Hadoop and Hive
Favicon
How to Install Hadoop on Ubuntu: A Step-by-Step Guide
Favicon
Mastering Hadoop FS Shell: copyFromLocal and get Commands
Favicon
Hadoop Installation and Deployment Guide
Favicon
Running a Script on All Data Nodes in an Amazon EMR Cluster
Favicon
Embark on a Captivating Coding Adventure with LabEx πŸš€
Favicon
Hadoop in Action: Real-World Case Studies
Favicon
Embark on a Cosmic Data Adventure with LabEx
Favicon
Mastering Hadoop: The 'Hadoop Practice Challenges' Course
Favicon
Embark on a Hadoop Adventure: Exploring Diverse Challenges in the Digital Realm 🌌
Favicon
Hadoop/Spark is too heavy, esProc SPL is light
Favicon
MapReduce Vs Tez
Favicon
Mastering Ninja Resource Management

Featured ones: