dev-resources.site
for different kinds of informations.
Unraveling the Secrets of Hadoop Sorting
Introduction
In a mysterious night market, a captivating figure adorned in an ornate mask gracefully moves through the bustling crowd. This enigmatic mask dancer seems to possess a secret power, effortlessly sorting the chaotic stalls into an orderly arrangement with each twirl and sway. Your goal is to unravel the mystery behind this remarkable talent by mastering the art of Hadoop Shuffle Comparable.
Implement the Mapper
In this step, we will create a custom Mapper class to process input data and emit key-value pairs. The key will be a composite key comprising two fields: the first character of each word and the length of the word. The value will be the word itself.
First, change the user to hadoop
and then switch to the home directory of the hadoop
user:
su - hadoop
Then, create a Java file for the Mapper class:
touch /home/hadoop/WordLengthMapper.java
Add the following code to the WordLengthMapper.java
file:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordLengthMapper extends Mapper<LongWritable, Text, CompositeKey, Text> {
private CompositeKey compositeKey = new CompositeKey();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split("\\s+");
for (String word : words) {
compositeKey.setFirstChar(word.charAt(0));
compositeKey.setLength(word.length());
context.write(compositeKey, new Text(word));
}
}
}
In the above code, we create a WordLengthMapper
class that extends the Mapper
class from the Hadoop MapReduce API. The map
method takes a LongWritable
key (representing the byte offset of the input line) and a Text
value (the input line itself). It then splits the input line into individual words, creates a CompositeKey
object for each word (containing the first character and length of the word), and emits the CompositeKey
as the key and the word as the value.
Implement the CompositeKey
In this step, we will create a custom CompositeKey
class that implements the WritableComparable
interface from the Hadoop MapReduce API. This class will be used as the key in our MapReduce job, allowing us to sort and group the data based on the first character and length of each word.
First, create a Java file for the CompositeKey
class:
touch /home/hadoop/CompositeKey.java
Then, add the following code to the CompositeKey.java
file:
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class CompositeKey implements WritableComparable<CompositeKey> {
private char firstChar;
private int length;
public CompositeKey() {
}
public void setFirstChar(char firstChar) {
this.firstChar = firstChar;
}
public char getFirstChar() {
return firstChar;
}
public void setLength(int length) {
this.length = length;
}
public int getLength() {
return length;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeChar(firstChar);
out.writeInt(length);
}
@Override
public void readFields(DataInput in) throws IOException {
firstChar = in.readChar();
length = in.readInt();
}
@Override
public int compareTo(CompositeKey other) {
int cmp = Character.compare(firstChar, other.firstChar);
if (cmp != 0) {
return cmp;
}
return Integer.compare(length, other.length);
}
@Override
public int hashCode() {
return firstChar + length;
}
@Override
public boolean equals(Object obj) {
if (obj instanceof CompositeKey) {
CompositeKey other = (CompositeKey) obj;
return firstChar == other.firstChar && length == other.length;
}
return false;
}
@Override
public String toString() {
return firstChar + ":" + length;
}
}
In the above code, we create a CompositeKey
class that implements the WritableComparable
interface. It has two fields: firstChar
(the first character of a word) and length
(the length of the word). The class provides getter and setter methods for these fields, as well as implementations of the write
, readFields
, compareTo
, hashCode
, equals
, and toString
methods required by the WritableComparable
interface.
The compareTo
method is particularly important, as it defines how the keys will be sorted in the MapReduce job. In our implementation, we first compare the firstChar
fields of the two keys. If they are different, we return the result of that comparison. If the firstChar
fields are the same, we then compare the length
fields.
Implement the Reducer
In this step, we will create a custom Reducer class to process the key-value pairs emitted by the Mapper and generate the final output.
First, create a Java file for the Reducer class:
touch /home/hadoop/WordLengthReducer.java
Then, add the following code to the WordLengthReducer.java
file:
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordLengthReducer extends Reducer<CompositeKey, Text, CompositeKey, Text> {
public void reduce(CompositeKey key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
StringBuilder sb = new StringBuilder();
for (Text value : values) {
sb.append(value.toString()).append(", ");
}
sb.setLength(sb.length() - 2);
context.write(key, new Text(sb.toString()));
}
}
In the above code, we create a WordLengthReducer
class that extends the Reducer
class from the Hadoop MapReduce API. The reduce
method takes a CompositeKey
key (containing the first character and length of a word) and an Iterable
of Text
values (the words that match the key).
Inside the reduce
method, we concatenate all the words that match the key into a comma-separated string. We use a StringBuilder
to efficiently build the output string, and we remove the trailing comma and space before writing the key-value pair to the output.
Implement the Driver
In this step, we will create a Driver class to configure and run the MapReduce job.
First, create a Java file for the Driver class:
touch /home/hadoop/WordLengthDriver.java
Then, add the following code to the WordLengthDriver.java
file:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordLengthDriver {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordLengthDriver <input> <output>");
System.exit(1);
}
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Word Length");
job.setJarByClass(WordLengthDriver.class);
job.setMapperClass(WordLengthMapper.class);
job.setReducerClass(WordLengthReducer.class);
job.setOutputKeyClass(CompositeKey.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
In the above code, we create a WordLengthDriver
class that serves as the entry point for our MapReduce job. The main
method takes two command-line arguments: the input path and the output path for the job.
Inside the main
method, we create a new Configuration
object and a new Job
object. We configure the job by setting the mapper and reducer classes, the output key and value classes, and the input and output paths.
Finally, we submit the job and wait for its completion. If the job completes successfully, we exit with a status code of 0; otherwise, we exit with a status code of 1.
To run the job, you can use the following command:
javac -source 8 -target 8 -classpath "/home/hadoop/:/home/hadoop/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.6.jar:/home/hadoop/hadoop/share/hadoop/common/lib/*" -d /home/hadoop /home/hadoop/WordLengthMapper.java /home/hadoop/CompositeKey.java /home/hadoop/WordLengthReducer.java /home/hadoop/WordLengthDriver.java
jar cvf word-length.jar *.class
hadoop jar word-length.jar WordLengthDriver /input /output
Finally, we can check the results by running the following command:
hadoop fs -cat /output/*
Example output:
A:3 Amr
A:6 AADzCv
A:10 AlGyQumgIl
...
h:7 hgQUIhA
h:8 hyrjMGbY, hSElGKux
h:10 hmfHJjCkwB
...
z:6 zkpRCN
z:8 zfMHRbtk
z:9 zXyUuLHma
Summary
In this lab, we explored the concept of Hadoop Shuffle Comparable by implementing a MapReduce job that groups words based on their first character and length. We created a custom Mapper to emit key-value pairs with a composite key, a custom CompositeKey class that implements the WritableComparable interface, a Reducer to concatenate words with the same key, and a Driver class to configure and run the job.
Through this lab, I gained a deeper understanding of the Hadoop MapReduce framework and the importance of custom data types and sorting in distributed computing. By mastering Hadoop Shuffle Comparable, we can design efficient algorithms for data processing and analysis, unlocking the power of big data like the enigmatic mask dancer sorting the chaotic night market stalls.
π Practice Now: Mystical Hadoop Sorting Secrets
Want to Learn More?
- π³ Learn the latest Hadoop Skill Trees
- π Read More Hadoop Tutorials
- π¬ Join our Discord or tweet us @WeAreLabEx
Featured ones: