Run MapReduce Program Against Hadoop Cluster in Eclipse

For a long while, I used command line to run a mapreduce program against my hadoop cluster by 'hadoop jar ${jar name} ${class name} -libjars ${lib jars}". It's kind of terrible because I have to make the MapReduce program into jar ( I usually program it in IDE like Eclipse, who don't? ) and then switch to terminal to run the long long long 'hadoop jar ...' command.

The reason I did so because I tried to run that Mapreduce in Eclipse, but I find it's always running locally, means LocalJobRunner is invoked. My hadoop cluster is not used ( and neither Yarn ). I searched online to find a way to solve this, but was only told that using Eclipse MapReduce program can only run in local mode. I was doing that way until yesterday I suddenly came up with an idea to configure Eclipse and run a MapReduce program against a cluster MYSELF. So here it is.

The environment I'm using ( although I think that doesn't matter much):
Hadoop Client Package: Apache Hadoop 2.6.0
Hadoop Server Package: I tried both HDP 2.2 and Cloudera 5.4
Java -version: 1.7.0_21
System: Fedora 20
Eclipse: Luna Service Release 2 (4.4.2) ( I'm not sure which is the 'version number'?)

Let's assume:
Your cluster runs here: hdp-mycluster.com
You have Hadoop client configuration files under directory: /tmp/conf (eg. hadoop-env.sh, core-site.xml, ...)
You have environment variable: export HADOOP_CONF_DIR=/tmp/conf
You have a mapreduce program, say wordCount (classic example?).

OK. Then here we go.

STEP 1: Eclipse -> Preference -> Java -> Build Path -> User Libraries -> New (on the right) -> Type a Name (E.g. Hadoop) -> OK -> Add External Jars... -> Choose all hadoop distro jars ( all jars in common, hdfs, yarn, mapreduce, tools,...) -> OK.

Explain: When you run 'Hadoop jar ...' command in terminal, hadoop will look for all those jars in hadoop classpath and add them into your classpath. With Eclipse, we have to manually to do so. In my case (Apache Hadoop), those jars are under hadoop/share/hadoop/, most jars in subfolders are needed:

common/ hdfs/ httpfs/ kms/ mapreduce/ tools/ yarn/

STEP 2: Eclipse -> Preference -> Java -> Build Path -> Classpath Variables -> New (on the right) -> Type Name 'HADOOP_CONF_DIR' -> Type or choose path /tmp/conf (or other path where your hadoop client package is) -> OK

Explain: This classpath is critical to run Mapreduce Job in distributed mode (against cluster) in Eclipse. Without this step, your program will use LocalJobRunner.

STEP 3: Right Click Project WordCount -> Build Path -> Configure Build Path -> 'Libraries' tab -> Add Library -> User Library -> Next -> Select 'Hadoop' (or the User Library you created in STEP 1) -> Finish

Explain: This step makes the hadoop library you created in step one available to this specific mapreduce program.

STEP 4: Right Click Project WordCount -> Run As -> Run Configuration -> Classpath tab -> Select WordCount project -> Advanced (on the right) -> Add Classpath Variables -> OK -> Select HADOOP_CONF_DIR you created in STEP 2 -> OK

Explain: This step makes the HADOOP_CONF_DIR you created in step two available to this specific mapreduce program.

STEP 5: Right Click Project WordCount -> Export -> Java -> JAR file -> Export to whereever you want -> Finish

Explain: This jar is required to run mapreduce program. Just as when you run 'hadoop jar ...' command, you have to specify jar name.

STEP 6: In your MapReduce Job configuration, add the following code:

conf.set("mapreduce.job.jar","/tmp/jar/wordCount.jar");

Replace the path with the path to your jar created last step.

Then YOU ARE DONE!
JUST RUN RUN IT.

Here is the logging I see with wordCount program:

And for your reference, Here is my wordCount program code

Lastly, ENJOY DEBUGGING in IDE.
[KEY WORD] DEBUGGING ECLIPSE HADOOP YARN MAPREDUCE DISTRIBUTED LOCAL JOB PATH CLUSTER CLIENT SERVER

Matt Sun: Proof of Stake

Search This Blog

Run MapReduce Program Against Hadoop Cluster in Eclipse

Labels

Comments

Post a Comment