Purpose of this post is to help Java developers new to Hadoop. Especially those folks who have read about Hadoop and Big Data but are unsure about how to get started. Problem is that the Hadoop related APIs require a complicated infrastructure to be setup even before you can run a “Hello World” type of program.
Thankfully, Hadoop distribution vendors like Cloudera and Hortonworks provide virtual machines to help the new folks get started quickly.
What are we doing here
Cloudera Quickstart VM for Hadoop comes as a file (huge one – 2.5 GB) that can be downloaded from the vendor site. The file is then opened with proper player. Linux (CentOS) gets launched and opens a web page with an option to start the Cloudera Manager. Once this is started, the various Hadoop related services like HDFS, Mapreduce, Hive, HBase, Solr, Flume and Zookeeper come to life. You are now good to run all your custom Hadoop jobs!!
What do we need
- 64-bit host operating system (Windows)
- A laptop with at least 4 GB RAM and at least quad core processor (4 CPUs)
- Cloudera Quickstart VM file for Oracle VirtualBox
- Oracle VirtualBox player
Step 1 – Launch the VM
Before we can do anything, the VM must be downloaded and launched. You will download a file – cloudera-quickstart-vm-4.3.0-virtualbox.tar.gz from Cloudera site. Unzip this file using any standard tool – WinZip or 7-zip or something else. You will eventually see a file called cloudera-quickstart-vm-4.3.0-virtualbox.vmdk.
The above screenshots will help in making the various choices required while creating an new VM. Next, select the newly added VM in the VirtualBox Manager and Settings -> System -> Processor => Set processors to 2 CPUs (I have 4 CPUs and hence decided to use 2 for the VM. If you have more CPUs at your disposal, you can be little more generous!!).
You are all set now to start the VM. I received an error “This kernel requires an x86-64 CPU, but only detected an i686 CPU. Unable to boot – please use a kernel appropriate for your CPU” when the VM was trying to start. Solution is to enable Intel VT-x/AMD-V from BIOS (courtesy – http://hereirestinremorse.wordpress.com/virtualbox/this-kernel-requires-an-x86-64-cpu-but-only-detected-an-i686-cpu-unable-to-boot-please-use-a-kernel-appropriate-for-your-cpu/). And that’s it – my fully loaded 64-bit CentOS Linux VM with Coudera distributed Hadoop components is ready to be played with!!
Step 2 – Getting started with Hadoop Services
Once the VM is launched, CentOS loads and Cloudera Hadoop Quickstart page opens up. Click the ‘Clouder Manager’ and you will be redirected to the admin console. But the admin page does not load immediately as the VM is configured to launch basic set of Hadoop services. As these services are all launched at once, it took about 5 minutes in my laptop for the CM console to appear. Login with admin/admin. In order to make the future startups faster, I stopped the already running services manually (Actions -> Stop). Next time you launch the CM after VM restart, the admin page comes up quickly, with all services in shutdown state.
Now, the only 2 services you will need to run basic Mapreduce program are hdfs and mapreduce. Rest of the services can be explored as you start gaining confidence with HDFS and Mapreduce.
Step 3 – Running the ubiquitous Word Count Mapreduce program
The desktop section of VM has an Eclipse shortcut. Launch Eclipse IDE with that and you will see a stub project created by Clouder to build Mapreduce programs. Under the same project, create your own package and include the Wordcount.java class (http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial.html).
– Create the input and output directories as listed in the Cloudera Hadoop Tutorial.
– Create the WordCount.java class in the already given sample project.
– Export the new packacge into a separate jar file
– Submit the Mapreduce job as shown in the Cloudera tutorial. Status of the job can be tracked in the Favourites in Mozilla FireFox (Hadoop JobTracker).