Hadoop for HPCers
Overview
This is a ~3 hour class that will introduce Hadoop to HPC users with a background in numerical simulation. We will walk through a brief overview of:
- The Hadoop File System (HDFS)
- Map Reduce
- Pig
- Spark
Most examples will be written in Python.
VM Instructions
This course will feature hands-on work with a 1-node Hadoop cluster running on your laptop. The VMs are created with Vagrant. Before the course, ensure this is up and running:
- Install VirtualBox on your laptop
- Download the virtual machine image you want to use:
- Start VirtualBox
- "Import Appliance", and select the downloaded image; this will uncompress the image which will take some minutes.
- Start the new virtual machine.
The GUI VM will start up a console with a full desktop environment; you can open a terminal and begin working. For the text VM, you will have to login to the console; the username/password is vagrant/vagrant. For either machine, you can also ssh into the VM from your laptop from the terminal:
ssh vagrant@192.168.33.10
or to the laptop from the VM with
ssh [username]@192.168.33.1
.
Then make sure everything is working:
- From a terminal, start up the hadoop cluster by typing
~/bin/init.sh
You may have to answer "yes" a few times to start up some servers. - Go to one of the example directories by typing
cd ~/examples/wordcount/streaming
- Then start the example by typing
make
You've now run your (maybe) first Hadoop job!