Hadoop for HPCers
Overview
This is a ~3 hour class that will introduce Hadoop to HPC users with a background in numerical simulation. We will walk through a brief overview of:
- The Hadoop File System (HDFS)
- Map Reduce
- Pig
- Spark
Most examples will be written in Python.
VM Instructions
This course will feature hands-on work with a 1-node Hadoop cluster running on your laptop. The VMs are created with Vagrant. Before the course, ensure this is up and running:
- Install VirtualBox on your laptop and start it.
- Under Settings or Preferences, go to Network, then Host-only networks, and add/create two host-only networks.
- Then download the virtual machine image you want to use:
- "Import Appliance", and select the downloaded image; this will uncompress the image which will take some minutes.
- Start the new virtual machine.
The GUI VM will start up a console with a full desktop environment; you can open a terminal and begin working. For the text VM, you will have to login to the console; the username/password is vagrant/vagrant. For either machine, you can also ssh into the VM from your laptop from the terminal:
ssh vagrant@192.168.33.10
or to the laptop from the VM with
ssh [username]@192.168.33.1
(If that particular address pair doesn't work, from a window within the VM, type "ifconfig | grep 192" to find a line like "inet addr: 192.168...."; that's the VMs IP address)
Then make sure everything is working:
- From a terminal, start up the hadoop cluster by typing
~/bin/init.sh
You may have to answer "yes" a few times to start up some servers. - Go to one of the example directories by typing
cd ~/examples/wordcount/streaming
- Then start the example by typing
make
You've now run your (maybe) first Hadoop job!
If you'd like, you can also create the virtual machine image yourself by downloading Vagrant and the Vagrantfile for the GUI or text image and running "vagrant up".
If you can't get the VM working for whatever reason, please contact us and we will make alternate arrangements.