Hadoop for HPCers

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search

Overview

This is a ~3 hour class that will introduce Hadoop to HPC users with a background in numerical simulation. We will walk through a brief overview of:

  • The Hadoop File System (HDFS)
  • Map Reduce
  • Pig
  • Spark

Most examples will be written in Python.

VM Instructions

This course will feature hands-on work with a 1-node Hadoop cluster running on your laptop. The VMs are created with Vagrant. Before the course, ensure this is up and running:

The GUI VM will start up a console with a full desktop environment; you can open a terminal and begin working. For the text VM, you will have to login to the console; the username/password is vagrant/vagrant. For either machine, you can also ssh into the VM from your laptop from the terminal:

ssh vagrant@192.168.33.10

or to the laptop from the VM with

ssh [username]@192.168.33.1

(If that particular address pair doesn't work, from a window within the VM, type "ifconfig | grep 192" to find a line like "inet addr: 192.168...."; that's the VMs IP address)

Then make sure everything is working:

  • From a terminal, start up the hadoop cluster by typing
    ~/bin/init.sh
    You may have to answer "yes" a few times to start up some servers.
  • Go to one of the example directories by typing
    cd ~/examples/wordcount/streaming
  • Then start the example by typing
    make

You've now run your (maybe) first Hadoop job!

If you'd like, you can also create the virtual machine image yourself by downloading Vagrant and the Vagrantfile for the GUI or text image and running "vagrant up".

If you can't get the VM working for whatever reason, please contact us and we will make alternate arrangements.