SOSCIP GPU

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
SOSCIP GPU
S882lc.png
Installed September 2017
Operating System Ubuntu 16.04 le
Number of Nodes 14x Power 8 with 4x NVIDIA P100
Interconnect Infiniband EDR
Ram/Node 512 GB
Cores/Node 2 x 10core (20 physical, 160 SMT)
Login/Devel Node sgc01
Vendor Compilers xlc/xlf, nvcc

SOSCIP

The SOSCIP GPU Cluster is a Southern Ontario Smart Computing Innovation Platform (SOSCIP) resource located at theUniversity of Toronto's SciNet HPC facility. The SOSCIP multi-university/industry consortium is funded by the Ontario Government and the Federal Economic Development Agency for Southern Ontario [1].

Support Email

Please use <soscip-support@scinet.utoronto.ca> for SOSCIP GPU specific inquiries.

Specifications

The SOSCIP GPU Cluster consists of of 14 IBM Power 822LC "Minsky" Servers each with 2x10core 3.25GHz Power8 CPUs and 512GB Ram. Similar to Power 7, the Power 8 utilizes Simultaneous MultiThreading (SMT), but extends the design to 8 threads per core allowing the 20 physical cores to support up to 160 threads. Each node has 4x NVIDIA Tesla P100 GPUs each with 16GB of RAM with CUDA Capability 6.0 (Pascal) connected using NVlink.

Compile/Devel/Test

Access is provided through the BGQ login node, bgqdev.scinet.utoronto.ca using ssh, and from there you can proceed to the GPU development node sgc01-ib0.

Filesystem

The filesystem is shared with the BGQ system. See here for details.

Job Submission

The SOSCIP GPU cluster uses SLURM as a job scheduler and jobs are scheduled by node, ie 20 cores and 4 GPUs each. Jobs are submitted from the development node sgc01. The maximum walltime per job is 12 hours (except in the 'long' queue, see below) with up to 8 nodes.

$ sbatch myjob.script

Where myjob.script is

#!/bin/bash
#SBATCH --nodes=1 
#SBATCH --ntasks=20  # MPI tasks (needed for srun) 
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

cd $SLURM_SUBMIT_DIR

hostname
nvidia-smi

You can queury job information using

squeue

To cancel a job use

scancel $JOBID

Longer jobs

If your job takes more than 12 hours, the sbatch command will not let you submit your job. There is, however, a way to have jobs up to 24 hours long, by specifying "-p long" as an option (i.e., add #SBATCH -p long to your job script). The priority of such jobs may be throttled in the future if we see that the 'long' queue is having a negative efffect on turnover time in the queue.

Interactive

For an interactive session use

salloc --gres=gpu:4

Automatic Re-submission and Job Dependencies

Commonly you may have a job that you know will take longer to run than what is permissible in the queue. As long as your program contains checkpoint or restart capability, you can have one job automatically submit the next. In the following example it is assumed that the program finishes before the time limit requested and then resubmits itself by logging into the development nodes. Job dependencies and a maximum number of job re-submissions are used to ensure sequential operation.

#!/bin/bash 

#SBATCH --nodes=1 
#SBATCH --ntasks=20  # MPI tasks (needed for srun) 
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

cd $SLURM_SUBMIT_DIR

: ${job_number:="1"}           # set job_nubmer to 1 if it is undefined
job_number_max=3

echo "hi from ${SLURM_JOB_ID}"

#RUN JOB HERE


# SUBMIT NEXT JOB
if [[ ${job_number} -lt ${job_number_max} ]]
then
  (( job_number++ ))
  next_jobid=$(ssh sgc01-ib0 "cd $SLURM_SUBMIT_DIR; /opt/slurm/bin/sbatch --export=job_number=${job_number} -d afterok:${SLURM_JOB_ID} thisscript.sh | awk '{print $4}'")
  echo "submitted ${next_jobid}"
fi
 
sleep 15

echo "${SLURM_JOB_ID} done"

Software

GNU Compilers

To load the newer advance toolchain version use:

module load gcc/6.3.1

IBM Compilers

To load the native IBM xlc/xlc++ compilers

module load xlc/13.1.5
module load xlf/15.1.5

Driver Version

The current NVIDIA driver version is 384.66

CUDA

The current installed CUDA Tookit is 8.0

module load cuda/8.0

The CUDA driver is installed locally, however the CUDA Toolkit is installed in:

/usr/local/cuda-8.0

OpenMPI

Currently OpenMPI has been setup on the 14 nodes connected over EDR Infiniband.

$ module load openmpi/2.1.1-gcc-5.4.0
$ module load openmpi/2.1.1-XL-13_15.1.5

IBM PowerAI

The PowerAI platform contains popular open machine learning frameworks such as Caffe, Tensorflow, and Torch. Run the module avail command for a complete listing. More information is available at this link: https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/. Release 4.0 is currently installed.