Difference between revisions of "SOSCIP GPU"

SOSCIP GPU
SOSCIP GPU
Installed	September 2017
Operating System	Ubuntu 16.04 le
Number of Nodes	14x Power 8 with 4x NVIDIA P100
Interconnect	Infiniband EDR
Ram/Node	512 GB
Cores/Node	2 x 10core (20 physical, 160 SMT)
Login/Devel Node	sgc01
Vendor Compilers	xlc/xlf, nvcc

Revision as of 14:35, 7 September 2017

SOSCIP

The SOSCIP GPU Cluster is a Southern Ontario Smart Computing Innovation Platform (SOSCIP) resource located at theUniversity of Toronto's SciNet HPC facility. The SOSCIP multi-university/industry consortium is funded by the Ontario Government and the Federal Economic Development Agency for Southern Ontario [1].

Specifications

The SOSCIP GPU Cluster consists of of 14 IBM Power 822LC "Minksy" Servers each with 2x10core 3.25GHz Power8 CPUs and 512GB Ram. Similar to Power 7, the Power 8 utilizes Simultaneous MultiThreading (SMT), but extends the design to 8 threads per core allowing the 20 physical cores to support up to 160 threads. Each node has 4x NVIDIA Tesla P100 GPUs each with 16GB of RAM with CUDA Capability 6.0 (Pascal) connected using NVlink.

Compile/Devel/Test

Access is provided through the BGQ login node, bgqdev.scinet.utoronto.ca using ssh, and from there you can proceed to the GPU development node sgc01.

Filesystem

The filesystem is shared with the BGQ system. See here for details.

Job Submission

The SOSCIP GPU cluster uses SLURM as a job scheduler and jobs are scheduled by node, ie 20 cores and 4 GPUs each. Jobs are submitted from the development node sgc01.

$ sbatch myjob.script

Where myjob.script is

#!/bin/bash
#SBATCH --nodes=1 
#SBATCH --ntasks=20  # MPI tasks (needed for srun) 
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

cd $SLURM_SUBMIT_DIR

hostname
nvidia-smi

You can queury job information using

 
squeue

To cancel a job use

scancel $JOBID

Software

GNU Compilers

To load the newer advance toolchain version use:

module load gcc/6.3.1

IBM Compilers

To load the native IBM xlc/xlc++ compilers

module load xlc/13.1.5
module load xlf/15.1.5

Driver Version

The current NVIDIA driver version is 384.66

CUDA

The current installed CUDA Tookit is 8.0

module load cuda/8.0

The CUDA driver is installed locally, however the CUDA Toolkit is installed in:

/usr/local/cuda-8.0

OpenMPI

Currently OpenMPI has been setup on the 14 nodes connected over EDR Infiniband.

$ module load openmpi/2.1.1-gcc-5.4.0
$ module load openmpi/2.1.1-XL-13_15.1.5

OpenAI

The IBM powerAI framework is installed.