Difference between revisions of "SOSCIP GPU"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
Line 43: Line 43:
 
#SBATCH --ntasks=20  # MPI tasks (needed for srun)  
 
#SBATCH --ntasks=20  # MPI tasks (needed for srun)  
 
#SBATCH --time=00:10:00  # H:M:S
 
#SBATCH --time=00:10:00  # H:M:S
##SBATCH --gres=gpu:4    # Ask for 4 GPUs per node
+
#SBATCH --gres=gpu:4    # Ask for 4 GPUs per node
  
 
cd $SLURM_SUBMIT_DIR
 
cd $SLURM_SUBMIT_DIR

Revision as of 14:35, 7 September 2017

SOSCIP GPU
P8 s822.jpg
Installed September 2017
Operating System Ubuntu 16.04 le
Number of Nodes 14x Power 8 with 4x NVIDIA P100
Interconnect Infiniband EDR
Ram/Node 512 GB
Cores/Node 2 x 10core (20 physical, 160 SMT)
Login/Devel Node sgc01
Vendor Compilers xlc/xlf, nvcc

SOSCIP

The SOSCIP GPU Cluster is a Southern Ontario Smart Computing Innovation Platform (SOSCIP) resource located at theUniversity of Toronto's SciNet HPC facility. The SOSCIP multi-university/industry consortium is funded by the Ontario Government and the Federal Economic Development Agency for Southern Ontario [1].

Specifications

The SOSCIP GPU Cluster consists of of 14 IBM Power 822LC "Minksy" Servers each with 2x10core 3.25GHz Power8 CPUs and 512GB Ram. Similar to Power 7, the Power 8 utilizes Simultaneous MultiThreading (SMT), but extends the design to 8 threads per core allowing the 20 physical cores to support up to 160 threads. Each node has 4x NVIDIA Tesla P100 GPUs each with 16GB of RAM with CUDA Capability 6.0 (Pascal) connected using NVlink.

Compile/Devel/Test

Access is provided through the BGQ login node, bgqdev.scinet.utoronto.ca using ssh, and from there you can proceed to the GPU development node sgc01.

Filesystem

The filesystem is shared with the BGQ system. See here for details.

Job Submission

The SOSCIP GPU cluster uses SLURM as a job scheduler and jobs are scheduled by node, ie 20 cores and 4 GPUs each. Jobs are submitted from the development node sgc01.

$ sbatch myjob.script

Where myjob.script is

#!/bin/bash
#SBATCH --nodes=1 
#SBATCH --ntasks=20  # MPI tasks (needed for srun) 
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node

cd $SLURM_SUBMIT_DIR

hostname
nvidia-smi

You can queury job information using

 
squeue

To cancel a job use

scancel $JOBID

Software

GNU Compilers

To load the newer advance toolchain version use:

module load gcc/6.3.1

IBM Compilers

To load the native IBM xlc/xlc++ compilers

module load xlc/13.1.5
module load xlf/15.1.5

Driver Version

The current NVIDIA driver version is 384.66

CUDA

The current installed CUDA Tookit is 8.0

module load cuda/8.0

The CUDA driver is installed locally, however the CUDA Toolkit is installed in:

/usr/local/cuda-8.0

OpenMPI

Currently OpenMPI has been setup on the 14 nodes connected over EDR Infiniband.

$ module load openmpi/2.1.1-gcc-5.4.0
$ module load openmpi/2.1.1-XL-13_15.1.5

OpenAI

The IBM powerAI framework is installed.