Difference between revisions of "SOSCIP GPU"

SOSCIP GPU
SOSCIP GPU
Installed	September 2017
Operating System	Ubuntu 16.04 le
Number of Nodes	14x Power 8 with 4x NVIDIA P100
Interconnect	Infiniband EDR
Ram/Node	512 GB
Cores/Node	2 x 10core (20 physical, 160 SMT)
Login/Devel Node	sgc01
Vendor Compilers	xlc/xlf, nvcc

Latest revision as of 15:17, 5 October 2018

WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to https://docs.scinet.utoronto.ca

New Documentation Site

Please visit the new documentation site: https://docs.scinet.utoronto.ca/index.php/SOSCIP_GPU for updated information.

SOSCIP

The SOSCIP GPU Cluster is a Southern Ontario Smart Computing Innovation Platform (SOSCIP) resource located at theUniversity of Toronto's SciNet HPC facility. The SOSCIP multi-university/industry consortium is funded by the Ontario Government and the Federal Economic Development Agency for Southern Ontario [1].

Support Email

Please use <soscip-support@scinet.utoronto.ca> for SOSCIP GPU specific inquiries.

@@ Line 1: / Line 1: @@
+__NOTOC__
+{| style="border-spacing: 8px; width:100%"
+| valign="top" style="cellpadding:1em; padding:1em; border:2px solid; background-color:#f6f674; border-radius:5px"|
+'''WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to [https://docs.scinet.utoronto.ca https://docs.scinet.utoronto.ca]'''
+|}
 {{Infobox Computer
 |image=[[Image:S882lc.png|center|300px|thumb]]
@@ Line 11: / Line 18: @@
 |vendorcompilers=xlc/xlf, nvcc
 }}
+== New Documentation Site ==
+Please visit the new documentation site: [https://docs.scinet.utoronto.ca/index.php/SOSCIP_GPU https://docs.scinet.utoronto.ca/index.php/SOSCIP_GPU] for updated information.
 == SOSCIP ==
@@ Line 20: / Line 30: @@
 Please use [mailto:soscip-support@scinet.utoronto.ca <soscip-support@scinet.utoronto.ca>] for SOSCIP GPU specific inquiries.
+ <!--
 == Specifications==
@@ Line 132: / Line 144: @@
 </pre>
-===Packing single-GPU jobs within one SLURM job===
+===Packing single-GPU jobs within one SLURM job submission===
-Jobs are scheduled by node on SOSCIP GPU cluster. If user's code/program cannot take advantage of all 4 GPUs, user can use GNU Parallel tool to pack 4 or more single-GPU jobs into one SLURM job. Below is an example of submitting 4 single-GPU python codes within one job:
+Jobs are scheduled by node (4 GPUs) on SOSCIP GPU cluster. If user's code/program cannot utilize all 4 GPUs, user can use GNU Parallel tool to pack 4 or more single-GPU jobs into one SLURM job. Below is an example of submitting 4 single-GPU python codes within one job:  (When using GNU parallel for a publication please cite as per '''''parallel --citation''''')
 <pre>
 #!/bin/bash
 #SBATCH --nodes=1
-#SBATCH --ntasks=20
+#SBATCH --ntasks=20  # MPI tasks (needed for srun)
-#SBATCH --time=23:59:59
+#SBATCH --time=00:10:00  # H:M:S
-#SBATCH --gres=gpu:4
+#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node
 module load gnu-parallel/20180422
 cd $SLURM_SUBMIT_DIR
-parallel -a params.input --colsep ' ' -j 4 'CUDA_VISIBLE_DEVICES=$(( {%} - 1 )) numactl -N $(( ({%} -1) / 2 )) python {1} {2} {3} &> {#}.out'
+parallel -a jobname-params.input --colsep ' ' -j 4 'CUDA_VISIBLE_DEVICES=$(( {%} - 1 )) numactl -N $(( ({%} -1) / 2 )) python {1} {2} {3} &> jobname-{#}.out'
 </pre>
-params.input file contains:
+The jobname-params.input file contains:
 <pre>
-code-1.py --param1=1 --param2=2
+code-1.py --param1=a --param2=b
+code-2.py --param1=c --param2=d
+code-3.py --param1=e --param2=f
+code-4.py --param1=g --param2=h
 </pre>
+*In the above example, GNU Parallel tool will read '''jobname-params.input''' file and separate parameters. Each row in the input file has to contain exact 3 parameters to '''python'''. code-N.py is also considered as a parameter. User can change parameter number in the '''parallel''' command ({1} {2} {3}...).
+*'''"-j 4"''' flag limits the max number of jobs to be 4. User can have more rows in the input file, but GNU Parallel tool only executes maximum of 4 at the same time.
+*'''"CUDA_VISIBLE_DEVICES=$(( {%} - 1 ))"''' will set one GPU for each job. '''"numactl -N $(( ({%} -1) / 2 ))"''' will bind 2 jobs on CPU socket 0, other 2 jobs on socket 1. {%} is job slot which will be translated to 1 or 2 or 3 or 4 in this case.
+*Outputs will be  jobname-1.out, jobname-2.out,jobname-3.out,jobname-4.out... {#} is job number which will be translated to the row number in the input file.
 == Software Installed ==
@@ Line 155: / Line 176: @@
 The PowerAI platform contains popular open machine learning frameworks such as '''Caffe, TensorFlow, and Torch'''. Run the <tt>module avail</tt> command for a complete listing. More information is available at this link: https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/. Release 4.0 is currently installed.
-==== GNU Compilers ====
+===GNU Compilers ===
-System default compiler is GCC/5.4.0. More recent versions of the GNU Compiler Collection (C/C++/Fortran) are provided in the IBM Advanced Toolchain with enhancements for the POWER8 CPU. To load the newer advance toolchain version use:
+System default compiler is GCC/5.4.0. More recent versions of the GNU Compiler Collection (C/C++/Fortran) are provided in the IBM Advance Toolchain with enhancements for the POWER8 CPU. To load the newer advance toolchain version use:
-Advanced Toolchain V10.0
+Advance Toolchain V10.0
 <pre>
-module load gcc/6.3.1
+module load gcc/6.4.1
 </pre>
-Advanced Toolchain V11.0
+Advance Toolchain V11.0
 <pre>
-module load gcc/7.2.1
+module load gcc/7.3.1
 </pre>
-More information about the IBM Advanced Toolchain can be found here: [https://developer.ibm.com/linuxonpower/advance-toolchain/ https://developer.ibm.com/linuxonpower/advance-toolchain/]
+More information about the IBM Advance Toolchain can be found here: [https://developer.ibm.com/linuxonpower/advance-toolchain/ https://developer.ibm.com/linuxonpower/advance-toolchain/]
-==== IBM XL Compilers ====
+=== IBM XL Compilers ===
 To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run
@@ Line 188: / Line 209: @@
 [https://www.ibm.com/support/knowledgecenter/SSAT4T_15.1.5/com.ibm.compilers.linux.doc/welcome.html IBM XL Fortran]
-==== NVIDIA GPU Driver Version ====
+=== NVIDIA GPU Driver ===
-The current NVIDIA driver version is 390.46
+The current NVIDIA driver version is 396.26
-==== CUDA ====
+=== CUDA ===
-The current installed CUDA Tookits is are version 8.0, version 9.0 and version 9.1.
+The current installed CUDA Tookits is are version 8.0, 9.0 and 9.1.
 <pre>
 module load cuda/8.0
-</pre>
 or
-<pre>
 module load cuda/9.0
-</pre>
 or
-<pre>
 module load cuda/9.1
+or
+module load cuda/9.2
 </pre>
@@ Line 218: / Line 234: @@
 /usr/local/cuda-9.0
 /usr/local/cuda-9.1
+/usr/local/cuda-9.2
 </pre>
-Note that the <tt>/usr/local/cuda</tt> directory is linked to the <tt>/usr/local/cuda-9.1</tt> directory.
+Note that the <tt>/usr/local/cuda</tt> directory is linked to the <tt>/usr/local/cuda-9.2</tt> directory.
 Documentation and API reference information for the CUDA Toolkit can be found here: [http://docs.nvidia.com/cuda/index.html http://docs.nvidia.com/cuda/index.html]
-==== OpenMPI ====
+=== OpenMPI ===
 Currently OpenMPI has been setup on the 14 nodes connected over EDR Infiniband.
@@ Line 306: / Line 323: @@
 source tensorflow-1.8-py2/bin/activate
 </pre>
-* Install TensorFlow into the virtual environment: (A custom Numpy built with optimized OpenBLAS library can be installed)
+* Install TensorFlow into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
 <pre>
 pip install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp27-cp27mu-linux_ppc64le.whl
@@ Line 323: / Line 340: @@
 source tensorflow-1.8-py3/bin/activate
 </pre>
-* Install TensorFlow into the virtual environment: (A custom Numpy built with optimized OpenBLAS library can be installed)
+* Install TensorFlow into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
 <pre>
 pip3 install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp35-cp35m-linux_ppc64le.whl
@@ Line 356: / Line 373: @@
 # GPU Cluster Introduction: [[Media:GPU_Training_01.pdf‎|SOSCIP GPU Platform]]
+-->

Difference between revisions of "SOSCIP GPU"

Latest revision as of 15:17, 5 October 2018

New Documentation Site

SOSCIP

Support Email

Navigation menu

Search