Difference between revisions of "SOSCIP GPU"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
 
(24 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
__NOTOC__
 +
 +
{| style="border-spacing: 8px; width:100%"
 +
| valign="top" style="cellpadding:1em; padding:1em; border:2px solid; background-color:#f6f674; border-radius:5px"|
 +
'''WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to [https://docs.scinet.utoronto.ca https://docs.scinet.utoronto.ca]'''
 +
|}
 +
 
{{Infobox Computer
 
{{Infobox Computer
 
|image=[[Image:S882lc.png|center|300px|thumb]]
 
|image=[[Image:S882lc.png|center|300px|thumb]]
Line 11: Line 18:
 
|vendorcompilers=xlc/xlf, nvcc
 
|vendorcompilers=xlc/xlf, nvcc
 
}}
 
}}
 +
 +
== New Documentation Site ==
 +
Please visit the new documentation site: [https://docs.scinet.utoronto.ca/index.php/SOSCIP_GPU https://docs.scinet.utoronto.ca/index.php/SOSCIP_GPU] for updated information.
  
 
== SOSCIP ==
 
== SOSCIP ==
Line 20: Line 30:
 
Please use [mailto:soscip-support@scinet.utoronto.ca <soscip-support@scinet.utoronto.ca>] for SOSCIP GPU specific inquiries.
 
Please use [mailto:soscip-support@scinet.utoronto.ca <soscip-support@scinet.utoronto.ca>] for SOSCIP GPU specific inquiries.
  
 +
 +
<!--
 
== Specifications==
 
== Specifications==
  
Line 132: Line 144:
  
 
</pre>
 
</pre>
===Packing single-GPU jobs within one SLURM job===
+
===Packing single-GPU jobs within one SLURM job submission===
Jobs are scheduled by node on SOSCIP GPU cluster. If user's code/program cannot take advantage of all 4 GPUs, user can use GNU Parallel tool to pack 4 or more single-GPU jobs into one SLURM job. Below is an example of submitting 4 single-GPU python codes within one job:
+
Jobs are scheduled by node (4 GPUs) on SOSCIP GPU cluster. If user's code/program cannot utilize all 4 GPUs, user can use GNU Parallel tool to pack 4 or more single-GPU jobs into one SLURM job. Below is an example of submitting 4 single-GPU python codes within one job: (When using GNU parallel for a publication please cite as per '''''parallel --citation''''')
 
<pre>
 
<pre>
 
#!/bin/bash
 
#!/bin/bash
#SBATCH --nodes=1
+
#SBATCH --nodes=1  
#SBATCH --ntasks=20
+
#SBATCH --ntasks=20 # MPI tasks (needed for srun)
#SBATCH --time=23:59:59
+
#SBATCH --time=00:10:00  # H:M:S
#SBATCH --gres=gpu:4
+
#SBATCH --gres=gpu:4     # Ask for 4 GPUs per node
 +
 
 
module load gnu-parallel/20180422
 
module load gnu-parallel/20180422
 
cd $SLURM_SUBMIT_DIR
 
cd $SLURM_SUBMIT_DIR
parallel -a params.input --colsep ' ' -j 4 'CUDA_VISIBLE_DEVICES=$(( {%} - 1 )) numactl -N $(( ({%} -1) / 2 )) python {1} {2} {3} &> {#}.out'
+
 
 +
parallel -a jobname-params.input --colsep ' ' -j 4 'CUDA_VISIBLE_DEVICES=$(( {%} - 1 )) numactl -N $(( ({%} -1) / 2 )) python {1} {2} {3} &> jobname-{#}.out'
 
</pre>
 
</pre>
params.input file contains:
+
The jobname-params.input file contains:
 
<pre>
 
<pre>
code-1.py --param1=1 --param2=2
+
code-1.py --param1=a --param2=b
 +
code-2.py --param1=c --param2=d
 +
code-3.py --param1=e --param2=f
 +
code-4.py --param1=g --param2=h
 
</pre>
 
</pre>
 +
*In the above example, GNU Parallel tool will read '''jobname-params.input''' file and separate parameters. Each row in the input file has to contain exact 3 parameters to '''python'''. code-N.py is also considered as a parameter. User can change parameter number in the '''parallel''' command ({1} {2} {3}...).
 +
*'''"-j 4"''' flag limits the max number of jobs to be 4. User can have more rows in the input file, but GNU Parallel tool only executes maximum of 4 at the same time.
 +
*'''"CUDA_VISIBLE_DEVICES=$(( {%} - 1 ))"''' will set one GPU for each job. '''"numactl -N $(( ({%} -1) / 2 ))"''' will bind 2 jobs on CPU socket 0, other 2 jobs on socket 1. {%} is job slot which will be translated to 1 or 2 or 3 or 4 in this case.
 +
*Outputs will be  jobname-1.out, jobname-2.out,jobname-3.out,jobname-4.out... {#} is job number which will be translated to the row number in the input file.
  
 
== Software Installed ==
 
== Software Installed ==
Line 155: Line 176:
 
The PowerAI platform contains popular open machine learning frameworks such as '''Caffe, TensorFlow, and Torch'''. Run the <tt>module avail</tt> command for a complete listing. More information is available at this link: https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/. Release 4.0 is currently installed.
 
The PowerAI platform contains popular open machine learning frameworks such as '''Caffe, TensorFlow, and Torch'''. Run the <tt>module avail</tt> command for a complete listing. More information is available at this link: https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/. Release 4.0 is currently installed.
  
==== GNU Compilers ====
+
===GNU Compilers ===
  
System default compiler is GCC/5.4.0. More recent versions of the GNU Compiler Collection (C/C++/Fortran) are provided in the IBM Advanced Toolchain with enhancements for the POWER8 CPU. To load the newer advance toolchain version use:
+
System default compiler is GCC/5.4.0. More recent versions of the GNU Compiler Collection (C/C++/Fortran) are provided in the IBM Advance Toolchain with enhancements for the POWER8 CPU. To load the newer advance toolchain version use:
  
Advanced Toolchain V10.0
+
Advance Toolchain V10.0
 
<pre>
 
<pre>
module load gcc/6.3.1
+
module load gcc/6.4.1
 
</pre>
 
</pre>
  
Advanced Toolchain V11.0
+
Advance Toolchain V11.0
 
<pre>
 
<pre>
module load gcc/7.2.1
+
module load gcc/7.3.1
 
</pre>
 
</pre>
  
More information about the IBM Advanced Toolchain can be found here: [https://developer.ibm.com/linuxonpower/advance-toolchain/ https://developer.ibm.com/linuxonpower/advance-toolchain/]
+
More information about the IBM Advance Toolchain can be found here: [https://developer.ibm.com/linuxonpower/advance-toolchain/ https://developer.ibm.com/linuxonpower/advance-toolchain/]
  
==== IBM XL Compilers ====
+
=== IBM XL Compilers ===
  
 
To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run
 
To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run
Line 188: Line 209:
 
[https://www.ibm.com/support/knowledgecenter/SSAT4T_15.1.5/com.ibm.compilers.linux.doc/welcome.html IBM XL Fortran]
 
[https://www.ibm.com/support/knowledgecenter/SSAT4T_15.1.5/com.ibm.compilers.linux.doc/welcome.html IBM XL Fortran]
  
==== NVIDIA GPU Driver Version ====
+
=== NVIDIA GPU Driver ===
  
The current NVIDIA driver version is 390.46
+
The current NVIDIA driver version is 396.26
  
==== CUDA ====
+
=== CUDA ===
  
The current installed CUDA Tookits is are version 8.0, version 9.0 and version 9.1.
+
The current installed CUDA Tookits is are version 8.0, 9.0 and 9.1.
  
 
<pre>
 
<pre>
 
module load cuda/8.0
 
module load cuda/8.0
</pre>
 
 
 
or  
 
or  
 
<pre>
 
 
module load cuda/9.0
 
module load cuda/9.0
</pre>
 
 
or  
 
or  
 
<pre>
 
 
module load cuda/9.1
 
module load cuda/9.1
 +
or
 +
module load cuda/9.2
 
</pre>
 
</pre>
  
Line 218: Line 234:
 
/usr/local/cuda-9.0
 
/usr/local/cuda-9.0
 
/usr/local/cuda-9.1
 
/usr/local/cuda-9.1
 +
/usr/local/cuda-9.2
 
</pre>
 
</pre>
  
Note that the <tt>/usr/local/cuda</tt> directory is linked to the <tt>/usr/local/cuda-9.1</tt> directory.
+
Note that the <tt>/usr/local/cuda</tt> directory is linked to the <tt>/usr/local/cuda-9.2</tt> directory.
  
 
Documentation and API reference information for the CUDA Toolkit can be found here: [http://docs.nvidia.com/cuda/index.html http://docs.nvidia.com/cuda/index.html]
 
Documentation and API reference information for the CUDA Toolkit can be found here: [http://docs.nvidia.com/cuda/index.html http://docs.nvidia.com/cuda/index.html]
  
==== OpenMPI ====
+
=== OpenMPI ===
  
 
Currently OpenMPI has been setup on the 14 nodes connected over EDR Infiniband.
 
Currently OpenMPI has been setup on the 14 nodes connected over EDR Infiniband.
Line 306: Line 323:
 
source tensorflow-1.8-py2/bin/activate
 
source tensorflow-1.8-py2/bin/activate
 
</pre>
 
</pre>
* Install TensorFlow into the virtual environment: (A custom Numpy built with optimized OpenBLAS library can be installed)
+
* Install TensorFlow into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
 
<pre>
 
<pre>
 
pip install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp27-cp27mu-linux_ppc64le.whl
 
pip install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp27-cp27mu-linux_ppc64le.whl
Line 323: Line 340:
 
source tensorflow-1.8-py3/bin/activate
 
source tensorflow-1.8-py3/bin/activate
 
</pre>
 
</pre>
* Install TensorFlow into the virtual environment: (A custom Numpy built with optimized OpenBLAS library can be installed)
+
* Install TensorFlow into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
 
<pre>
 
<pre>
 
pip3 install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp35-cp35m-linux_ppc64le.whl
 
pip3 install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp35-cp35m-linux_ppc64le.whl
Line 356: Line 373:
  
 
# GPU Cluster Introduction: [[Media:GPU_Training_01.pdf‎|SOSCIP GPU Platform]]
 
# GPU Cluster Introduction: [[Media:GPU_Training_01.pdf‎|SOSCIP GPU Platform]]
 +
-->

Latest revision as of 15:17, 5 October 2018


WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to https://docs.scinet.utoronto.ca

SOSCIP GPU
S882lc.png
Installed September 2017
Operating System Ubuntu 16.04 le
Number of Nodes 14x Power 8 with 4x NVIDIA P100
Interconnect Infiniband EDR
Ram/Node 512 GB
Cores/Node 2 x 10core (20 physical, 160 SMT)
Login/Devel Node sgc01
Vendor Compilers xlc/xlf, nvcc

New Documentation Site

Please visit the new documentation site: https://docs.scinet.utoronto.ca/index.php/SOSCIP_GPU for updated information.

SOSCIP

The SOSCIP GPU Cluster is a Southern Ontario Smart Computing Innovation Platform (SOSCIP) resource located at theUniversity of Toronto's SciNet HPC facility. The SOSCIP multi-university/industry consortium is funded by the Ontario Government and the Federal Economic Development Agency for Southern Ontario [1].

Support Email

Please use <soscip-support@scinet.utoronto.ca> for SOSCIP GPU specific inquiries.