Difference between revisions of "SOSCIP GPU"

From oldwiki.scinet.utoronto.ca
Jump to navigation Jump to search
 
(52 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
__NOTOC__
 +
 +
{| style="border-spacing: 8px; width:100%"
 +
| valign="top" style="cellpadding:1em; padding:1em; border:2px solid; background-color:#f6f674; border-radius:5px"|
 +
'''WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to [https://docs.scinet.utoronto.ca https://docs.scinet.utoronto.ca]'''
 +
|}
 +
 
{{Infobox Computer
 
{{Infobox Computer
 
|image=[[Image:S882lc.png|center|300px|thumb]]
 
|image=[[Image:S882lc.png|center|300px|thumb]]
Line 11: Line 18:
 
|vendorcompilers=xlc/xlf, nvcc
 
|vendorcompilers=xlc/xlf, nvcc
 
}}
 
}}
 +
 +
== New Documentation Site ==
 +
Please visit the new documentation site: [https://docs.scinet.utoronto.ca/index.php/SOSCIP_GPU https://docs.scinet.utoronto.ca/index.php/SOSCIP_GPU] for updated information.
  
 
== SOSCIP ==
 
== SOSCIP ==
Line 20: Line 30:
 
Please use [mailto:soscip-support@scinet.utoronto.ca <soscip-support@scinet.utoronto.ca>] for SOSCIP GPU specific inquiries.
 
Please use [mailto:soscip-support@scinet.utoronto.ca <soscip-support@scinet.utoronto.ca>] for SOSCIP GPU specific inquiries.
  
 +
 +
<!--
 
== Specifications==
 
== Specifications==
  
Line 57: Line 69:
 
</pre>
 
</pre>
  
You can queury job information using
+
More information about the <tt>sbatch</tt> command is found [https://slurm.schedmd.com/sbatch.html here].
 +
 
 +
 
 +
You can query job information using
  
 
<pre>
 
<pre>
 
squeue
 
squeue
 
</pre>
 
</pre>
 +
 +
To see only your own jobs, run
 +
 +
<pre>
 +
squeue -u <userid>
 +
</pre>
 +
 +
Once your job is running, SLURM creates a file usually named <tt>slurm<jobid>.out</tt> in the directory from where you issued the <tt>sbatch</tt> command. This contains the console output from your job. You can monitor the output of your job by using the <tt>tail -f <file></tt> command.
 +
  
 
To cancel a job use
 
To cancel a job use
Line 80: Line 104:
 
salloc --gres=gpu:4
 
salloc --gres=gpu:4
 
</pre>
 
</pre>
 +
 +
After executing this command, you may have to wait in the queue until a system is available.
 +
 +
More information about the <tt>salloc</tt> command is [https://slurm.schedmd.com/salloc.html here].
  
 
=== Automatic Re-submission and Job Dependencies ===
 
=== Automatic Re-submission and Job Dependencies ===
Line 116: Line 144:
  
 
</pre>
 
</pre>
 +
===Packing single-GPU jobs within one SLURM job submission===
 +
Jobs are scheduled by node (4 GPUs) on SOSCIP GPU cluster. If user's code/program cannot utilize all 4 GPUs, user can use GNU Parallel tool to pack 4 or more single-GPU jobs into one SLURM job. Below is an example of submitting 4 single-GPU python codes within one job:  (When using GNU parallel for a publication please cite as per '''''parallel --citation''''')
 +
<pre>
 +
#!/bin/bash
 +
#SBATCH --nodes=1
 +
#SBATCH --ntasks=20  # MPI tasks (needed for srun)
 +
#SBATCH --time=00:10:00  # H:M:S
 +
#SBATCH --gres=gpu:4    # Ask for 4 GPUs per node
 +
 +
module load gnu-parallel/20180422
 +
cd $SLURM_SUBMIT_DIR
 +
 +
parallel -a jobname-params.input --colsep ' ' -j 4 'CUDA_VISIBLE_DEVICES=$(( {%} - 1 )) numactl -N $(( ({%} -1) / 2 )) python {1} {2} {3} &> jobname-{#}.out'
 +
</pre>
 +
The jobname-params.input file contains:
 +
<pre>
 +
code-1.py --param1=a --param2=b
 +
code-2.py --param1=c --param2=d
 +
code-3.py --param1=e --param2=f
 +
code-4.py --param1=g --param2=h
 +
</pre>
 +
*In the above example, GNU Parallel tool will read '''jobname-params.input''' file and separate parameters. Each row in the input file has to contain exact 3 parameters to '''python'''. code-N.py is also considered as a parameter. User can change parameter number in the '''parallel''' command ({1} {2} {3}...).
 +
*'''"-j 4"''' flag limits the max number of jobs to be 4. User can have more rows in the input file, but GNU Parallel tool only executes maximum of 4 at the same time.
 +
*'''"CUDA_VISIBLE_DEVICES=$(( {%} - 1 ))"''' will set one GPU for each job. '''"numactl -N $(( ({%} -1) / 2 ))"''' will bind 2 jobs on CPU socket 0, other 2 jobs on socket 1. {%} is job slot which will be translated to 1 or 2 or 3 or 4 in this case.
 +
*Outputs will be  jobname-1.out, jobname-2.out,jobname-3.out,jobname-4.out... {#} is job number which will be translated to the row number in the input file.
  
 
== Software Installed ==
 
== Software Installed ==
Line 123: Line 176:
 
The PowerAI platform contains popular open machine learning frameworks such as '''Caffe, TensorFlow, and Torch'''. Run the <tt>module avail</tt> command for a complete listing. More information is available at this link: https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/. Release 4.0 is currently installed.
 
The PowerAI platform contains popular open machine learning frameworks such as '''Caffe, TensorFlow, and Torch'''. Run the <tt>module avail</tt> command for a complete listing. More information is available at this link: https://developer.ibm.com/linuxonpower/deep-learning-powerai/releases/. Release 4.0 is currently installed.
  
==== GNU Compilers ====
+
===GNU Compilers ===
  
More recent versions of the GNU Compiler Collection (C/C++/Fortran) are provided in the IBM Advanced Toolchain with enhancements for the POWER8 CPU. To load the newer advance toolchain version use:
+
System default compiler is GCC/5.4.0. More recent versions of the GNU Compiler Collection (C/C++/Fortran) are provided in the IBM Advance Toolchain with enhancements for the POWER8 CPU. To load the newer advance toolchain version use:
  
Advanced Toolchain V10.0
+
Advance Toolchain V10.0
 
<pre>
 
<pre>
module load gcc/6.3.1
+
module load gcc/6.4.1
 
</pre>
 
</pre>
  
Advanced Toolchain V11.0
+
Advance Toolchain V11.0
 
<pre>
 
<pre>
module load gcc/7.2.1
+
module load gcc/7.3.1
 
</pre>
 
</pre>
  
More information about the IBM Advanced Toolchain can be found here: [https://developer.ibm.com/linuxonpower/advance-toolchain/ https://developer.ibm.com/linuxonpower/advance-toolchain/]
+
More information about the IBM Advance Toolchain can be found here: [https://developer.ibm.com/linuxonpower/advance-toolchain/ https://developer.ibm.com/linuxonpower/advance-toolchain/]
  
==== IBM XL Compilers ====
+
=== IBM XL Compilers ===
  
 
To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run
 
To load the native IBM xlc/xlc++ and xlf (Fortran) compilers, run
Line 156: Line 209:
 
[https://www.ibm.com/support/knowledgecenter/SSAT4T_15.1.5/com.ibm.compilers.linux.doc/welcome.html IBM XL Fortran]
 
[https://www.ibm.com/support/knowledgecenter/SSAT4T_15.1.5/com.ibm.compilers.linux.doc/welcome.html IBM XL Fortran]
  
==== NVIDIA GPU Driver Version ====
+
=== NVIDIA GPU Driver ===
  
The current NVIDIA driver version is 384.66
+
The current NVIDIA driver version is 396.26
  
==== CUDA ====
+
=== CUDA ===
  
The current installed CUDA Tookits is are version 8.0 and version 9.0
+
The current installed CUDA Tookits is are version 8.0, 9.0 and 9.1.
  
 
<pre>
 
<pre>
 
module load cuda/8.0
 
module load cuda/8.0
</pre>
 
 
 
or  
 
or  
 
<pre>
 
 
module load cuda/9.0
 
module load cuda/9.0
 +
or
 +
module load cuda/9.1
 +
or
 +
module load cuda/9.2
 
</pre>
 
</pre>
  
Line 180: Line 233:
 
/usr/local/cuda-8.0
 
/usr/local/cuda-8.0
 
/usr/local/cuda-9.0
 
/usr/local/cuda-9.0
 +
/usr/local/cuda-9.1
 +
/usr/local/cuda-9.2
 
</pre>
 
</pre>
  
Note that the <tt>/usr/local/cuda</tt> directory is linked to the <tt>/usr/local/cuda-9.0</tt> directory.
+
Note that the <tt>/usr/local/cuda</tt> directory is linked to the <tt>/usr/local/cuda-9.2</tt> directory.
  
 
Documentation and API reference information for the CUDA Toolkit can be found here: [http://docs.nvidia.com/cuda/index.html http://docs.nvidia.com/cuda/index.html]
 
Documentation and API reference information for the CUDA Toolkit can be found here: [http://docs.nvidia.com/cuda/index.html http://docs.nvidia.com/cuda/index.html]
  
==== OpenMPI ====
+
=== OpenMPI ===
  
 
Currently OpenMPI has been setup on the 14 nodes connected over EDR Infiniband.
 
Currently OpenMPI has been setup on the 14 nodes connected over EDR Infiniband.
Line 210: Line 265:
  
 
TIP: If you plan to use Tensorflow within Anaconda, download the Python 2.7 version of Anaconda
 
TIP: If you plan to use Tensorflow within Anaconda, download the Python 2.7 version of Anaconda
 +
 +
=== cuDNN ===
 +
The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN accelerates widely used deep learning frameworks, including Caffe2, MATLAB, Microsoft Cognitive Toolkit, TensorFlow, Theano, and PyTorch. If a specific version of cuDNN is needed, user can download from https://developer.nvidia.com/cudnn and choose '''"cuDNN [VERSION] Library for Linux (Power8/Power9)"'''.
 +
 +
The default cuDNN installed on the system is version 6 with CUDA-8 from IBM PowerAI. More recent cuDNN versions are installed as modules:
 +
<pre>
 +
cudnn/cuda9.0/7.0.5
 +
</pre>
  
 
=== Keras ===
 
=== Keras ===
Line 215: Line 278:
 
Keras ([https://keras.io/ https://keras.io/]) is a popular high-level deep learning software development framework. It runs on top of other deep-learning frameworks such as TensorFlow.
 
Keras ([https://keras.io/ https://keras.io/]) is a popular high-level deep learning software development framework. It runs on top of other deep-learning frameworks such as TensorFlow.
  
The easiest way to install Keras is to install Anaconda first, then install Keras by using using the pip command.
+
*The easiest way to install Keras is to install Anaconda first, then install Keras by using using the pip command. Keras uses TensorFlow underneath to run neural network models. Before running code using Keras, be sure to load the PowerAI TensorFlow module and the cuda module.
  
Keras uses TensorFlow underneath to run neural network models. Before running code using Keras, be sure to load the PowerAI TensorFlow module and the cuda module.
+
*Keras can also be installed into a Python virtual environment by using '''pip'''. User can install optimized scipy (built with OpenBLAS) before installing Keras.
 +
In a virtual environment (python2.7 as example):
 +
<pre>
 +
pip install /scinet/sgc/Libraries/scipy/scipy-1.1.0-cp27-cp27mu-linux_ppc64le.whl
 +
pip install keras
 +
</pre>
 +
 
 +
=== NumPy/SciPy (built with OpenBLAS) ===
 +
 
 +
Optimized NumPy and SciPy are provided as Python wheels located in '''/scinet/sgc/Libraries/numpy''' and '''/scinet/sgc/Libraries/scipy''' and can be installed by '''pip'''. Please uninstall old numpy/scipy before installing the new ones.
  
 
=== PyTorch ===
 
=== PyTorch ===
Line 236: Line 308:
  
 
NOTE: Do not have the gcc modules loaded when building PyTorch. Use the default version of gcc (currently v5.4.0) included with the operating system. Build will fail with later versions of gcc.
 
NOTE: Do not have the gcc modules loaded when building PyTorch. Use the default version of gcc (currently v5.4.0) included with the operating system. Build will fail with later versions of gcc.
 +
 +
=== TensorFlow (new versions and python3) ===
 +
 +
The TensorFlow which is included in PowerAI may not be the most recent version. Newer versions of TensorFlow are provided as prebuilt Python Wheels that users can use '''pip''' to install under user space. Custom Python wheels are stored in '''/scinet/sgc/Applications/TensorFlow_wheels'''. It is highly recommended to install custom TensorFlow wheels into a Python virtual environment.
 +
 +
====Installing with Python2.7:====
 +
<div class="toccolours mw-collapsible mw-collapsed" style="overflow:auto;">
 +
* Create a virtual environment '''tensorflow-1.8-py2''' with packages installed with system:
 +
<pre>
 +
virtualenv --python=python2.7 --system-site-packages tensorflow-1.8-py2
 +
</pre>
 +
* Activate virtual environment:
 +
<pre>
 +
source tensorflow-1.8-py2/bin/activate
 +
</pre>
 +
* Install TensorFlow into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
 +
<pre>
 +
pip install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp27-cp27mu-linux_ppc64le.whl
 +
pip install /scinet/sgc/Applications/TensorFlow_wheels/tensorflow-1.8.0-cp27-cp27mu-linux_ppc64le.whl
 +
</pre>
 +
</div>
 +
 +
====Installing with Python3.5:====
 +
<div class="toccolours mw-collapsible mw-collapsed" style="overflow:auto;">
 +
* Create a virtual environment '''tensorflow-1.8-py3''' with packages installed with system:
 +
<pre>
 +
virtualenv --python=python3.5 --system-site-packages tensorflow-1.8-py3
 +
</pre>
 +
* Activate virtual environment:
 +
<pre>
 +
source tensorflow-1.8-py3/bin/activate
 +
</pre>
 +
* Install TensorFlow into the virtual environment: (A custom Numpy built with OpenBLAS library can be installed)
 +
<pre>
 +
pip3 install --upgrade --force-reinstall /scinet/sgc/Libraries/numpy/numpy-1.14.3-cp35-cp35m-linux_ppc64le.whl
 +
pip3 install /scinet/sgc/Applications/TensorFlow_wheels/tensorflow-1.8.0-cp35-cp35m-linux_ppc64le.whl
 +
</pre>
 +
</div>
 +
 +
====Submitting jobs====
 +
<div class="toccolours mw-collapsible mw-collapsed" style="overflow:auto;">
 +
The above myjob.script file needs to be modified to run custom TensorFlow. '''cuda/9.0''' and '''cudnn/cuda9.0/7.0.5''' modules need to be loaded. Virtual environment needs to be activated.
 +
<pre>
 +
#!/bin/bash
 +
#SBATCH --nodes=1
 +
#SBATCH --ntasks=20  # MPI tasks (needed for srun)
 +
#SBATCH --time=00:10:00  # H:M:S
 +
#SBATCH --gres=gpu:4    # Ask for 4 GPUs per node
 +
 +
module purge
 +
module load cuda/9.0 cudnn/cuda9.0/7.0.5
 +
source tensorflow-1.8-py2/bin/activate #change this to the location where virtual environment is created
 +
 +
cd $SLURM_SUBMIT_DIR
 +
python code.py
 +
</pre>
 +
</div>
  
 
== LINKS ==
 
== LINKS ==
Line 243: Line 372:
 
== DOCUMENTATION ==
 
== DOCUMENTATION ==
  
# GPU Cluster Introduction: [[Media:SOSCIP_GPU_Platform.pdf|SOSCIP GPU Platform]]
+
# GPU Cluster Introduction: [[Media:GPU_Training_01.pdf‎|SOSCIP GPU Platform]]
 +
-->

Latest revision as of 15:17, 5 October 2018


WARNING: SciNet is in the process of replacing this wiki with a new documentation site. For current information, please go to https://docs.scinet.utoronto.ca

SOSCIP GPU
S882lc.png
Installed September 2017
Operating System Ubuntu 16.04 le
Number of Nodes 14x Power 8 with 4x NVIDIA P100
Interconnect Infiniband EDR
Ram/Node 512 GB
Cores/Node 2 x 10core (20 physical, 160 SMT)
Login/Devel Node sgc01
Vendor Compilers xlc/xlf, nvcc

New Documentation Site

Please visit the new documentation site: https://docs.scinet.utoronto.ca/index.php/SOSCIP_GPU for updated information.

SOSCIP

The SOSCIP GPU Cluster is a Southern Ontario Smart Computing Innovation Platform (SOSCIP) resource located at theUniversity of Toronto's SciNet HPC facility. The SOSCIP multi-university/industry consortium is funded by the Ontario Government and the Federal Economic Development Agency for Southern Ontario [1].

Support Email

Please use <soscip-support@scinet.utoronto.ca> for SOSCIP GPU specific inquiries.