R Statistical Package

From oldwiki.scinet.utoronto.ca
Revision as of 11:20, 11 February 2015 by Ejspence (talk | contribs)
Jump to navigation Jump to search

Running R on the GPC

R is powerful statistical and plotting software available on the GPC in the module R. In fact, there are currently six R modules installed, 2.13.1, 2.14.1, 2.15.1, 3.0.0, 3.0.1 and 3.1.1. While 2.15.1 is the default, we do recommend making the transition to a newer version, which you load by specifying the version number explicitly:

$ module load intel R/3.0.1

(The intel module is a prerequesite for the R module). If you will be using Rmpi, you will need to load the openmpi module as well.

Many optional packages are available for R which add functionality for specific domains; they are available through the Comprehensive R Archive Network (CRAN).

R provides an easy way for users to install the libraries they need in their home directories rather than having them installed system-wide; there are so many potential optional packages for R people could potentially want, we recommend users who want additional packages to proceed this way. This is almost certainly the easiest way to deal with the wide range of packages, ensure they're up to date, and ensure that users package choices don't conflict.

In general, you can install those that you need yourself in your home directory; eg,

$ R 
> install.packages("package-name", dependencies = TRUE)

will download and compile the source for the packages you need in your home directory under ${HOME}/R/x86_64-unknown-linux-gnu-library/2.11/ (you can specify another directory with a lib= option.) Then take a look at help(".libPaths") to make sure that R knows where to look for the packages you've compiled.

Note that during the installation you may get warnings that the packages cannot be installed in e.g. /scinet/gpc/Applications/R/3.0.1/lib64/R/bin/. But after those messages, R should have succeeded in installing the package into your home directory.

Running serial R jobs

As with all serial jobs, if your R computation do not use multiple cores, you should bundle them up so the 8 cores of a nodes are all performing work. Examples of this can be found on the User_Serial page.

Saving images from R in compute jobs

To make use of the graphics capability of R, R insists on having an X server running, even if you're just writing to a file. There is no X server on the compute nodes, and you'd get a message like

unable to open connection to X11 display 

To get around this issue, you can run a 'virtual' X server on the compute nodes by adding the following commands at the start of your job script:

# Make virtual X server command called Xvfb available:
module load Xlibraries
# Select a unique display number:
export DISPLAY=":$UID"
# Start the virtual X server
Xvfb $DISPLAY -ac 2>/dev/null &

After this, run R or Rscript as usual. The virtual X server will be running in the background and will get killed which your job is done. Alternatively, you may want to kill it explicitly at the end of you job script using

# Kill any remaining Xvfb server
pkill -u $UID Xvfb

Rmpi (R with MPI)

All the newer R installations on the GPC have Rmpi installed by default using OpenMPI. Be sure to load the OpenMPI module if you wish to use Rmpi.

Installing Rmpi, version 2.13.1

Version 2.13.1 does not have the Rmpi library as a standard package, which means you have to install it yourself if you are using that version. The same is true if you want to use IntelMPI instead of OpenMPI.

Installing the Rmpi package can be a bit challenging, since some additional parameters need to be given to the installation, which contain the path to various header files and libraries. These paths differ depending on what MPI version you are using.

The various MPI versions on the GPC are loaded with the module command. So the first thing to do is to decide what mpi version to use (openmpi or intelmpi), and to type the corresponding "module load" command on the command-line (as well as in your jobs scripts).

Because the MPI modules define all the paths in environment variables, the following line seem to work for installations of all openmpi versions.

> install.packages("Rmpi",
                   configure.args =
                   c(paste("--with-Rmpi-include=",Sys.getenv("SCINET_MPI_INC"),sep=""),
                     paste("--with-Rmpi-libpath=",Sys.getenv("SCINET_MPI_LIB"),sep=""),
                     "--with-Rmpi-type=OPENMPI"))

For intelmpi, you only need to change OPENMPI to MPICH2 in the last line.

Running Rmpi

To start using R with Rmpi, make sure you have all require modules loaded (e.g. module load intel openmpi R/2.14.1), then launch it with

$ mpirun -np 1 R --no-save

which starts one master mpi process, but starts up the infrastructure to be able to spawn additional processes.

Creating an R cluster

The 'parallel' package allows you to use R to launch individual serial subjobs across multiple nodes. This section describes how this is accomplished.

Creating your Rscript wrapper

The first thing to do is create a wrapper for Rscript. This needs to be done because the R module needs to be loaded on all nodes, but the submission script only loads modules on the head node of the job. The wrapper script, let's call it MyRscript.sh, is short:

#!/bin/bash

module load intel/13.1.1 R/3.0.1
${SCINET_R_BIN}/Rscript --no-restore "$@"

The "--no-restore" flag prevents Rscript from loading your "workspace image", if you have one saved. Loading the image causes problems for the cluster.

Once you've created your wrapper, make it executable:

$ chmod u+x MyRscript.sh

Your wrapper is now ready to be used.

The cluster R code

The R code which we will run consists of two parts, the code which launches the cluster, and does pre- and post-analysis, and the code which is run on the individual cluster "nodes". Here is some code which demonstrates this functionality. Let's call it MyClusterCode.R.


######################################################
#
#  worker code
#

# first define the function which will be run on all the cluster nodes.  This is just a test function.  
# Put your real worker code here.
testfunc <- function(a) {

  # this part is just to waste time
  b <- 0
  for (i in 1:10000) {
      b <- b + 1
  }

  s <- Sys.info()['nodename']
  return(paste0(s, " ", a[1], " ", a[2]))

}


######################################################
#
#  head node code
#

# Create a bunch of index pairs to feed to the worker function.  These could be parameters,
# or whatever your code needs to vary across jobs.  Note that the worker function only 
# takes a single argument; each entry in the list must contain all the information 
# that the function needs to run.  In this example, each entry contains a list which
# contains two pieces of information, a pair of indices.
indexlist <- list()
index <- 1
for (i in 1:10) {
  for (j in 1:10) {
     indexlist[index] <- list(c(i,j))
     index <- index +1
   }
}
 

# Now set up the cluster.

# First load the parallel library.
library(parallel)

# Next find all the nodes which the scheduler has given to us.
# These are listed in the file which is indicated by the PBS_NODEFILE
# environment variable.
nodefile <- Sys.getenv("PBS_NODEFILE")
hostnames <- readLines(nodefile)

# Now launch the cluster, using the list of nodes and our Rscript
# wrapper.
cl <- makePSOCKcluster(names = hostnames, rscript = "/path/to/your/MyRscript.sh")

# Now run the worker code, using the parameter list we created above.
result <- clusterApplyLB(cl, indexlist, testfunc)

# The results of all the jobs will now be put in the 'result' variable,
# in the order they were specified in the 'indexlist' variable.

# Don't forget to stop the cluster when you're finished.
stopCluster(cl)

You can, of course, add any post-processing code you need to the above code.

Submitting an R cluster job

You are now ready to submit your job to the GPC queue. The submission script is like most others:

#!/bin/bash
#PBS -l nodes=3:ppn=8                                                         
#PBS -l walltime=5:00:00
#PBS -N MyRCluster

cd $PBS_O_WORKDIR
module load intel/13.1.1 R/3.0.1
${SCINET_R_BIN}/Rscript --no-restore MyClusterCode.R

Be sure to use whatever number of nodes, length of time, etc., is appropriate for your job.