#### Intro to GPGPU



## Your graphics card is probably faster than your computer.

- Graphics
   performance has
   grown by leaps and
   bounds
- Driven by gamers





#### ...but it's not magic

- CPU very flexible, easy to program
- GPU almost all transistors go to cores and mathematics.

| Control |     | ALU | ALU |
|---------|-----|-----|-----|
|         |     | ALU | ALU |
| Cache   |     |     |     |
|         |     |     |     |
| DRAM    |     |     |     |
|         | CPU | J   |     |
|         |     |     |     |
|         |     |     |     |
|         |     |     |     |
|         |     |     |     |
|         |     |     |     |
|         |     |     |     |
| DRAM    |     |     |     |
|         | GPU | J   |     |

#### ...but it's not magic

- All cores in a "multiprocessor unit" have same control, cache
- Act in lock step
- Do same computations on different data
- "Data parallel"



#### ...but it's not magic

- All cores in a "multiprocessor unit" have same control, cache
- Act in lock step
- Do same computations on different data
- "Data parallel"



#### lf it works, it's great..

• GPU: ~448

compute cores, into ~14 streaming multiprocessors (SM)

- ~32 threads
   operate at once
- Very small cache (48KB/SM)





#### ...and it often does.

- Much of scientific computing is "data parallel"
- Same operation on each cell of grid, on each particle in domain.







## Let's get straight to it

- From login node, ssh to arc01 (devel node of accelerator research cluster)
- •cd ~ppp; cp -r ~ljdursi/ppp/cuda .; cd cuda
- source ../setup
- module load cuda
- make
- •./saxpy





#### Question: How would we OpenMP this? MPI this?



$$\vec{y} = a\vec{x} + b$$

saxpy.cu

#### Very fine-grained parallelism. Each core does one (or few) tasks.



$$\vec{y} = a\vec{x} + b$$

cuda\_saxpb<<<1, n>>>(xd, a, b, yd, n);

saxpy.cu

For loop over elements is implied by the call; n in the <<<>>'s invokes n of these kernels in parallel.



```
/* run GPU code */
CHK_CUDA( cudaMalloc(&xd, n*sizeof(float)) );
CHK_CUDA( cudaMalloc(&yd, n*sizeof(float)) );
tick(&gputimer);
CHK_CUDA( cudaMemcpy(xd, x, n*sizeof(float), cudaMemcpyHostToDevice) );
for (i=0; i<niters; i++) {</pre>
    cuda_saxpb<<<1, n>>>(xd, a, b, yd, n);
7
CHK_CUDA( cudaMemcpy(ycuda, yd, n*sizeof(float), cudaMemcpyDeviceToHost) );
gputime = tock(&gputimer);
CHK_CUDA( cudaFree(xd) );
CHK_CUDA( cudaFree(yd) );
```

saxpy.cu

GPU Memory is separate from system memory (on card). Have to allocate/free it, and copy data GPU↔CPU



```
/* run GPU code */
CHK_CUDA( cudaMalloc(&xd, n*sizeof(float)) );
CHK_CUDA( cudaMalloc(&yd, n*sizeof(float)) );
tick(&gputimer);
CHK_CUDA( cudaMemcpy(xd, x, n*sizeof(float), cudaMemcpyHostToDevice) );
for (i=0; i<niters; i++) {
    cuda_saxpb<<<1, n>>>(xd, a, b, yd, n);
}
CHK_CUDA( cudaMemcpy(ycuda, yd, n*sizeof(float), cudaMemcpyDeviceToHost) );
gputime = tock(&gputimer);
CHK_CUDA( cudaFree(xd) );
CHK_CUDA( cudaFree(yd) );
```

#### saxpy.cu



#### Notes:

- CHK\_CUDA -- test for error cord. More later.
- Allocating, copying to GPU memory: SLOW compared to computing capability of GPU. Avoid wherever possible.
- What happens if you try

   ./saxpb --nvals=200 ? ./saxpb --nvals=2048 ?



#### Threads, Blocks, Grids

- CUDA threads are organized into blocks
- Threads operate in SIMD (ish) manner -- each executing same instructions in lockstep.
- Only difference are thread ids
- Can have a grid of multiple blocks



CUDA Thread



Block of CUDA Threads



Grid of CUDA Blocks



## CUDA - H/W mapping

- Blocks are assigned to a particular SM
  - Executed there one 'warp' at a time (typically 32 threads)
- Multiple blocks may be on SM concurrently
  - Good; latency hiding
  - Bad SM resources must be divided between blocks
- If only use I Block I SM





## Multi-block y=ax+b

- Break input, output vectors into blocks
- Within each block, thread index specifies which item to work on
- Each thread does one update, puts results in y[i]





#### Multi-block y=ax+b



#### block-saxpb.cu



#### Hands on -- do multi-block saxpb



# More blocks $\rightarrow$ more SMs $\rightarrow$ more FLOPs

• We can use 1024 threads/ block:

Multiple calcs, so timing not dominated by memory copy

```
$ ./block-saxpb --nblocks=1 --nvals=1024 --niters=100
CPU time = 0.455 millisec.
GPU time = 0.511 millisec.
CUDA and CPU results differ by 0.0000000
$
$ ./block-saxpb --nblocks=8 --nvals=8192 --niters=100
CPU time = 3.62 millisec.
GPU time = 0.546 millisec.
CUDA and CPU results differ by 0.000000
```





#### Multi-block y=ax+b





#### Multi-block y=ax+b







i = 10 + 2\*100 = 210 yd[210] = a\*xd[210] + b



## How many threads/ block?

- Should be integral multiple of warp (32)
- No more than max allowed by scheduling hardware
- Can get last number from hardware specs
- But what if will be needed on several machines?
- API can return it:



#### cudaGetDeviceProperty

```
int i, count;
cudaDeviceProp prop;
CHK_CUDA( cudaGetDeviceCount( &count ));
for (i=0; i<count; i++) {
   CHK_CUDA( cudaGetDeviceProperties( &prop, i ));
   printf("Device %d has:\n",i);
   printf("Device %d has:\n",i);
   printf("\tName %s,\n",prop.name);
   printf("\tNumber of SMs %d,\n",prop.multiProcessorCount);
   printf("\tWarp Size %d,\n",prop.warpSize);
   printf("\tMax Threads/block %d,\n",prop.maxThreadsPerBlock);
```

#### querydevs.cu



#### cudaGetDeviceProperty

```
#define CHK_CUDA(e) {if (e != cudaSuccess) { \
    fprintf(stderr,"Error: %s\n", cudaGetErrorString(e)); \
    exit(-1);}\
}
```

All CUDA calls return cudaSuccess on successful completion.

GPU hardware does not try very hard to catch errors/notify you; testing return codes important!

Common to see simple automation like this wrapping all CUDA calls; bare minimum for sensible operation.

Test early, fail often.



## Why the .xs?

- For convenience, CUDA allows thread, block indicies to be multidimensional
- Thread blocks can be 3 dimensional (512,512,64)
- Grids of blocks can be 2 dimensional (64k, 64k, 1)
- These variables are of type dim3 or uint3
- CUDA has int1, int2, int3, int4, float1, float2, float3, float4, etc.

```
yd[i] = a*xd[i]+b;
}
```

```
return;
```



## Why the .xs?

- threadIdx.{x,y,z} thread index
- blockDim.{x,y,z} size of block (# of threads in each dim)
- blockIdx.{x,y,z} block index
- gridDim.{x,y,z} size of grid (# of blocks in each dim)
- warpsize size of warp (int)

```
yd[i] = a*xd[i]+b;
}
return;
```



## Why the .xs?

- \_global\_\_ device code that can be seen (invoked) from host.
- host default. Not usually interesting.
- <u>device</u> device code.
   Can be called only from other device code.
- <u>host</u> <u>device</u> compiled for both host and device.

```
yd[i] = a*xd[i]+b;
}
return;
```





#### Restrictions

return;

- global functions can't recurse, neither can device on non-Fermis
- No function pointers to \_\_\_\_\_\_device\_\_\_\_ functions on non-fermis, can't take address of \_\_\_\_\_\_ device\_\_\_ function
- Can't have static variables in global\_\_\_, \_\_device\_\_\_\_
   functions
- Can't use varargs with device code



- Use of 2/3d thread blocks, or 2d grids, never strictly necessary...
- But can make code clearer, shorter.
- Matrix multiplication



$$C_{i,j} = \sum_{k} A_{i,k} B_{k,j}$$



```
void cpu_sgemm(const float *a, const float *b,
                const int n, float *c) {
    /* this, of course, is a
       terrible implementation */
    int i, j, k;
    double sum;
    for (i=0;i<n;i++) {</pre>
                                                                               *
        for (j=0;j<n;j++) {</pre>
            sum = 0.;
            for (k=0;k<n;k++) {</pre>
                 sum += a[i*n + k]*b[k*n + j];
            c[i*n + j] = sum;
                                                       C_{i,j} = \sum A_{i,k} B_{k,j}
        }
    }
    return;
}
             matmult.cu
```







```
global
void cuda_sgemm_reg(const float *ad, const float *bd,
                    const int n, float *cd) {
    int i = threadIdx.x + blockIdx.x*blockDim.x;
    int j = threadIdx.y + blockIdx.y*blockDim.y;
    int k;
                                                                                    *
    double sum;
    if (i<n && j<n) {
        sum = 0.;
        for (k=0; k<n; k++) {</pre>
            sum += ad[i*n + k]*bd[k*n + j];
        }
        cd[i*n + j] = sum;
                                                            C_{i,j} = \sum A_{i,k} B_{k,j}
    }
    return;
}
```



### Timings:

Orig \$ ./matmult --matsize=160 --nblocks=10 Matrix size = 160, Number of blocks = 10. CPU time = 14.093 millisec. GPU time = 4.416 millisec. CUDA and CPU results differ by 0.162872

#### **Double Prec. sum**

\$ ./matmult --matsize=160 --nblocks=10 Matrix size = 160, Number of blocks = 10. CPU time = 14.047 millisec. GPU time = 2.219 millisec. CUDA and CPU results differ by 0.000000

Faster, even with double precision sums - why?



#### **CUDA Memories**

- All HPC, but especially GPU, all about planning memory access to be fast
- Global mem is off the GPU chip (but on the card); ~100 cycle latency
- Thread-local variables get put into registers on each SM - fast (~I cycle) but small





#### **CUDA Memories**

| Memory   | On<br>Chip? | Cached? | R/W   | Scope        |                  |      |      |
|----------|-------------|---------|-------|--------------|------------------|------|------|
| Register | On          | No      | R/W   | Thread       | Global           |      |      |
| Shared   | On          | No      | R/W   | Block        | Mem<br>(On Card) |      |      |
| Global   | Off         | No      | R/W   | Kernel, Host | Registers        |      |      |
| Constant | Off         | Yes     | R     | Kernel, Host | (On Chip)        | SM#1 | SM#2 |
| Texture  | Off         | Yes     | R(W?) | Kernel, Host |                  |      |      |
| 'Local'* | Off         | No      | R/W   | Thread       |                  |      |      |

\* if you run out of registers, will put 'local' mem in global.



# Memory usage in SGEMM

- How can we exploit this?
- N<sup>3</sup> multiplies, adds
- 2N<sup>2</sup> data
- Regular access
- Opportunity for high memory re-use
- Need to find ways to bring data into shared memory (incurring global mem overhead once), use it several times



$$C_{i,j} = \sum_{k} A_{i,k} B_{k,j}$$



# Memory usage in SGEMM

- One nice thing about matrix multiplication same as block multiplication, each subblock is a matrix mult
- Neighbouring threads within block all see nearby rows, columns
- Pull whole block in
- If b blocks in each dim, each data only pulled in 2b times, not 2n times



$$C_{bi,bj} = \sum_{k} A_{bi,bk} B_{bk,bj}$$



# Memory usage in SGEMM

```
int i = threadIdx.x + blockIdx.x*blockDim.x;
int i = threadIdx.x + blockIdx.x*blockDim.x;
int locj = threadIdx.y;
int locj = threadIdx.y;
int locn = blockDim.x;
__shared___atile[TILESIZE][TILESIZE];
__shared___btile[TILESIZE][TILESIZE];
//...
```



```
double sum = 0;
```

```
for (each tile) {
    //..load in tiles
    for (k=0; k<locn; k++) {
        sum += atile[loci*locn + k]*
            btile[k*locn + locj];
    }
}
c[i*n + j] = sum;</pre>
```

$$C_{bi,bj} = \sum_{k} A_{bi,bk} B_{bk,bj}$$



### \_\_\_\_\_syncthreads()

- Computation must wait until all threads have brought in their data
- Not all memory accesses may take same length of time
- <u>syncthreads()</u> waits until all threads in block are at same point.
- No equivalent between blocks
- Loop must similarly wait for computation





#### shared

arrays

- If declared in device code, must be sized at compile time.
- No sharedMalloc (all threads in block would have to agree)
- can use consts or #defines to size array, but we want to maintain flexibility





#### extern \_\_\_\_\_shared \_\_\_

```
global ____ void cuda_sgemm_shared(const float *ad, const float *bd,
                           const int n, float *cd) {
    extern __shared__float shared_data[];
    //...
    float *atile = &(shared_data[0]);
    float *btile = &(shared_data[tilesize*tilesize]);
void main() {
    //..
    cuda_sgemm_shared<<<gridsize, blocksize,
              [2*blocksize.x*blocksize.y*sizeof(float)>>>(ad, bd, n, cd);
     Optional 3rd argument - size (in bytes)
     of shared memory to allocate per block
```



#### extern \_\_\_\_\_shared \_\_\_





#### extern \_\_\_\_\_shared \_\_\_



#### Orig \$ ./matmult --matsize=160 --nblocks=10 Matrix size = 160, Number of blocks = 10. CPU time = 14.093 millisec. GPU time = 4.416 millisec. CUDA and CPU results differ by 0.162872

#### **Double Prec. sum**

```
$ ./matmult --matsize=160 --nblocks=10
Matrix size = 160, Number of blocks = 10.
CPU time = 14.047 millisec.
GPU time = 2.219 millisec.
CUDA and CPU results differ by 0.000000
```

#### Shared

```
$ ./matmult --matsize=160 --nblocks=10
Matrix size = 160, Number of blocks = 10.
CPU time = 14.041 millisec.
GPU time = 0.998 millisec.
CUDA and CPU results differ by 0.000000
```



#### Hands On

- Using matmult.cu as a template, look at laplacian.c implements 2d laplacian.
- Implement a CUDA version using shared memory, and make sure it gets same answer as CPU version.
- How would we do multiple iterations? Does entire memory have to be copied back each time?



# Making effective use of CUDA memories

- Preload data wherever possible
- Global memory -
  - Coalesced access
  - Make use of 128B (or, maybe, 32B) at a time
- Profiler to see what's happening
- Shared memory
  - Bank conflicts

| Memory   | On<br>Chip? | Cached? | R/W   | Scope        |
|----------|-------------|---------|-------|--------------|
| Register | On          | No      | R/W   | Thread       |
| Shared   | On          | No      | R/W   | Block        |
| Global   | Off         | No      | R/W   | Kernel, Host |
| Constant | Off         | Yes     | R     | Kernel, Host |
| Texture  | Off         | Yes     | R(W?) | Kernel, Host |
| 'Local'* | Off         | No      | R/W   | Thread       |



#### Stalling on Memory Access

- Graphics card schedules by the warp on an SM
- All warps that are ready to execute get scheduled
- Not ready to execute stalled on memory access
- Nothing ready SM sits idle.





#### Stalling on Memory Access

- Two ways to ensure no idle SM:
  - Lots of warps (=blocks\*threads/32); hide latency with other threads.
  - Little or no stalling on memory access; hide latency within threads.
- Sometimes work to counter purposes! Must experiment to see what works best for your algorithm.





### Stalling happens on use.

- Kernel does not stall on loading data
- Stalls when data not yet ready needs to be used
- Can "preload" data that you will need at beginning of kernel
- Hide latency by doing as much work as possible before need bulk of data.





### Keep memory accesses going

- Make maximum use of memory bandwidth hardware provides
- To fully use a pipe, must have bandwidth x latency memory accesses 'in flight'.
- Little's Law, Queueing theory - <u>http://en.wikipedia.org/</u> wiki/Little%27s\_law





- Global memory is slow
- Get as much out of it per access as possible
- HW reads 128 byte lines from global memory (Fermi: can turn off caching and read 4x 32byte segments)
- Want to make the most of this







- Corresponds to 4B for each thread in a warp
- If each thread in warp reads consecutive float, aligned w/ boundary, can be coalesced into 1 read: high bandwidth
- Warp can continue after
   I global read cycle





- If each thread in warp reads consecutive float, but offset, can be coalesced into 2 read: reduced bandwidth
- Warp can continue after
   2 global read cycle (and 128B of bandwidth wasted)





- Random access is a nightmare
- Can potentially take 32 times as long, wasting 97% of available global memory bandwidth





#### List reversal

- Imagine having to reverse a list
- (Sounds dumb, but matrix transpose, partial pivoting, various graph algorithms require data reordering)
- Obvious way to do this, particularly on older (pre cc 1.2) hardware, doesn't work well:





#### List reversal

```
__global__ void cuda_reverse(const float *xd,
float *yd,
const int n) {
```

```
int i = threadIdx.x + blockIdx.x*blockDim.x;
if (i<n) {
    yd[n-(i+1)] = xd[i];
}
return;
}
Read - coalesced</pre>
```



#### List reversal









- Sometimes we'd like to see more detail than just integrated timings
- Cuda/OpenCL profiler comes with NVidia SDK
- run with computeprof (/scinet/arc/cuda-3.2/ computeprof)
- From there, you can run an application and look at timings





 Click 'Profile application' to begin getting data,

| <u>File Session View Options Window</u>     | v <u>H</u> elp |  |  |  |  |  |
|---------------------------------------------|----------------|--|--|--|--|--|
|                                             | <b>B B B</b>   |  |  |  |  |  |
| Sessions Management of A                    |                |  |  |  |  |  |
| 😑 🔿 🔿 🛛 Welcome to Compute V                | ∕is            |  |  |  |  |  |
| Project———————————————————————————————————— |                |  |  |  |  |  |
| Recent Profile applica                      | tion           |  |  |  |  |  |
| Open Import CS                              | Import CSV     |  |  |  |  |  |
| Create                                      | Help           |  |  |  |  |  |
|                                             |                |  |  |  |  |  |
| 🕱 Show this dialog on startup               |                |  |  |  |  |  |
|                                             | Close          |  |  |  |  |  |
|                                             |                |  |  |  |  |  |



- Click 'Profile application' to begin getting data,
- Enter directory, executable, and arguments of program to profile,

| 00       |                | X                   | Session settings                    |  |  |  |
|----------|----------------|---------------------|-------------------------------------|--|--|--|
| ession ( | Profiler Cou   | inters              | Other Options                       |  |  |  |
| Session  | Name:          | Sessio              | onl                                 |  |  |  |
| Launch:  |                | /home/              | /ljdursi/gpuclass/class4/reverse" 💌 |  |  |  |
| Working  | Directory:     | /home               | /ljdursi/gpuclass/class4            |  |  |  |
| Argumer  | nts:           | nvals=960nblocks=30 |                                     |  |  |  |
| Max Exe  | cution Time:   |                     |                                     |  |  |  |
| 🗶 Enabl  | e profiling at | applica             | ation launch                        |  |  |  |
|          | API trace      |                     |                                     |  |  |  |
|          | separate wi    | ndow                |                                     |  |  |  |



- Click 'Profile application' to begin getting data,
- Enter directory, executable, and arguments of program to profile,
- and then run the program. Program runs several times to get all counter information.





- Summary table shows lots of good stuff
- Here we see overall kernel time is about 12% faster, presumably because of roughly ~12% better global memory throughput.

| Method                   | #Calls | GPU time 🛛 🗸 | %GPU time | glob mem read throughpu | glob mem write | glob mem overall thro |
|--------------------------|--------|--------------|-----------|-------------------------|----------------|-----------------------|
| 1 cuda_reverse           | 1      | 2.88         | 6.95      | 1.33333                 | 1.33333        | 2.66667               |
| 2 cuda_reverse_coalesced | 1      | 2.56         | 6.18      | 1.5                     | 1.5            | 3                     |
| 3 memcpyHtoD             | 4      | 23.712       | 57.26     |                         |                |                       |
| 4 memcpyDtoH             | 2      | 12.256       | 29.59     |                         |                |                       |



#### Another Example: Multi-block y=ax+b

- Break input, output vectors into blocks
- Within each block, thread index specifies which item to work on
- Each thread does one update, puts results in y[i]





#### Another Example: Multi-block y=ax+b

- Break input, output vectors into blocks
- Within each block, thread index specifies which item to work on
- Each thread does one update, puts results in y[i]
- But now with a stride:
- Can coalesce reads, writes, but not both.





#### Another Example: Multi-block y=ax+b

• Break input, output

vectors into hlocks

Profiler Output

| Γ | Method             | #Calls | GPU time 🛛 🗸 | %GPU time | glob mem read throughput | glob mem write | glob mem overall | gld efficiency | gst efficiency | instr |
|---|--------------------|--------|--------------|-----------|--------------------------|----------------|------------------|----------------|----------------|-------|
| 1 | cuda_saxpb_strided | 1      | 4.608        | 7.61      | 18.6806                  | 18.6806        | 37.3611          | 0.307692       | 0.307692       | 0.14  |
| 2 | cuda_saxpb         | 1      | 3.008        | 4.97      | 4.78723                  | 4.78723        | 9.57447          | 1              | 1              | 0.04  |
| 3 | memcpyHtoD         | 4      | 37.088       | 61.32     |                          |                |                  |                |                |       |
| 4 | memcpyDtoH         | 2      | 15.776       | 26.08     |                          |                |                  |                |                |       |

- Each thread does one update, puts results in y[i]
- But now with a stride:
- Can coalesce reads, writes, but not both.





- Rewriting algorithm to ensure coalesced memory access probably most important optimization.
- Try to rearrange data before transfer to device to be in order needed;
- Reorder in shared mem if necessary.





 Each thread in warp accesses different bank: no problem.

| Ŏ |  |  |
|---|--|--|
|   |  |  |
|   |  |  |

SM#1



 Each thread in warp accesses different bank: no problem.







- Each thread in warp accesses different bank: no problem.
- Each thread accesses same one value: 'broadcast', no problem.



SM#1



- Each thread in warp accesses different bank: no problem.
- Each thread accesses same one value: 'broadcast', no problem.
- Multiple threads need data from same bank: conflict. Accesses are serialized.



SM#I



 Imagine 8 banks, and working on an 8xN matrix

|    |    | •  | •  | •  | •  | •  | •  |
|----|----|----|----|----|----|----|----|
| 0  | I  | 2  | 3  | 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
| 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |



- Imagine 8 banks, and working on an 8xN matrix
- Row operations are great

| 0  |    | 2  | 3  | 4  | 5  | 6  | 7  |
|----|----|----|----|----|----|----|----|
| 8  | 9  | 10 |    | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
| 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |



- Imagine 8 banks, and working on an 8xN matrix
- Row operations are great
- Column operations maximally bad

| 0  | Ι  | 2  | 3  | 4  | 5  | 6  | 7  |
|----|----|----|----|----|----|----|----|
| 8  | 9  | 10 |    | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
| 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |



- Imagine 8 banks, and working on an 8xN matrix
- Row operations are great
- Column operations maximally bad
- Solutions
  - Row ops if possible

|    | -  | •  | •  | •  | •  | •  | •  |
|----|----|----|----|----|----|----|----|
| 0  |    | 2  | 3  | 4  | 5  | 6  | 7  |
| 8  | 9  | 10 |    | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
| 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |



- Imagine 8 banks, and working on an 8xN matrix
- Row operations are great
- Column operations maximally bad
- Solutions
  - Row ops if possible
  - Pad matrix with extra column to stride across banks

| •  | •  | •  | •  | •  | •  | •  | •  |
|----|----|----|----|----|----|----|----|
| 0  |    | 2  | 3  | 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
| 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |



### Warps in multi-d blocks

- Easy to see how warps are assigned in I-d block:
  - First 32 = warp0
  - Next 32 = warp1..
- How done in 2d block?
- C ordering: x first, then y
- blockDim.x = 32:
  - warp 0 : blockDim.y = 0
  - warp I: blockDim.y = I..





```
global____void cuda_sgemm_shared(const float *ad, const float *bd,
                                                                   Striding through matrix
                         const int n, float *cd)
                                                                    w/ slow moving index;
 extern ____shared___ float shared_data[];
                                                                   Massive bank conflicts if
  int loci = threadIdx.x;
                                                                     blocksize = warpsize
 int locj = threadIdx.y;
 int tilesize = blockDim.x;
 int bx = blockIdx.x;
  int by = blockIdx.y;
 int i = threadIdx.x + blockIdx.x*blockDim.x;
 int j = threadIdx.y + blockIdx.y*blockDim.y;
 int k;
 int blockk;
 float *atile = &(shared_data[0]);
  float *btile = &(shared_data[tilesize*tilesize]);
 double sum;
 if (i<n && j<n) {
     sum = 0.;
     for (blockk=0; blockk<gridDim.x; blockk++) {</pre>
         /* read in shared data */
          atile[loci*tilesize + locj] = ad[(tilesize*bx+loci)*n + (tilesize*blockk+locj)];
          btile[loci*tilesize + locj] = bd[(tilesize*blockk+loci)*n + (tilesize*by+locj)];
          _____syncthreads();
          for (k=0; k<tilesize; k++)</pre>
              sum += atile[loci*tilesize + k]*btile[k*tilesize + locj];
          ____syncthreads();
      }
      cd[i*n + j] = sum;
  }
                                                         matmult.cu
  return;
```



| 🥪 <u>F</u> ile <u>S</u> ession <u>V</u> iew <u>O</u> ptions <u>W</u> indow <u>H</u> elp |   |                             |        |              |           |                |
|-----------------------------------------------------------------------------------------|---|-----------------------------|--------|--------------|-----------|----------------|
|                                                                                         |   |                             |        |              |           |                |
| nonnonnon Sessions nonnonnon 🗗 🗵 🛛 Profiler Output 🖾 Summary Table 🔯                    |   |                             |        |              |           |                |
| ⊡… Session1<br>⊡… Device 0                                                              | Γ | Method                      | #Calls | GPU time 🛛 🗸 | %GPU time | warp serialize |
| Context_0 [CUDA]                                                                        |   | cuda_sgemm_shared           | 1      | 112289       | 63.09     | 58021046       |
| ⊡ Session2<br>⊡ Device_0<br>Context_0 [CUDA]                                            | 2 | cuda_sgemm_shared_transpose | 1      | 53739.4      | 30.19     | 0              |
|                                                                                         | 3 | memcpyHtoD                  | 4      | 6673.89      | 3.74      |                |
| ⊡… Session3<br>⊡… Device_0                                                              |   | memcpyDtoH                  | 2      | 5268.99      | 2.96      |                |

blocksize = 32 marten\$ ./matmult --matsize=1536 --nblocks=48 = warpsize Matrix size = 1536, Number of blocks = 48. CPU time = 29466.5 millisec, GFLOPS=0.245966 GPU time = 522.71 millisec, GFLOPS=13.865733, diff = 0.000000. GPU2 time = 128.905 millisec, GFLOPS=56.225572, diff = 0.000000.

4x performance



## Memory structure informs block sizes:

- By choosing block size in such a way to maximize global, shared memory bandwidth and preloading data into shared, can extract significant performance
- Get your code working first, then use these considerations to get them working fast

```
$ ./matmult --matsize=1536 --nblocks=24
Matrix size = 1536, Number of blocks = 24.
CPU time = 29467.4 millisec, GFLOPS=0.245958
GPU time = 8.203 millisec, GFLOPS=883.549593, diff = 0.000000.
GPU2 time = 8.122 millisec, GFLOPS=892.361156, diff = 0.000000.
```

 Use tuned code where available (this is still much slower than CUBLAS, MAGMA!)

