PyCUDA, Python, C/C++, Nvidia CUDA, GPU Computing

**GPU:** Geforce GTX 970 4G

**CPU: **AMD FX 6300

**Memory:** 8G

**Operating System:** Ubuntu Linux 16.04 64-bit

We often heard about GPU computing being faster than CPU. There is no doubt for most Deep Learning algorithms GPU outperforms CPU on efficiency. But what about other daily computations? Now exploiting the advantage of GPU becomes more and more easier, due to the improved development platform and API access interface, and a more intriguing paradigm is to combine the power of CPU cluster with GPU-powered node for very complex computation. When data is parallelized and assigned to each node, the algorithm should be able to determine the most powerful computational device. The answer of this question depends on many factors, but one of them is surely empirical: How big is the data? Is it big enough to gain advantage from GPU? In this blog, I will setup a PyCuda experiment to explore the best CPU-GPU switching point on my old desktop for matrix mulitplication problem.

# 0. Install NVIDIA CUDA

I have discussed the step of installing NVIDIA CUDA in my previous post (https://wordpress.com/post/eilianyu.wordpress.com/177)

# 1. Compile and Install Numpy and OpenBLAS

To fairly compare performance between GPU and CPU, we have to ensure Numpy is linked to OpenBLAS. Otherwise, the comparison would be “flawed” because we haven’t fully exploit opportunities provided by the underlying hardware, as suggested by this interesting article (https://people.eecs.berkeley.edu/~sangjin/2013/02/12/CPU-GPU-comparison.html).

Please follow https://hunseblog.wordpress.com/2014/09/15/installing-numpy-and-openblas/ to install Numpy and OpenBLAS.

*(I had an error “Program received signal SIGILL: Illegal instruction.” when upgrading from AMD Phenom to FX because the architectural change. Please make sure download the latest version of openBLAS and set the Target correctly as suggested in this post https://github.com/xianyi/OpenBLAS/issues/382 ).*

After installation, please double check that Numpy is correctly linked to openBLAS:

$python
>>import numpy as np
>>np.__config__.show()
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/openblas/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/openblas/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/openblas/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/openblas/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
blas_mkl_info:
NOT AVAILABLE

And

$locate lapack_lite.so
/home/cyberyu/openblas/numpy/build/lib.linux-x86_64-2.7/numpy/linalg/lapack_lite.so
/usr/local/lib/python2.7/dist-packages/numpy-1.11.2-py2.7-linux-x86_64.egg/numpy/linalg/lapack_lite.so
$ldd /usr/local/lib/python2.7/dist-packages/numpy-1.11.2-py2.7-linux-x86_64.egg/numpy/linalg/lapack_lite.so
cyberyu@cyberyugpu:~$ ldd /usr/local/lib/python2.7/dist-packages/numpy-1.11.2-py2.7-linux-x86_64.egg/numpy/linalg/lapack_lite.so
linux-vdso.so.1 => (0x00007ffceef67000)
libopenblas.so.0 => /opt/openblas/lib/libopenblas.so.0 (0x00007f8ea1b45000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8ea18fd000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8ea152d000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8ea121d000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f8ea0eed000)
/lib64/ld-linux-x86-64.so.2 (0x00005578bbd00000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f8ea0cad000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f8ea0a95000)

# 2. Install PyCuda

Obtain and install PyCuda from Andreas Klöckner’s web page.

PyCUDA provides an interface to access Nvidia‘s CUDA parallel computation API from Python. Please notice it is just an interface, so the core algorithm should still be written in CUDA’s programming model based on C and C++.

PyCUDA has a series of dependencies, so please make sure installing them all.

# 3. Install OpenCV and OpenMP

To fully exploit the platform, I install OpenMP to enhance OpenCV with multithreading. OpenCV is available at http://docs.opencv.org/2.4/doc/tutorials/introduction/linux_install/linux_install.html. OpenMP is available at http://bisqwit.iki.fi/story/howto/openmp/.

To keep the discussion concise, I skip all the trouble-shootings for OpenCV and OpenMP setup. Once they are correctly installed, I use the following command to compile the C++ code:

$g++ -fpic rbf_kernel_simple.cpp -o rbf_kernel_simple.o

# 4. A simple test comparing CPU and GPU

My experiment started with a very simple (naive) function. The goal is to take a large single dimensional float array x[i], where i is the index, as input and update its elements as new values equal to x[i]*x[i] – 0.25 and repeat some stupid loops for 1000 times. The overall Python code is illustrated as follows:

import os
os.environ['LD_LIBRARY_PATH'] = '/opt/openblas/lib/'
import numpy
import time
from pycuda.compiler import SourceModule
import pycuda.autoinit
from numpy import linalg as LA
mod = SourceModule("""
__global__ void cudakernel(float * buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf[i] = 1.0f * i / 10000;
for (int j = 0; j &amp;amp;amp;lt; 1000; j++)
buf[i] = buf[i] * buf[i] - 0.25f;
}
""")
a = numpy.ones(1000) # construct an array of 1000 elements
a = a.astype(numpy.float32)
import pycuda.driver as cuda
t1 = time.time()
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
func = mod.get_function("cudakernel")
func(a_gpu, block = (250, 1, 1), grid=(4,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
t2 = time.time()-t1
print "GPU time used:", t2*1000, " ms "
t3 = time.time()
import ctypes as C
my_test_lib = C.CDLL('/home/cyberyu/PycharmProjects/mpi_pycuda_test/cudakernelmp.o')
outdata = numpy.zeros(a.size, dtype=numpy.double)
a_cpued = my_test_lib.serial_add(C.c_void_p(a.ctypes.data), C.c_int(1000), C.c_int(2000), C.c_void_p(outdata.ctypes.data))
t4 = time.time()-t3
print "CPU time nedded:", t2*1000, " ms "
print "Error is " + str(LA.norm(a_doubled - outdata))

The code above uses the PyCUDA SourceModule system to pass the cuda C code to the cubin file compiler, and load that cubin file into the current CUDA context. The Cuda C code is wrapped in mod=SourceModule(“”” … “””). Notice that the first line defines the index of the array using threadIdx, blockIdx, and blockDim variables, which are delicately controlled by Cuda’s block and grid parallelism mechanism. Understanding the concept of threads, blocks, and grids, more importantly, to design heuristics that automatically scales with data is critical to, also the beauty of, CUDA development. Some nice references online:

http://users.wfu.edu/choss/CUDA/docs/Lecture%205.pdf

http://cuda-programming.blogspot.com/2013/01/thread-and-block-heuristics-in-cuda.html

The relationship of block and thread in single dimensional assignment is illustrated in the first reference document by Samuel S. Cho:

In my code, I choose the block size as (250,1,1) , and grid size as (4,1) because we are processing a single dimensional array. So in the first dimension, the multiplication of block size with grid size should be equal to the array length 1000. Otherwise, the parallelism will not lead to correct result (can be checked via the error variable). Other settings are also possible, such as

func(a_gpu, block = (100, 1, 1), grid=(10,1))
or
func(a_gpu, block = (500, 1, 1), grid=(2,1))

The first part of the code is to compute the processing time for GPU. In the second part, I calculate the processing time of CPU by writing a separate C code cudakernelmp.c :

void serial_add(float * indatav, int n_count, int m_count, float *outdatav){
const double * indata = (double *) indatav;
double * outdata = (double *) outdatav;
#pragma omp parallel for
for(int i = 0; i &amp;amp;amp;lt; n_count; i++)
{
outdata[i] = 1.0f * i / n_count;
#pragma omp parallel for
for(int j = 0; j &amp;amp;amp;lt; m_count; j++)
{
outdata[i] =outdata[i] * outdata[i] - 0.25f;
}
}
}

Then I compile it to linked library:

gcc -shared cudakernelmp.c -o cudakernelmp.o

And it is called in line 38 of the python code.

The output of the python code is:

GPU time used: 1.57809257507 ms
CPU time nedded: 21.1148262024 ms
Error is 8.85293575516e-08

It shows GPU is 20 faster than CPU on this naive task and the difference (L2 norm) of the results is almost negligible.

# 5. Comparing CPU and GPU for Matrix Multiplication

The real advantage of GPU computing lies in the complicated matrix computation. In the next experiment, I extract the matrix multiplication code from Andreas Klöckner’s PyCuda library. The original algorithm is listed at:

https://wiki.tiker.net/PyCuda/Examples/MatrixmulTiled

Unlike the previous single dimensional experiment, now we need to exploit the assignment of threads w.r.t. blocks and grids in 2D space. The relationship is illustrated as follows (taken from Samuel S. Cho’s lecture slide):

I made small changes on the dynamic assignment of block and grid size. To make it simple, I consider square matrix multiplication only. I haven’t tested it on non-square matrices.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import division
"""
Multiples two square matrices together using multiple blocks and shared memory.
Each thread block is assigned a "tile" of the resulting matrix and is responsible
for generating the elements in that tile. Each thread in a block computes one element
of the tile.
"""
import numpy as np
from numpy import linalg as la
import time
import matplotlib.pylab as plt
from pycuda import driver, compiler, gpuarray, tools
# -- initialize the device
import pycuda.autoinit
kernel_code_template = """
__global__ void MatrixMulKernel(float *A, float *B, float *C)
{
const uint wA = %(MATRIX_SIZE)s;
const uint wB = %(MATRIX_SIZE)s;
// Block index
const uint bx = blockIdx.x;
const uint by = blockIdx.y;
// Thread index
const uint tx = threadIdx.x;
const uint ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
const uint aBegin = wA * %(BLOCK_SIZE)s * by;
// Index of the last sub-matrix of A processed by the block
const uint aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
const uint aStep = %(BLOCK_SIZE)s;
// Index of the first sub-matrix of B processed by the block
const uint bBegin = %(BLOCK_SIZE)s * bx;
// Step size used to iterate through the sub-matrices of B
const uint bStep = %(BLOCK_SIZE)s * wB;
// The element of the block sub-matrix that is computed
// by the thread
float Csub = 0;
// Loop over all the sub-matrices of A and B required to
// compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a &amp;amp;lt;= aEnd;
a += aStep, b += bStep)
{
// Shared memory for the sub-matrix of A
__shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Shared memory for the sub-matrix of B
__shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
// Load the matrices from global memory to shared memory
// each thread loads one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k &amp;amp;lt; %(BLOCK_SIZE)s; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to global memory;
// each thread writes one element
const uint c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
C[c + wB * ty + tx] = Csub;
}
"""
def benchmarkCPU(scale):
rsCPU = []
print 'Start CPU processing'
for scaleFactor in xrange(scale):
# load the matrices
a_cpu = np.load('testmat_{}.npz'.format(scaleFactor))['arr_0']
b_cpu = np.load('testmat_{}.npz'.format(scaleFactor))['arr_1']
MATRIX_SIZE = 2**(scaleFactor) * 16
print "==" * 100
print 'Loading matrix size of ' + str(MATRIX_SIZE)
# compute reference on the CPU to verify GPU computation
at1 = time.time()
c_cpu = np.dot(a_cpu, b_cpu)
at2 = time.time()
dt12 = (at2 - at1)*1000
print "CPU time used:", dt12, " ms "
# save the results in npz
np.savez('cpu_res_{}.npz'.format(scaleFactor), c_cpu)
rsCPU.append(dt12)
return rsCPU
def benchmarkGPU(scale):
rsGPU = []
rsCOPY= []
print 'Start GPU processing'
# define size of blocks and tiles sub-matrix
# (we assume that the block size is same as tile size)
TILE_SIZE = 16
BLOCK_SIZE = TILE_SIZE
for scaleFactor in xrange(scale):
MATRIX_SIZE = 2 ** (scaleFactor) * 16
print "==" * 100
print 'Loading Matrix size of ' + str(MATRIX_SIZE)
# load the matrices
a_cpu = np.load('testmat_{}.npz'.format(scaleFactor))['arr_0']
b_cpu = np.load('testmat_{}.npz'.format(scaleFactor))['arr_1']
at1 = time.time()
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu = gpuarray.to_gpu(b_cpu)
at2 = time.time()
dt12= (at2-at1)*1000
print "COPY time used:", dt12, " ms "
# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)
# get the kernel code from the template
# by specifying the constants MATRIX_SIZE and BLOCK_SIZE
kernel_code = kernel_code_template % {
'MATRIX_SIZE': MATRIX_SIZE,
'BLOCK_SIZE': BLOCK_SIZE,
}
# compile the kernel code
mod = compiler.SourceModule(kernel_code)
# get the kernel function from the compiled module
matrixmul = mod.get_function("MatrixMulKernel")
# call the kernel on the card
matrixmul(
# inputs
a_gpu, b_gpu,
# output
c_gpu,
# grid of multiple blocks
# Andreas' original code is: grid = (MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE),
grid=( (MATRIX_SIZE + TILE_SIZE -1) // TILE_SIZE, (MATRIX_SIZE + TILE_SIZE -1) // TILE_SIZE),
# block of multiple threads
block=(TILE_SIZE, TILE_SIZE, 1),
)
# copy result from GPU
re = c_gpu.get()
at3 = time.time()
dt23 = (at3 - at2)*1000
print "GPU time used:", dt23, " ms "
np.savez('gpu_res_{}.npz'.format(scaleFactor), re)
rsGPU.append(dt23)
rsCOPY.append(dt12)
return [rsGPU, rsCOPY]
def calErr(scale):
rsErr=[]
print 'Comparing Error'
for scaleFactor in xrange(scale):
res_cpu = np.load('cpu_res_{}.npz'.format(scaleFactor))['arr_0']
res_gpu = np.load('gpu_res_{}.npz'.format(scaleFactor))['arr_0']
err = la.norm(res_cpu - res_gpu)
rsErr.append(err)
return rsErr
def generate_mat(scale):
# generate some large matrices and store them as npz files
# I can only try scaleFactor = 9 because of the memory limit of my GPU card.
print 'Generating Matrices'
for scaleFactor in xrange(scale):
MATRIX_SIZE = 2 ** (scaleFactor) * 16
a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
np.savez('testmat_{}.npz'.format(scaleFactor), a_cpu, b_cpu)
def main():
GSCALE = 10
generate_mat(GSCALE)
rsCPU = benchmarkCPU(GSCALE)
rs = benchmarkGPU(GSCALE)
rsGPU = rs[0]
rsCopy = rs[1]
rsErr= calErr(GSCALE)
labels = [2**(x)*16 for x in xrange(GSCALE)]
plt.plot(xrange(GSCALE), rsCPU,'b-', label="CPU processing time")
plt.plot(xrange(GSCALE), rsGPU,'r-', label="GPU processing time")
plt.plot(xrange(GSCALE), rsCopy, 'y-', label="Copy processing time")
plt.xticks(xrange(GSCALE), labels, rotation='vertical')
plt.grid(True, which="major", linestyle="dotted")
plt.yscale("log")
plt.ylabel("Logrithm Response time (msec)", fontsize=9)
plt.xlabel("Matrix Size ", fontsize=9)
plt.xticks(fontsize=9)
plt.yticks(fontsize=9)
plt.legend(loc='upper left', fancybox=True, shadow=True, prop=dict(size=8))
ax2 = plt.twinx()
ax2.set_ylabel('Error', color='g')
ax2.plot(xrange(GSCALE), rsErr, 'g-', label="Norm difference")
ax2.legend(loc=0)
plt.show()
if __name__ == "__main__":
main()

The program generates 10 pair of matrices from 16×16 to 8192×8192, and calculates their mutiplications respectively using CUDA tiled matrix multiplication algorithm and numpy dotproduct. The matrices are pregenerated and saved as numpy npz files, and loaded back for benchmarking. The CPU processing time and GPU procesing time are compared on multiplication matrices of the same sizes. The Copy time from CPU to GPU is separately tracked. Finally, the difference of the CPU and GPU results are evaluated as the L2-norm of production matrix.

Matrix Size |
16 |
32 |
64 |
128 |
256 |
512 |
1024 |
2048 |
4096 |
8192 |

GPU (ms) |
170.27 |
1.76 |
1.36 |
1.39 |
0.98 |
2.59 |
9.24 |
45.35 |
303.45 |
2134.40 |

Copy from CPU to GPU (ms) |
0.38 |
0.71 |
0.51 |
0.53 |
0.52 |
1.23 |
3.23 |
8.64 |
31.01 |
116.77 |

CPU (ms) |
0.42 |
0.01 |
0.03 |
0.74 |
0.55 |
3.21 |
50.41 |
199.14 |
1675.36 |
11114.58 |

Error(L2 norm) |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.02 |
0.08 |
0.31 |
1.22 |

As shown in the figure above (notice that the timing is represented in log scale), GPU computing does not bring any advantage on multiplication of smaller matrices, and the overhead of Copy time is significant. When the matrix size becomes larger than 512, GPU computing outperforms CPU on efficiency.