This web site provides a summary of the Parallel Programming Seminar, August 5-7 2009 at the Interdisciplinary Mathematics Institute, University of South Carolina. The three day lectures focused on GPU programming, primarily using CUDA. This was a hands-on event, where participants had the opportunity to use CUDA on Apple iMac computers.
Contact: Emil Dotchevski (emil@revergestudios.com).
Special thanks to Matt Hielsberg, Peter Binev and Robert Sharpley for organizing and hosting this seminar.
This web site contains links for downloading the source code from the hands-on sessions. The source code is portable, but since it was presented on Apple iMacs, only Makefiles for building under Mac OS X are included.
To build and run the source code, you will need to download and install a CUDA-enabled driver from NVIDIA. You also need to ensure that the LD_LIBRARY_PATH environment variable refers to /usr/local/cuda/lib -- open a Terminal window and type:
export LD_LIBRARY_PATH=/usr/local/cuda/lib
The first day includes a two-part presentation introducing the basic principles in GPU programming, followed by a simple "Hello World!"-style CUDA program.
Part 1 (PDF) focuses on the design and evolution of the GPU, to help understand the core principles in its operation:
Part 2 (PDF) presents CUDA as a specific interface for programming NVIDIA GPUs:
Part 3 introduces a simple CUDA program that fills a memory buffer with the value 42.
The second day is 100% hands-on session designed to introduce various programming techniques in CUDA. For this purpose, we are focusing on a single trivial problem: transposing a matrix. Further, we have a requirement that the transposed matrix is square and of size power-of-two. While unrealistic, this limitation makes the problem exceptionally trivial which highlights the differences between the five implementations we will discuss.
To streamline the presentation, we have created a few simple functions common to all five implementations. They include a couple of classes that use RAII to allocate and free CUDA memory buffers, and functions to parse a simple command line, fill in matrices with random elements, transpose a matrix on the CPU, and a function to verify that a given CUDA transpose implementation is correct. This is arranged in a single header file common.h:
template <class T>
class
cuda_buffer //non-copyable
{
private:
....
public:
cuda_buffer();
explicit cuda_buffer( int size );
~cuda_buffer();
int size() const;
T * ptr() const;
void swap( cuda_buffer & other );
};
template <class T>
class
cuda_buffer_2d //non-copyable
{
private:
....
public:
cuda_buffer_2d();
explicit cuda_buffer_2d( int width, int height );
~cuda_buffer_2d();
int width() const;
int height() const;
T * ptr() const;
int pitch() const;
void swap( cuda_buffer_2d & other );
};
int cmd_num_runs( int argc, char const * argv[] );
int cmd_matrix_dim( int argc, char const * argv[] );
void make_random_matrix( std::vector<float> &, int dim );
void make_cuda_matrix( cuda_buffer<float> &, std::vector<float> const & m, int dim );
void transpose( std::vector<float> & result, std::vector<float> const & matrix, int dim );
void cout_matrix( std::vector<float> const & matrix, int dim );
int check_error( std::vector<float> const & m1, cuda_buffer<float> const & m2, int dim );
Utilizing this interface, we have five different solutions for transposing a matrix in CUDA:
Click here for the complete source code from Day 2.
The last day focuses on image processing and introduces a CUDA program for applying a 3x3 convolution filter to an arbitrary image file. It demonstrates the use of textures, filtering, texture addressing modes, and coalesced memory writes.