Parallel Programming Seminar

August 5-7 2009

Interdisciplinary Mathematics Institute
University of South Carolina


This is a simple CUDA program for transposing a square matrix as part of the Parallel Programming Seminar at IMI. To compile it, enter the following command line:

make -f

To run, type:


Below is the kernel source code for this matrix_transpose variant. It uses 1D texture cache combined with "manual" swizzling of the texture coordinates to achieve 2D spacial proximity caching. This technique is somewhat obsolete as of version 2.2 of the CUDA SDK, which allows plain device memory to be directly mapped to 2D texture references. Previously, 2D texture references could only be mapped to cudaArrays, which are opaque regions of device memory that can not be written directly into by kernel code.

However, swizzling is still a valid technique for achieving 2D spacial locality in 1D hardware caches.

Click here for the complete source code from Day 2.

#include "matrix_transpose_swizzle_texture_1d_kernel.h"
#define inline __device__
#include "swizzle.h"
#include "unswizzle.h"
#undef inline
#include "bit_count_01.h"
#include <assert.h>

#define BLOCK_SIZE 16

texture<float,1,cudaReadModeElementType> matrix_transpose_swizzle_texture_1d_texref;

matrix_transpose_swizzle_texture_1d_kernel( float * output, int dim, int swizzle_bits )
    unsigned offset = (threadIdx.x + threadIdx.y*blockDim.x) + (blockIdx.x + blockIdx.y*gridDim.x)*blockDim.x*blockDim.y;
    common::unswizzle_xy const unsw(swizzle_bits);
    common::unswizzle_xy::xy xy=unsw(offset);
    int y=xy.major;
    int x=xy.minor;
    common::swizzle_xy const sw(swizzle_bits);
    float v=tex1Dfetch(matrix_transpose_swizzle_texture_1d_texref,sw(x,y));
    output[offset] = v;

cuda_matrix_transpose_swizzle_texture_1d( float * output, float const * input, int dim )
    dim3 block(BLOCK_SIZE,BLOCK_SIZE);
    dim3 grid(dim/block.x,dim/block.y);

Other matrix_transpose variants used: