ECE408 Sample Quiz questions – Fall 2012
These are meant to serve as sample questions that you should ask and answer when you study for the
exam.
Set 1:
1. If we need to use each thread to calculate one output element of a vector addition, what would
be the expression for mapping the thread/block indices to data index:
(A) i=threadIdx.x + threadIdx.y;
(B) i=blockIdx.x + threadIdx.x;
(C) i=blockIdx.x*blockDim.x + threadIdx.x;
(D) i=blockIdx.x * threadIdx.x;
Answer: (C)
2. We want to use each thread to calculate two (adjacent) elements of a vector addition, Assume
that variable i should be the index for the first element to be processed by a thread. What would
be the expression for mapping the thread/block indices to data index?
(A) i=blockIdx.x*blockDim.x + threadIdx.x +2;
(B) i=blockIdx.x*threadIdx.x*2
(C) i=(blockIdx.x*blockDim.x + threadIdx.x)*2
(D) i=blockIdx.x*blockDim.x*2 + threadIdx.x
Answer: (C)
3. If a CUDA device’s SM (streaming multiprocessor) can take up to 1536 threads and up to 4
thread blocks. Which of the following block configuration would result in the most number of
threads in the SM?
(A) 128 threads per block
(B) 256 threads per block
(C) 512 threads per block
(D) 1024 threads per block
Answer: (C)
4. For a vector addition, assume that the vector length is 2000, each thread calculates one output
element, and the thread block size is 512 threads. How many threads will be in the grid?
(A) 2000
(B) 2024
(C) 2048
(D) 2096
Answer: (C)
5. In the previous question, how many warps do you expect to have divergence due to the
boundary check on vector length?
(A) 1
(B) 2
(C) 3
(D) 6
Answer: (A)
Set 2:
1. For our tiled matrix-matrix multiplication kernel, if we use a 32X32 tile, what is the reduction of
memory bandwidth usage for input matrices M and N?
(A) 1/8 of the original usage
(B) 1/16 of the original usage
(C) 1/32 of the original usage
(D) 1/64 of the original usage
Answer: (C)
2. Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a
variable is declared as a local variable in the kernel, how many versions of the variable will be
created through the lifetime of the execution of the kernel?
(A) 1
(B) 1000
(C) 512
(D) 512000
Answer: (D)
3. In the previous question, if a variable is declared as a shared memory variable, how many
versions of the variable will be created through the lifetime of the execution of the kernel?
(A) 1
(B) 1000
(C) 512
(D) 51200
Answer: (B)
4. For the simple matrix-matrix multiplication (MxN) based on row-major layout, which input
matrix will have coalesced accesses?
(A)
(B)
(C)
(D)
M
N
M, N
Neither
Answer: (B)
5. For the tiled matrix-matrix multiplication (MxN) based on row-major layout, which input matrix
will have coalesced accesses?
(A) M
(B) N
(C) M, N
(D) Neither
Answer: (C)
Set 3:
1. For the following reduction kernel code fragment, if the block size is 1024 and warp size is 32,
how many warps in a block will have divergence during the 5th iteration?
unsigned int t = threadIdx.x;
Unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim+t] = input[start+ blockDim.x+t];
for (unsigned int stride = 1;
stride <= blockDim.x; stride *= 2)
{
__syncthreads();
if (t % stride == 0)
partialSum[2*t]+= partialSum[2*t+stride];
}
(A) 0
(B) 1
(C) 16
(D) 32
Answer: (D), All warps will have divergence for at least part of the execution.
2. For the following reduction kernel, if the block size is 1024 and warp size is 32, how many warps
will have divergence during the 5th iteration?
unsigned int t = threadIdx.x;
Unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim+t] = input[start+ blockDim.x+t];
for (unsigned int stride = blockDim.x;
stride > 0; stride /= 2)
{
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}
(A)
(B)
(C)
(D)
0
1
16
32
Answer: (A), There are 64 consecutive active threads, which is a multiple of warp size.
3. For the work efficient exclusive scan kernel, assume that we have 2048 elements, how many
add operations will be performed in both the reduction tree phase and the inverse reduction
tree phase?
(A) (2048-1)*2
(B) (1024-1)*2
(C) 1024*1024
(D) 10*1024
Answer: (A)
4. For the work inefficient scan kernel based on reduction trees, assume that we have 2048
elements, which of the following gives the closest approximation on how many add operations
will be performed?
(A) (2048-1)*2
(B) (1024-1)*2
(C) 1024*1024
(D) 10*1024
Answer: (D)
5. For the vector addition example where input vectors are read from disk, if the GPU kernel runs
at 190GFLOPS, and the PCIe is able to deliver a bandwidth of 6GBps, which of the following is
the closest approximation of the minimum time it would take to add two 190 mega-element
vectors stored in the host memory and get the result back to the host memory?
(A) 190 / 190 ms
(B) 190 / 6 ms
(C) 8 * 190 / 6 ms
(D) 2 * 190 / 6 ms
Answer: (C)
Set 4:
1. What is the CUDA API call that makea sure that all previous kernel executions and memory
copies have been completed?
(A) __syncthreads()
(B) cudaDeviceSynchronize()
(C) cudaStreamSynchronize()
(D) __barrier()
Answer: (B)
2. Which of the following statements is true?
(A) The data transfer between device and host is done by DMA hardware using virtual
addresses.
(B) The OS automatically guarantees that any memory being used by a DMA device is not
swapped out.
(C) If a swapped page is to be transferred by cudyMemcpy(), it needs to be first copied to a
pinned memory buffer before transferred.
(D) Pinned memory is allocated with cudaMalloc() function.
Answer: (C)
3. If we want to allocate an array of v integer elements in CUDA device global memory, what would
be an appropriate expression for the second argument of the cudaMalloc() call?
(A) n
(B) v
(C) n * sizeof(int)
(D) v * sizeof(int)
Answer: (D)
4. If we want to allocate an array of n floating-point elements and have a floating-point pointer
variable d_A to point to the allocated memory, what would be an appropriate expression for the
first argument of the cudaMalloc() call?
(A) n
(B) (void *) d_A
(C) *d_A
(D) (void **) &d_A
Answer: (D)
5. If we want to copy 3000 bytes of data from host array h_A (h_A is a pointer to element 0 of the
source array) to device array d_A (d_A is a pointer to element 0 of the destination array), what
would be an appropriate API call for this in CUDA?
(A) cudaMemcpy(3000, h_A, d_A, cudaMemcpyHostToDevice);
(B) cudaMemcpy(h_A, d_A, 3000, cudaMemcpyDeviceTHost);
(C) cudaMemcpy(d_A, h_A, 3000, cudaMemcpyHostToDevice);
(D) cudaMemcpy(3000, d_A, h_A, cudaMemcpyHostToDevice);
Answer: (C)
Set 5:
1. How would one declare a variable err that can appropriately receive returned value of a CUDA
API call?
(A) int err;
(B) cudaError err;
(C) cudaError_t err;
(D) cudaSuccess_t err;
Answer: (C)
2. If the MPI call MPI_Send(ptr_a, 1000, MPI_FLOAT, 2000, 4, MPI_COMM_WORLD) resulted in a
data transfer of 4000 bytes, what is the size of each data element being sent?
(A) 1 byte
(B) 2 bytes
(C) 4 bytes
(D) 8 bytes
Answer: (C)
3. Which of the following statements is true?
(A) MPI_send() is blocking by default.
(B) MPI_recv() is blocking by default.
(C) MPI messages must be at least 128 bytes.
(D) MPI processes can access the same variable through shared memory.
Answer: (B)