ECE408 Sample Quiz questions – Fall 2012 These are meant to serve as sample questions that you should ask and answer when you study for the exam. Set 1: 1. If we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index: (A) i=threadIdx.x + threadIdx.y; (B) i=blockIdx.x + threadIdx.x; (C) i=blockIdx.x*blockDim.x + threadIdx.x; (D) i=blockIdx.x * threadIdx.x; Answer: (C) 2. We want to use each thread to calculate two (adjacent) elements of a vector addition, Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index? (A) i=blockIdx.x*blockDim.x + threadIdx.x +2; (B) i=blockIdx.x*threadIdx.x*2 (C) i=(blockIdx.x*blockDim.x + threadIdx.x)*2 (D) i=blockIdx.x*blockDim.x*2 + threadIdx.x Answer: (C) 3. If a CUDA device’s SM (streaming multiprocessor) can take up to 1536 threads and up to 4 thread blocks. Which of the following block configuration would result in the most number of threads in the SM? (A) 128 threads per block (B) 256 threads per block (C) 512 threads per block (D) 1024 threads per block Answer: (C) 4. For a vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid? (A) 2000 (B) 2024 (C) 2048 (D) 2096 Answer: (C) 5. In the previous question, how many warps do you expect to have divergence due to the boundary check on vector length? (A) 1 (B) 2 (C) 3 (D) 6 Answer: (A) Set 2: 1. For our tiled matrix-matrix multiplication kernel, if we use a 32X32 tile, what is the reduction of memory bandwidth usage for input matrices M and N? (A) 1/8 of the original usage (B) 1/16 of the original usage (C) 1/32 of the original usage (D) 1/64 of the original usage Answer: (C) 2. Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel? (A) 1 (B) 1000 (C) 512 (D) 512000 Answer: (D) 3. In the previous question, if a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel? (A) 1 (B) 1000 (C) 512 (D) 51200 Answer: (B) 4. For the simple matrix-matrix multiplication (MxN) based on row-major layout, which input matrix will have coalesced accesses? (A) (B) (C) (D) M N M, N Neither Answer: (B) 5. For the tiled matrix-matrix multiplication (MxN) based on row-major layout, which input matrix will have coalesced accesses? (A) M (B) N (C) M, N (D) Neither Answer: (C) Set 3: 1. For the following reduction kernel code fragment, if the block size is 1024 and warp size is 32, how many warps in a block will have divergence during the 5th iteration? unsigned int t = threadIdx.x; Unsigned int start = 2*blockIdx.x*blockDim.x; partialSum[t] = input[start + t]; partialSum[blockDim+t] = input[start+ blockDim.x+t]; for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) { __syncthreads(); if (t % stride == 0) partialSum[2*t]+= partialSum[2*t+stride]; } (A) 0 (B) 1 (C) 16 (D) 32 Answer: (D), All warps will have divergence for at least part of the execution. 2. For the following reduction kernel, if the block size is 1024 and warp size is 32, how many warps will have divergence during the 5th iteration? unsigned int t = threadIdx.x; Unsigned int start = 2*blockIdx.x*blockDim.x; partialSum[t] = input[start + t]; partialSum[blockDim+t] = input[start+ blockDim.x+t]; for (unsigned int stride = blockDim.x; stride > 0; stride /= 2) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } (A) (B) (C) (D) 0 1 16 32 Answer: (A), There are 64 consecutive active threads, which is a multiple of warp size. 3. For the work efficient exclusive scan kernel, assume that we have 2048 elements, how many add operations will be performed in both the reduction tree phase and the inverse reduction tree phase? (A) (2048-1)*2 (B) (1024-1)*2 (C) 1024*1024 (D) 10*1024 Answer: (A) 4. For the work inefficient scan kernel based on reduction trees, assume that we have 2048 elements, which of the following gives the closest approximation on how many add operations will be performed? (A) (2048-1)*2 (B) (1024-1)*2 (C) 1024*1024 (D) 10*1024 Answer: (D) 5. For the vector addition example where input vectors are read from disk, if the GPU kernel runs at 190GFLOPS, and the PCIe is able to deliver a bandwidth of 6GBps, which of the following is the closest approximation of the minimum time it would take to add two 190 mega-element vectors stored in the host memory and get the result back to the host memory? (A) 190 / 190 ms (B) 190 / 6 ms (C) 8 * 190 / 6 ms (D) 2 * 190 / 6 ms Answer: (C) Set 4: 1. What is the CUDA API call that makea sure that all previous kernel executions and memory copies have been completed? (A) __syncthreads() (B) cudaDeviceSynchronize() (C) cudaStreamSynchronize() (D) __barrier() Answer: (B) 2. Which of the following statements is true? (A) The data transfer between device and host is done by DMA hardware using virtual addresses. (B) The OS automatically guarantees that any memory being used by a DMA device is not swapped out. (C) If a swapped page is to be transferred by cudyMemcpy(), it needs to be first copied to a pinned memory buffer before transferred. (D) Pinned memory is allocated with cudaMalloc() function. Answer: (C) 3. If we want to allocate an array of v integer elements in CUDA device global memory, what would be an appropriate expression for the second argument of the cudaMalloc() call? (A) n (B) v (C) n * sizeof(int) (D) v * sizeof(int) Answer: (D) 4. If we want to allocate an array of n floating-point elements and have a floating-point pointer variable d_A to point to the allocated memory, what would be an appropriate expression for the first argument of the cudaMalloc() call? (A) n (B) (void *) d_A (C) *d_A (D) (void **) &d_A Answer: (D) 5. If we want to copy 3000 bytes of data from host array h_A (h_A is a pointer to element 0 of the source array) to device array d_A (d_A is a pointer to element 0 of the destination array), what would be an appropriate API call for this in CUDA? (A) cudaMemcpy(3000, h_A, d_A, cudaMemcpyHostToDevice); (B) cudaMemcpy(h_A, d_A, 3000, cudaMemcpyDeviceTHost); (C) cudaMemcpy(d_A, h_A, 3000, cudaMemcpyHostToDevice); (D) cudaMemcpy(3000, d_A, h_A, cudaMemcpyHostToDevice); Answer: (C) Set 5: 1. How would one declare a variable err that can appropriately receive returned value of a CUDA API call? (A) int err; (B) cudaError err; (C) cudaError_t err; (D) cudaSuccess_t err; Answer: (C) 2. If the MPI call MPI_Send(ptr_a, 1000, MPI_FLOAT, 2000, 4, MPI_COMM_WORLD) resulted in a data transfer of 4000 bytes, what is the size of each data element being sent? (A) 1 byte (B) 2 bytes (C) 4 bytes (D) 8 bytes Answer: (C) 3. Which of the following statements is true? (A) MPI_send() is blocking by default. (B) MPI_recv() is blocking by default. (C) MPI messages must be at least 128 bytes. (D) MPI processes can access the same variable through shared memory. Answer: (B)
© Copyright 2025