matchlobi.blogg.se - Dim3 engine

#Dim3 engine how to#

Since the computational kernel cannot compute more floating-point than the amount global memory has loaded, it will execute no more than 36 gigaflops per second. With 4 bytes in each single precision floating-point datum, we can load no more than 36 (144/4) giga single precision data per second. The nVidia Tesla C2075 companion processor supports 144 gigabytes per second (GB/s) of global memory access bandwidth. With this in mind, we can proceed to our example. We assume that in order to perform one floating point operation, the runtime need to transfer one single-precision floating-point from global memory datum to the computational kernel. One of the most important standards of a processor’s computation ability is its flops computation. The following example will show you why matching these two speeds is so important to GPU computation. The reason CUDA architecture has many memory types is to increase the memory accessing speed so that data transfer speed can match data processing speed. Matrix Multiplication with Global Memory source file: It will later on write back the result to the out put matrix. This thread will fetch the data and do all the calculations. We do this by assigning each entry in output matrix a thread of its own. As you can see, this kind of operation is highly paralleled, make it perfect for us to use CUDA. The result will be the value at entry (A,B) in the output matrix. We do this for all the elements in row A and column B, and then we get the sum of products. Later, we take the second left element in row A and multiply it by second top element in column B. We first take the left most element in row A and multiply it by top element in column B. For example, to calculate entry (A,B) in the output matrix, we need to use row A in one input matrix and column B in another input matrix. If you have learned linear algebra before, you will know that the output of two square matrices multiplied together is a square matrix of the same size. The output matrix is P with the same size. Two input matrices of size Width x Width are M and N.

In this example, we will do the Square Matrix Multiplication. This feature in CUDA architecture enable us to create two-dimensional or even three-dimensional thread hierarchy so that solving two or three-dimensional problems becomes easier and more efficient. As we know, threads can be organized into multi-dimensional block and blocks can also be organized into multi-dimensional grid.

#Dim3 engine how to#

Starting from this example, we will look at the how to solve problem in two-dimensional domain using two-dimensional grid and block.