Trade-off type: compute with global memory (block strided) Device specifications -Ĭompute throughput: 7464.96 GFlops (theoretical single precision FMAs) for the OpenCL implementation you may use the commands as follows: Thus, to build a particular implementation use the proper CMakeLists.txt residing in each subdirectory,Į.g. Half precision Flops (multiply-additions, for GPUs only).Double precision Flops (multiply-additions).Single precision Flops (multiply-additions).Kernel typesįour types of experiments are executed combined with global memory accesses: Since each implementation resides in a separate folder, please check the documentation available within each sub-project's folder. CUDA, HIP, OpenCL and SYCL implementations have been developed, targeting GPUs, or OpenMP when using a CPU as a target. Using this tool one can assess the practical optimum balance in both types of operations for a compute device. Modern GPUs are able to hide memory latency by switching execution to threads able to perform compute operations. The executed kernel is customized on a range of different operational intensity values. The purpose of this benchmark tool is to evaluate performance bounds of GPUs (or CPUs) on mixed operational intensity kernels.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |