3. Samples List

3.1. Introduction

3.1.1. simpleStreams

This sample illustrates the usage of TOPS streams for overlapping kernel execution with device/host memcopies. The kernel is used to initialize an array to a specific value, after which the array is copied to the host (CPU) memory. To increase performance, multiple kernel/memcopy pairs are launched asynchronously, each pair in its own stream. Kernels are serialized. Thus, if n pairs are launched, streamed approach can reduce the memcopy cost to the (1/n)th of a single copy of the entire data set. Additionally, this sample uses TOPS events to measure elapsed time. Elapsed times are averaged over nreps repetitions (10 by default).

3.1.2. simpleSharedLibrary

This sample demonstrated how to make and use the shared library.

3.1.3. simplePrintf

This is a simple example of using printf() inside a kernel.

3.1.4. simpleAssert

This is a simple example of using assert() inside a kernel.

3.1.5. simpleTemplates

This sample is a templatized version of the template project.

3.1.6. simpleMultiThread

This sample demonstrates how to launch the kernel of topscc in multithreading

3.1.7. simpleZeroCopy

This sample illustrates how to use Zero MemCopy, kernels can read and write directly to pinned system memory.

3.1.8. simpleMultiGCU

This sample illustrates how to use TOPS API to use multiple GCUs.

3.1.9. simpleMultiCopy

This sample illustrates how to use Tops streams to achieve overlapping of kernel execution with data copies to and from the device.

3.1.10. simpleVectorAdd

This sample implements element by element vector addition.

3.1.11. asyncAPI

This sample illustrates the usage of TOPS events for both GCU timing and overlapping CPU and GPU execution. Events are inserted into a stream of calls. Since stream calls are asynchronous, the CPU can perform computations while GCU is executing. CPU can query events to determine whether GPU has completed tasks.

3.1.12. simpleP2P

This sample demonstrates how to copy device memory from Peer to Peer (P2P) directly, or how to use remote device memory in a kernel, and measures the bandwidth of a P2P memory copy.

3.1.13. simpleIPC

This sample is a very basic sample that demonstrates Inter Process Communication with one process per GCU for computation.

3.1.14. simpleRTC

This sample demonstrated how to use the topsrtc mode. To run RTC cases, please set environment of TopsCC installing location: export CAPS_HOME=/topscc/location By default it will be /opt/tops.

3.2. Utilities

3.2.1. deviceQuery

This sample queries the properties of the GCU devices present in the system via Host Runtime API.

3.2.2. kernelEfficiency

This sample meatures elapsed time to launch an empty kernel.

3.3. Concepts and Techniques

3.3.1. simpleElemwiseAdd

This sample illustrates multiple implementations of element-by-element addition, including algorithms of scalar addition and vector addition, and also measures the performance of each implementation.

3.3.2. simpleReductionAdd

This sample illustrates multiple implementations of the reduction addition of scalar, vector, and vector_async, and also measures the performance of each implementation.

3.4. TOPS Features

3.4.1. memoryUsage

This sample demonstrates how to get memory usage of a specific device.

3.4.2. sharedMemory

This sample demonstrates how to allocate L2 Buffer directly and use the L2 Buffer in a kernel function.

3.4.3. scatterMemory

This sample demonstrates how to use scatter memory to slice/deslice memories and better utilize device memory on different MCs.

3.4.4. resourceBundle

This sample demonstrates how to create resource bundles, and use the resource bundles in multiple threads.

3.4.5. graphMemoryNodes

This sample demonstrates how to create a graph by stream capture and launch the kernel with shared memory.

3.4.6. setLimit

This sample demonstrates how to use topsDeviceSetLimit() setting a thread-specific topsLimitMultiProcessorCount and topsLimitMaxThreadsPerBlock, and how to use topsStreamGetLaunchLimit() setting a grid/block limit hint for a stream. The working threads can get the device limit and the stream hint, and then decide to use how many blocks and threads to launch the kernels.

3.4.7. executableDump

This sample demonstrates how to create an executable from a prebuild binary. It can be created directly from a file, or from a preloaded binary buffer.

4. References

  • TopsCC Programming Guide