3. Samples List¶
3.1. Introduction¶
3.1.1. simpleStreams¶
This sample illustrates the usage of TOPS streams for overlapping kernel execution with device/host memcopies. The kernel is used to initialize an array to a specific value, after which the array is copied to the host (CPU) memory. To increase performance, multiple kernel/memcopy pairs are launched asynchronously, each pair in its own stream. Kernels are serialized. Thus, if n pairs are launched, streamed approach can reduce the memcopy cost to the (1/n)th of a single copy of the entire data set. Additionally, this sample uses TOPS events to measure elapsed time. Elapsed times are averaged over nreps repetitions (10 by default).
3.1.3. simplePrintf¶
This is a simple example of using printf() inside a kernel.
3.1.4. simpleAssert¶
This is a simple example of using assert() inside a kernel.
3.1.5. simpleTemplates¶
This sample is a templatized version of the template project.
3.1.6. simpleMultiThread¶
This sample demonstrates how to launch the kernel of topscc in multithreading
3.1.7. simpleZeroCopy¶
This sample illustrates how to use Zero MemCopy, kernels can read and write directly to pinned system memory.
3.1.8. simpleMultiGCU¶
This sample illustrates how to use TOPS API to use multiple GCUs.
3.1.9. simpleMultiCopy¶
This sample illustrates how to use Tops streams to achieve overlapping of kernel execution with data copies to and from the device.
3.1.10. simpleVectorAdd¶
This sample implements element by element vector addition.
3.1.11. asyncAPI¶
This sample illustrates the usage of TOPS events for both GCU timing and overlapping CPU and GPU execution. Events are inserted into a stream of calls. Since stream calls are asynchronous, the CPU can perform computations while GCU is executing. CPU can query events to determine whether GPU has completed tasks.
3.1.12. simpleP2P¶
This sample demonstrates how to copy device memory from Peer to Peer (P2P) directly, or how to use remote device memory in a kernel, and measures the bandwidth of a P2P memory copy.
3.1.13. simpleIPC¶
This sample is a very basic sample that demonstrates Inter Process Communication with one process per GCU for computation.
3.1.14. simpleRTC¶
This sample demonstrated how to use the topsrtc mode.
To run RTC cases, please set environment of TopsCC installing location:
export CAPS_HOME=/topscc/location
By default it will be /opt/tops
.
3.2. Utilities¶
3.2.1. deviceQuery¶
This sample queries the properties of the GCU devices present in the system via Host Runtime API.
3.2.2. kernelEfficiency¶
This sample meatures elapsed time to launch an empty kernel.
3.3. Concepts and Techniques¶
3.3.1. simpleElemwiseAdd¶
This sample illustrates multiple implementations of element-by-element addition, including algorithms of scalar addition and vector addition, and also measures the performance of each implementation.
3.3.2. simpleReductionAdd¶
This sample illustrates multiple implementations of the reduction addition of scalar, vector, and vector_async, and also measures the performance of each implementation.
3.4. TOPS Features¶
3.4.1. memoryUsage¶
This sample demonstrates how to get memory usage of a specific device.
3.4.3. scatterMemory¶
This sample demonstrates how to use scatter memory to slice/deslice memories and better utilize device memory on different MCs.
3.4.4. resourceBundle¶
This sample demonstrates how to create resource bundles, and use the resource bundles in multiple threads.
3.4.5. graphMemoryNodes¶
This sample demonstrates how to create a graph by stream capture and launch the kernel with shared memory.
3.4.6. setLimit¶
This sample demonstrates how to use topsDeviceSetLimit() setting a thread-specific topsLimitMultiProcessorCount and topsLimitMaxThreadsPerBlock, and how to use topsStreamGetLaunchLimit() setting a grid/block limit hint for a stream. The working threads can get the device limit and the stream hint, and then decide to use how many blocks and threads to launch the kernels.
3.4.7. executableDump¶
This sample demonstrates how to create an executable from a prebuild binary. It can be created directly from a file, or from a preloaded binary buffer.
4. References¶
TopsCC Programming Guide