Pearson Education, Inc. 2013. 522 p.
Our Approach
Code
Administrative Items
Road Map
Hardware Architecture
CPU Configurations
Integrated GPUs
Multiple GPUs
Address Spaces in CUDA
CPU/GPU Interactions
GPU Architecture
Further Reading
Software Architecture
Software Layers
Devices and Initialization
Contexts
Modules and Functions
Kernels (Functions)
Device Memory
Streams and Events
Host Memory
CUDA Arrays and Texturing
Graphics Interoperability
The CUDA Runtime and CUDA Driver API
Software Environment
nvcc — CUDA Compiler Driver
ptxas — the PTX Assembler
cuobjdump
nvidia-smi
Amazon Web Services
Memory
Host Memory
Global Memory
Constant Memory
Local Memory
Texture Memory
Shared Memory
Memory Copy
Streams and Events
CPU/GPU Concurrency: Covering Driver Overhead
Asynchronous Memcpy
CUDA Events: CPU/GPU Synchronization
CUDA Events: Timing
Concurrent Copying and Kernel Processing
Mapped Pinned Memory
Concurrent Kernel Processing
GPU/GPU Synchronization: cudaStreamWaitEvent()
Source Code Reference
Kernel Execution
OverviewSyntax
Blocks, Threads, Warps, and Lanes
Occupancy
Dynamic Parallelism
Streaming Multiprocessors
Memory
Integer Support
Floating-Point Support
Conditional Code
Textures and Surfaces
Miscellaneous Instructions
Instruction Sets
Multiple GPUs
OverviewPeer-to-Peer
UVA: Inferring Device from Address
Inter-GPU Synchronization
Single-Threaded Multi-GPU
Multithreaded Multi-GPU
Texturing
OverviewTexture Memory
D Texturing
Texture Setup
Texture as a Read Path
Increasing Effective Address Coverage
Texturing from Host Memory
Texturing with Unnormalized Coordinates
Texturing with Normalized Coordinates
1D Surface Read/Write
1D Texturing
2D Texturing: Copy Avoidance
3D Texturing
Layered Textures
Optimal Block Sizing and Performance
Texturing Quick References
Streaming Workloads
Device Memory
Asynchronous Memcpy
Streams
Mapped Pinned Memory
Performance and Summary
Reduction
OverviewTwo-Pass Reduction
Single-Pass Reduction
Reduction with Atomics
Arbitrary Block Sizes
Reduction Using Arbitrary Data Types
Predicate Reduction
Warp Reduction with Shuffle
Scan
Definition and Variations
OverviewScan and Circuit Design
CUDA Implementations
Warp Scans
Stream Compaction
References (Parallel Scan Algorithms)
Further Reading (Parallel Prefix Sum Circuits)
N-Body
Naïve Implementation
Shared Memory
Constant Memory
Warp Shuffle
Multiple GPUs and Scalability
CPU Optimizations
References and Further Reading
Image Processing: Normalized Correlation
OverviewNaïve Texture-Texture Implementation
Template in Constant Memory
Image in Shared Memory
Further Optimizations
Source Code
Performance and Further Reading
Further Reading
Appendix AThe CUDA Handbook Library
ATiming
AThreading
ADriver API Facilities
AShmoos
ACommand Line Parsing
AError Handling