# **CUDA Deep Dive: From Fundamentals to Advanced Techniques**

### Nitin Shukla

HPC Application Engineer

October 27th 2024











## Contents: topics explored



### Performance consideration

Memory management, analysis Nsight and Nvidia

### Streams and Concurrency

Overlapping kernel execution & data transfer on Single/Multi GPU



### Computer architecture drives parallelism at the core level



Most modern processors implement

Memory (instruction memory and data memory) Central processing unit (control unit and arithmetic logic unit) Input/Output interfaces

#### Parallel computing two core technologies

Computer architecture i.e Hardware aspect Parallel programming i.e Software aspect





### Computer architecture drives parallelism at the core level

Fundamentals types of parallelism

- Task parallelisms: multiple independent tasks can run simultaneously, distributing functions across multiple cores
- Data parallelisms: multiple data items can be processed simultaneously, distributing the data across multiple cores

#### Heterogeneous computing

• CUDA programming: well-suited to address problems that can be expressed as data-parallel computations





### How GPUs are different than CPUs?



### CPU (host): minimize latency

### GPU(Device): maximize throughput



## Why computing perf/Watt matters?



2.3 PFlops



### 7.0 Megawatts

### 7000 homes



7.0 Megawatts GPU-accelerated computing started a new era





### GPU architecture

GPU architecture is built around a scalable array of SM

- CUDA cores
- Shared Memory/L1 Cache
- Register File
- Load/Store Units
- Special Function Units
- Warp Scheduler

| М                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                                                                           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                         | L1 Instr. | uction | n Cache                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                         |                                                                                                         |                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                      |                               |               |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|-------------------------------|---------------|--|
|                                                                                                                                                                                                                                                                                                                                                                                                                                     | L0 In<br>Warp Sch<br>Dispatch                                                                                                                                             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | LU Instruction Cache<br>Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk) |           |        |                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                         |                                                                                                         |                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                      |                               |               |  |
| Register File (16,384 x 32-bit)                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                         |           |        |                                                                                                                                                                                                                                                                                                                                                                                                                                    | Reg                                                                                     | jister I                                                                                                | File (1                                                                    | 16,38                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 4 x 32                               | l-bit)                        |               |  |
| INT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                         | FP32 FP32                                                                                                                                                                 | FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                         |           |        | NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                         | FP32                                                                                    | FP32                                                                                                    | FP                                                                         | 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                      | _                             |               |  |
| INT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                         | FP32 FP32                                                                                                                                                                 | FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 1                                                                                       |           |        | NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                         | FP32                                                                                    | FP32                                                                                                    | FP                                                                         | 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                      |                               |               |  |
| NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                          | FP32 FP32                                                                                                                                                                 | FP04                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                         |           |        | NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                         | FP32                                                                                    | FP32                                                                                                    | FP                                                                         | 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                      |                               |               |  |
| NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                          | FP32 FP32                                                                                                                                                                 | FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | TENEO                                                                                   | RCORE     |        | NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                         | FP32                                                                                    | FP32                                                                                                    | FP                                                                         | 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | п                                    | NRO                           |               |  |
| N T 32 IN T 32                                                                                                                                                                                                                                                                                                                                                                                                                      | FP32 FP32                                                                                                                                                                 | FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | TENSC                                                                                   | ROOKE     |        | N 132 IN 132                                                                                                                                                                                                                                                                                                                                                                                                                       | EP:12                                                                                   | FP32                                                                                                    | FP                                                                         | 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                      | -1130                         | CORE          |  |
| NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                          | FP32 FP32                                                                                                                                                                 | FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                         |           |        |                                                                                                                                                                                                                                                                                                                                                                                                                                    | FP32                                                                                    | FP32                                                                                                    | FP                                                                         | FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                      |                               |               |  |
| NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                          | FP32 FP32                                                                                                                                                                 | <b>ЕРБИ</b>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                         |           |        | NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                         | FP32                                                                                    | FP32                                                                                                    | FP                                                                         | 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                      |                               |               |  |
| NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                          | FP32 FP32                                                                                                                                                                 | FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                         |           |        |                                                                                                                                                                                                                                                                                                                                                                                                                                    | FP32                                                                                    | FP32                                                                                                    | FP                                                                         | 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                      |                               |               |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                         |           |        |                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                         |                                                                                                         |                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                      |                               |               |  |
| LD/ LD/<br>ST ST                                                                                                                                                                                                                                                                                                                                                                                                                    | LDV LDV<br>ST ST                                                                                                                                                          | LD) LD/<br>ST ST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | LD/ LD/<br>ST ST                                                                        | SFU       |        | LD/ LD/<br>ST ST                                                                                                                                                                                                                                                                                                                                                                                                                   | LD/<br>ST                                                                               | LD'<br>ST                                                                                               | LD/<br>ST                                                                  | LDi<br>ST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | LD/<br>ST                            | LDV<br>ST                     | SFU           |  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                     | ST ST<br>LO In<br>Warp Sch<br>Dispatch                                                                                                                                    | ST ST<br>Istruction C<br>eduler (32<br>D Unit [32 th                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | ST ST<br>Jache<br>thread/clk)<br>hread/clk)                                             | SFU       |        |                                                                                                                                                                                                                                                                                                                                                                                                                                    | ST<br>War<br>Di                                                                         | ST<br>L0 in<br>rp Sch<br>spatch                                                                         | struct<br>eduler<br>Unit (                                                 | 5T<br>tion C<br>r (32 t<br>(32 th                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | ST<br>ache<br>hread<br>read/o        | ST<br>/clk)<br>:lk)           | SFU           |  |
| ST ST                                                                                                                                                                                                                                                                                                                                                                                                                               | ST ST<br>LO In<br>Warp Sch<br>Dispatch<br>Register I                                                                                                                      | ST ST<br>Istruction C<br>eduler (32<br>D Unit [32 th<br>File (16, 38                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | st st<br>ache<br>thread/clk)                                                            | SFU       |        | ST ST                                                                                                                                                                                                                                                                                                                                                                                                                              | ST<br>War<br>Dit                                                                        | ST<br>L0 in<br>rp Sch<br>spatch<br>jister i                                                             | struct<br>eduler<br>i Unit (<br>File (1                                    | 5T<br>ion C<br>(32 t<br>(32 th<br>16, 384                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ST<br>ache<br>hread<br>read/o        | ST<br>/clk)<br>:lk)           | SFU           |  |
| 5T 5T                                                                                                                                                                                                                                                                                                                                                                                                                               | ST ST<br>L0 In<br>Warp Sch<br>Dispatch<br>Register I<br>FP32 FP32                                                                                                         | ST ST<br>Istruction (<br>eduler (32 th<br>Unit [32 th<br>File (16, 38<br>FP54                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | ST ST<br>Jache<br>thread/clk)<br>hread/clk)                                             | SFU       |        | ST ST                                                                                                                                                                                                                                                                                                                                                                                                                              | ST<br>War<br>Di<br>Reg                                                                  | 5T<br>L0 in<br>p Sch<br>spatch<br>jister i<br>FP32                                                      | ST<br>eduler<br>Unit (<br>File (1                                          | 5T<br>ion C<br>(32 th<br>(32 th<br>16, 384<br>54                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | ST<br>ache<br>hread<br>read/o        | ST<br>/clk)<br>:lk)           | SFU           |  |
| ST ST<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                                 | ST ST<br>L0 In<br>Warp Sch<br>Dispatch<br>Register I<br>FP32 FP32<br>FP32 FP32                                                                                            | ST ST<br>Instruction C<br>eduler (32<br>Unit (32 th<br>File (16, 38<br>FP64<br>FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | ST ST<br>Jache<br>thread/clk)<br>hread/clk)                                             | SFU       |        | ST ST<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                                  | ST<br>War<br>Dif<br>Reg<br>FP32<br>FP32                                                 | 5T<br>L0 in<br>p Sch<br>spatch<br>ister i<br>FP32<br>FP32                                               | ST<br>eduler<br>Unit (<br>FIIe (1<br>FP                                    | 5T<br>ion C<br>(32 th<br>(32 th<br>(5, 384<br>64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | ST<br>ache<br>hread<br>read/o        | ST<br>/clk)<br>:lk)           | SFU           |  |
| ST ST<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                     | ST ST<br>L0 in<br>Warp Sch<br>Dispatch<br>Register i<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32                                                                               | ST ST<br>Istruction C<br>eduler (32<br>Unit [32 th<br>File (16, 38<br>FP64<br>FP64<br>FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | ST ST<br>Jache<br>thread/clk)<br>hread/clk)                                             | SFU       |        | ST ST<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                                                                    | ST<br>War<br>Di<br>Reg<br>FP32<br>FP32<br>FP32                                          | 5T<br>L0 in<br>rp Sch<br>spatch<br>ister i<br>FP32<br>FP32                                              | ST<br>eduler<br>Unit (<br>File (1<br>FP<br>FP                              | 5T<br>(021<br>(32 th<br>(32 th<br>(54)<br>64<br>64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | ST<br>ache<br>hread<br>read/o        | ST<br>/clk)<br>:lk)           | SFU           |  |
| ST ST<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                                                       | ST ST<br>L0 In<br>Werp Sch<br>Dispatch<br>Register I<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32                                                                  | ST ST<br>Istruction C<br>eduler (32<br>Unit [32 th<br>File (16, 38<br>FP64<br>FP64<br>FP64<br>FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | st st<br>ache<br>thread/clk)<br>nread/clk)<br>4 x 32-bit)                               |           |        | ST ST<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                                                      | ST<br>War<br>Di<br>Reg<br>FP32<br>FP32<br>FP32<br>FP32                                  | ST<br>L0 in<br>rp Sch<br>spatch<br>ister<br>FP32<br>FP32<br>FP32                                        | ST<br>eduler<br>Unit<br>File (1<br>FP<br>FP                                | 5T<br>(32 th<br>(32 th<br>(538-<br>64<br>64<br>64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | st<br>hread<br>read/d                | ST<br>/cik)<br>:ik)<br>2-bit) |               |  |
| ST ST<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                                         | ST ST<br>L0 In<br>Warp Sch<br>Dispatch<br>Register 1<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32                                                     | ST ST<br>Istruction C<br>eduler (32<br>Unit [32 th<br>File (16, 38<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | st st<br>ache<br>thread/clk)<br>nread/clk)<br>4 x 32-bit)                               |           |        | ST ST<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                                        | ST<br>War<br>Dif<br>Reg<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                         | ST<br>L0 in<br>p Sch<br>spatch<br>ister i<br>FP32<br>FP32<br>FP32<br>FP32                               | ST<br>eduler<br>Unit i<br>File (1<br>FP<br>FP<br>FP                        | 5T<br>(021<br>(32 th<br>(32 th<br>(32 th<br>(32 th<br>(32 th<br>(32 th<br>(32 th<br>(32 th)<br>(32 t                                                                                                             | st<br>hread<br>read/d                | ST<br>/cik)<br>:ik)<br>2-bit) |               |  |
| ST ST<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                           | ST ST<br>L0 In<br>Werp Sch<br>Dispatch<br>Register I<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32                                                                  | ST ST<br>Istruction C<br>eduler (32<br>Unit [32 th<br>File (16, 38<br>FP64<br>FP64<br>FP64<br>FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | st st<br>ache<br>thread/clk)<br>nread/clk)<br>4 x 32-bit)                               |           |        | ST ST<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                                                      | ST<br>War<br>Dif<br>Reg<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                         | 5T<br>L0 in<br>p Sch<br>spatch<br>jister i<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                      | ST<br>eduler<br>Unit<br>File (1<br>FP<br>FP                                | 5T<br>(021<br>(32 th<br>(32 th<br>(32 th<br>(32 th<br>(32 th<br>(32 th<br>(32 th<br>(32 th<br>(32 th)<br>(32 th                                                                                                             | st<br>hread<br>read/d                | ST<br>/cik)<br>:ik)<br>2-bit) |               |  |
| ST ST<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32<br>NT32 INT32                                                                                                                                                                                                                                                                                                                                           | ST ST<br>L0 In<br>Warp Sch<br>Dispatch<br>Register I<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32                                        | ST ST<br>Istruction (<br>eduler (32)<br>Unit [32 th<br>File (16, 38)<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | st st<br>ache<br>thread/clk)<br>nread/clk)<br>4 x 32-bit)                               |           |        | ST         ST           NT32         INT32                                                                                                                                                                              | ST<br>War<br>Dif<br>Reg<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                 | 5T<br>L0 in<br>p Sch<br>spatch<br>ister<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                 | ST<br>struct<br>eduler<br>Unit i<br>File (1<br>FP<br>FP<br>FP              | 5T<br>(02 f<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh)<br>(32 fh                                                                                                             | st<br>hread<br>read/d                | ST<br>/cik)<br>:ik)<br>2-bit) |               |  |
| ST         ST           NT32         NT32           NT32         INT32                                                                                         | ST ST<br>L0 In<br>Warp Sch<br>Dispatch<br>Register I<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32                           | ST ST<br>Istruction C<br>eduler (32<br>Unit [32 tr<br>File (16, 38<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | st st<br>ache<br>thread/clk)<br>nread/clk)<br>4 x 32-bit)                               |           |        | ST         ST           NT32         NT32           NT32         INT32                                                                                                                     | ST<br>War<br>Dif<br>Reg<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                 | 5T<br>L0 in<br>p Sch<br>spatch<br>ister<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                 | ST<br>struct<br>Unit<br>File (1<br>FP<br>FP<br>FP<br>FP                    | 5T<br>(02 f<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh<br>(32 fh)<br>(32 fh                                                                                                             | st<br>hread<br>read/d                | ST<br>/cik)<br>:ik)<br>2-bit) | RCORE         |  |
| ST         ST           NT32         INT32           NT32         INT32 | ST ST<br>L0 In<br>Warp Sch<br>Dispatch<br>Register I<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32 | 5T 5T<br>Istruction (<br>eduler (32)<br>Unit [32 tf<br>File (16, 38<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64 | ST ST                                                                                   |           |        | ST         ST           NT32         NT32           NT32         INT32           NT32         INT32 | ST<br>War<br>Dir<br>Reg<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32 | ST<br>L0 in<br>p Sch<br>spatch<br>ister<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32 | ST<br>Struct<br>eduler<br>Unit (<br>FP)<br>FP)<br>FP)<br>FP)<br>FP)<br>FP) | 5T<br>(021<br>(321h<br>(321h<br>(321h<br>(321h<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h)<br>(321h | ST<br>auchu<br>hread<br>4 x 32<br>TE | ST<br>(cik)<br>(ik)<br>(-bit) | SFU<br>R CORE |  |



### Latency Hiding





### GPU acceleration for data-parallel tasks

Two important features that describe GPU capability

- Number of CUDA cores
- Memory size

GPU Performance Metrics: Throughput vs. Latency

- Peak computational performance measures in Tflops or Pflops, reflects a device's ability to perform floatingpoint calculations rapidly and efficiently
- Memory bandwidth
  - the rate at which data can be transferred between the CPU and memory, measured in gigabytes per second (GB/s). It directly impacts the speed of data-intensive applications.



### NVIDIA Tesla A100 with 54 Billion Transistor



- With 7nm technologies
- 19.5 teraflops of FP32 performance
- 6912 CUDA cores, 40GB of graphics memory, and 1.6TB/s of graphics memory bandwidth
- The A100 80GB model announced in Nov 2020, has 2.0TB/s graphics memory bandwidth

phics memory bandwidth hics memory bandwidth





### TOP10 System – November 2023



70 % of FLOP/s by GPUs, > 100 000 GPUs in Frontier+Aurora





### GPUs serve as a co-processor, not a standalone platform





## Ways to parallels an applications on Nvidia GPUs





# SYCL/ONEAP HACKATHON **OCINECA**

Empowering the Future of High-Performance Computing with SYCL

# CINECA Intel.

### **Register now!**







For further info / questions: a.masini@cineca.it



Follow the link: https://hpcortal.eu/node/2190



... or scan the QR code



# Why CUDA?

#### Performance

- Massive Parallelism: scale to 1000's of cores, 10000000's of parallel thread
- Massive Gain: substantial performance improvements in tasks that can be divided into smaller, concurrent operations

### Scalability

- Efficiently maps to the GPU architecture: well-suited for leveraging GPU capabilities
- Wide Range of Hardware: applications can scale from small embedded devices to large supercomputers

### Flexibility

- Programming Languages: supports various programming languages
- Easy to use: let programmers strip away complexity associated with parallel computing and focus on parallel algorithms



### What is CUDA?

#### CUDA : Compute Unified Device Architecture

- Enable heterogeneous systems (i.e., CPU+GPU)
- A new architecture instruction set called PTX (Parallel Thread eXecution) to match GPU typical hardware
- Parallelism allows developers to use GPUs for general purpose processing (GPGPU)

### The SDK includes

- A Drivers, runtimes and API
- Compiler wrappers for complain coda code (nvcc)
- Libraries (cuBLAS, cuFFT, cuSolver) debuggers (cuda-gdb, cuda-memcheck), profilers (nvprof, nView), etc
- CUDA-aware languages C/C++, Fortran, PyCUDA, CUDA.JI





## CUDA execution model

#### **CUDA programmer perspective**

- Heterogenous computing: combination of CPU and GPU
  - Host: The CPU and its memory
  - **Device:** The GPU and its memory
- **Execution:** Programs run a on the host and launch parallel code (kernels) on the device by many threads

#### Programming model view

- Kernels: A function written in CUDA C/C++ and executed on the GPU
- Launch configurations:
  - Threads: Smallest unit of execution in CUDA
  - **Block:** A collection of threads
  - Grid: A collection of blocks
- Memory management: Allocate and transfer data between host (CPU) and device (GPU)





## Compiling and running CUDA enable application

CUDA enhances your control over memory and thread hierarchies, optimizing execution and scheduling with:

#### Thread hierarchy structure



#### Memory hierarchy structure





### Embarrassing parallel code

Vector Addition

- Simple operation: a memory-bound operation
- Natural Fit for GPUs: Each element of a vector are independent
- Scalability: Larger vectors benefit from GPU or multi-core CPU parallelism, offering faster computation than serial processing.

### // CPU function

```
sumArraysOnHost(float *A, float *B, float *C, const int N)
 { for (int idx=0; idx<N; idx++)</pre>
       C[idx] = A[idx] + B[idx];
int main(int argc, char **argv)
      • •
     Start = cpuSecond();
     sumArrayOnCPU(h_A, h_B, h_C, N);
     Double cpuTime = cpuSecond() - start;
     printf("CPU Execution Time: %f second \n", cpuTime);
      • •
```





### Declaring Host–Called, Device–Executed Functions

- CUDA differentiates between these functions by using one of the following function type qualifiers as a prefix • \_\_\_\_\_global\_\_\_ qualifier for kernels that can be invoked globally
- <u>host</u> functions called from host and executed on the host
- <u>device</u> functions called from device and execute on the device (a function that is called from a kernel needs the \_\_\_\_device\_\_\_ qualifier)







## Step to Launching a CUDA Kernel

\_\_global\_\_ void()

Defines a kernel can be invoked globally either from CPU or GPU

#### **Execution configuration**

Kernel\_name <<<numBlocks, numThreads>>> (arguments); Specifies grid and block dimensions

#### Synchronization

Launching kernel is asynchronous cudaDeviceSynchronize(): wait until device code completeness

```
// Kernel
    _gl#ballude <stdio.h>
    sumArraysOnDevice(float *A, float *B, float *C, const int N)
                  ____global___ void onGPU()
            int idx = threadIdx.x + (blockIdx.x * blockDim.x)
            if{(idx<N)
                  pfiels("FhAd if the block on GPU\n");
}
                  }
 int main(int argc, char **argv)
 ł
                  int main()
         •• {
        start = cpuSecond();
        sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<<rpre>sumArgage@n&PU<</pre>sumArgage@n&PU<</pre>sumArgage@n&PU<</pre>sumArgage@n&PU<</pre>sumArgage@n&PU<</pre>sumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@n&PUsumArgage@nsumArgage@n
        cudaDeviceSynchronize();
       doubleugputime Sympusedond() - start;
        printf("GPU Execution Time: %f seconds\n", gpuTime);
         . }
 J
```



## CUDA launches arrays of parallel threads



A block has a fixed number of threads which are guaranteed to be running simultaneously on the same SM



## CUDA launches arrays of parallel threads

For fully utilisation of the parallel processing power of the GPU

A CUDA kernel is executed as a grid (array) of threads

- All threads in a grid run the same kernel code
- Each thread has a unique ID: threadIdx
- Threads are similar to data-parallel tasks. •
- Threads independently execute the same operation on a data subset
- Follows SPMD model i.e the Single Program Multiple Data => SIMT Single Instructions Multiple threads







### SIMT VS. SIMD execution model

Both SIMD and SIMT achieve parallelism by broadcasting a single instruction to multiple execution units

Consider how computations will be distributed between threads for the following loop (N >> threads count):



float \*A, \*B, \*C = ...; for (int I = 0; I <N; I++ ) A[I] = B[I] + C[I]</pre>



### SIMT VS. SIMD execution model

Both SIMD and SIMT achieve parallelism by broadcasting a single instruction to multiple execution units

A loose extension of SIMD which is what CUDA's computational model is, although there is key differences

- Single instruction, multiple registers
- Single instructions multiple addresses i.e. parallel memory access!
- Single instruction, multiple flow paths if statements are allowed!

#### SIMT allows

- CUDA GPU to perform "vector" computations on scalar cores
- Much easier to vectorise than getting compiler to autovectorize on CPU

https://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

|      | SIMT thre | ead registe | ers    |
|------|-----------|-------------|--------|
| a[I] | a[I+1]    | a[I+2]      | a[I+3] |
| b[I] | a[I+1]    | b[I+2]      | b[I+3] |
| а    | а         | а           | а      |
| b    | b         | b           | b      |
| I    | I+1       | I+2         | I+3    |
| •••  | •••       | •••         | •••    |



### SIMT VS. SIMD execution model

Both SIMD and SIMT achieve parallelism by broadcasting a single instruction to multiple execution units

| Feature         | SIMD                        | SIMT                              |
|-----------------|-----------------------------|-----------------------------------|
| Architecture    | Traditional CPUs            | Utilized by NVIDIA GPUs           |
| Execution Unit  | Multiple data lanes         | Multiple threads (warps)          |
| Flexibility     | Low                         | High                              |
| Branch Handling | No support for divergence   | Supports thread divergence        |
| Best Suited For | Homogeneous data operations | Dynamic control flow applications |
| Common Usage    | CPU computing               | Vector processing on GPUs         |



### CUDA launches arrays of parallel threads





What is warp, and why is it important?



### What is WARP?

### Hardware Multithreading

- NVIDIA SM schedules threads in warps (groups of 32 threads)
- Warp simply means a group of threads that are scheduled together to execute the same instructions in lockstep.
- Execution contest stays on chip
- No overhead for switching warps
- Volta SM has 4 warp schedulers, each one is responsible for
  - feeding 32 CUDA cores
  - 8 load/store units
  - 8 special functions unit





## Warps as Scheduling Units



Groups (vectors) of 32 consecutive threads of a block that are executed in parallel in hardware

- An implementation technique, not part of the CUDA programming model
- basic unit of execution in an SM

| Warp | 0: | thread | Ο,  | thread | 1,  | thread | 2,  | • • • | thread | 31  |
|------|----|--------|-----|--------|-----|--------|-----|-------|--------|-----|
| Warp | 1: | thread | 32, | thread | 33, | thread | 34, | • • • | thread | 63  |
| Warp | 3: | thread | 64, | thread | 65, | thread | 66, | • • • | thread | 95  |
| Warp | 4: | thread | 96, | thread | 97, | thread | 98, | • • • | thread | 127 |



## Why do we need to have so many warps in an SM?

#### Latency hiding

#### **Resource Utilisation**

- Maximizing Throughput: More warps allow for better utilization of SM resources (ALUs, memory bandwidth)
- Load Balancing: Distributes the workload evenly across the available execution units

#### Parallelism

- Enhancing Parallel execution: Multiple warps increase the parallelism, enabling more threads to be processed concurrently
- Improved Performance: Higher parallelism leads to better performance and throughput for data-intensive applications

• Memory Access Latency: Multiple warps can hide memory access latency by switching to another ready warp when one warp is waiting for data • Instruction Pipeline Latency: Keeps the execution units busy while other warps are stalled due to dependencies or resource constraints



### GPU Thread hierarchy



## GPU Thread hierarchy

|              |              |              |              |              |              |              |              | GP | U con        | sists        | ofH          | und        |
|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|----|--------------|--------------|--------------|------------|
| Thread Block |    | Thread Block | Thread Block | Thread Block | Thread Blo |
|              |              | ЩЩ           |              |              |              | ШШ           |              |    |              |              | ЩЩ           | 111111     |
| Thread Block |    | Thread Block | Thread Block | Thread Block | Thread Blo |
|              |              | ЩЩ           |              |              |              | ЩЩ           |              |    |              |              | ЩЩ           |            |

#### Multi-proces

| Thread Block | Thread Block | Thread Block | Thread Block |
|--------------|--------------|--------------|--------------|
|              |              | ЩЩ           | ЩЩ           |
| Thread Block | Thread Block | Thread Block | Thread Block |
|              |              | ЩЩ           |              |

#### Block 1024 threads







#### CINECA



م م م م م م م

🔆 Warp







| ssors: | tens | of t | housa | nds |
|--------|------|------|-------|-----|





## Kernel execution across Thread, Block, and Grid



- In order to compute N elements on the GPU in parallel, at least N concurrent threads must be created on the device
- GPU threads are grouped together in teams or blocks of threads
- Threads belonging to the same block or team can cooperate togheter exchanging data through a shared memory cache area
- Each block of threads will be executed independently
- No assumption is made on the blocks execution order



### Kernel execution across Thread, Block, and Grid

gridDim.x: number of blocks in the grid, in this case 2





### Kernel execution across Thread, Block, and Grid

blockIdx.x: index of a blocks in a grid
blockDim.x: number of threads per block



blockDim.x = 4

blockIdx.x = 0

#### blockIdx.x = 1



# Kernel execution across Thread, Block, and Grid

**threadIdx.x:** index of the thread with a block







# Kernel execution across Thread, Block, and Grid

## Choose the optimal block size

- A limited number of threads (1024) can fit inside a thread block
- To increase parallelism, we need to **coordinate** work **among thread blocks**.
- This is achieved by **mapping** element of data vector to threads using **global index = threadIdx.x + blockIdx.x\*blockDim.x**





# Grid size larger than data set









Code must check that the **dataIndex** calculated by threadIdx.x + blockIdx.x \* blockDim.x is less than  $\mathbf{N}$ , the number of data elements.





# Choosing the optimal grid size

## Choose the optimal block size

- Write an execution configuration that creates more threads than necessary
- Pass a value as an argument into the kernel (N) that represents that total size if the data set to be processed/total threads needed to complete the work
- Calculate the global index and if it does not exceed N perform the kernel work

```
// Coalesced access example
__global__ vectorSum(int N)
int idx = threadIdx.x + blockIdx.x * blockDim.x;
{
    if(idx < N){ // only do work if it does}
}</pre>
```

## Know your limitations

Maximum size at each level of the thread hierarchy is device dependent. On A100 typical you get :

- Maximum number of threads per block : 1024
- Maximum sizes of x-, y-, and -z dimensions of threads block 1024 x 1024 x 64
- Maximum sizes of each dimension of grid of thread blocks: 65535 x
   65535 x 65535 (about 280,000 billion blocks)





# Every thread runs exactly the same program



A limited number of threads (1024) can fit inside a thread block To increase parallelism, we need to **coordinate** work **among thread blocks** 

This is achieved by **mapping** element of data vector to threads using **global index** 

All about this one line code

int index = threadIdx.x + (blockIdx.x \* blockDim.x)



## Transparent scalability



GPU with 2 SM

# Mapping to hardware

## CUDA invokes kernel grid

Host kicks off the execution of a kernel grid which contains blocks of threads

## 2 Execute concurrently

Each SM runs multiple thread blocks Each SP runs on thread from a thread blocks

3 Grid blocks distributed to SMs Shared cache, register and memory Global memory shared by all SMs





# Compiling and running CUDA enable application

CUDA enhances your control over memory and thread hierarchies, optimizing execution and scheduling with:

Thread hierarchy structure



Memory hierarchy structure





# Three simple processing steps





Copy input data from CPU memory to GPU





# Three simple processing steps



| 1 | Copy input data from CPU memory to GPU                            |
|---|-------------------------------------------------------------------|
| 2 | Load GPU program and execute caching data on chip for performance |





# Three simple processing steps



| 1 | Copy input data from CPU memory to GPU                            |
|---|-------------------------------------------------------------------|
| 2 | Load GPU program and execute caching data on chip for performance |

| 3 Copy results From GPU to CPU | memory |
|--------------------------------|--------|
|--------------------------------|--------|





## Data movement



2 Copy Device to host

3 Clean up memory for host and device

## // Copy data from host to device

checkCuda( cudaMemcpy(d\_A, h\_A, size, cudaMemcpyHostToDevice) ); checkCuda( cudaMemcpy(d\_B, h\_B, size, cudaMemcpyHostToDevice) );

### // Copy result from device to host

checkCuda( cudaMemcpy(h\_C\_ref, d\_C, size, cudaMemcpyDeviceToHost) );

## // Clean up memory

checkCuda( cudaFree(d\_A) ); checkCuda( cudaFree(d\_B) ); checkCuda( cudaFree(d\_C) ); cleanup(h\_A, h\_B, h\_C, h\_C\_ref);



## 1 How to compile CUDA enable application?



# CUDA components

## CUDA Driver

A critical piece of software that acts as the interface between your application and the NVIDIA GPU hardware

## 2 The CUDA Toolkit

NVHPC Compiler: translate CUDA into optimised machine instructions for NVIDIA GPUs

Libraries: Comprehensive libraries like cuBLAS and cuDNN are provided

Debugging tools: robust debugging tools





# CUDA components

## Compilation process

Code for host and device in some.cu file CUDA compiler separates source code into host and device components

Based LLVM open source compiler infrastructure



### nvcc –arch=sm\_70 –o out some–CUDA.cu –run

- arch: indicates for which architecture the files must be compiled (sm\_80 is for TESLA A100 GPU)
- run: execute the successfully compiled binary
- Information on CUDA device: nvidia-smi, deviceQuery



## 2 Measuring performance and Error handling



# Validate GPU results by comparing with CPU results

```
// Validate results
bool validateResults(float *hostRef, float *gpuRef, int nElem) {
  bool correct = true;
  for (int i = 0; i < nElem; i++) {</pre>
  if (fabs(hostRef[i] - gpuRef[i]) > 1e-5) {
  correct = false;
  printf("Mismatch at index %d: CPU = %f, GPU = %f\n", i, hostRef[i], gpuRef[I]);
  break;
}
  if (correct) {
  printf("Results match!\n");
  return correct;
}
```





# Kernel Launch Errors

- Error handling in accelerated CUDA code is essential.
- All CUDA API returns an error code of type cudaError t
  - Special value cudaSuccess means that no error occurred
- An error message can be printed with cudaGetErrorString

```
cudaError_t err;
err = cudaMallocManaged(&a, N);
if(err != cudaSuccess) { printf("Error: %s \n", cudaGetErrorString(err)); }
```

• To check for errors occurring at the time of kernel launch, CUDA provides the cudaGetLastError function, which does return a value of type cudaError t

```
someKernel <<<1, -1 >>>();
                              // - 1 is not a valid number of threads
cudaError_t err;
err = cudaGetLastError();
if(err != cudaSuccess) { printf("Error: %s \n", cudaGetErrorString(err)); }
```



# **CUDA Error Handling Function**

- A macro that wraps CUDA function calls for checking errors could be useful
- Can be wrapped around any function that returns a cudaError t

```
#include <stdio.h>
#include <assert.h>
inline cudaError_t checkCuda(cudaError_t result) {
 if (result != cudaSuccess) {
    fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
    assert(result == cudaSuccess); }
 return result; }
int main() {
/* The macro can be wrapped around any function returning
* a value of type `cudaError_t`.
*/
checkCuda( cudaDeviceSynchronize() )
```



## Asynchronous errors

To catch errors that occur in asynchronous part of the code (for example during the execution of an asynchronous kernel), check the status returned by a subsequent synchronizing CUDA runtime API call, such as cudaDeviceSynchronize.

```
cudaError_t asynchErr;
asynchErr = cudaDeviceSynchronize(); if (asynchErr != cudaSuccess)
   printf("Error: %s\n", cudaGetErrorString(err));
```



# Timing your kernel

```
double cpuSecond() {
    struct timespec ts;
   timespec_get(&ts, TIME_UTC);
    return ((double)ts.tv_sec + (double)ts.tv_nsec * 1.e-9);
}
/* Measure time for CPU execution */
   double start = cpuSecond();
   sumArraysOnCPU(h_A, h_B, hostRef, nElem);
   double cpuTime = cpuSecond() - start;
   printf("CPU Execution Time: %f seconds\n", cpuTime);
/* Measure time for GPU execution
   double start = cpuSecond();
   sumArraysOnGPU<<<gridSize, blockSize>>>(d_A, d_B, d_C, nElem);
   checkCuda( cudaDeviceSynchronize() ); // Ensure GPU kernel finishes
   double gpuTime = cpuSecond() - start;
   printf("GPU Execution Time: %f seconds\n", gpuTime);
```









# Measuring performance with events

An event in CUDA is essentially a GPU time stamp that is recorded at a user-specified point in time. The API calls that create and destroy events, record events and convert timestamp difference into a floating-point value in milliseconds

## How to time code using CUDA events

```
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventRecord(&stop);
cudaEventRecord( start, 0 );
kernel << < grid, threads >>> ( d_odata, d_idata, size_x, size_y, NUM_REPS);
// do some work on the GPU
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );
```



# Time your kernels

| Ν       | Elapsed Time on Host | Kernel Configuration | Elapsed Time on Device | Speed up [Second] |
|---------|----------------------|----------------------|------------------------|-------------------|
| 1 <<20  | 0.000757             | (4096, 256)          | 0.000206               | 3.67              |
| 1 << 24 | 0.00013451           | (4096, 256)          | 0.000447               | 30.12             |
| 1 << 26 | 0.052383             | (524288, 128)        | 0.001013               | 51.72             |
| 1 << 29 | 0.424363             | (524288, 128)        | 0.008173               | 51.92             |





## 3 Are there hardware constraints on threads per block and blocks per grid?



# When the data set is larger than grid size?

## Advantages of Grid-stride loops

- Scalability: handles any size of input data regardless of hardware contains. It ensures all the data is processed
- Efficient resource utilisations: It allows the kernels to utilise all available threads efficiently by feeding more jobs
- Simplicity: straight forward implementation, without needing any complex logic to manage the devision of the work

```
// Coalesced access example
__global__ vectorSum(int N)
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int gridStride = gridDim.x * blockDim.x;
{
    if(idx < N){ // only do work if it does}
}</pre>
```





## Ways to improve your code

### Types of Data transfer

Pageable and Pinned memory Unified memory and Asynchronous Prefetching Global memory reads/writes

Aligned and coalesced memory accesses that reduce wasted

bandwidth

Array of Structure versus Structure of Array

Overlapping Kernel and Data movement by using non-default

streams

### Performance tuning

Parallelising higher dimensions-2D Unrolling techniques Matrix Transpose Problem Shared memory

3



## Data transfer impacts on performance





# Measuring performance with events



Important to minimise the transfer between the host and device



# **Application Performance constraints**

Roofline Model

- Key Concept: Computational Intensity:
  - Defined as FLOP (floating-point operations) per byte of memory transferred
- Latency Hiding:
  - Utilizing multiple warps on a Streaming Multiprocessor (SM) enables concurrent computation.
  - While some warps wait for memory transfers, others can continue executing
- Combined Performance:
  - The model illustrates how computation and memory transfer can overlap, represented as:
    - Performance = max(compute, memory transfer)





# GPU vs. CPU: Understanding Performance Trade-offs

Impact of data transfer on overall application performance







# Understanding CUDA Memory Hierarchy



Shared Memory

Medium speed, shared among threads

**Global Memory** 

Slowest, largest, off-chip



# GPU memory breakdown

Device code can

- R/W per-thread registers
- R/W per-thread Local Memory
- R/W per-block Shared Memory
- R/W per-grid global Memory
- Read only per-grid Constant Memory
- Read only per-grid Texture Memory

Host code can

- Transfer data to/from per-grid global and constant memories





# CUDA Variable Declaration Summary

| QUALIFIER | VARIABLE NAME     | MEMORY   | SCOPE  | LIFESPAN    |
|-----------|-------------------|----------|--------|-------------|
|           | float var         | Register | Thread | Thread      |
|           | float<br>var[100] | Local    | Thread | Thread      |
| shared    | float var†        | Shared   | Block  | Block       |
| device    | float var†        | Global   | Global | Application |
| constant  | float var†        | Constant | Global | Application |
|           |                   |          |        |             |



## CUDA memory management

1. Memory allocation

allocation can be performed using different memory types, such as global, shared and constant memory

2. Memory transfer

Process of copying data from one memory location to another Memory copy can be performed using different memory types, such as host memory and device memory

3. Memory synchronization

Process of reserving memory space for a variable or data structure Memory

Process of coordinating the access of multiple threads to shared memory or global memory Synchronization primitives: atomic operations, barriers, and locks



# Data transfer between host and device

Pageable Data Transfer



Pageable data transfer is default method

- Allocated host memory is pageable
- GPU cannot safely access data in pageable host memory
- When transferring data between the host and device, the CUDA driver first copies data from pageable host memory to a page locked or pinned memory buffer before sending it to the device
- Pageable memory in CUDA is used for memory allocation when data transfers between the CPU and GPU are infrequent



# Data memory allocation/release

- cudaMemcpy (void\* dst, void \*src, size\_t nbytes, cudaMemcpyKind kind)
  - Direction specifies locations (host or device) of src and dst
  - Blocks CPU thread (returns after the copy is complete)
  - Does not start copy until previous CUDA calls complete
- Kind: specifies the direction of the memory copy
  - cudaMemcpyHostToHost
  - cudaMemcpyHostToDevice
  - cudaMemcpyDeviceToHost

- CudaFree( devPtr )
  - Free memory from device Global memory
  - Pointer to free object



## Data memory allocation/release

Refers to the coordination of threads accessing global memory or shared memory

- Device synchronization
  - In CUDA, the CPU and the GPU operate asynchronously
  - Synchronization is necessary to ensure that the GPU has finished executing before continuing with the CPU code

cudaMemcpy(d\_data, h\_data, size \* sizeof(float), cudaMemcpyHostToDevice); cudaMemcpyAsync(h\_result, d\_result, size \* sizeof(float), cudaMemcpyDeviceToHost); cudaDeviceSynchronize();

- Thread synchronization
  - Threads within a block can access shared memory, which is a memory space shared among all threads in a block
  - Ensure that threads accessing shared memory do not interfere with each other

```
____syncthreads();
// compute using shared memory
```

### CINECA



## Data transfer between host and device

Pageable Data Transfer

Pinned Data Transfer



| M |               |
|---|---------------|
|   | $\mathcal{I}$ |

| d   |  |
|-----|--|
| ory |  |
|     |  |

Pinned data transfer is pinned or locked

- Memory cannot be moved by the operating system
- Pinned memory is memory that is locked in physical memory and is accessible to both the CPU and the GPU
- Allocation and deallocation is expensive than pageable memory
- Provides higher transfer throughput for large data transfers

### CINECA



## Pageable and pinned memory transfer

Pageable Data Transfer

// allocate and initialize int \*h\_a, \*d\_a; // host and device specific arrays h\_a = (float\*)malloc(nbytes); cudaMalloc( &d\_a, nbytes);

// memcpy H->D
cudaMemcpy( d\_a, h\_a, nbytes, cudaMemcpyHostToDevice);

// kernel compute
kernelGPU<<<>>>(..., d\_a, ...);

//cudaMemcpy D->H
cudaMemcpy( h\_a, d\_a, nbytes, cudaMemcpyDeviceToHost);
verifyOnHost(host\_a, N);

//Free host and device memory
cudaFree(device\_a); Free(host\_a)

Pinned Data Transfer

// allocate and initialize
cudaMallocHost(nbytes);
cudaMalloc( &d\_a, nbytes);

// memcpy H->D
cudaMemcpy( d\_a, h\_a, nbytes, cudaMemcpyHostToDevice);

// kernel compute
kernelGPU<<<>>>(..., d\_a, ...);

//cudaMemcpy D->H

cudaMemcpy( h\_a, d\_a, nbytes, cudaMemcpyDeviceToHost); verifyOnHost(host\_a, N);

//Free host and device memory

cudaFree(device\_a); cudaFreeHost(host\_a)



## Vector sum pageable memory transfer

Pageable Data Transfer

```
/* Host memory allocation */
```

float \*h\_A, \*h\_B, \*hostRef, \*gpuRef;

```
h_A = (float*)malloc(size);
```

```
h_B = (float*)malloc(size);
```

hostRef = (float\*)malloc(size);

```
// Result from CPU
gpuRef = (float *)malloc(size);
                               // Result from GPU
```

```
/* malloc device global memory */
  *float *d_A, *d_B, *d_C;
  checkCuda( cudaMalloc(&d_A, size) );
  checkCuda( cudaMalloc(&d_B, size) );
  checkCuda( cudaMalloc(&d_C, size) );
```

/\* Copy data from host to device\*/ IcudaMemcpy(d\_A, h\_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d\_B, h\_B, size, cudaMemcpyHostToDevice);

- /\* Define block and grid sizes \*/ int blockSize = 256; int gridSize = (nElem + blockSize - 1) / blockSize;
- /\* Measure time for GPU execution \*/ start = cpuSecond(); sumArraysOnGPU<<<gridSize, blockSize>>>(d\_A, d\_B, d\_C, nElem); checkCuda( cudaDeviceSynchronize() ); // Ensure GPU kernel finishes double gpuTime = cpuSecond() - start; printf("GPU Execution Time: %f seconds\n", gpuTime);
- /\* Copy result from device to host \*/ checkCuda( cudaMemcpy(gpuRef, d\_C, size, cudaMemcpyDeviceToHost) );

### CINECA



## Vector sum pinned memory transfer

### Pinned Data Transfer

- /\* malloc device global memory \*/
  - float \*h\_A, \*h\_B, \*hostRef, \*gpuRef;
  - cudaMallocHost((void\*\*)&h\_A, size); // Use cudaMallocHost for pinned memory cudaMallocHost((void\*\*)&h\_B, size); // Use cudaMallocHost for pinned memory cudaMallocHost((void\*\*)&hostRef, size); // Result from CPU cudaMallocHost((void\*\*)&gpuRef, size); // Result from GPU
- /\* Copy data from host to device \*/
   checkCuda( cudaMemcpy(d\_A, h\_A, size, cudaMemcpyHostToDevice) );
   checkCuda( cudaMemcpy(d\_B, h\_B, size, cudaMemcpyHostToDevice) );
- /\* malloc device global memory \*/

float \*d\_A, \*d\_B, \*d\_C; checkCuda( cudaMalloc(&d\_A, size) ); checkCuda( cudaMalloc(&d\_B, size) ); checkCuda( cudaMalloc(&d\_C, size) );

/\* Copy data from host to device \*/
 checkCuda( cudaMemcpy(d\_A, h\_A, size, cudaMemcpyHostToDevice) );
 checkCuda( cudaMemcpy(d\_B, h\_B, size, cudaMemcpyHostToDevice) );

- /\* Define block and grid sizes \*/
  int blockSize = 256;
  int gridSize = (nElem + blockSize 1) / blockSize;
- /\* Measure time for GPU execution \*/
  start = cpuSecond();
  sumArraysOnGPU<<<gridSize, blockSize>>>(d\_A, d\_B, d\_C, nElem);
  checkCuda( cudaDeviceSynchronize() ); // Ensure GPU kernel finishes
  double gpuTime = cpuSecond() start;
  printf("GPU Execution Time: %f seconds\n", gpuTime);

/\* Copy result from device to host \*/
 checkCuda( cudaMemcpy(gpuRef, d\_C, size, cudaMemcpyDeviceToHost) );

### CINECA



## Vector sum pageable and pinned memory transfer

| Ν           | Pageable mem transfer | Pinned mem transfer | SlowDown          |
|-------------|-----------------------|---------------------|-------------------|
| <br>1 << 20 | 0.000500              | 0.00036             | 0.72              |
| 1 << 22     | 0.000486              | 0.000225            | 0.462962962963    |
| 1 << 24     | 0.001842              | 0.002379            | 1.29153094462541  |
| 1 << 26     | 0.003168              | 0.001021            | 0.322285353535354 |
| 1 << 28     | 0.004015              | 0.007195            | 1.7920298879203   |
| 1 << 30     | 0.029974              | 0.019884            | 0.663374924934944 |
|             |                       |                     |                   |



## Zero-copy memory

Host cannot access device variables and device cannot access host variables directly, one exception rule to this : zero copy memory



- Leveraging host memory when there is insufficient device memory
- Avoiding explicit data transfer between the host and device
- Improving PCIe transfer rates

### CUDA API call

2

- cudaHostAlloc(void \*\*ptr, size\_t size, unsigned int flags);
- flags = cudaHostAllocMapped, cudaHostAllocDefault, cudaHostAllocPortable

• When using zero-copy memory to share data between the host and device, you must synchronise memory access across the host and device

• Most relevant flag to zero-copy memory is cudaHostAllocMapped, which returns host memory that is mapped into the device address space



## Vector sum Zero copy transfer

### Zero Data Transfer

- /\* Allocate and initialize host memory for zero-copy\*/
  cudaHostAlloc((void\*\*)&h\_A, size, cudaHostAllocMapped);
  cudaHostAlloc((void\*\*)&h\_B, size, cudaHostAllocMapped);
  cudaHostAlloc((void\*\*)&h\_C, size, cudaHostAllocMapped);
- /\* Get device pointers for zero-copy memory\*/
   cudaHostGetDevicePointer(&d\_A, h\_A, 0);
   cudaHostGetDevicePointer(&d\_B, h\_B, 0);
   cudaHostGetDevicePointer(&d\_C, h\_C, 0);
- /\* malloc device global memory \*/
  float \*d\_A, \*d\_B, \*d\_C;
  checkCuda( cudaMalloc(&d\_A, size) );
  checkCuda( cudaMalloc(&d\_B, size) );
  checkCuda( cudaMalloc(&d\_C, size) );
- /\* Copy data from host to device \*/
   checkCuda( cudaMemcpy(d\_A, h\_A, size, cudaMemcpyHostToDevice) );
   checkCuda( cudaMemcpy(d\_B, h\_B, size, cudaMemcpyHostToDevice) );

- /\* Define block and grid sizes \*/
  int blockSize = 256;
  int gridSize = (nElem + blockSize 1) / blockSize;
- /\* Measure time for GPU execution \*/
  start = cpuSecond();
  sumArraysOnGPU<<<gridSize, blockSize>>>(d\_A, d\_B, d\_C, nElem);
  checkCuda( cudaDeviceSynchronize() ); // Ensure GPU kernel finishes
  double gpuTime = cpuSecond() start;
  printf("GPU Execution Time: %f seconds\n", gpuTime);
- /\* Copy result from device to host \*/
   checkCuda( cudaMemcpy(gpuRef, d\_C, size, cudaMemcpyDeviceToHost) );

### CINECA



## Comparison of Zero-copy Memory vs Device Memory

| SIZE   | <b>Device memory</b><br>(ELAPSED TIME [s]) |
|--------|--------------------------------------------|
| 1 KB   | 0.000033                                   |
| 4 KB   | 0.007286                                   |
| 16 KB  | 0.007289                                   |
| 64 KB  | 0.001673                                   |
| 256 kB | 0.002434                                   |
| 1 MB   | 0.002446                                   |
| 4 MB   | 0.000849                                   |
| 16 MB  | 0.004292                                   |
| 64 MB  | 0.012136                                   |
| 256 MB | 0.051559                                   |

| <b>Zero-copy Memory</b><br>(ELAPSED TIME [s]) | SlowDown          |
|-----------------------------------------------|-------------------|
| 0.00014                                       | 0.4242424242424   |
| 0.002334                                      | 0.320340378808674 |
| 0.002335                                      | 0.320345726437097 |
| 0.002342                                      | 1.39988045427376  |
| 0.002358                                      | 0.968775677896467 |
| 0.002524                                      | 1.03188879803761  |
| 0.000454                                      | 0.534746760895171 |
| 0.004123                                      | 0.960624417520969 |
| 0.007024                                      | 0.578773895847067 |
| 0.029347                                      | 0.569192575496034 |

------------------. . . . . . . . . . . . . . . . . . . . . . . . -----. . . . . . . . . . . . .

## Unified virtual memory (UVM)

Increased memory latency

- Single allocation, single pointer, accessible everywhere eliminate the need of explicit copy and simplify code porting
- Enables the sharing of memory which reduces overall usage

Limited control over memory placement

UVM automatically manages memory placement, which may not always be optimal for a given application







When **UM** is allocated, it may not be resident initially on the CPU or the GPU

Time



DEEP





When some work asks for the memory for the first time, a **page fault** will occur

Time









The page fault will trigger the migration of the demanded memory

Time







This process repeats anytime the memory is requested somewhere in the system where it is not resident

**?** work<<<>>> ()

Time





This process repeats anytime the memory is requested somewhere in the system where it is not resident





## Simplified memory management code

Allow to **allocate** and **free memory** 



### **CUDA Code with UM**

int N = 10000;size\_t size = N\*sizeof(int);

int \*a; cudaMallocManaged(&a, size);

cudaFree(a);



## Vector sum Unified memory transfer

Unified memory Transfer

/\* Unified Memory allocation \*/

float \*a, \*b, \*hostRef, \*gpuRef; checkCuda(cudaMallocManaged(&a, size)); checkCuda(cudaMallocManaged(&b, size)); checkCuda(cudaMallocManaged(&hostRef, size)); checkCuda(cudaMallocManaged(&gpuRef, size));

- /\* Define block and grid sizes \*/
  int blockSize = 256;
  int gridSize = (nElem + blockSize 1) / blockSize;
- /\* Measure time for GPU execution \*/
  start = cpuSecond();
  sumArraysOnGPU<<<gridSize, blockSize>>>(d\_A, d\_B, d\_C, nElem);
  checkCuda( cudaDeviceSynchronize() );
  double gpuTime = cpuSecond() start;
  printf("GPU Execution Time: %f seconds\n", gpuTime);
- /\* Copy result from device to host \*/
   checkCuda( cudaMemcpy(gpuRef, d\_C, size,
   cudaMemcpyDeviceToHost) );

### CINECA



### 4 Performance consideration



# Best Practices for porting a code

Understand the application

Mini app, Understand if the kernel is memory or compute bound

Identify Hot Spots

2

3

4

Analyze your application's memory access patterns and identify the critical data that should be prefetched

Time Prefetching

Carefully time the prefetch operations to overlap with kernel execution and minimize latency

### **Monitor Performance**

Use profiling tools to measure the impact and fine-tune its usage: profiler the code with Nsight-system + NVTX, Nsight compute





## It's all about memory access patterns

Depending on how you access memory bandwidth can very greatly!





## Memory access patterns

M33



| Loa | ading i | teratio        | on O           | Loading iteration 1 |                |                |                |
|-----|---------|----------------|----------------|---------------------|----------------|----------------|----------------|
| To  | T       | T <sub>2</sub> | T <sub>3</sub> | Τo                  | T <sub>1</sub> | T <sub>2</sub> | T <sub>3</sub> |

- For blocks that consist of multiple dimensions of threads, the dimensions will be projected into a linear order before partitioning into warps
- Each thread is shown as M(x,y), with x being the threadIdx.x and y being threadIdx.y for the thread
- Cooperatively, the 32 threads in a warp present a single memory access request comprised of the requested addresses, which is serviced by one or more device memory transactions.



## Memory access patterns



| Loa | ading i | teratio        | on O           | Loading iteration 1 |                |                |                |
|-----|---------|----------------|----------------|---------------------|----------------|----------------|----------------|
| To  | T       | T <sub>2</sub> | T <sub>3</sub> | Τo                  | T <sub>1</sub> | T <sub>2</sub> | T <sub>3</sub> |

### Aligned direction in kernel code

| Moo             | M10             | M <sub>20</sub> | M30             |
|-----------------|-----------------|-----------------|-----------------|
| M10             | Μη              | M <sub>12</sub> | M <sub>13</sub> |
| M <sub>20</sub> | M <sub>21</sub> | M <sub>22</sub> | M <sub>23</sub> |
| M <sub>30</sub> | M <sub>31</sub> | M <sub>32</sub> | M33             |





## Memory bandwidth limits GPU-enabled applications

- Memory operations are issued per warp, with each thread providing its own memory address
- Global memory loads/stores are staged through L2 and sometimes L1 caches
- Global memory accesses go through L2 cache, with optional L1 cache usage based on architecture
- Memory transactions use 128-byte or 32-byte segments, depending on cache involvement
- L1 cache lines are 128 bytes and map to 128-byte aligned segments in device memory
- Perfect mapping occurs when each thread in a warp requests one 4-byte value, matching the 128byte cache line size







# Efficient memory access is crucial

Aligned Memory Access

accessed by threads are arranged such that each thread accesses data in consecutive memory locations L1 and L2 cache granularity: 32 bytes 128 byte



Misaligned Memory Access accessed by threads are not consecutive or not aligned to memory transaction boundaries







# Efficient memory access is crucial

\_\_global\_\_ void sumAddalignedacces(float \*a, float \*b, float \*c, int n, int offset) { for (int idx = offset, k = 0; idx < n; idx++, k++) C[k] = A[idx] + B[idx];}

\_\_global\_\_ void missedAlignedAccessed(float \*a, float \*b, float \*c, int n) { int index = blockIdx.x \* blockDim.x + threadIdx.x; int k = i + offset; if (int i < k) { c[i] = a[i] + b[I]; }



# Time your kernels

| Offset | SIMULATION TIME (SECONDS) |
|--------|---------------------------|
| 0      | 0.003968                  |
| 12     | 0.004011                  |
| 33     | 0.004024                  |



# Array of Structure versus Structure of Arrays





```
Structure of Arrays (SOA)
```

```
struct innerStruct {
  float x[N];
  float y[N];
};
```

struct innerArray moa;







# Sample code: EPIC in a predefined electric field

**Basic assumptions** 

Only compute the force from electric field Neglect magnetic field

Main function

Particle position Particle velocity Electric field









# AOS: EPIC in a predefined electric field

```
Struct for ParticleList
struct ParticleList {
  // An array of particles Structures
   struct Particle* parts;
   // This represents the number of particles in the array
  int n;
};
pl->parts = (struct Particle*)malloc(n * sizeof(struct Particle));
// Set the electric field for each particle
void setE(struct ParticleList* pl, int DIM) {
   for (int i = 0; i < pl->n; ++i) {
       for (int j = 0; j < DIM; ++j) {</pre>
           pl->parts[i].E[j] = sin(M_PI * pl->parts[i].pos[j]);
}
// Accelerate particles by updating their velocity
void accel(struct ParticleList* pl, double dt, int DIM) {
   for (int i = 0; i < pl ->n; ++i) {
       for (int j = 0; j < DIM; ++j) {</pre>
           pl->parts[i].vel[j] += dt * pl->parts[i].q / pl->parts[i].m * pl->parts[i].E[j];
}
```

```
// Move particles by updating their position
void move(struct ParticleList* pl, double dt, int DIM) {
    for (int i = 0; i < pl->n; ++i) {
       for (int j = 0; j < DIM; ++j) {</pre>
           pl->parts[i].pos[j] += dt * pl->parts[i].vel[j];
           // Apply periodic boundary conditions
           if (pl->parts[i].pos[j] > 1.0) {
               pl->parts[i].pos[j] -= 1.0;
           if (pl->parts[i].pos[j] < 0.0) {</pre>
               pl->parts[i].pos[j] += 1.0;
    // Main simulation loop
    int step = 0;
    for (double t = 0; t < 1; t += dt, ++step) {
        nvtxRangePush("Time Step");
        nvtxRangePush("setE");
        setE(&p, DIM); // Update electric field for all particles
        nvtxRangePop(); //SetE
        nvtxRangePush("accel");
        accel(&p, dt, DIM); // Update velocities of all particles
        nvtxRangePop(); // Accel
        nvtxRangePush("move");
        move(&p, dt, DIM); // Update positions of all particles
        nvtxRangePop();
        nvtxRangePop(); // Time Step
        // Save data every ndumps steps
        if (step % ndumps == 0) {
            printData(&p, t, outFile, DIM); // Save particle data
```



# SOA: EPIC in a predefined electric field

```
Struct for ParticleList
struct ParticleList {
   double *pos[MAX_DIM]; // Array of pointers for position
   double *vel[MAX_DIM]; // Array of pointers for velocity
   double *E[MAX DIM]; // Array of pointers for electric field
   double *q;
                        // Array for charges
   double *m;
                        // Array for masses
                        // Number of particles
   int n;
};
for (int i = 0; i < DIM; ++i) {</pre>
   pl->pos[i] = (double*)malloc(n * sizeof(double));
   pl->vel[i] = (double*)malloc(n * sizeof(double));
   pl->E[i] = (double*)malloc(n * sizeof(double));
}
pl->q = (double*)malloc(n * sizeof(double));
pl->m = (double*)malloc(n * sizeof(double));
```

```
Data access pattern in functions like `setE`, `accel`, and `move`:
for (int j = 0; j < DIM; ++j) {</pre>
    for (int i = 0; i < pl->n; ++i) {
        pl->E[j][i] = sin(M_PI * pl->pos[j][i]);
// Accelerate particles by updating their velocity
void accel(struct ParticleList* pl, double dt, int DIM) {
    for (int j = 0; j < DIM; ++j) {</pre>
        for (int i = 0; i < pl->n; ++i) {
            pl->vel[j][i] += dt * pl->q[i] / pl->m[i] * pl->E[j][i];
// Move particles by updating their position
void move(struct ParticleList* pl, double dt, int DIM) {
    for (int j = 0; j < DIM; ++j) {</pre>
        for (int i = 0; i < pl->n; ++i) {
            pl->pos[j][i] += dt * pl->vel[j][i];
            // Apply periodic boundary conditions
            if (pl->pos[j][i] > 1.0) {
                pl->pos[j][i] -= 1.0;
            if (pl->pos[j][i] < 0.0) {
                pl->pos[j][i] += 1.0;
```



# Time your kernels

Input parameters

number of Particles = 4000000

dimensions = 2

dt = 0.1

ndumps = 1000

| <ul> <li>CPU (32)</li> </ul>    | to 100% |  |
|---------------------------------|---------|--|
| <ul> <li>Threads (4)</li> </ul> |         |  |
| ✓ [116674] picaos.x →           | to 100% |  |
| NVTX                            |         |  |
| Profiler overhead               |         |  |
| 3 threads hidden+               | to 100% |  |

| RUNS | SIMULATION TIME (SECONDS) |
|------|---------------------------|
| AOS  | 38.33                     |
| SOA  | 35.93                     |
|      |                           |

|                   |                        | •             |                |         |
|-------------------|------------------------|---------------|----------------|---------|
|                   | Time Step [477.385 ms] |               |                |         |
| setE [323.025 ms] |                        | accel [72.516 | ms] move [81.8 | 815 ms) |
|                   |                        |               |                |         |
|                   |                        |               |                |         |
|                   |                        |               |                |         |
|                   |                        |               |                |         |





# Time your kernels

| Runs              | Ν       | Kernel Configuration | Elapsed Time on Device |
|-------------------|---------|----------------------|------------------------|
| Pageable memory   | 4000000 | (156250, 256)        | 19.93                  |
| Pinned memory     | 4000000 | (156250, 256)        | 19.21                  |
| CudaMallocManaged | 4000000 | (156250, 256)        | 19.59                  |





## Nsight Compute

|                                                     | Report                                                            | Result                            |                    | Size                                                               | Time                                     | Cycles                                                      | GPU                                                          | SM Frequen                                                        |
|-----------------------------------------------------|-------------------------------------------------------------------|-----------------------------------|--------------------|--------------------------------------------------------------------|------------------------------------------|-------------------------------------------------------------|--------------------------------------------------------------|-------------------------------------------------------------------|
| Baseline 2                                          | report_202414_110809_n                                            | cu 897 - setEKernel               | - V-               | (156250, 1, 1)x(256, 1, 1)                                         | 1.45 ms                                  | 1,802,928                                                   | 0 - NVIDIA A100-SXM-64GB                                     | 1.25 Ghz                                                          |
| Baseline 1                                          | report_202413_111730_n                                            | cu 901 - setEKernel               |                    | (78125, 1, 1)x(256, 1, 1)                                          | 729.57 us                                | 906,411                                                     | 0 - NVIDIA A100-SXM-64GB                                     | 1.24 Ghz                                                          |
| Summary                                             | Details Source                                                    | Context                           | Comments           | Raw .                                                              | Session                                  |                                                             |                                                              |                                                                   |
| <ul> <li>GPU Speed Of Light</li> </ul>              | t Throughput                                                      |                                   |                    |                                                                    |                                          |                                                             |                                                              |                                                                   |
|                                                     | ne throughput for compute and<br>tributor. High-level overview of |                                   |                    |                                                                    |                                          |                                                             | ilization with respect to the th                             | eoretical maxi                                                    |
| Compute (SM) Through                                | 1put [%]                                                          |                                   |                    |                                                                    | 78.58                                    | 3 (+0.57%)                                                  | Duration [ms]                                                |                                                                   |
| Memory Throughput [%                                |                                                                   |                                   |                    |                                                                    |                                          | 9 (+2.87%)                                                  | Elapsed Cycles [cycle]                                       |                                                                   |
| L1/TEX Cache Through                                | -<br>iput [%]                                                     |                                   |                    |                                                                    | 11.92                                    | 2 (+0.48%)                                                  | SM Active Cycles [cycle]                                     |                                                                   |
| L2 Cache Throughput [                               | %]                                                                |                                   |                    |                                                                    | 42.0                                     | 5 (+0.51%)                                                  | SM Frequency [Ghz]                                           |                                                                   |
| DRAM Throughput [%]                                 |                                                                   |                                   |                    |                                                                    | 26.89                                    | 9 (+2.87%)                                                  | DRAM Frequency [Ghz]                                         |                                                                   |
| High Compute                                        | e mrougnput look-up table                                         | 5.                                |                    |                                                                    |                                          |                                                             | e compute pipelines are spen<br>932 peak performance and clo |                                                                   |
| _                                                   | e mrougnput look-up table                                         | 5.                                |                    |                                                                    |                                          | s device's fp                                               |                                                              |                                                                   |
| <ol> <li>Roofline Analy</li> </ol>                  | e mrougnput look-up table                                         | 5.                                |                    |                                                                    |                                          | s device's fp                                               | 932 peak performance and clo                                 |                                                                   |
| _                                                   | e mrougnput look-up table                                         | 5.                                |                    |                                                                    |                                          | s device's fp                                               | 932 peak performance and clo                                 |                                                                   |
| Roofline Analy     Compute (SM) [%]                 | e mrougnput look-up table                                         | 5.                                |                    |                                                                    |                                          | s device's fp                                               | 932 peak performance and clo                                 |                                                                   |
| <ol> <li>Roofline Analy</li> </ol>                  | e mrougnput look-up table                                         | 5.                                |                    |                                                                    |                                          | s device's fp                                               | 932 peak performance and clo                                 |                                                                   |
| Roofline Analy     Compute (SM) [%]                 | ysis The ratio of peak float                                      | 5.                                | formance on this d |                                                                    |                                          | s device's fp                                               | 932 peak performance and clo                                 |                                                                   |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie                                     | eved 5% of thi                           | s device's fp<br>GPU T                                      | 32 peak performance and clo<br>hroughput                     | se to 1% of its                                                   |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie                                     | eved 5% of thi                           | s device's fp<br>GPU T                                      | 50.0                                                         | se to 1% of its                                                   |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie<br>30.0<br>Warning: The target syst | eved 5% of thi<br>40.0                   | s device's fp<br>GPU T<br>S<br>NVLink supp                  | 50.0<br>peed Of Light (SOL) [%]                              | se to 1% of its<br>60.0                                           |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie<br>30.0<br>Warning: The target syst | eved 5% of thi<br>40.0<br>tem provides I | s device's fp<br>GPU T<br>S<br>NVLink supp<br>hight want to | 50.0<br>peed Of Light (SOL) [%]                              | se to 1% of its<br>60.0<br>any <u>NVLink se</u><br>and writing yo |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie<br>30.0<br>Warning: The target syst | eved 5% of thi<br>40.0<br>tem provides I | s device's fp<br>GPU T<br>S<br>NVLink supp<br>hight want to | 50.0<br>peed Of Light (SOL) [%]                              | se to 1% of its<br>60.0<br>any <u>NVLink se</u><br>and writing yo |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie<br>30.0<br>Warning: The target syst | eved 5% of thi<br>40.0<br>tem provides I | s device's fp<br>GPU T<br>S<br>NVLink supp<br>hight want to | 50.0<br>peed Of Light (SOL) [%]                              | se to 1% of its<br>60.0<br>any <u>NVLink se</u><br>and writing yo |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie<br>30.0<br>Warning: The target syst | eved 5% of thi<br>40.0<br>tem provides I | s device's fp<br>GPU T<br>S<br>NVLink supp<br>hight want to | 50.0<br>peed Of Light (SOL) [%]                              | se to 1% of its<br>60.0<br>any <u>NVLink se</u><br>and writing yo |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie<br>30.0<br>Warning: The target syst | eved 5% of thi<br>40.0<br>tem provides I | s device's fp<br>GPU T<br>S<br>NVLink supp<br>hight want to | 50.0<br>peed Of Light (SOL) [%]                              | se to 1% of its<br>60.0<br>any <u>NVLink se</u><br>and writing yo |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie<br>30.0<br>Warning: The target syst | eved 5% of thi<br>40.0<br>tem provides I | s device's fp<br>GPU T<br>S<br>NVLink supp<br>hight want to | 50.0<br>peed Of Light (SOL) [%]                              | se to 1% of its<br>60.0<br>any <u>NVLink se</u><br>and writing yo |
| () Roofline Analy<br>Compute (SM) [%]<br>Memory [%] | ysis The ratio of peak float                                      | s.<br>(fp32) to double (fp64) per | formance on this d | evice is 2:1. The kernel achie<br>30.0<br>Warning: The target syst | eved 5% of thi<br>40.0<br>tem provides I | s device's fp<br>GPU T<br>S<br>NVLink supp<br>hight want to | 50.0<br>peed Of Light (SOL) [%]                              | se to 1% of its<br>60.0<br>any NVLink se<br>and writing w         |



1799449.76 (+99.06%) 1.25 (+0.23%) 1.59 (+0.26%) doing. Also, consider whether any computation is redundant and could be reduced or moved to  $\odot$ 

fp64 peak performance. See the 😔 Kernel Profiling Guid e for more details on roofline analysis.

| 70 | .0 80 | 0.0 90 | .0 100 | 0.0 |
|----|-------|--------|--------|-----|

### Peageable Memory

### Unified Memory



#### Optimising memory transfers: cudaMemPrefetchAsync 5



## What is cudaMemPrefetchAsync?



CUDA function that allows you to explicitly move data to a specific memory location before it is actually needed



### CINECA



## How to use cudaMemPrefetchAsync?

### Before Kernel Launch

1

Call cudaMemPrefetchAsync to prefetch data into the cache before the kernel that will use it runs

2 Syntax cudaMemPrefetchAsync(particles.pos, N \* DIM \* sizeof(float), device\_id);



## When use cudaMemPrefetchAsync?





# Time your kernels

| Runs              | Ν       | Kernel Configuration | Elapsed Time on Device |
|-------------------|---------|----------------------|------------------------|
| Pageable memory   | 4000000 | (156250, 256)        | 19.93                  |
| Pinned memory     | 4000000 | (156250, 256)        | 19.21                  |
| CudaMallocManaged | 4000000 | (156250, 256)        | 19.59                  |
| Prefetching       | 4000000 | (524288, 128)        | 19.33                  |





## Nsight Compute

|                        | Report          |                    | Result                 |                      | Size                                                       | Time                           | Cycles                        | GPU                                                             | SM Frequency            | Process              | Attributes                                   |                       |                     |                       |         |
|------------------------|-----------------|--------------------|------------------------|----------------------|------------------------------------------------------------|--------------------------------|-------------------------------|-----------------------------------------------------------------|-------------------------|----------------------|----------------------------------------------|-----------------------|---------------------|-----------------------|---------|
| Current                |                 | .4_111312_ncu      | 903 - setEKernel       | - 7-                 | (156250, 1, 1)x(256, 1, 1)                                 | 1.45 ms                        |                               | 0 - NVIDIA A100-SXM-64GB                                        |                         | [2800488] picaos.x   | Ö                                            |                       |                     |                       |         |
|                        |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     |                       |         |
| Baseline 1             | report_20241    | .3_111730_ncu      | 901 - setEKernel       |                      | (78125, 1, 1)x(256, 1, 1)                                  | 729.57 us                      | 906,411                       | 0 - NVIDIA A100-SXM-64GB                                        | 1.24 Ghz                | [118627] picaos.x    |                                              |                       |                     |                       |         |
| Baseline 2             | report_20241    | .4_110809_ncu      | 897 - setEKernel       |                      | (156250, 1, 1)x(256, 1, 1)                                 | 1.45 ms                        | 1,802,928                     | 0 - NVIDIA A100-SXM-64GB                                        | 1.25 Ghz                | [2799376] picaos.x   | ۲                                            |                       |                     |                       |         |
| Summary                | Details         | Source             | Context                | Comments             | Raw                                                        | Sessio                         | n                             |                                                                 |                         |                      |                                              | 🕻 Compare 🖵           | X Tools 🔵 🥝         | View 🖵 🕒 Export       | J I     |
|                        | Throughput      |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       | GPU Thre            | oughput Chart 🛛 👻     | Q       |
|                        |                 |                    |                        |                      | the throughput reports the a<br>ces of the GPU presented a |                                |                               | tilization with respect to the th                               | eoretical maximu        | um. Breakdowns show  | the throughpu                                | t for each individual | sub-metric of Corr  | pute and Memory to ck | arly    |
| Compute (SM) Through   | put [%]         |                    |                        |                      | 7                                                          | 78.60 (+0.30                   | 0%, z=+0.74)                  | Duration [ms]                                                   |                         |                      |                                              |                       |                     | 1.45 (+32.93%,        | =+0.71) |
| Memory Throughput [%]  |                 |                    |                        |                      | 2                                                          | 26.90 (+1.44                   | \$%, z=+0.72)                 | Elapsed Cycles (cycle)                                          |                         |                      |                                              |                       |                     | 1802718 (+33.07%,     | =+0.71) |
| L1/TEX Cache Through   | out [%]         |                    |                        |                      | 1                                                          | 1.92 (+0.27                    | 7%, z=+0.77)                  | SM Active Cycles [cycle]                                        |                         |                      |                                              |                       | 1                   | 798855.16 (+33.08%,   | =+0.71) |
| L2 Cache Throughput [% | -)              |                    |                        |                      | 4                                                          | 2.06 (+0.29                    | 9%, z=+0.79)                  | SM Frequency [Ghz]                                              |                         |                      |                                              |                       |                     | 1.25 (+0.14%,         | =+0.79) |
| DRAM Throughput [%]    |                 |                    |                        |                      | 2                                                          | 26.90 (+1.44                   | \$%, z=+0.72)                 | DRAM Frequency [Ghz]                                            |                         |                      |                                              |                       |                     | 1.59 (+0.16%,         | =+0.82) |
| Roofline Analy         | sis The ratio o | of peak float (fp3 | 2) to double (fp64) pe | rformance on this de | wice is 2:1. The kernel achie                              | wed 5% of th                   |                               | o32 peak performance and clo<br>hroughput                       | se to 1% of its fp      | 64 peak performance. | See the 🤔 Ker                                | nel Profiling Guide f | for more details on | roofline analysis.    |         |
| The second second      |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     | 1                     |         |
| Compute (SM) [%]       |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     |                       |         |
|                        |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     |                       |         |
| Memory [%]             |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         | I                    |                                              |                       |                     |                       |         |
|                        |                 | 10.0               | 20                     | 0                    | 30.0                                                       | 40.0                           |                               | 50.0                                                            | 60.0                    | 70.0                 | <u>,                                    </u> | 80.0                  |                     | 90.0                  | 100.0   |
| 0.0                    |                 | 10.0               | 20                     |                      | 00.0                                                       | 40.0                           | S                             | peed Of Light (SOL) [%]                                         | 66.6                    |                      |                                              | 00.0                  |                     |                       | 100.0   |
|                        |                 |                    |                        |                      | Warning: The target syst                                   | em provides                    | NVLink supp                   | port, but you did not collected                                 | any <u>NVLink secti</u> | iona.                |                                              |                       |                     |                       |         |
|                        |                 |                    |                        | То                   | customize your report even<br>You                          | further, you r<br>1 might also | might want to<br>want to cons | o learn about custom sections<br>ider adding individual metrics | and <u>writing your</u> |                      |                                              |                       |                     |                       |         |
|                        |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     |                       |         |
|                        |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     |                       |         |
|                        |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     |                       |         |
|                        |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     |                       |         |
|                        |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     |                       |         |
|                        |                 |                    |                        |                      |                                                            |                                |                               |                                                                 |                         |                      |                                              |                       |                     |                       |         |



### Peageable Memory

### Unified Memory

Prefetch Memory



# Nsight system report

| 23s <del>-</del>                         |                  | +940ms | +960ms      | +980ms                    | 24s                          |
|------------------------------------------|------------------|--------|-------------|---------------------------|------------------------------|
| <ul> <li>CPU (32)</li> </ul>             | to 100%          |        |             |                           |                              |
| - CUDA HW (0000:1d:00.0 - NVID           | kernel<br>memory |        |             |                           |                              |
| <ul> <li>55.0% Context 1</li> </ul>      |                  |        |             |                           |                              |
| <ul> <li>100.0% Kernels</li> </ul>       |                  |        |             |                           |                              |
| 86.5% setEKernel                         |                  |        |             |                           |                              |
| 8.0% accelKernel                         |                  |        |             |                           |                              |
| 5.6% moveKernel                          |                  |        |             |                           |                              |
| NVTX                                     |                  |        |             |                           |                              |
| <ul> <li>45.0% Unified memory</li> </ul> |                  |        |             |                           |                              |
| <ul> <li>100.0% Memory</li> </ul>        |                  |        | الالفقة الق | a da bi sa sa sa sa sa sa | n a le ital à ant a          |
| 81.7% HtoD transfer                      |                  |        |             |                           | ilitaik iki kitakainaitai ai |
|                                          | 1                |        |             |                           |                              |



Look at this pattern



### 6 How can we overlap kernel and data transfer?



### What is a STREAM?



Sequence of CUDA operations

kernel execution, memory transfer that execute in issue-order on the GPU By default, CUDA kernels are executed in a default stream Instructions are excited in order (in any stream): an instruction must be completed before the next one can begin





### Non-default Stream behaviour

2

Rules of governing the behaviour of streams Kernels, with any single STREAM must execute in order



Time

- Multiple streams or Non-default streams can be created and utilise by CUDA programmers
- However, kernels in different, non-default streams, can interact concurrently, have no fixed order of execution



#### Understanding CUDA Non-Streams behaviour 7





### Where it can be useful?

Kernel Enqueuing

Kernels are enqueued into a specific stream for execution on the GPU.

Memory Transfer

Data transfers between host and device can be enqueued asynchronously into streams.

### Asynchronous Execution with Streams



2

### 3

#### **Overlapped Execution**

The GPU can execute kernels and memory transfers concurrently in different streams.





## When use cudaMemPrefetchAsync?



// Note that a pointer must be passed to `cudaCreateStream`.



## **CUDA Stream Synchronization**

### • Explicit

- cudaDeviceSynchronize()
  - Blocks until all CUDA operations are finished
- cudaStreamSynchronize(stream))
  - Blocks until all CUDA operations are finished within given stream
- cudaEvenRecord(event, stream1), cudaStreamWaitEvent(stream2, event)
  - Blocks until all CUDA operations are finished within given stream
- Implicit
  - Page-locked memory allocation
    - cudaMallocHost, cudaHostAlloc
  - Device memory allocation
    - cudaMalloc
  - Blocking version of memory operations
    - cudaMemcpy, cudaMemset
  - Implicit synchronize all CUDA operations

m !, event) m



# Nsight system report



Look at this pattern



# Time your kernels

| Runs              | Ν       | Kernel Configuration | Elapsed Time on Device |
|-------------------|---------|----------------------|------------------------|
| Pageable memory   | 4000000 | (156250, 256)        | 19.93                  |
| Pinned memory     | 4000000 | (156250, 256)        | 19.21                  |
| CudaMallocManaged | 4000000 | (156250, 256)        | 19.59                  |
| Prefetching       | 4000000 | (524288, 128)        | 19.33                  |
| Streams           | 4000000 | (524288, 128)        | 20.06                  |





## Nsight Compute

|                        | Barant                                                                                                                                                                                                | D                                                                                     | <b>0</b> !                     | Time Autor            | 001                                              | 011 5                   | <b>D</b>              |                      |                        |                                      |                 |
|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|--------------------------------|-----------------------|--------------------------------------------------|-------------------------|-----------------------|----------------------|------------------------|--------------------------------------|-----------------|
|                        | Report                                                                                                                                                                                                | Result                                                                                | Size                           | Time Cycles           | GPU                                              | SM Frequency            |                       | Attributes           |                        |                                      |                 |
| Current                | report_202414_111312_ncu                                                                                                                                                                              | 903 - setEKernel 🔻 🗸 👻                                                                | (156250, 1, 1)x(256, 1, 1)     | 1.45 ms 1,802,718     | 0 - NVIDIA A100-SXM-64GB                         | 1.25 Ghz                | [2800488] picaos.x    | 0                    |                        |                                      |                 |
| Baseline 1             | report_202414_110809_ncu                                                                                                                                                                              | 897 - setEKernel                                                                      | (156250, 1, 1)x(256, 1, 1)     | 1.45 ms 1,802,928     | 0 - NVIDIA A100-SXM-64GB                         | 1.25 Ghz                | [2799376] picaos.x    | ۲                    |                        |                                      |                 |
| Baseline 2             | report_202413_114546_ncu                                                                                                                                                                              | 903 - setEKernel                                                                      | (39063, 1, 1)x(1024, 1, 1)     | 2.13 ms 2,653,289     | 0 - NVIDIA A100-SXM-64GB                         | 1.25 Ghz                | [122720] picaos.x     |                      |                        |                                      |                 |
| Summary                | Details Source                                                                                                                                                                                        | Context Comments                                                                      | Raw                            | Session               |                                                  |                         |                       |                      | 💭 Compare 🖵            | 🎗 Tools 🕽 🔘 View 🕽 🕒                 | Export 🔵 📃 🖵    |
| ➡ GPU Speed Of Light   | Throughput                                                                                                                                                                                            |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        | GPU Throughput Chart                 | v Q             |
|                        |                                                                                                                                                                                                       | mory resources of the GPU. For each unit,<br>utilization for compute and memory resou |                                |                       | utilization with respect to the                  | theoretical maxin       | num. Breakdowns shov  | v the throughpu      | rt for each individual | sub-metric of Compute and Memo       | y to clearly    |
| Compute (SM) Throughp  | out [%]                                                                                                                                                                                               |                                                                                       | 78                             | .60 (+19.12%, z=+0.71 | ) Duration [ms]                                  |                         |                       |                      |                        | 1.45 (-1                             | 9.13%, z=-0.71) |
| Memory Throughput [%]  |                                                                                                                                                                                                       |                                                                                       |                                | .90 (+19.45%, z=+0.71 |                                                  |                         |                       |                      |                        |                                      | 9.09%, z=-0.71) |
| L1/TEX Cache Throughp  | ut [%]                                                                                                                                                                                                |                                                                                       | 11                             | .92 (+18.36%, z=+0.71 | ) SM Active Cycles [cycle]                       |                         |                       |                      |                        | 1798855.16 (-1                       | 7.89%, z=-0.71) |
| L2 Cache Throughput [% | ]                                                                                                                                                                                                     |                                                                                       | 42                             | .06 (+20.58%, z=+0.71 | ) SM Frequency [Ghz]                             |                         |                       |                      |                        | 1.25 (+0                             | 0.02%, z=+1.40) |
| DRAM Throughput [%]    |                                                                                                                                                                                                       |                                                                                       | 26                             | .90 (+19.45%, z=+0.71 | ) DRAM Frequency [Ghz]                           |                         |                       |                      |                        | 1.59 (+0                             | 0.05%, z=+1.15) |
| Roofline Analys        | ais The ratio of peak float (fp3                                                                                                                                                                      | 2) to double (fp64) performance on this d                                             | evice is 2:1. The kernel achie |                       | fp32 peak performance and c<br><b>Throughput</b> | lose to 1% of its f     | fp64 peak performance | . See the <u> Ke</u> | mel Profiling Guide f  | or more details on roofline analysis |                 |
|                        |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
| Compute (SM) [%]       |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
| Memory [%]             |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
| Memory [36]            |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
| 0.0                    | 10.0                                                                                                                                                                                                  | 20.0                                                                                  | 30.0                           | 40.0                  | 50.0                                             | 60.0                    | 70                    | .0                   | 80.0                   | 90.0                                 | 100.0           |
|                        |                                                                                                                                                                                                       |                                                                                       |                                |                       | Speed Of Light (SOL) [%]                         |                         |                       |                      |                        |                                      |                 |
|                        |                                                                                                                                                                                                       |                                                                                       | Warning: The target syst       | em provides NVLink su | pport, but you did not collecte                  | d any <u>NVLink sec</u> | tions.                |                      |                        |                                      |                 |
|                        | To customize your report even further, you might want to learn about <u>custom sections</u> and <u>writing your own rules</u> .<br>You might also want to consider <u>adding individual metrics</u> . |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
|                        |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
|                        |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
|                        |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
|                        |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
|                        |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
|                        |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |
|                        |                                                                                                                                                                                                       |                                                                                       |                                |                       |                                                  |                         |                       |                      |                        |                                      |                 |



### Peageable Memory

### Unified Memory

Stream



# Multiple streaming-GPU







## When use cudaMemPrefetchAsync?



```
// Non-coalesced access example
for (int gpu = 0; gpu < numGPUs; ++gpu) {</pre>
  cudaSetDevice(gpu); // Set the GPU
  int numBlocks = (particles[gpu].n + BLOCK_SIZE - 1) /
  BLOCK_SIZE;
  setEKernel<<<numBlocks, BLOCK_SIZE, 0,</pre>
  streams1[gpu]>>>(particles[gpu].d_pos, particles[gpu].d_E,
  particles[gpu].n, DIM); }
  cudaStreamSynchronize(streams2[gpu]); // Ensure all data is
  transferred
```



# Time your kernels

| Runs                 | Ν       | Kernel Configuration | Elapsed Time on Device |
|----------------------|---------|----------------------|------------------------|
| Peag-able memory     | 4000000 | (156250, 256)        | 19.93                  |
| Pinned memory        | 4000000 | (156250, 256)        | 19.21                  |
| CudaMallocManaged    | 4000000 | (156250, 256)        | 19.59                  |
| Prefetching          | 4000000 | (524288, 128)        | 19.33                  |
| Multiple Streams     | 4000000 | (524288, 256)        | 20.06                  |
| Multiple Streams-GPU | 4000000 | (39063, 256)         | 20.23                  |





# Nsight system report

| 4s -                             |                  | +44ms | +45ms | +46ms               | 4s 47.11ms | +48ms |
|----------------------------------|------------------|-------|-------|---------------------|------------|-------|
| <ul> <li>CPU (32)</li> </ul>     | to 100%          | ,     |       |                     |            |       |
| - CUDA HW (0000:1d:00.0 - NVIDI  | kernel<br>memory |       |       |                     |            |       |
| ▶ [All Streams]                  | ,                |       |       |                     |            |       |
| ▶ 67.7% Stream 17                |                  |       |       |                     |            |       |
| > 22.6% Default stream 7         |                  |       |       |                     |            |       |
| ▶ 9.7% Stream 16                 |                  |       |       |                     |            |       |
| - CUDA HW (0000:56:00.0 - NVIDI  | kernel<br>memory |       |       |                     |            |       |
| [All Streams]                    | ,                |       |       |                     |            |       |
| 67.3% Stream 29                  |                  |       |       |                     |            |       |
| 22.9% Default stream 19          |                  |       |       |                     |            |       |
| 9.8% Stream 28                   |                  |       |       |                     |            |       |
| - CUDA HW (0000:8f:00.0 - NVIDI. | kernel<br>memory |       |       |                     |            |       |
| [All Streams]                    |                  |       |       |                     |            |       |
| 64.4% Stream 41                  |                  |       |       |                     |            |       |
| 22.0% Default stream 31          |                  |       |       |                     |            |       |
| 13.6% Stream 40                  |                  |       |       |                     |            |       |
| - CUDA HW (0000:c8:00.0 - NVIDI  | kernel<br>memory |       |       |                     |            |       |
| [All Streams]                    |                  | }     | Me    | mcpy HtoD (Pageable | )          |       |
| 67.3% Stream 53                  |                  |       |       |                     |            |       |
| 22.9% Default stream 43          |                  | •     | Me    | mcpy HtoD (Pageable | )          |       |
| 9.8% Stream 52                   |                  |       |       |                     |            |       |



### Look at this pattern



## Nsight system report

|                                 | Report            |               | Result              |          |          | Size                                                     | Time      | Cycles                  | GPU                                                           | SM Freq     | μ  |
|---------------------------------|-------------------|---------------|---------------------|----------|----------|----------------------------------------------------------|-----------|-------------------------|---------------------------------------------------------------|-------------|----|
| Current                         | report_202414     | _141720_ncu   | 997 - setEKerne     | <b>•</b> | <b>\</b> | (39063, 1, 1)x(256, 1, 1)                                | 369.09 us | 457,553                 | 0 - NVIDIA A100-SXM-64GB                                      | 1.24 Gh     | z  |
| Baseline 1                      | report_202413     | _114546_ncu   | 903 - setEKernel    |          |          | (39063, 1, 1)x(1024, 1, 1)                               | 2.13 ms   | 2,653,289               | 0 - NVIDIA A100-SXM-64GB                                      | 1.25 Gh     | z  |
| Summary                         | Details           | Source        | Context             |          | Comments | Raw                                                      | Session   | 1                       |                                                               |             |    |
| GPU Speed Of Light              | Throughput        |               |                     |          |          |                                                          |           |                         |                                                               |             |    |
|                                 |                   |               |                     |          |          | the throughput reports the<br>ces of the GPU presented : |           |                         | ilization with respect to the th                              | eoretical n | na |
| ompute (SM) Through             |                   |               |                     |          |          |                                                          |           | (+45.04%)               | Duration [us]                                                 |             |    |
| emory Throughput [%]            |                   |               |                     |          |          |                                                          |           | (+39.44%)               | Elapsed Cycles [cycle]                                        |             |    |
| /TEX Cache Through              |                   |               |                     |          |          | _                                                        |           | (+43.60%)               | SM Active Cycles [cycle]                                      |             |    |
| Cache Throughput [%             |                   |               |                     |          |          |                                                          |           | (+46.13%)               | SM Frequency [Ghz]                                            |             |    |
| RAM Throughput [%]              | -1                |               |                     |          |          |                                                          |           | (+35.09%)               | DRAM Frequency [Ghz]                                          |             |    |
|                                 | i nrougnput lo    | ok-up tables. |                     |          |          |                                                          |           | is device's fp          | e compute pipelines are spend<br>032 peak performance and clo |             |    |
|                                 | i nrougnput lo    | ok-up tables. |                     |          |          |                                                          |           | is device's fp          |                                                               |             |    |
| Roofline Analy                  | i nrougnput lo    | ok-up tables. |                     |          |          |                                                          |           | is device's fp          | 032 peak performance and clo                                  |             |    |
| Roofline Analy Compute (SM) [%] | rsis The ratio of | ok-up tables. | 2) to double (fp64) |          |          |                                                          |           | is device's fp<br>GPU T | 032 peak performance and clo                                  |             |    |

#### Attributes cy Process [2812692] picaos.x 🛛 ወ 0 [122720] picaos.x 🛱 Compare 🛛 💥 Tools 🖉 🞯 View 🖉 🕒 Export 🖉 🚍 Ω GPU Throughput Chart mum. Breakdowns show the throughput for each individual sub-metric of Compute and Memory to clearly 369.09 (-82.68%) 457553 (-82.76%) 453809.80 (-82.42%) 1.24 (-0.48%) 1.59 (-0.40%)

 $\odot$ 

ne doing. Also, consider whether any computation is redundant and could be reduced or moved to

its fp64 peak performance. See the 🜐 Kernel Profiling Guide for more details on roofline analysis.

| 0 70 | .0 80 | .o 90 | j.o 100 | .0 |
|------|-------|-------|---------|----|

### Multiple stream single gpu

### Multiple stream-gpu



### Implementing higher dimensional grid in CUDA





## Multidimensional Blocks and Grids

Host program specifies "grid-block-threads" configuration for kernel at run time

- All threads spawned by a single kernel launch are collectively called a *grid*
- All threads in a grid share the same global memory space
- A grid is made up of many thread blocks
- Kernel needs to know run-time configuration
- Built-in-three-dimensional type for threads (uint3) and blocks (dim3)

- threadIdx.x, threadIdx.y, threadIdx.z

- blockIdx.x, blockIdx.y, blockIdx.z

- blockDim.x, blockDim.y, blockDim.z

#### Grid Dimension: 3x2 = 6 Blocks





## **Device Run-time Configuration**

| Туре  | Variable  | Description                  |
|-------|-----------|------------------------------|
| dim3  | gridDim   | Dimensions of grid           |
| uint3 | blockIdx  | Index of block within grid   |
| dim3  | blockDim  | Dimensions of block          |
| uint3 | ThreadIdx | Index of thread within block |

| Dimension | Variable     | ID                |
|-----------|--------------|-------------------|
| 1D        | (Dx)         | X                 |
| 2D        | (Dx, Dy)     | y + y*Dx          |
| 3D        | (Dx, Dy, Dz) | z + y*Dx + z*DxDy |



## CUDA compute grid

### CUDA compute grid supports 1–3 dimensions

gpu\_kernel<<<4,2>>>(...)

gpu\_kernel<<<dim3(8, 4, 1), dim3(4,2,1) >>>(...)

gpu\_kernel<<<dim3(16, 8, 4), dim3(8, 4, 2) >>>(...)

### Useful for when

Dealing with multidimensional data CUDA's dim3 type for both 2D and 3D grids and blocks CUDA variables: gridDim.x, gridDim.y, gridDim.z, gridBlock.z,...





## Two matrix multiplication

$$P_{ij} = \sum_{k=1}^{n} M_{ik} \cdot N_{kj}$$

$$\mathsf{P}_{10} = M_{00} * N_{10} + M_{10} * N_{11} + M_{20} * N_{12} + M_{30} * N_{13}$$

$$\mathsf{P}_{00} = M_{00} * N_{00} + M_{10} * N_{10} + M_{20} * N_{20} + M_{30} * N_{30}$$





### Two matrix multiplication

```
void matrixMultOnHost(float* M, float* N, float* P, int Width){
  for (int row = 0; row < Width; ++row) {</pre>
    for (int col = 0; col < Width; ++col){</pre>
    // accumulate element-wise products
    float pval = 0;
     for (int k = 0; k < Width; ++k) {
      float a = M[row*Width + k];
      float b = M[k*Width + col];
      pval += a*b;
      }
      P[row*width + col] = pval;
```







## CUDA compute grid supports 1–3 dimensions

### 2D

int i = blockIdx.x \* blockDim.x + threadIdx.x; int j = blockIdx.y \* blockDim.y + threadIdx.z;



3D



- CUDA "hides" loop headers into kernel launch parameters
- Ranges are distributed between threads and blocks of threads
- Blocks number is rounded up to handle the remainder



global linear memory index: idx = iy\*nx + ix



## Two matrix multiplication on GPU

```
// Kernel for matrix multiplication
__global__
void matrixMultiplicationKernel(float* M, float* N, float* Pd, int Width)
{
       int row = blockIdx.y * blockDim.y + threadIdx.y;
       int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < Width && col < Width) {
        float sum = 0;
        for (int k = 0; k < Width; ++k) {
            sum += M[row * Width + k] * N[k * Width + col];
        }
        Pd[row * Width + col] = sum;
```







## Two matrix multiplication on GPU

| Ν         | Methods | Time execution | Speedup |
|-----------|---------|----------------|---------|
|           |         |                |         |
| 2048×2048 | Serial  | 25.18          |         |
| 2040X2040 | CUDA    | 0.063          | 398.29  |







### Unrolling loops

9





## Unrolling loops

```
__global__ void unrolledMatrixMultiplicationKernel(float *A, float *B, float *C, int n, int m, int p) {
  int i = blockIdx.x * blockDim.x + threadIdx.x; // Row index of C
  int j = blockIdx.y * blockDim.y + threadIdx.y; // Column index of C
```

```
if (i < n && j < p) {
              float sum = 0; // Changed to float
              for (int k = 0; k < m - 3; k += 4) {
                               sum += A[i * m + k] * B[k * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[(k + 1) * p + j] + A[i * m + k + 1] * B[i * m + 
                                                                                A[i * m + k + 2] * B[(k + 2) * p + j] + A[i * m + k + 3] * B[(k + 3) * p + j];
              // Handle remaining elements
              for (int k = (m / 4) * 4; k < m; k++) {
                              sum += A[i * m + k] * B[k * p + j];
              C[i * p + j] = sum;
```







## Two matrix multiplication on GPU

| Ν         | Methods       | Time execution | Speedup |
|-----------|---------------|----------------|---------|
|           |               |                |         |
| 2048x2048 | Serial        | 25.18          | ]       |
|           | CUDA          | 0.063          | 398.29  |
|           | Unrolled loop | 0.055491       | 453.92  |







### What Bandwidth can a kernel achieve?

10





## Theoretical Bandwidth vs. Effective Bandwidth

### Theoretical Bandwidth

The absolute maximum bandwidth achievable with the hardware.

### Effective Bandwidth

The measured bandwidth that a kernel actually achieves

effective bandwidth (GB/s) =  $\frac{\text{(bytes read+bytes written)} \times 10^{-9}}{\text{time elapsed}}$ 

### Performance Gap

Effective bandwidth is often lower than theoretical bandwidth due to various factors.

### **Optimization Importance**

Bridging the gap between theoretical and effective bandwidth is a key optimization goal.





### Matrix transpose problem

| 0 | 1 | 2  | 3  |
|---|---|----|----|
| 4 | 5 | 6  | 7  |
| 8 | 9 | 10 | 11 |

for (int iy = 0; iy < ny; ++iy) { for (int ix = 0; ix < nx; ++ix) { out[ix\*ny+iy] = in[iy\*nx+ix];

#### data layout of original matrix

void transposeHost(float \*out, float \*in, const int nx, const int ny) {

| 0 | 4 | 8  |
|---|---|----|
| 1 | 5 | 9  |
| 2 | 6 | 10 |
| 3 | 7 | 11 |

### transposed

#### data layout of transposed matrix

| 0 | 4 | 8 | 1 | 5 | 9 | 2 | 6 | 10 | 3 | 7 | 11 |
|---|---|---|---|---|---|---|---|----|---|---|----|
|---|---|---|---|---|---|---|---|----|---|---|----|



## CUDD Matrix transpose

### \_\_global\_\_

void tranposeRow(float \*out, float \*in, const int nx, const int ny) { unsigned int ix = blockDim.x \* blockIdx.x + threadIdx.x; unsigned int iy = blockDim.y \* blockIdx.y + threadIdx.y;

```
if (ix < nx && iy < ny) { out[iy*nx + ix] = in[iy*nx + ix]; }
```

### \_\_global\_\_

void tranposeCol(float \*out, float \*in, const int nx, const int ny) { unsigned int ix = blockDim.x \* blockIdx.x + threadIdx.x; unsigned int iy = blockDim.y \* blockIdx.y + threadIdx.y;

```
if (ix < nx && iy < ny) { out[ix*ny + iy] = in[ix*ny + iy]; }
```







## Effective Bandwidth of Kernels

| BLOCKSIZE | KERNEL                         | BANDWIDTH [GB/s] | RATIO TO PEAK BANDWITDH (%) |  |
|-----------|--------------------------------|------------------|-----------------------------|--|
|           | Theoretical peak bandwidth     | 900.0            |                             |  |
| 16 X16    | copyRow: Load/store using rows | 626.60           | 69.62                       |  |
|           | copyCol: Load/store using cols | 275.42           | 30.60                       |  |
| 32X32     | copyRow: Load/store using rows | 376.32           | 41.81                       |  |
|           | copyCol: Load/store using cols | 170.14           | 18.90                       |  |





## Naive Transpose: Reading Rows versus Reading Columns

```
__global__
void tranposeNRow(float *out, float *in, const int nx, const int ny) {
  unsigned int ix = blockDim.x * blockIdx.x + threadIdx.x;
  unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;
   if (ix < nx && iy < ny) { out[ix * ny + iy] = in[iy * nx + ix]; }
```

| BLOCKSIZE | KERNEL                         | BANDWIDTH [GB/s] | RATIO TO PEAK BANDWITDH (%) |
|-----------|--------------------------------|------------------|-----------------------------|
|           | Theoretical peak bandwidth     | 900.0            |                             |
| 16 X16    | copyRow: Load/store using rows | 273.09           | 30.34                       |
|           | copyCol: Load/store using rows | 296.09           | 32.90                       |

| global                                                                 |
|------------------------------------------------------------------------|
| void tranposeNCol(float *out, float *in, const int nx, const int ny) { |
| unsigned int ix = blockDim.x * blockIdx.x + threadIdx.x;               |
| unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;               |
| if (ix < nx && iy < ny) {    out[iy*nx + ix] = in[ix*ny + iy];    }    |
| }                                                                      |
|                                                                        |



### Unrolling Transpose: Reading Rows versus Reading Columns

```
__global__ void transposeUnroll4Row(float *out, float *in, const int nx,
const int ny) {
  unsigned int ix = blockDim.x * blockIdx.x*4 + threadIdx.x;
  unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;
  unsigned int ti = iy^*nx + ix; unsigned int to = ix^*ny + iy;
  // access in columns
  if (ix+3*blockDim.x < nx && iy < ny) {
     out[to] = in[ti];
     out[to + ny*blockDim.x] = in[ti+blockDim.x];
     out[to + ny*2*blockDim.x] = in[ti+2*blockDim.x];
     out[to + ny*3*blockDim.x] = in[ti+3*blockDim.x];
```

```
__global___ void transposeUnroll4Col(float *out, float *in, const int nx,
const int ny) {
    unsigned int ix = blockDim.x * blockIdx.x*4 + threadIdx.x;
    unsigned int iy = blockDim.y * blockIdx.y + threadIdx.y;
    unsigned int ti = iy*nx + ix; unsigned int to = ix*ny + iy;
    // access in columns
    if (ix+3*blockDim.x < nx && iy < ny) {
        out[ti] = in[to];
        out[ti + blockDim.x] = in[to+ blockDim.x*ny];
        out[ti + blockDim.x] = in[to+ 2*blockDim.x*ny];
        out[ti + 3*blockDim.x] = in[to+ 3*blockDim.x*ny];
    }
}
```



## Effective Bandwidth of Kernels

| BLOCKSIZE | KERNEL                          | BANDWIDTH [GB/s] | RATIO TO PEAK BANDWITDH (%) |  |
|-----------|---------------------------------|------------------|-----------------------------|--|
|           | Theoretical peak bandwidth      | 900.0            |                             |  |
| 16 X16    | NaiveRow: Load/store using rows | 317.29           | 35.25                       |  |
|           | NaiveCol: Load/store using rows | 742.74           | 82.53                       |  |
| 32X32     | NaiveRow: Load/store using rows | 160.73           | 17.86                       |  |
|           | NaiveCol: Load/store using rows | 492.21           | 54.69                       |  |





## Take away message

GPU is throughput Horsepower

Offer fast memory access and significant computing power Importance of compute intensity and memory access patterns

#### Minimize the available data

Wasting bandwidth can severely impact performance Use structured arrays and maintain proper data order

#### **Optimizing Performance**

2

3

4

About 75% of issues in code adaptation stem from memory access problems Techniques for improving occupancy and latency hiding

#### Advanced Techniques

Efficient use of shared memory Utilizing CUDA streams for concurrent execution





### Extra-Slide





# **GPU Memory Hierarchy**

Global Memory

Large, off-chip memory with high latency and lower bandwidth compared to shared memory.

Shared Memory

Small, on-chip memory shared by all threads within a thread block, offering low latency and high bandwidth.

### **Register Memory**

Private memory for each individual thread, with the fastest access but limited capacity.



## Shared Memory Basics

Low Latency

Shared memory has much lower access latency compared to global memory, allowing for faster data processing.

Limited Capacity

Shared memory is limited in size, typically ranging from 16KB to 96KB per Streaming Multiprocessor (SM).

### High Bandwidth

Shared memory offers significantly higher bandwidth, enabling more efficient data transfer between threads.

### Thread Block Scope

Shared memory is shared among all threads within a thread block, allowing for efficient inter-thread communication.



## The \_\_\_shared\_\_ Qualifier

Declaration

The \_\_\_\_\_shared\_\_\_ qualifier is used to declare shared memory variables in CUDA kernels

Thread Sync

Threads in a thread block can synchronize using the \_\_\_\_\_syncthreads() intrinsic Synchronization enables safe data exchange between threads within a block.

### Scope

Shared memory variables are only accessible to threads within the same thread block





## Shared memory matrix multiplication kernel

```
__global__ void sharedMemoryMatrixMultiplicationKernel(float* M, float* N, float* P, int Width) {
   ___shared__ float sharedM[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float sharedN[BLOCK_SIZE][BLOCK_SIZE];
    int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;
    for (int m = 0; m < (Width + BLOCK_SIZE - 1) / BLOCK_SIZE; ++m) {</pre>
       // Load elements into shared memory
       if (m * BLOCK_SIZE + threadIdx.x < Width && row < Width) {</pre>
            sharedM[threadIdx.y][threadIdx.x] = M[row * Width + m * BLOCK_SIZE + threadIdx.x];
        } else {
            sharedM[threadIdx.y][threadIdx.x] = 0.0f; // Fill with zero if out of bounds
        }
        if (m * BLOCK_SIZE + threadIdx.y < Width && col < Width) {</pre>
            sharedN[threadIdx.y][threadIdx.x] = N[(m * BLOCK_SIZE + threadIdx.y) * Width + col];
       } else {
            sharedN[threadIdx.y][threadIdx.x] = 0.0f; // Fill with zero if out of bounds
        }
        ____syncthreads(); // Synchronize to make sure all threads have loaded their data
       // Perform the multiplication
        for (int k = 0; k < BLOCK_SIZE; ++k) {</pre>
            sum += sharedM[threadIdx.y][k] * sharedN[k][threadIdx.x];
        __syncthreads(); // Synchronize before loading the next tile
    }
    // Write the result to global memory
    if (row < Width && col < Width) {</pre>
        P[row * Width + col] = sum;
}
```







## Two matrix multiplication on GPU

| Ν         | Methods       | Time execution | Speedup |
|-----------|---------------|----------------|---------|
|           |               |                |         |
| 2048x2048 | Serial        | 25.18          | ]       |
|           | CUDA          | 0.063          | 398.29  |
|           | Shared memory | 0.055491       | 453.92  |





