AMD Radeon (GNU libgomp)

Next: nvptx, Up: Offload-Target Specifics [Contents][Index]

12.1 AMD Radeon (GCN) ¶

On the hardware side, there is the hierarchy (fine to coarse):

work item (thread)
wavefront
work group
compute unit (CU)

All OpenMP and OpenACC levels are used, i.e.

OpenMP’s simd and OpenACC’s vector map to work items (thread)
OpenMP’s threads (“parallel”) and OpenACC’s workers map to wavefronts
OpenMP’s teams and OpenACC’s gang use a threadpool with the size of the number of teams or gangs, respectively.

The used sizes are

Number of teams is the specified num_teams (OpenMP) or num_gangs (OpenACC) or otherwise the number of CU. It is limited by two times the number of CU.
Number of wavefronts is 4 for gfx900 and 16 otherwise; num_threads (OpenMP) and num_workers (OpenACC) overrides this if smaller.
The wavefront has 102 scalars and 64 vectors
Number of workitems is always 64
The hardware permits maximally 40 workgroups/CU and 16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
80 scalars registers and 24 vector registers in non-kernel functions (the chosen procedure-calling API).
For the kernel itself: as many as register pressure demands (number of teams and number of threads, scaled down if registers are exhausted)

The implementation remark:

I/O within OpenMP target regions and OpenACC compute regions is supported using the C library printf functions and the Fortran print/write statements.
Reverse offload regions (i.e. target regions with device(ancestor:1)) are processed serially per target region such that the next reverse offload region is only executed after the previous one returned.
OpenMP code that has a requires directive with self_maps or unified_shared_memory is only supported if all the AMD GPUs present have the HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT property; some systems require the "xnack" feature enabled for this to be true, in which case the runtime will attempt to set the HSA_XNACK environment variable to ‘1’ automatically (user-set values are not overridden, and the setting only affects the executable itself and any child processes). If any AMD GPU device is not supported, all AMD GPUs are removed from the list of available devices (“host fallback”).
The available stack size can be changed using the GCN_STACK_SIZE environment variable; the default is 32 kiB per thread.
Low-latency memory (omp_low_lat_mem_space) is supported when the the access trait is set to cgroup. The default pool size is automatically scaled to share the 64 kiB LDS memory between the number of teams configured to run on each compute-unit, but may be adjusted at runtime by setting environment variable GOMP_GCN_LOWLAT_POOL=bytes.
omp_low_lat_mem_alloc cannot be used with true low-latency memory because the definition implies the omp_atv_all trait; main graphics memory is used instead.
omp_cgroup_mem_alloc, omp_pteam_mem_alloc, and omp_thread_mem_alloc, all use low-latency memory as first preference, and fall back to main graphics memory when the low-latency pool is exhausted.
Pinned memory allocated using omp_alloc with the ompx_gnu_pinned_mem_alloc allocator or the pinned trait is obtained via the CUDA API when an NVPTX device is present. This provides a performance boost for NVPTX offload code and also allows unlimited use of pinned memory regardless of the OS ulimit/rlimit settings.
Managed memory allocated on the host with the ompx_gnu_managed_mem_alloc allocator or in the ompx_gnu_managed_mem_space (both GNU extensions) allocate memory equivalent to HIP Managed Memory, although not actually allocated using hipMallocManaged. This memory is accessible by both the host and the device at the same address, so it need not be mapped with map clauses. Instead, use the is_device_ptr clause or has_device_addr clause to indicate that the pointer is already accessible on the device. The ROCm runtime will automatically handle data migration between host and device as needed. Not all AMD GPU devices support this feature, and many that do require that -mxnack=on is configured at compile time. If managed memory is not supported by the default device, as configured at the moment the allocator is called, then the allocator will use the fall-back setting. If the default device is configured differently when the memory is freed, via omp_free or omp_realloc, the result may be undefined. If the current device does not support Unified Shared Memory (or it is not enabled with HSA_XNACK=1) then Managed Memory might still work, but allocations may only be visible to a single device (whichever was the default device when the first allocation was made).
The OpenMP routines omp_target_memcpy_rect and omp_target_memcpy_rect_async and the target update directive for non-contiguous list items use the 3D memory-copy function of the HSA library. Higher dimensions call this functions in a loop and are therefore supported.
The unique identifier (UID), used with OpenMP’s API UID routines, is the value returned by the HSA runtime library for HSA_AMD_AGENT_INFO_UUID. For GPUs, it is currently ‘GPU-’ followed by 16 lower-case hex digits, yielding a string like GPU-f914a2142fc3413a. The output matches the one used by rocminfo.

OpenMP interop – Foreign-Runtime Support for AMD GPUs