On the hardware side, there is the hierarchy (fine to coarse):
All OpenMP and OpenACC levels are used, i.e.
The used sizes are
num_teams (OpenMP) or
num_gangs (OpenACC) or otherwise the number of CU. It is limited
by two times the number of CU.
num_threads (OpenMP) and num_workers (OpenACC)
overrides this if smaller.
The implementation remark:
printf functions and the Fortran
print/write statements.
target regions with
device(ancestor:1)) are processed serially per target region
such that the next reverse offload region is only executed after the previous
one returned.
requires directive with self_maps or
unified_shared_memory is only supported if all the AMD GPUs
present have the HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT
property; some systems require the "xnack" feature enabled for this to be
true, in which case the runtime will attempt to set the HSA_XNACK
environment variable to ‘1’ automatically (user-set values are not
overridden, and the setting only affects the executable itself and any
child processes). If any AMD GPU device is not supported, all AMD GPUs
are removed from the list of available devices (“host fallback”).
GCN_STACK_SIZE
environment variable; the default is 32 kiB per thread.
omp_low_lat_mem_space) is supported when the
the access trait is set to cgroup. The default pool size
is automatically scaled to share the 64 kiB LDS memory between the number
of teams configured to run on each compute-unit, but may be adjusted at
runtime by setting environment variable
GOMP_GCN_LOWLAT_POOL=bytes.
omp_low_lat_mem_alloc cannot be used with true low-latency memory
because the definition implies the omp_atv_all trait; main
graphics memory is used instead.
omp_cgroup_mem_alloc, omp_pteam_mem_alloc, and
omp_thread_mem_alloc, all use low-latency memory as first
preference, and fall back to main graphics memory when the low-latency
pool is exhausted.
omp_alloc with the
ompx_gnu_pinned_mem_alloc allocator or the pinned trait is
obtained via the CUDA API when an NVPTX device is present. This provides
a performance boost for NVPTX offload code and also allows unlimited use
of pinned memory regardless of the OS ulimit/rlimit
settings.
ompx_gnu_managed_mem_alloc allocator or in the
ompx_gnu_managed_mem_space (both GNU extensions) allocate memory
equivalent to HIP Managed Memory, although not actually allocated
using hipMallocManaged. This memory is accessible by both the
host and the device at the same address, so it need not be mapped with
map clauses. Instead, use the is_device_ptr clause or
has_device_addr clause to indicate that the pointer is already
accessible on the device. The ROCm runtime will automatically handle
data migration between host and device as needed. Not all AMD GPU
devices support this feature, and many that do require that
-mxnack=on is configured at compile time. If managed memory is
not supported by the default device, as configured at the moment the
allocator is called, then the allocator will use the fall-back setting.
If the default device is configured differently when the memory is freed,
via omp_free or omp_realloc, the result may be undefined.
If the current device does not support Unified Shared Memory (or it is
not enabled with HSA_XNACK=1) then Managed Memory might still
work, but allocations may only be visible to a single device (whichever
was the default device when the first allocation was made).
omp_target_memcpy_rect and
omp_target_memcpy_rect_async and the target update
directive for non-contiguous list items use the 3D memory-copy function
of the HSA library. Higher dimensions call this functions in a loop and
are therefore supported.
HSA_AMD_AGENT_INFO_UUID.
For GPUs, it is currently ‘GPU-’ followed by 16 lower-case hex digits,
yielding a string like GPU-f914a2142fc3413a. The output matches
the one used by rocminfo.