On the hardware side, there is the hierarchy (fine to coarse):
All OpenMP and OpenACC levels are used, i.e.
The used sizes are
num_teams (OpenMP) or
num_gangs (OpenACC) or otherwise the number of CU. It is limited
by two times the number of CU.
num_threads (OpenMP) and num_workers (OpenACC)
overrides this if smaller.
The implementation remark:
printf functions and the Fortran
print/write statements.
target regions with
device(ancestor:1)) are processed serially per target region
such that the next reverse offload region is only executed after the previous
one returned.
requires directive with self_maps or
unified_shared_memory is only supported if all AMD GPUs have the
HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT property; for
discrete GPUs, this may require setting the HSA_XNACK environment
variable to ‘1’; for systems with both an APU and a discrete GPU that
does not support XNACK, consider using ROCR_VISIBLE_DEVICES to
enable only the APU. If not supported, all AMD GPU devices are removed
from the list of available devices (“host fallback”).
GCN_STACK_SIZE
environment variable; the default is 32 kiB per thread.
omp_low_lat_mem_space) is supported when the
the access trait is set to cgroup. The default pool size
is automatically scaled to share the 64 kiB LDS memory between the number
of teams configured to run on each compute-unit, but may be adjusted at
runtime by setting environment variable
GOMP_GCN_LOWLAT_POOL=bytes.
omp_low_lat_mem_alloc cannot be used with true low-latency memory
because the definition implies the omp_atv_all trait; main
graphics memory is used instead.
omp_cgroup_mem_alloc, omp_pteam_mem_alloc, and
omp_thread_mem_alloc, all use low-latency memory as first
preference, and fall back to main graphics memory when the low-latency
pool is exhausted.
omp_alloc with the
ompx_gnu_pinned_mem_alloc allocator or the pinned trait is
obtained via the CUDA API when an NVPTX device is present. This provides
a performance boost for NVPTX offload code and also allows unlimited use
of pinned memory regardless of the OS ulimit/rlimit
settings.
omp_target_memcpy_rect and
omp_target_memcpy_rect_async and the target update
directive for non-contiguous list items use the 3D memory-copy function
of the HSA library. Higher dimensions call this functions in a loop and
are therefore supported.
HSA_AMD_AGENT_INFO_UUID.
For GPUs, it is currently ‘GPU-’ followed by 16 lower-case hex digits,
yielding a string like GPU-f914a2142fc3413a. The output matches
the one used by rocminfo.