On the hardware side, there is the hierarchy (fine to coarse):
All OpenMP and OpenACC levels are used, i.e.
The used sizes are
warp_size is always 32
dim={#teams,1,1}, blocks={#threads,warp_size,1}.
Additional information can be obtained by setting the environment variable to
GOMP_DEBUG=1 (very verbose; grep for kernel.*launch for launch
parameters).
GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
which caches the JIT in the user’s directory (see CUDA documentation; can be
tuned by the environment variables CUDA_CACHE_{DISABLE,MAXSIZE,PATH}.
Note: While PTX ISA is generic, the -mptx= and -march= commandline
options still affect the used PTX ISA code and, thus, the requirements on
CUDA version and hardware.
The implementation remark:
printf functions.
Additionally, the Fortran print/write statements are
supported within OpenMP target regions, but not yet within OpenACC compute
regions. requires reverse_offload
requires at least -march=sm_35, compiling for -march=sm_30
is not supported.
target regions with
device(ancestor:1)), there is a slight performance penalty
for all target regions, consisting mostly of shutdown delay
Per device, reverse offload regions are processed serially such that
the next reverse offload region is only executed after the previous
one returned.
requires directive with self_maps or
unified_shared_memory runs on nvptx devices if and only if
all of those support the pageableMemoryAccess property;5
otherwise, all nvptx device are removed from the list of available
devices (“host fallback”).
-msoft-stack
in the GCC manual.
omp_low_lat_mem_space) is supported when the
the access trait is set to cgroup, and libgomp has
been built for PTX ISA version 4.1 or higher (such as in GCC’s
default configuration). The default pool size
is 8 kiB per team, but may be adjusted at runtime by setting environment
variable GOMP_NVPTX_LOWLAT_POOL=bytes. The maximum value is
limited by the available hardware, and care should be taken that the
selected pool size does not unduly limit the number of teams that can
run simultaneously.
omp_low_lat_mem_alloc cannot be used with true low-latency memory
because the definition implies the omp_atv_all trait; main
graphics memory is used instead.
omp_cgroup_mem_alloc, omp_pteam_mem_alloc, and
omp_thread_mem_alloc, all use low-latency memory as first
preference, and fall back to main graphics memory when the low-latency
pool is exhausted.
ompx_gnu_managed_mem_alloc allocator or in the
ompx_gnu_managed_mem_space (both GNU extensions) allocate memory
in the CUDA Managed Memory space using cuMemAllocManaged. This
memory is accessible by both the host and the device at the same address,
so it need not be mapped with map clauses. Instead, use the
is_device_ptr clause or has_device_addr clause to indicate
that the pointer is already accessible on the device. The CUDA runtime
will automatically handle data migration between host and device as
needed. If managed memory is not supported by the default device, as
configured at the moment the allocator is called, then the allocator will
use the fall-back setting. If the default device is configured
differently when the memory is freed, via omp_free or
omp_realloc, the result may be undefined.
omp_target_memcpy_rect and
omp_target_memcpy_rect_async and the target update
directive for non-contiguous list items use the 2D and 3D memory-copy
functions of the CUDA library. Higher dimensions call those functions
in a loop and are therefore supported.
GPU-a8081c9e-f03e-18eb-1827-bf5ba95afa5d. The output
matches the format used by nvidia-smi.