13.8 Memory-mapped I/O

On modern operating systems, it is possible to mmap (pronounced “em-map”) a file to a region of memory. When this is done, the file can be accessed just like an array in the program.

This is more efficient than read or write, as only the regions of the file that a program actually accesses are loaded. Accesses to not-yet-loaded parts of the mmapped region are handled in the same way as swapped out pages.

Since mmapped pages can be stored back to their file when physical memory is low, it is possible to mmap files orders of magnitude larger than both the physical memory and swap space. The only limit is address space. The theoretical limit is 4GB on a 32-bit machine - however, the actual limit will be smaller since some areas will be reserved for other purposes. If the LFS interface is used the file size on 32-bit systems is not limited to 2GB (offsets are signed which reduces the addressable area of 4GB by half); the full 64-bit are available.

Memory mapping only works on entire pages of memory. Thus, addresses for mapping must be page-aligned, and length values will be rounded up. To determine the default size of a page the machine uses one should use:

size_t page_size = (size_t) sysconf (_SC_PAGESIZE);

On some systems, mappings can use larger page sizes for certain files, and applications can request larger page sizes for anonymous mappings as well (see the MAP_HUGETLB flag below).

The following functions are declared in sys/mman.h:

Function: void * mmap (void *address, size_t length, int protect, int flags, int filedes, off_t offset)

Preliminary: | MT-Safe | AS-Safe | AC-Safe | See POSIX Safety Concepts.

The mmap function creates a new mapping, connected to bytes (offset) to (offset + length - 1) in the file open on filedes. A new reference for the file specified by filedes is created, which is not removed by closing the file.

address gives a preferred starting address for the mapping. NULL expresses no preference. Any previous mapping at that address is automatically removed. The address you give may still be changed, unless you use the MAP_FIXED flag.

protect contains flags that control what kind of access is permitted. They include PROT_READ, PROT_WRITE, and PROT_EXEC. The special flag PROT_NONE reserves a region of address space for future use. The mprotect function can be used to change the protection flags. See Memory Protection.

The flags parameter contains flags that control the nature of the map. One of MAP_SHARED, MAP_SHARED_VALIDATE, or MAP_PRIVATE must be specified. Additional flags may be bitwise OR’d to further define the mapping.

Note that, aside from MAP_PRIVATE and MAP_SHARED, not all flags are supported on all versions of all operating systems. Consult the kernel-specific documentation for details. The flags include:

MAP_PRIVATE

This specifies that writes to the region should never be written back to the attached file. Instead, a copy is made for the process, and the region will be swapped normally if memory runs low. No other process will see the changes.

Since private mappings effectively revert to ordinary memory when written to, you must have enough virtual memory for a copy of the entire mmapped region if you use this mode with PROT_WRITE.

MAP_SHARED

This specifies that writes to the region will be written back to the file. Changes made will be shared immediately with other processes mmaping the same file.

Note that actual writing may take place at any time. You need to use msync, described below, if it is important that other processes using conventional I/O get a consistent view of the file.

MAP_SHARED_VALIDATE

Similar to MAP_SHARED except that additional flags will be validated by the kernel, and the call will fail if an unrecognized flag is provided. With MAP_SHARED using a flag on a kernel that doesn’t support it causes the flag to be ignored. MAP_SHARED_VALIDATE should be used when the behavior of all flags is required.

MAP_FIXED

This forces the system to use the exact mapping address specified in address and fail if it can’t. Note that if the new mapping would overlap an existing mapping, the overlapping portion of the existing map is unmapped.

MAP_ANONYMOUS
MAP_ANON

This flag tells the system to create an anonymous mapping, not connected to a file. filedes and offset are ignored, and the region is initialized with zeros.

Anonymous maps are used as the basic primitive to extend the heap on some systems. They are also useful to share data between multiple tasks without creating a file.

On some systems using private anonymous mmaps is more efficient than using malloc for large blocks. This is not an issue with the GNU C Library, as the included malloc automatically uses mmap where appropriate.

MAP_HUGETLB

This requests that the system uses an alternative page size which is larger than the default page size for the mapping. For some workloads, increasing the page size for large mappings improves performance because the system needs to handle far fewer pages. For other workloads which require frequent transfer of pages between storage or different nodes, the decreased page granularity may cause performance problems due to the increased page size and larger transfers.

In order to create the mapping, the system needs physically contiguous memory of the size of the increased page size. As a result, MAP_HUGETLB mappings are affected by memory fragmentation, and their creation can fail even if plenty of memory is available in the system.

Not all file systems support mappings with an increased page size.

The MAP_HUGETLB flag is specific to Linux.

MAP_32BIT

Require addresses that can be accessed with a signed 32 bit pointer, i.e., within the first 2 GiB. Ignored if MAP_FIXED is specified.

MAP_DENYWRITE
MAP_EXECUTABLE
MAP_FILE

Provided for compatibility. Ignored by the Linux kernel.

MAP_FIXED_NOREPLACE

Similar to MAP_FIXED except the call will fail with EEXIST if the new mapping would overwrite an existing mapping. To test for support for this flag, specify MAP_FIXED_NOREPLACE without MAP_FIXED, and (if the call was successful) check the actual address returned. If it does not match the address passed, then this flag is not supported.

MAP_GROWSDOWN

This flag is used to make stacks, and is typically only needed inside the program loader to set up the main stack for the running process. The mapping is created according to the other flags, except an additional page just prior to the mapping is marked as a “guard page”. If a write is attempted inside this guard page, that page is mapped, the mapping is extended, and a new guard page is created. Thus, the mapping continues to grow towards lower addresses until it encounters some other mapping.

Note that accessing memory beyond the guard page will not trigger this feature. In gcc, use -fstack-clash-protection to ensure the guard page is always touched.

MAP_LOCKED

A hint that requests that mapped pages are locked in memory (i.e. not paged out). Note that this is a request and not a requirement; use mlock if locking is required.

MAP_POPULATE
MAP_NONBLOCK

MAP_POPULATE is a hint that requests that the kernel read-ahead a file-backed mapping, causing pages to be mapped before they’re needed. MAP_NONBLOCK is a hint that requests that the kernel not attempt such except for pages are already in memory. Note that neither of these hints affects future paging activity, use mlock if such needs to be controlled.

MAP_NORESERVE

Asks the kernel to not reserve physical backing (i.e. space in a swap device) for a mapping. This would be useful for, for example, a very large but sparsely used mapping which need not be limited in total length by available RAM, but with very few mapped pages. Note that writes to such a mapping may cause a SIGSEGV if the system is unable to map a page due to lack of resources.

On Linux, this flag’s behavior may be overwridden by /proc/sys/vm/overcommit_memory as documented in the proc(5) man page.

MAP_STACK

Ensures that the resulting mapping is suitable for use as a program stack. For example, the use of huge pages might be precluded.

MAP_SYNC

This is a special flag for DAX devices, which tells the kernel to write dirty metadata out whenever dirty data is written out. Unlike most other flags, this one will fail unless MAP_SHARED_VALIDATE is also given.

MAP_DROPPABLE

Request the page to be never written out to swap, it will be zeroed under memory pressure (so kernel can just drop the page), it is inherited by fork, it is not counted against mlock budget, and if there is not enough memory to service a page fault there is no fatal error (so no signal is sent).

The MAP_DROPPABLE flag is specific to Linux.

mmap returns the address of the new mapping, or MAP_FAILED for an error.

Possible errors include:

EACCES

filedes was not open for the type of access specified in protect.

EAGAIN

The system has temporarily run out of resources.

EBADF

The fd passed is invalid, and a valid file descriptor is required (i.e. MAP_ANONYMOUS was not specified).

EEXIST

MAP_FIXED_NOREPLACE was specified and an existing mapping was found overlapping the requested address range.

EINVAL

Either address was unusable (because it is not a multiple of the applicable page size), or inconsistent flags were given.

If MAP_HUGETLB was specified, the file or system does not support large page sizes.

ENODEV

This file is of a type that doesn’t support mapping, the process has exceeded its data space limit, or the map request would exceed the process’s virtual address space.

ENOMEM

There is not enough memory for the operation, the process is out of address space, or there are too many mappings. On Linux, the maximum number of mappings can be controlled via /proc/sys/vm/max_map_count or, if your OS supports it, via the vm.max_map_count sysctl setting.

ENOEXEC

The file is on a filesystem that doesn’t support mapping.

EPERM

PROT_EXEC was requested but the file is on a filesystem that was mounted with execution denied, a file seal prevented the mapping, or the caller set MAP_HUDETLB but does not have the required priviledges.

EOVERFLOW

Either the offset into the file plus the length of the mapping causes internal page counts to overflow, or the offset requested exceeds the length of the file.

Function: void * mmap64 (void *address, size_t length, int protect, int flags, int filedes, off64_t offset)

Preliminary: | MT-Safe | AS-Safe | AC-Safe | See POSIX Safety Concepts.

The mmap64 function is equivalent to the mmap function but the offset parameter is of type off64_t. On 32-bit systems this allows the file associated with the filedes descriptor to be larger than 2GB. filedes must be a descriptor returned from a call to open64 or fopen64 and freopen64 where the descriptor is retrieved with fileno.

When the sources are translated with _FILE_OFFSET_BITS == 64 this function is actually available under the name mmap. I.e., the new, extended API using 64 bit file sizes and offsets transparently replaces the old API.

Function: int munmap (void *addr, size_t length)

Preliminary: | MT-Safe | AS-Safe | AC-Safe | See POSIX Safety Concepts.

munmap removes any memory maps from (addr) to (addr + length). length should be the length of the mapping.

It is safe to unmap multiple mappings in one command, or include unmapped space in the range. It is also possible to unmap only part of an existing mapping. However, only entire pages can be removed. If length is not an even number of pages, it will be rounded up.

It returns 0 for success and -1 for an error.

One error is possible:

EINVAL

The memory range given was outside the user mmap range or wasn’t page aligned.

Function: int msync (void *address, size_t length, int flags)

Preliminary: | MT-Safe | AS-Safe | AC-Safe | See POSIX Safety Concepts.

When using shared mappings, the kernel can write the file at any time before the mapping is removed. To be certain data has actually been written to the file and will be accessible to non-memory-mapped I/O, it is necessary to use this function.

It operates on the region address to (address + length). It may be used on part of a mapping or multiple mappings, however the region given should not contain any unmapped space.

flags can contain some options:

MS_SYNC

This flag makes sure the data is actually written to disk. Normally msync only makes sure that accesses to a file with conventional I/O reflect the recent changes.

MS_ASYNC

This tells msync to begin the synchronization, but not to wait for it to complete.

msync returns 0 for success and -1 for error. Errors include:

EINVAL

An invalid region was given, or the flags were invalid.

EFAULT

There is no existing mapping in at least part of the given region.

Function: void * mremap (void *address, size_t length, size_t new_length, int flag, ... /* void *new_address */)

Preliminary: | MT-Safe | AS-Safe | AC-Safe | See POSIX Safety Concepts.

This function can be used to change the size of an existing memory area. address and length must cover a region entirely mapped in the same mmap statement. A new mapping with the same characteristics will be returned with the length new_length.

Possible flags are

MREMAP_MAYMOVE

If it is given in flags, the system may remove the existing mapping and create a new one of the desired length in another location.

MREMAP_FIXED

If it is given in flags, mremap accepts a fifth argument, void *new_address, which specifies a page-aligned address to which the mapping must be moved. Any previous mapping at the address range specified by new_address and new_size is unmapped.

MREMAP_FIXED must be used together with MREMAP_MAYMOVE.

MREMAP_DONTUNMAP

If it is given in flags, mremap accepts a fifth argument, void *new_address, which specifies a page-aligned address. Any previous mapping at the address range specified by new_address and new_size is unmapped. If new_address is NULL, the kernel chooses the page-aligned address at which to create the mapping. Otherwise, the kernel takes it as a hint about where to place the mapping. The mapping at the address range specified by old_address and old_size isn’t unmapped.

MREMAP_DONTUNMAP must be used together with MREMAP_MAYMOVE. old_size must be the same as new_size. This flag bit is Linux-specific.

The address of the resulting mapping is returned, or MAP_FAILED. Possible error codes include:

EFAULT

There is no existing mapping in at least part of the original region, or the region covers two or more distinct mappings.

EINVAL

Any arguments are inappropriate, including unknown flags values.

EAGAIN

The region has pages locked, and if extended it would exceed the process’s resource limit for locked pages. See Limiting Resource Usage.

ENOMEM

The region is private writable, and insufficient virtual memory is available to extend it. Also, this error will occur if MREMAP_MAYMOVE is not given and the extension would collide with another mapped region.

This function is only available on a few systems. Except for performing optional optimizations one should not rely on this function.

Not all file descriptors may be mapped. Sockets, pipes, and most devices only allow sequential access and do not fit into the mapping abstraction. In addition, some regular files may not be mmapable, and older kernels may not support mapping at all. Thus, programs using mmap should have a fallback method to use should it fail. See Mmap in GNU Coding Standards.

Function: int madvise (void *addr, size_t length, int advice)

Preliminary: | MT-Safe | AS-Safe | AC-Safe | See POSIX Safety Concepts.

This function can be used to provide the system with advice about the intended usage patterns of the memory region starting at addr and extending length bytes.

The valid BSD values for advice are:

MADV_NORMAL

The region should receive no further special treatment.

MADV_RANDOM

The region will be accessed via random page references. The kernel should page-in the minimal number of pages for each page fault.

MADV_SEQUENTIAL

The region will be accessed via sequential page references. This may cause the kernel to aggressively read-ahead, expecting further sequential references after any page fault within this region.

MADV_WILLNEED

The region will be needed. The pages within this region may be pre-faulted in by the kernel.

MADV_DONTNEED

The region is no longer needed. The kernel may free these pages, causing any changes to the pages to be lost, as well as swapped out pages to be discarded.

MADV_HUGEPAGE

Indicate that it is beneficial to increase the page size for this mapping. This can improve performance for larger mappings because the system needs to handle far fewer pages. However, if parts of the mapping are frequently transferred between storage or different nodes, performance may suffer because individual transfers can become substantially larger due to the increased page size.

This flag is specific to Linux.

MADV_NOHUGEPAGE

Undo the effect of a previous MADV_HUGEPAGE advice. This flag is specific to Linux.

The POSIX names are slightly different, but with the same meanings:

POSIX_MADV_NORMAL

This corresponds with BSD’s MADV_NORMAL.

POSIX_MADV_RANDOM

This corresponds with BSD’s MADV_RANDOM.

POSIX_MADV_SEQUENTIAL

This corresponds with BSD’s MADV_SEQUENTIAL.

POSIX_MADV_WILLNEED

This corresponds with BSD’s MADV_WILLNEED.

POSIX_MADV_DONTNEED

This corresponds with BSD’s MADV_DONTNEED.

madvise returns 0 for success and -1 for error. Errors include:

EINVAL

An invalid region was given, or the advice was invalid.

EFAULT

There is no existing mapping in at least part of the given region.

Function: int shm_open (const char *name, int oflag, mode_t mode)

Preliminary: | MT-Safe locale | AS-Unsafe init heap lock | AC-Unsafe lock mem fd | See POSIX Safety Concepts.

This function returns a file descriptor that can be used to allocate shared memory via mmap. Unrelated processes can use same name to create or open existing shared memory objects.

A name argument specifies the shared memory object to be opened. In the GNU C Library it must be a string smaller than NAME_MAX bytes starting with an optional slash but containing no other slashes.

The semantics of oflag and mode arguments is same as in open.

shm_open returns the file descriptor on success or -1 on error. On failure errno is set.

Preliminary: | MT-Safe locale | AS-Unsafe init heap lock | AC-Unsafe lock mem fd | See POSIX Safety Concepts.

This function is the inverse of shm_open and removes the object with the given name previously created by shm_open.

shm_unlink returns 0 on success or -1 on error. On failure errno is set.

Function: int memfd_create (const char *name, unsigned int flags)

Preliminary: | MT-Safe | AS-Safe | AC-Safe fd | See POSIX Safety Concepts.

The memfd_create function returns a file descriptor which can be used to create memory mappings using the mmap function. It is similar to the shm_open function in the sense that these mappings are not backed by actual files. However, the descriptor returned by memfd_create does not correspond to a named object; the name argument is used for debugging purposes only (e.g., will appear in /proc), and separate invocations of memfd_create with the same name will not return descriptors for the same region of memory. The descriptor can also be used to create alias mappings within the same process.

The descriptor initially refers to a zero-length file. Before mappings can be created which are backed by memory, the file size needs to be increased with the ftruncate function. See File Size.

The flags argument can be a combination of the following flags:

MFD_CLOEXEC

The descriptor is created with the O_CLOEXEC flag.

MFD_ALLOW_SEALING

The descriptor supports the addition of seals using the fcntl function.

MFD_HUGETLB

This requests that mappings created using the returned file descriptor use a larger page size. See MAP_HUGETLB above for details.

This flag is incompatible with MFD_ALLOW_SEALING.

memfd_create returns a file descriptor on success, and -1 on failure.

The following errno error conditions are defined for this function:

EINVAL

An invalid combination is specified in flags, or name is too long.

EFAULT

The name argument does not point to a string.

EMFILE

The operation would exceed the file descriptor limit for this process.

ENFILE

The operation would exceed the system-wide file descriptor limit.

ENOMEM

There is not enough memory for the operation.