Problem statement:
Application code allocates a shared memory (via shm_open/shmget IPC calls) which acts as a shadow memory in RAM for Device's registers to be read at any time by all application processes in the system.
An initial configuration of the device involves register writes (thousands of them); and these are written 32-byte word-at a time to both shadow memory and to the device. This is time-consuming for a PCIe device. The goal hence is to reduce this initial config time (boot time of the device). Ideally, it would be better to copy configs to RAM and then do a DMA of the RAM contents to device offline (while application is doing other things of initialization).
The challenge:
DMA'ing the RAM contents to device provides hardware acceleration. But the RAM contents are part of a user-space shared memory(physically non-contiguous). But, to use DMA, we need to have the memory contents to be placed physically contiguously in RAM.
To further elaborate on the problem, shared memory allocated to application processes might be physically spread across in the RAM (uses vmalloc call) i.e., the chunks are virtually contiguous and might not be physically contiguous. We can still utilize scatter-gather capability of the device to gather the scattered physical blocks, but this requires walking through the page-tables of vmalloc'ed block to find the physical address which is again time consuming.
Linux kernel provides 'kmalloc' call to allocate physically contiguous memory, but has a limitation that we can allocate only upto 128K bytes.
Background on the idea:
In Linux, chunk of Physical RAM can be set aside during boot time specifying 'bigphysarea=size(in bytes)' option in Linux boot line. This sets aside size no. of bytes in low-mem during boot-time. Low-mem is always DMAable, and although modern DMA are capable to address high-mem, a code utilising low-mem is easily portable.
Shared memory obtained via shmget/shm_open creates a RAM backed file in tmpfs.
Our solution, in simplest terms, would be grab a bigphysarea chunk and map it to the RAM backed file, for the applications to use. DMA will be performed in the background.
Implementation:
Write a kernel module :)
I'll highlight on the following main APIs to be implementation:
1. ioctl:
command: ALLOC_BIGPHYS - Allocate the Bigphysarea and return the physical start address to the caller for future use (for mmap)
command: STRT_DMA - Use streaming DMA APIs (pci_map_single) to map a single region.
2. mmap:
After some checks, map (using remap_pfn) the application VMA to bigphysarea we caught hold of, in the previous ioctl command (ALLOC_BIGPHYS). Set flag to make sure, the pages are not swapped out, while we mmap ! Take care of caching issues (how ?)