NVIDIA FreeBSD kernel feature requests

Thu Jun 29 11:12:35 UTC 2006

Hi all,

NVIDIA has been looking at ways to improve its graphics driver for the
FreeBSD i386 platform, as well as investigating the possibility of adding
support for the FreeBSD amd64 platform, and identified a number of
obstacles. Some progress has been made to resolve them, and NVIDIA would
like to summarize the current status. We would also like to thank John
Baldwin and Doug Rabson for their valuable help.

This summary makes an attempt to describe the kernel interfaces needed by
the NVIDIA FreeBSD i386 graphics driver to achieve feature parity with
the Linux/Solaris graphics drivers, and/or required to make support for
the FreeBSD amd64 platform feasible. It also describes some of the
technical difficulties encountered by NVIDIA during the FreeBSD i386
graphics driver's development, how these problems have been worked around
and what could be done to solve them better.

While the following is focused on the NVIDIA FreeBSD graphics drivers, we
believe the interfaces discussed below are generally applicable to any
modern high performance graphics driver.

The interfaces in question can be loosely categorized into the different
classes reliability, compatibility and performance:

Reliability:

   The NVIDIA graphics driver needs to be able to create uncached kernel
   and user mappings of I/O memory, such as NVIDIA GPU registers. The
   FreeBSD kernel does not currently provide the interfaces necessary to
   specify the memory type when creating such mappings, which makes it
   difficult for the NVIDIA graphics driver to guarantee that the correct
   memory type is selected.

   Kernel mappings of I/O memory can be created with the pmap_mapdev()
   interface, user mappings are created with mmap(2). On FreeBSD i386 and
   on FreeBSD amd64, the effective memory type of mappings created with
   either interface is determined by a given system's MTRR configuration
   by default, which will specify the correct UC memory type in most, but
   not in all cases.

   MTRR configurations with non-UC memory ranges overlapping I/O memory
   mapped via pmap_mapdev() or mmap(2) can result in the incorrect memory
   type being selected, which can impair reliability.

   To reduce the likelihood of problems, the FreeBSD i386 driver updates
   the mappings returned by pmap_mapdev() with the PCD/PWT flags to force
   use of the UC memory type. On FreeBSD amd64, the presence of a large
   static mapping using 2MB pages makes this approach unfeasible.

   In the case of user mappings, limited control over the memory type can
   be exerted with the help of MTRRs, but their lack of flexibility
   greatly reduces the feasibility of this approach.

1) The NVIDIA FreeBSD graphics driver is in need of new a interface that
   supports the creation of UC kernel mappings on FreeBSD i386 and on
   FreeBSD amd64.

   John Baldwin is working on a new interface, pmap_mapdev_attr(), which
   will allow the NVIDIA graphics driver to create UC kernel mappings
   on FreeBSD i386 and on FreeBSD amd64; the implementation on the latter
   platform will handle the direct mapping transparently.

2) As described above, user mappings of I/O memory are created via the
   mmap(2) interface and the FreeBSD device pager; unfortunately, drivers
   do not currently have control over the memory type used.

   The NVIDIA FreeBSD graphics driver needs to be able to specify the
   memory type used for user mappings created via mmap(2). This interface
   is also important for high performance graphics (see 'Performance'
   below).

Compatibility:

1) The NVIDIA graphics driver needs to be able to set the memory type of
   the kernel mapping of memory allocated with malloc()/contigmalloc()
   to UC, which presents essentially the same problems as those outlined
   above for I/O memory mappings.

   The ability to change the memory type is necessary to avoid aliasing
   problems when the memory is mapped into the AGP aperture, which is
   accessed via WC user mappings. If the creation of UC/WC user mappings
   becomes possible for system memory in the future (see below), the
   ability to change the memory type of the associated kernel mappings to
   UC will be important for the same reason.

   Newer NVIDIA FreeBSD i386 graphics drivers manually update the memory
   type of the kernel mappings of malloc() allocated memory using the
   approach described for kernel mappings above. This is not feasible on
   FreeBSD amd64 due to the static direct mapping (see above).

   The NVIDIA FreeBSD graphics driver needs an interface that allows it
   to change the memory type of the kernel mapping(s) of system memory
   allocated with malloc()/contigmalloc(). The interface should flush CPU
   and TLB caches, when necessary.

   John Baldwin is working on pmap_change_attr() for FreeBSD i386 and for
   FreeBSD amd64, which will allow specifying the desired memory types
   for kernel mappings created with e.g. malloc()/contigmalloc().

2) The NVIDIA graphics driver needs to map different types of memory into
   the address spaces of user clients, most commonly:

    a) NVIDIA graphics device registers
    b) NVIDIA graphics device frame buffer memory
    c) AGP memory allocations (mapped via the AGP aperture)
    d) DMA system memory allocations

   This is currently done via mmap(2) and the device pager, i.e. the user
   client performs a private ioctl(2) to allocate memory (this step is
   specific to the b) - d) memory types), then calls mmap(2) to obtain a
   user mapping of the memory. The NVIDIA graphics driver's d_mmap()
   callback is invoked first to check the logical mmap(2) offset(s), then
   again to return the associated page frame number(s) when the mapping
   is accessed for the first time.

   The device pager mechanism works well for a) - c), but not for d). The
   system memory allocations are frequently very large (several MB) and
   need to be allocated physically non-contiguous. This leads to problems
   with the d_mmap() interface:

    - d_mmap() is called per page with logical offsets computed based on
      the mmap(2) base offset provided by the client and the current
      page's position within the allocation, but no context information
      is provided to d_mmap(). The NVIDIA FreeBSD graphics driver can
      look up the associated system memory allocation and determine the
      page frame number(s) for a given logical offset only if a linear
      address range is associated with each system memory allocation, in
      which case the start address can serve as the mmap(2) offset used
      by the client and the logical offsets can be compared with each
      allocation's linear address range.

      Since the memory itself is not physically contiguous, the physical
      addresses of pages in the allocation can not be used as mmap(2)
      offsets, a different address range needs to be used. The FreeBSD
      i386 driver currently allocates its system memory with malloc() and
      derives the address range used with mmap(2) from the allocation's
      kernel virtual address range.

      This allocation of DMA system memory with malloc() is problematic
      on FreeBSD i386 PAE and FreeBSD amd64 systems with more than 4GB of
      RAM and older NVIDIA GPUs limited to 32-bit DMA, since malloc()
      doesn't currently allow drivers to specify allocation constraints,
      like contigmalloc() does, i.e. it may allocate physical memory that
      can not be addressed by such GPUs.

      Further, since the physical addresses of non-contiguous allocations
      can not be used as mmap(2) offsets for system memory, but need to
      be used for a) - c), the logical and physical addresses used as
      mmap(2) offsets can potentially be confused by d_mmap(). The NVIDIA
      graphics driver tries to minimize this risk, but can not avoid it
      completely without a significant performance penalty.

    - The device pager was designed for I/O memory regions and it assumes
      that d_mmap() will always return the same page frame number for a
      given logical offset. As a result, d_mmap() is invoked exactly once
      for any given logical offset by default. In case of system memory
      allocations, however, the physical page backing a given offset may
      change as the malloc()'d memory is freed/reallocated.

      The NVIDIA FreeBSD graphics driver needs to manually invalidate the
      translation cache to work around this problem. It does so with the
      msync() system call, which was extended for this purpose in FreeBSD
      4.7 and again in FreeBSD 4.9 and 5.2.1. This leads to performance
      problems on some configurations.

   The NVIDIA FreeBSD graphics driver needs a different interface to make
   the mapping of system memory allocations via mmap(2) simpler. If the
   d_mmap() callback was extended to be called with the base offset in
   addition to the current offset, the first two of the problems detailed
   above would no longer be an issue; the NVIDIA graphics driver would
   then be able to use physical addresses as mmap(2) offsets for a) - d).

   The new interface may not require a FreeBSD specific ioctl(2), as this
   would break compatibility with the NVIDIA Linux OpenGL library used
   in the FreeBSD Linux ABI compatibility environment.

3) To be able to support FreeBSD i386 PAE and FreeBSD amd64 systems with
   more than 4GB of physical memory and NVIDIA GPUs that are limited to
   32-bit DMA, the NVIDIA FreeBSD graphics driver will need to be updated
   to allocate memory from within the first 4GB of memory.

   Unfortunately, this is not feasible with the current interfaces. The
   malloc() interface does not allow the caller to specify allocation
   constraints and while contigmalloc() does, its usefulness is currently
   limited. This is because DMA memory can't realistically be allocated
   contiguously, except if the allocations are very small, and because
   a contiguous address range is needed for mmap(2), as described above,
   which would need to be maintained seperately for contigmalloc() memory
   allocations.

   The introduction of an malloc() variant that allows the specification
   of allocation constraints would solve the addressing problem, but
   due to the problems caused by using logical and physical addresses for
   mmap(2), a different solution would be preferred. By making it
   possible to use physical addresses exclusively as mmap(2) offsets, as
   described above, the NVIDIA FreeBSD graphics driver could use the
   contigmalloc() interface to allocate the invidiual pages in the larger
   non-contiguous allocations.

   If contigmalloc() were used, the NVIDIA FreeBSD graphics driver would
   need to be able to create contiguous virtual mappings spanning more
   than one page within larger virtually non-contiguous allocations; this
   functionality had best be implemented in the FreeBSD kernel.

   The 'vmap()' kernel interface does this on Linux. It takes an array of
   pages and maps them into a single contiguous address range.

Performance:

1) For optimal PCI-E performance and improved compatibility with systems
   where MTRR memory ranges do not provide sufficient flexibility, the
   NVIDIA FreeBSD graphics driver needs to be able to specify the memory
   type used for user mappings created with mmap(2).

   John Baldwin is working on PAT support for FreeBSD, which will be used
   by the pmap_mapdev_attr() and pmap_change_attr() kernel interfaces
   referred to above. This support can provide the desired flexibility if
   the d_mmap() interface is extended or complemented with a new one,
   allowing drivers to take advantage of the PAT support.

   In order to provide optimal PCI-E performance, NVIDIA FreeBSD graphics
   drivers need to be able to create WC system memory mappings.

2) The device pager mechanism is page fault based, which incurs noticable
   overhead due to the large number of user/kernel context switches.
   This can result in significant performance penalties with very large
   or numerous kernel mappings. It also currently requires the use of the
   msync() workaround (see above), which incurs additional overhead.

   Performance with the NVIDIA FreeBSD graphics driver would benefit from
   an mmap(2) interface that is independent of the device pager and
   allows the mappings' page tables to be prebuilt. The Linux and Solaris
   operating systems support such interfaces.

3) On Linux and Solaris, the NVIDIA graphics driver can maintain per open
   instance data, i.e. data that is specific to the processes' file
   descriptors associated with NVIDIA character special files. This is
   useful primarily to achieve optimal results with the driver's internal
   notification mechanism, which is used to implement Sync-to-VBlank
   functionality, among other things. On these two operating systems, the
   NVIDIA graphics driver can selectively wake threads select(2)'ing the
   device files (/dev/nvidia0..N).

   The NVIDIA FreeBSD graphics driver can only maintain per device state
   at the moment. It wakes all processes waiting on /dev/nvidiaX, and
   needs to traverse a per device event list for each of these processes
   to check whether an event was delivered for each one of them, which
   incurs some overhead. The logic also can't currently guarantee correct
   delivery of events to different threads in the same process.

   Future versions of the NVIDIA FreeBSD graphics driver are likely to
   employ the notification mechanism more aggressively, to better support
   composited X desktop functionality.

Summary of Tasks:

 # Task:        implement pmap_mapdev_attr() on FreeBSD i386 and on
                FreeBSD amd64.
   Motivation:  allows reliable creation of kernel mappings of I/O
                memory with specific cache attributes (with per-page
                granularity).
   Priority:    gates FreeBSD amd64 support.
   Status:      is being implemented for i386 and amd64 (work is being
                done to allow easily breaking down 2MB pages).

 # Task:        design/implement better mmap(2) mechanism for mapping
                memory to user space (context information, cache
                attributes).
   Motivation:  allows reliable creation of user mappings of DMA and
                I/O memory and support for systems with more than
                4GB of RAM.
   Priority:    gates improved FreeBSD i386 support (PCI-E performance,
                SLI support, improved reliability); gates FreeBSD
                amd64 support.
   Status:      has not been started, pending.

 # Task:        implement pmap_change_attr() on FreeBSD i386 and on
                FreeBSD amd64.
   Motivation:  allows prevention of cache coherency problems.
   Priority:    gates FreeBSD amd64 support.
   Status:      is being implemented for i386 and amd64.

 # Task:        implement vmap()-like kernel interface.
   Motivation:  allows creation of contiguous kernel mappings of
                parts of or complete non-contiguous DMA/system memory
                allocations.
   Priority:    gates support for systems with more than 4GB of RAM.
   Status:      has not been started.

 # Task:        implement mechanism to allow character drivers to
                maintain per-open instance data (e.g. like the Linux
                kernel's 'struct file *').
   Motivation:  allows per thread NVIDIA notification delivery; also
                reduces CPU overhead for notification delivery
                from the NVIDIA kernel module to the X driver and to
                OpenGL.
   Priority:    should translate to improved X/OpenGL performance.
   Status:      has not been started.

Thanks,

-- 
christian zander
ch?zander at nvidia.com