Maximum Swapsize

Tue Apr 11 21:18:28 UTC 2006

    From 'man tuning' (I think I wrote this, a long time ago):

     You should typically size your swap space to approximately 2x main mem-
     ory.  If you do not have a lot of RAM, though, you will generally want a
     lot more swap.  It is not recommended that you configure any less than
     256M of swap on a system and you should keep in mind future memory expan-
     sion when sizing the swap partition.  The kernel's VM paging algorithms
     are tuned to perform best when there is at least 2x swap versus main mem-
     ory.  Configuring too little swap can lead to inefficiencies in the VM
     page scanning code as well as create issues later on if you add more mem-
     ory to your machine.  Finally, on larger systems with multiple SCSI disks
     (or multiple IDE disks operating on different controllers), we strongly
     recommend that you configure swap on each drive (up to four drives).  The
     swap partitions on the drives should be approximately the same size.  The
     kernel can handle arbitrary sizes but internal data structures scale to 4
     times the largest swap partition.  Keeping the swap partitions near the
     same size will allow the kernel to optimally stripe swap space across the
     N disks.  Do not worry about overdoing it a little, swap space is the
     saving grace of UNIX and even if you do not normally use much swap, it
     can give you more time to recover from a runaway program before being
     forced to reboot.
					--

    The last sentence is probably the most important.  The primary reason why 
    you want to configure a fairly large amount of swap has less to do with
    performance and more to do with giving the system admin a long runway
    to have the time to deal with unexpected situations before the machine
    blows itself to bits.

    The swap subsystem has the following limitation:

        /*
         * If we go beyond this, we get overflows in the radix
         * tree bitmap code.
         */
        if (nblks > 0x40000000 / BLIST_META_RADIX / nswdev) {
                printf("exceeded maximum of %d blocks per swap unit\n",
                        0x40000000 / BLIST_META_RADIX / nswdev);
                VOP_CLOSE(vp, FREAD | FWRITE, td);
                return (ENXIO);
        }

    By default, BLIST_META_RADIX is 16 and nswdev is 4, so the maximum
    number of blocks *PER* swap device is 16 million.  If PAGE_SIZE is 4K,
    the limitation is 64 GB per swap device and up to 4 swap devices
    (256 GB total swap).

    The kernel has to allocate memory to track the swap space.  This memory
    is allocated and managed by kern/subr_blist.c (assuming you haven't
    changed things since I wrote it).  This is basically implemented as a
    flattened radix tree using a fixed radix of 16.  The memory overhead is
    fixed (based on the amount of swap configured) and comes to
    approximately 2 bits per VM page.  Performance is approximately O(log N).

    Additionally, once pages are actually swapped out the VM object must
    record the swap index for each page.  This costs around 4 bytes per
    swapped-out page and is probably the greatest limiting factor in the
    amount of swap you can actually use.  256GB of 100% used swap would
    eat 256MB of kernel ram.

    I believe that large linear chunks of reserved swap, such as used by MD,
    currently still require the per-page overhead.  However, theoretically,
    since the reservation model uses a radix tree, it *IS* possible to
    reserve huge swaths of linear-addressed swap space with no per-page
    storage requirements in the VM object.  It is even possible to do away
    with the 2 bits per page that the radix tree uses if the radix tree
    were allocated dynamically.  I decided against doing that because I
    did not want the swap subsystem to be reliant on malloc() during 
    critical low-memory paging situations.

						-Matt