32GB limit per swap device?

Sat Aug 20 18:46:13 UTC 2011

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alan Cox wrote:
> On 08/20/2011 12:41, Kostik Belousov wrote:
>> On Sat, Aug 20, 2011 at 12:33:29PM -0500, Alan Cox wrote:
>>> On Thu, Aug 18, 2011 at 3:16 AM, Alexander V.
>>> Chernikov<melifaro at ipfw.ru>wrote:
>>>
>>>> On 10.08.2011 19:16, perryh at pluto.rain.com wrote:
>>>>
>>>>> Chuck Swiger<cswiger at mac.com>   wrote:
>>>>>
>>>>>   On Aug 9, 2011, at 7:26 AM, Daniel Kalchev wrote:
>>>>>>> I am trying to set up 64GB partitions for swap for a system that
>>>>>>> has 64GB of RAM (with the idea to dump kernel core etc). But, on
>>>>>>> 8-stable as of today I get:
>>>>>>>
>>>>>>> WARNING: reducing size to maximum of 67108864 blocks per swap unit
>>>>>>>
>>>>>>> Is there workaround for this limitation?
>>>>>>>
>>>> Another interesting question:
>>>>
>>>> swap pager operates in page blocks (PAGE_SIZE=4k on common arch).
>>>>
>>>> Block device size in passed to swaponsomething() in number of _disk_
>>>> blocks
>>>>   (e.g. in DEV_BSIZE=512). After that, kernel b-lists (on top of
>>>> which swap
>>>> pager is build) maximum objects check is enforced.
>>>>
>>>> The (possible) problem is that real object count we will operate on
>>>> is not
>>>> the value passed to swaponsomething() since it is calculated in
>>>> wrong units.
>>>>
>>>> we should check b-list limit on (X * DEV_BSIZE512 / PAGE_SIZE) value
>>>> which
>>>> is rough (X / 8) so we should be able to address 32*8=256G.
>>>>
>>>> The code should look like this:
>>>>
>>>> Index: vm/swap_pager.c
>>>> ==============================**==============================**=======
>>>> --- vm/swap_pager.c     (revision 223877)
>>>> +++ vm/swap_pager.c     (working copy)
>>>> @@ -2129,6 +2129,15 @@ swaponsomething(struct vnode *vp, void *id,
>>>> u_long
>>>>         u_long mblocks;
>>>>
>>>>         /*
>>>> +        * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd
>>>> chunks.
>>>> +        * First chop nblks off to page-align it, then convert.
>>>> +        *
>>>> +        * sw->sw_nblks is in page-sized chunks now too.
>>>> +        */
>>>> +       nblks&= ~(ctodb(1) - 1);
>>>> +       nblks = dbtoc(nblks);
>>>> +
>>>> +       /*
>>>>
>>>>          * If we go beyond this, we get overflows in the radix
>>>>          * tree bitmap code.
>>>>          */
>>>> @@ -2138,14 +2147,6 @@ swaponsomething(struct vnode *vp, void *id,
>>>> u_long
>>>>                         mblocks);
>>>>                 nblks = mblocks;
>>>>         }
>>>> -       /*
>>>> -        * nblks is in DEV_BSIZE'd chunks, convert to PAGE_SIZE'd
>>>> chunks.
>>>> -        * First chop nblks off to page-align it, then convert.
>>>> -        *
>>>> -        * sw->sw_nblks is in page-sized chunks now too.
>>>> -        */
>>>> -       nblks&= ~(ctodb(1) - 1);
>>>> -       nblks = dbtoc(nblks);
>>>>
>>>>         sp = malloc(sizeof *sp, M_VMPGDATA, M_WAITOK | M_ZERO);
>>>>         sp->sw_vp = vp;
>>>>
>>>>
>>>> (move pages recalculation before b-list check)
>>>>
>>>>
>>>> Can someone comment on this?
>>>>
>>>>
>>> I believe that you are correct.  Have you tried testing this change on a
>>> large swap device?
I will try tomorrow.

>> I probably agree too, but I am in the process of re-reading the swap
>> code,
>> and I do not quite believe in the limit.
>>
> 
> I'm uncertain whether the current limit, "0x40000000 /
> BLIST_META_RADIX", is exact or not, but I doubt that it is too large.

It is not exact.  It is rough estimation of
sizeof(blmeta_t) * X < 4G (blist_create() assumes malloc() not being
able to allocate more that 4G. I'm not sure if it is true this days)
X is number of blocks we need to store. Actual number, however, it is X
/ (1 + 1/BLIST_META_RADIX + 1/BLIST_META_RADIX^2 + ...) but it dffers
from X not very much.

blist can be seen as tree of radix trees, with metainformation for all
those radix trees allocated by single allocation which imposes this
limit. Metatinformation is used to find free blocks more quickly

Single linear allocation is required to advance to next radix tree on
the same level very fast:

*   *   *   *   *
**  **  **  **  **
********************
^^^
Some kind of schema with 3 level in tree and BLIST_META_RADIX=2 (instead
of 16).

> 
>> When the initial code was committed, our daddr_t was 32bit, I checked
>> the RELENG_4 sources. Current code uses int64_t for daddr_t. My
>> impression
>> right now is that we only utilize the low 32bits of daddr_t.
>>
>> Esp. interesting looks the following typedef:
>> typedef    uint32_t    u_daddr_t;    /* unsigned disk address */
>> which (correctly) means that typical mask (u_daddr_t)-1 is 0xffffffff.
>>
>> I wonder whether we could just use full 64bit and de-facto remove the
>> limitation on the swap partition size.

This will increase struct blmeta_t twice and cause 2*X memory usage for
every swap configuration.

> 
> I would rather argue first that the subr_list code should not be using
> daddr_t all.  The code is abusing daddr_t and defining u_daddr_t to
> represent things that are not disk addresses.  Instead, it should either
> define its own type or directly use (u)int*_t.  Then, as for choosing
> between 32 and 64 bits, I'm skeptical of using this structure for
> managing more than 32 bits worth of blocks, given the amount of RAM it
> will use.
> 
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5QABQACgkQwcJ4iSZ1q2kdXwCfWPN48wauijoGOQCUaalYnFCR
BIgAnRLCuDmPwySp1gd0xf+UPG5nC7KJ
=sP6M
-----END PGP SIGNATURE-----