RPI3 swap experiments (r338342 with vm.pageout_oom_seq="1024" and 6 GB swap)

Thu Sep 6 02:05:25 UTC 2018

[I've omitted Kirk McKusick as my notes are largely off subject for
what he asked about for testing specific to his changes.]

On 2018-Sep-5, at 5:38 PM, bob prohaska <fbsd at www.zefox.net> wrote:

> On Sat, Sep 01, 2018 at 04:02:33PM -0700, bob prohaska wrote:
>> 
>> With r338342  and
>> vm.pageout_oom_seq="1024"
>> in /boot/loader.conf the RPI3 is a bit closer to a Mars Rover.
>> No panics, crashes or USB errors, -j4 buildworld runs to completion.
>> When swap usage goes over about 50% the system slows, but doesn't give up.
>> There are six 1 GB swap partitions available, 3 on USB and 3 on microSD.
>> 
>> Log files are at
>> http://www.zefox.net/~fbsd/rpi3/swaptests/r338342/
>> for the combinations tried so far.
>> 
> 
> It looks as if using all six GB of swap doesn't cause any immediate problem,
> at least so long as swap usage stays relatively low, say 1.5 GB. In a final
> test, TRIM was turned on without catastrophe, though it had little to do
> given that all the busy filesystems were on USB. The penalty was about one
> hour extra (25 vs 24 hours) to run -j4 buildworld from a clean start.

What UFS file systems with TRIM enabled were on some /dev/mmcsd0* ?
Did you 1st use "fsck_ffs -E" on any of the file systems where
trim would work?

If I gather right, the "clean start" was on USB where TRIM during the
clean would not be available.

The extra swap space may have contributed to the extra time? Having
more swap uses more kernel memory for keeping track of the swap
if I understand right. That leaves less for other things. That could
have consequences other than outright failure.

Quoting "man 8 loader" related to kern.maxswzone :

                  Note that swap metadata can be fragmented, which means that
                  the system can run out of space before it reaches the
                  theoretical limit.  Therefore, care should be taken to not
                  configure more swap than approximately half of the
                  theoretical maximum.

                  Running out of space for swap metadata can leave the system
                  in an unrecoverable state.

This wording suggests not allocating 6 GiBytes of swap when 3.5 GiBytes
is approximately half the theoretical maximum --even if the system does
still operate with 6 GiBytes.

(Note: The man page's reference to "eight times the amount of physical memory"
and such does not seem to apply to all platforms. And rpi2 V1.1 and an rpi3
with the same amount of RAM get rather difference recommended figures
according to the messages generated.)

> One chance observation caught my attention, however. I'd always thought
> the VM system would favor fast swap devices over slow, but the gstat log
> recorded this, visible at
> http://www.zefox.net/~fbsd/rpi3/swaptests/r338342/3gbsd_3gbusb/trim_on/swapscript.log
> 
> 
> 
> dT: 10.004s  w: 10.000s
> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d   %busy Name
>    3    175     91    673    4.0     84    701    4.0      0      0    0.0   24.4  mmcsd0
>    4    173     88    693  106.6     86    723  176.5      0      0    0.0  103.4  da0
>    1     58     30    224    4.5     28    220    4.1      0      0    0.0   14.5  mmcsd0s2b
>    3    175     91    673    4.0     84    701    4.0      0      0    0.0   24.7  mmcsd0s2
>    1     58     30    223    4.0     28    244    3.8      0      0    0.0   14.0  mmcsd0s2d
>    1     59     31    227    3.7     28    237    4.3      0      0    0.0   14.9  mmcsd0s2e
>    2     57     28    235  140.2     28    236  103.8      0      0    0.0  186.1  da0a
>    0     56     28    224  178.4     28    222   35.9      0      0    0.0  131.5  da0b
>    2     59     31    234    9.4     28    240   59.1      0      0    0.0   99.5  da0d
>    0      0      0      0    0.0      0      3  15011      0      0    0.0  150.1  da0e
>    0      1      0      0    0.0      1     22  13376      0      0    0.0  147.8  da0g

Are there any examples of "d/s kBps ms/d" being non-zero? If they are
always zero then no TRIMing likely happened. That in turn would make
TRIM an unlikely use of an extra hour.

> Tue Sep  4 15:07:39 PDT 2018
> Device          1K-blocks     Used    Avail Capacity
> /dev/da0b         1048576   236872   811704    23%
> /dev/mmcsd0s2b    1048576   221568   827008    21%
> /dev/da0d         1048576   218636   829940    21%
> /dev/da0a         1048576   222028   826548    21%
> /dev/mmcsd0s2d    1048576   221660   826916    21%
> /dev/mmcsd0s2e    1048576   221392   827184    21%
> Total             6291456  1342156  4949300    21%

As I understand the normal use of multiple swap partitions
is to split the load across channels that can operate
independently in parallel. Having 3 such partitions on
the same channel/device may only add overhead vs. one
full-size partition per channel/device.

I also do not know if mmcsd0 and da0 can have independent,
parallel I/O activity in the rpi3 context.

> Sep  4 14:57:52 www sshd[41673]: error: Received disconnect from 103.207.39.197 port 64499:3: com.jcraft.jsch.JSchException: Auth cancel [preauth]
> Sep  4 15:04:19 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2217840, size: 12288

Note: my context is very different from yours and I get no console
messages about I/O or waits during buildworld buildkernel or other
such build/install tests.

> The system has lots of fast swap available on microSD, but is seemingly choking 
> trying to use the slow swap on da0 _and_ run traffic to /usr and /var. Buildworld
> doesn't run any faster with less swap, so I don't think the oversupply is the problem.

If I understand right, your only 6 GiByte swap experiment was slower
but you attributed all time variations to an (inactive? ever used?)
TRIM enabled status. You might want to manipulate the two
separately. For all I know something else may also have contributed.

I've no clue if having so many swap partitions on the same channel/device
has consequences that having only one per channel/device would avoid.

> Is this expected behavior?  

As I understand the approximately even split across the in-use swap
partitions is the normal way things are split. It is the placement
of the partitions themselves that contributes to how effective that
split is at improving the swap/paging I/O if I understand right.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)