RPI3 swap experiments ["was killed: out of swap space" with: "v_free_count: 5439, v_inactive_count: 1"]

Mon Aug 20 14:59:43 UTC 2018

On Mon, Aug 20, 2018 at 8:00 AM, Mark Millard via freebsd-arm <
freebsd-arm at freebsd.org> wrote:

> There is a way to explore Mark Johnston's swap information
> reports (from his patches for such).
>
> Taking a Pine64+ 2GB as an example (4 cores with 1
> HW-thread per core, 2 GiBytes of RAM, USB device for
> root file system and swap partition):
>
> In one login:
> # nice -20 gstat -pd
>
> In another login:
> # stress -d 2 -m 4 --vm-keep --vm-bytes 536870912
>
> That "4" and "536870912" total to the 2 GiBytes so
> swapping is induced for the context in question.
> (Scale --vm-bytes appropriately to context.)
>
> [stress is from sysutils/stress .]
>
> gstat tends to show things such as:
>
> dT: 1.006s  w: 1.000s
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps
>  ms/d   %busy Name
>     0      0      0      0    0.0      0      0    0.0      0      0
> 0.0    0.0| mmcsd0
>    56    312      0      0    0.0    312  19985  142.6      0      0
> 0.0   99.6| da0
>
> where the ms/w and kBps are fairly stable but the Length
> of the queue length is widely variable. For the above it
> makes the likes of 56 writes queued * 142.6 ms/write (mean)
> [as an estimate] a rather large total time for the last
> of the queued writes to complete. (If I understand how
> to interpret the above.)
>

No. 142.6ms/write is the average of the time that the operations that
completed during the polling interval took to complete. There's no
estimating here.

So, at 6 or 7 per second for the operation to complete, coupled with a
parallel factor of 1 (typical for low end junk flash), we wind up with 56
operations in the queue taking 8-10s to complete.

> It appears to me that, compared to a observed capacity of
> roughly around 20 MiBytes/sec for writes, large amounts of
> bytes are being queued up to be written in a short time,
> for which it just takes a while for the backlog to be
> finished.
>

Yes. That matches my expectation as well. In other devices, I've found that
I needed to rate-limit things to more like 50-75% of the max value to keep
variance in performance low. It's the whole reason I wrote the CAM I/O
scheduler.

> The following is from multiple such runs, several manually
> stopped but some killed because of sustained low free
> memory. I had left vm.pageout_oom_seq=12 in place for this,
> making the kills easier to get than the 120 figure would. It
> does not take very long generally for some sort of message to
> show up.
>
>
>
> waited 9s for async swap write
> waited 9s for swap buffer
> waited 9s for async swap write
> waited 9s for async swap write
> waited 9s for async swap write
> v_free_count: 1357, v_inactive_count: 1
> Aug 20 06:04:27 pine64 kernel: pid 1010 (stress), uid 0, was killed: out
> of swap space
> waited 5s for async swap write
> waited 5s for swap buffer
> waited 5s for async swap write
> waited 5s for async swap write
> waited 5s for async swap write
> waited 13s for async swap write
> waited 12s for swap buffer
> waited 13s for async swap write
> waited 12s for async swap write
> waited 12s for async swap write
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 161177, size: 131072
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 164766, size: 65536
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 164064, size: 12288
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 161630, size: 131072
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 161177, size: 131072
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 164766, size: 65536
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 161630, size: 131072
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 164064, size: 12288
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 161177, size: 131072
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 164766, size: 65536
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 161630, size: 131072
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 164064, size: 12288
> waited 65s for async swap write
> waited 65s for swap buffer
> waited 65s for async swap write
> waited 65s for async swap write
> waited 65s for async swap write
> v_free_count: 955, v_inactive_count: 1
> Aug 20 06:11:49 pine64 kernel: pid 1047 (stress), uid 0, was killed: out
> of swap space
> waited 5s for async swap write
> waited 5s for swap buffer
> waited 5s for async swap write
> waited 5s for async swap write
> waited 5s for async swap write
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314021, size: 12288
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314084, size: 32768
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314856, size: 32768
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314638, size: 131072
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312518, size: 4096
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312416, size: 16384
> waited 39s for async swap write
> waited 39s for swap buffer
> waited 39s for async swap write
> waited 39s for async swap write
> waited 39s for async swap write
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314802, size: 24576
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 314934, size: 40960
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 315617, size: 32768
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 312484, size: 4096
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 328616, size: 131072
> waited 31s for async swap write
> waited 31s for swap buffer
> waited 31s for async swap write
> waited 31s for async swap write
> waited 31s for async swap write
> waited 5s for async swap write
> waited 5s for swap buffer
> waited 5s for async swap write
> waited 5s for async swap write
> waited 5s for async swap write

These numbers are consistent with the theory that the swap device becomes
overwhelmed, spiking latency and causing crappy down-stream effects. You
can use the I/O scheduler to limit the write rates at the low end. You
might also be able to schedule a lower write queue depth at the top end as
well, but I've not seen good ways to do that.

Warner