Belated out of swap kill on rpi3 at r359216
Mark Millard
marklmi at yahoo.com
Wed Mar 25 08:19:05 UTC 2020
On 2020-Mar-24, at 18:56, bob prohaska <fbsd at www.zefox.net> wrote:
> [this thread needs a haircut, new comments start near line 288]
> On Tue, Mar 24, 2020 at 05:21:47PM -0700, Mark Millard wrote:
>> On 2020-Mar-24, at 15:46, bob prohaska <fbsd at www.zefox.net> wrote:
>>
>>> . . .
>>
>> . . .
>>
>>
>> The text before "Mar 24 04:20:50" messages, with notes
>> and blank lines mixed in to group things:
>>
>> . . .
>> Tue Mar 24 04:20:38 PDT 2020
Note the time above.
>> Device 1K-blocks Used Avail Capacity
>> /dev/mmcsd0s2b 4404252 403784 4000468 9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
>> 3/773/776/19177 mbuf clusters in use (current/cache/total/max)
>>
>> procs memory page disks faults cpu
>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
>> 0 0 30 2034696 54180 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
>> dT: 1.031s w: 1.000s
>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
>> 18 1 0 0 0.0 1 31 26686 0 0 0.0 287.1 mmcsd0
>> 18 1 0 0 0.0 1 31 26686 0 0 0.0 287.1 mmcsd0s2
>> 10 1 0 0 0.0 1 31 26686 0 0 0.0 287.1 mmcsd0s2a
>> 10 1 0 0 0.0 1 31 26686 0 0 0.0 287.1 ufs/rootfs
>> Tue Mar 24 04:20:40 PDT 2020
>> Device 1K-blocks Used Avail Capacity
>> /dev/mmcsd0s2b 4404252 403784 4000468 9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
>> 6/770/776/19177 mbuf clusters in use (current/cache/total/max)
>>
>> Note that the above has 26sec+ ms/w figures. No reads.
>> No mmcsd0s2b.
>>
>>
>> procs memory page disks faults cpu
>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
>> 0 0 30 2034696 54252 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
>> dT: 1.064s w: 1.000s
>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
>> Tue Mar 24 04:20:42 PDT 2020
>> Device 1K-blocks Used Avail Capacity
>> /dev/mmcsd0s2b 4404252 403784 4000468 9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
>> 9/1023/1032/19177 mbuf clusters in use (current/cache/total/max)
>>
>> For the above, note the lack of any lines with ms/r or ms/w figures,
>> just the title rows show.
>>
>
> Same thing happens when gstat -a is run on an idle machine.
> I thought it normal, the CPU had all the data it needed.
>
>
>
>>
>> procs memory page disks faults cpu
>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
>> 0 0 30 2034696 53668 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
>> dT: 1.046s w: 1.000s
>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
>> 17 1 0 0 0.0 1 31 30042 0 0 0.0 320.7 mmcsd0
>> 17 1 0 0 0.0 1 31 30042 0 0 0.0 320.7 mmcsd0s2
>> 9 1 0 0 0.0 1 31 30042 0 0 0.0 320.7 mmcsd0s2a
>> 9 1 0 0 0.0 1 31 30042 0 0 0.0 320.7 ufs/rootfs
>> Tue Mar 24 04:20:43 PDT 2020
Note the time above compared to the earlier one that
I referenced: around 5sec later (== 43sec-38sec),
despite the ms/w figures involved (over 26sec/w and
over 30sec/w with a nothing-listed in the middle).
>> Device 1K-blocks Used Avail Capacity
>> /dev/mmcsd0s2b 4404252 403784 4000468 9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
>> 9/1023/1032/19177 mbuf clusters in use (current/cache/total/max)
>>
>> Note that the above has 30sec+ ms/w figures. No reads.
>>
>> Also note that the L(q) figures dropped by only 1 over
>> the about 04:20:40 to 04:20:43 interval (18->17 and
>> 10->9). (I only take these deltas as suggestive. They
>> might be accidental near matches.) No mmcsd0s2b.
>>
>>
>> procs memory page disks faults cpu
>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
>> 0 0 30 2034696 53624 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
>> dT: 1.003s w: 1.000s
>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
>> Tue Mar 24 04:20:45 PDT 2020
>> Device 1K-blocks Used Avail Capacity
>> /dev/mmcsd0s2b 4404252 403784 4000468 9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
>> 12/1020/1032/19177 mbuf clusters in use (current/cache/total/max)
>>
>> For the above, note the lack of any lines with ms/r or ms/w figures,
>> just the title rows show.
>>
>
> It's tempting, but maybe not correct, the think the I/O caught up.
On this side of the 30sec+ example, things are less clear
than the above note, so maybe. (Same below.)
>>
>> procs memory page disks faults cpu
>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
>> 0 0 30 2034696 53600 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
>> dT: 1.063s w: 1.000s
>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
>> Tue Mar 24 04:20:47 PDT 2020
>> Device 1K-blocks Used Avail Capacity
>> /dev/mmcsd0s2b 4404252 403784 4000468 9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
>> 12/1020/1032/19177 mbuf clusters in use (current/cache/total/max)
>>
>> For the above, note the lack of any lines with ms/r or ms/w figures,
>> just the title rows show.
>>
>
> Can't that be interpreted as gstat -a having nothing to report?
> Just _why_ it might have nothing to report never crossed my mind....
>
>> procs memory page disks faults cpu
>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
>> 0 0 30 2034696 53592 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
>> dT: 1.006s w: 1.000s
>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
>> Tue Mar 24 04:20:48 PDT 2020
>> Device 1K-blocks Used Avail Capacity
>> /dev/mmcsd0s2b 4404252 403784 4000468 9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
>> 15/1017/1032/19177 mbuf clusters in use (current/cache/total/max)
>>
>> For the above, note the lack of any lines with ms/r or ms/w figures,
>> just the title rows show.
>>
>>
>>
>> procs memory page disks faults cpu
>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
>> 0 0 30 2034696 53672 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
>> dT: 1.002s w: 1.000s
>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
>> Tue Mar 24 04:20:50 PDT 2020
>> Device 1K-blocks Used Avail Capacity
>> /dev/mmcsd0s2b 4404252 403784 4000468 9%
>> Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 623056, size: 16384
>> Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1033854, size: 4096
>> 15/1017/1032/19177 mbuf clusters in use (current/cache/total/max)
>>
>> For the above, note the list of any lines with ms/r or ms/w figures,
>> just the title rows.
>>
>>
>>
>> procs memory page disks faults cpu
>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id
>> 2 0 30 2045028 54160 12942 7 2 4 13432 1374 0 0 14074 8102 2849 65 29 5
>> dT: 1.001s w: 1.000s
>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name
>> 2 329 236 3839 13.0 93 1262 18.6 0 0 0.0 99.4 mmcsd0
>> 2 329 236 3839 13.2 93 1262 18.7 0 0 0.0 101.1 mmcsd0s2
>> 1 153 152 2816 12.5 1 32 366.5 0 0 0.0 95.2 mmcsd0s2a
>> 1 176 84 1023 14.5 92 1230 14.9 0 0 0.0 66.8 mmcsd0s2b
>> 1 153 152 2816 12.5 1 32 366.6 0 0 0.0 95.3 ufs/rootfs
>> Tue Mar 24 04:20:51 PDT 2020
>> Device 1K-blocks Used Avail Capacity
>> /dev/mmcsd0s2b 4404252 296976 4107276 7%
>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21842, size: 12288
>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 25477, size: 4096
>> 0/1032/1032/19177 mbuf clusters in use (current/cache/total/max)
>>
>> For the above, the lines with ms/r and ms/w figures are back
>> but the indefinite wait buffer for "Mar 24 04:20:50" happened
>> too, before the "Tue Mar 24 04:20:51 PDT 2020" above. I'd guess
>> they are from before the data that the gstat output is based on.
>>
>> Also there are lots of reads and some writes in the above
>> gstat output. mmcsd0s2b is again showing as well.
>>
>> Note: It looks like ufs/rootfs might be a label
>> identifying mmcsd0s2a, so they track up to slight
>> time-of-information differences. I'm guessing that
>> mmcsd0s2b is just a swap/paging partition.
>>
>
> That's correct.
>
>> So it looks like activity for ufs/rootfs may have
>> blocked/delayed activity for paging/swaping
>> (mmcsd0s2b), at least for a time, even if it is
>> not directly the cause of the specific OOM
>> activity.
>>
>
> How would one distinguish slow swap i/o from slow
> filesystem i/o ?
The 26sec+ and 30sec+ examples do not list mmcsd0s2b
but do list mmcsd0s2a (and ufs/rootfs). Presumably
that means no reas/writes to mmcsd0s2b in the time
frame spanned. That is what I was referring to. I
was not comparing against mmcsd0s2b I/O activity.
> Would moving swap to a USB device avoid blocking
> by writes to microSD? That's not hard to try, but
> it puts a whole new kettle of fish on the fire.
Presuming the USB device is well performing,
splitting the load across buses might well
help. A good USB SSD might handle both
activities better than involving the microsd
card at all(?).
It is possible to have only /boot/ materials
on ufs on the microsd card and so to have the
(ufs) root file system on the USB.
For example, in the Pine64+2GB's /boot/loader.conf
on the microsd card I have:
vfs.root.mountfrom="ufs:/dev/gpt/PINE642Groot"
That label identifies the USB SSD's ufs partition
holding the root file system. On that file system
is an empty directory /microsd_ufs used as a mount
point to get to the microsd card's ufs-based /boot/ .
# gpart show -p
=> 63 249737153 mmcsd0 MBR (119G)
63 16380 - free - (8.0M)
16443 131040 mmcsd0s1 fat32lba [active] (64M)
147483 997 - free - (499K)
148480 241172480 mmcsd0s2 freebsd (115G)
241320960 8416256 - free - (4.0G)
=> 0 241172480 mmcsd0s2 BSD (115G)
0 230686720 mmcsd0s2a freebsd-ufs (110G)
230686720 10485760 - free - (5.0G)
=> 40 468862048 da0 GPT (224G)
40 2008 - free - (1.0M)
2048 413138944 da0p1 freebsd-ufs (197G)
413140992 6291456 da0p2 freebsd-swap (3.0G)
419432448 6291456 da0p4 freebsd-swap (3.0G)
425723904 43138184 - free - (21G)
# gpart show -pl
=> 63 249737153 mmcsd0 MBR (119G)
63 16380 - free - (8.0M)
16443 131040 mmcsd0s1 (null) [active] (64M)
147483 997 - free - (499K)
148480 241172480 mmcsd0s2 (null) (115G)
241320960 8416256 - free - (4.0G)
=> 0 241172480 mmcsd0s2 BSD (115G)
0 230686720 mmcsd0s2a (null) (110G)
230686720 10485760 - free - (5.0G)
=> 40 468862048 da0 GPT (224G)
40 2008 - free - (1.0M)
2048 413138944 da0p1 PINE642Groot (197G)
413140992 6291456 da0p2 PINE642Gswap (3.0G)
419432448 6291456 da0p4 PINE642Gswp2 (3.0G)
425723904 43138184 - free - (21G)
After booting it looks like:
# df -m
Filesystem 1M-blocks Used Avail Capacity Mounted on
/dev/gpt/PINE642Groot 195378 34613 145135 19% /
devfs 0 0 0 100% /dev
/dev/label/PINE64P2Groot 109101 75 100297 0% /microsd_ufs
/dev/label/PINE642GAboot 63 43 20 69% /boot/efi
(The Pine64+2GB uses a /boot/efi/ instead of a /boot/msdos/ .)
# swapinfo
Device 1K-blocks Used Avail Capacity
/dev/gpt/PINE642Gswap 3145728 0 3145728 0%
/dev/gpt/PINE642Gswp2 3145728 0 3145728 0%
Total 6291456 0 6291456 0%
(I'm not suggesting that having 2 swap partitions
is important, it just is what I happen to have.)
# dumpon -l
gpt/PINE642Gswp2,gpt/PINE642Gswp2
(The duplication is a result of something I
report later.)
I also have in the USB SSD's ufs file system's
/etc/fstab :
/dev/label/PINE64P2Groot /microsd_ufs ufs rw,noatime 1 1
(Again, I used labeling, in this case glabel
based on the microsd card. Having a /boot/ is
a subset of having a full root file system and
I still used "root" terminology in the label.)
I keep a /boot/ on the USB SSD and update the
microsd card's copy from the USB copy:
# more /boot/copy_boot_to_microsd.sh
rsync -axHh --info=progress2 --exclude=/boot/entropy --delete -r /boot /microsd_ufs/
(I do not have /boot/entropy on the USB SSD,
just on the microsd card at its /boot/entropy .
The exclude prevents the --delete from touching
the /boot/entropy file on the microsd card.)
I also have in /etc/rc.conf (so: on the USB SSD):
entropy_boot_file="/microsd_ufs/boot/entropy"
For reference:
# ls -ldT /microsd_ufs/*
-r--r--r-- 1 root wheel 6170 Feb 1 04:48:34 2020 /microsd_ufs/COPYRIGHT
drwxr-xr-x 11 root wheel 1024 Mar 24 19:06:27 2020 /microsd_ufs/boot
(The COPYRIGHT file is, by content, the normal one
that is on a root filesystem.)
# ls -ldT /microsd_ufs/*/*
-r--r--r-- 1 root wheel 3547 Mar 14 15:05:54 2020 /microsd_ufs/boot/beastie.4th
. . .
-rw------- 1 root wheel 4096 Dec 31 16:00:25 2009 /microsd_ufs/boot/entropy
. . .
drwxr-xr-x 2 root wheel 512 Dec 23 22:17:51 2019 /microsd_ufs/boot/zfs
(The date/time for /microsd_ufs/boot/entropy and /entropy
are from before time is adjusted via ntp, so the dates
can be odd.)
I also use:
dumpdev="/dev/gpt/PINE642Gswp2"
in /boot/loader.conf (both media) and in /etc/rc.conf .
(This leads to the duplication noted earlier.)
The same media-pair (microsd card plus USB SSD) can
boot the RPi3. For the Pine64+2GB and RPi3, these days
I use a small powered USB hub instead of directly
plugging in the USB SSD to the arm-based board. I
had power problems otherwise, at least on the
Pine64+2GB in recent times.
Technically, I use a USB3 hub and a USB3 SSD, both of
which happen to allow USB2 use as well.
The OPi+2e (armv7) and Rock64 (aarch64) are set up
similarly. In all cases the USB SSDs used happen to
perform better than the microsd cards used. (On the
Rock64, the USB SSD also out performs the e.MMC that
is in use --and the e.MMC outperforms the microsd
card.)
(The Rock64 USB3 is not operational yet. As stands,
it too is a USB2 based context in my use.)
I do not have other external USB devices involved.
Mulitple external USB devices may share an internal
hub on some arm boards, leading to a lack of
multi-channel performance gains for multiple drives.
> It should be added that the original buildworld has been restarted with
> vm.pfault_oom_attempts: -1 and so far, though it still generates the
> "indefinite wait" warnings, the ms/w times look better. It's not done yet,
> though. How much overhead comes with OOMA?
I doubt the ms/w figures span execution of the code
that deals with OOM kill decisions or retries logic.
But vm.pfault_oom_attempts==-1 would avoid any
retries, if I understand right. That in turn might
put less of a load on the microsd card? (Guess
work on my part.)
===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
More information about the freebsd-arm
mailing list