FYI: various 11.0-CURRENT -r293227 (and older) hangs on arm (rpi2): a description of sorts

Fri Jan 8 05:49:56 UTC 2016

Top post of a major conclusion:

I have isolated a working vs. failing context for the hangs issue:
(Note that everything for world and ports and such is on the SSD root partition in my examples.)

A) Using a swap file on the root partition as the swap space leads to hangs when the space gets sufficient activity

vs.

B) Using a swap partition as the swap space works without hangs

It is the same SSD as before both ways. (I had to dump, repartition, restore since I'd not provided space for a swap partition earlier.)

A swap partition on the sdcard as the swap space also works.

So it appears there is a problem with using swapfiles --at least when they are on otherwise sometimes-also-busy file systems but possibly more generally. (As the SSD has a USB SSD interface to the RPI2, swapfiles do not provide trim support any more than swap partitions would.)

The SSD is likely noticeably faster in various respects and so may be more of a challenge for swapfile handling in some way (via extra file-system/IO/resource load with less time between various activities), at least on rpi2's.

I now have for the SSD context:

$ df -m
Filesystem          1M-blocks  Used  Avail Capacity  Mounted on
/dev/ufs/RPI2rootfs    440365 11179 393957     3%    /
devfs                       0     0      0   100%    /dev
/dev/mmcsd0s1              49     7     42    15%    /boot/msdos

$ swapinfo
Device          1K-blocks     Used    Avail Capacity
/dev/gpt/RPI2swap   3282756   905876  2376880    28%

===
Mark Millard
markmi at dsl-only.net

On 2016-Jan-7, at 3:24 PM, Mark Millard <markmi at dsl-only.net> wrote:
> 
> 
> On 2016-Jan-7, at 2:28 PM, Warner Losh <imp at bsdimp.com> wrote:
>> 
>> 4 page requests shouldn't hang the whole system. That should be more like hundreds or thousands depending on the tuning you've done.
>> 
>> Warner
>> 
> 
> FYI: I do not remember doing any explicit tuning. Other than having a SSD for the root file system (via fstab content) and using cortex-a7 related compile options things are default with ssh and little else enabled as I remember. I'm even currently running KERNCONF=RPI2 instead of my RPI2-NODBG variant.
> 
> For my note about L(q)==4 for md0: "SWAP/swap/md0" showed 0. The only "name" showing a non-zero value was "md0" --and only for L(q).
> 
> 
> 
> It does look like the latest hang finally produced some messages: 3 copies of
> 
> smsc0: warning: failed to create new mbuf
> 
> but these messages do not normally appear.
> 
> 
> 
>> On Thu, Jan 7, 2016 at 3:16 PM, Mark Millard <markmi at dsl-only.net> wrote:
>> I'm top posting this change of information about the hang status seen via gstat:
>> 
>> After a long time the gstat -cod is showing a non-zero value in one place:
>> 
>> L(q) for md0 is showing 4 now.
>> 
>> (I've no clue when it changed. I do not expect that I missed the 4 before.)
>> 
>> md0 is for the file-system based page file. That file is on the SSD, not the sdcard.
>> 
>> 
>> ===
>> Mark Millard
>> markmi at dsl-only.net
>> 
>> On 2016-Jan-7, at 2:04 PM, Mark Millard <markmi at dsl-only.net> wrote:
>> 
>>> 
>>> On 2016-Jan-7, at 1:31 PM, Hans Petter Selasky <hps at selasky.org> wrote:
>>>> 
>>>> On 01/07/16 22:26, Hans Petter Selasky wrote:
>>>>> On 01/07/16 21:20, Mark Millard wrote:
>>>>>> 
>>>>>> On 2016-Jan-7, at 12:04 PM, Hans Petter Selasky <hps at selasky.org>
>>>>>> wrote:
>>>>>>> 
>>>>>>> On 01/07/16 20:48, Ian Lepore wrote:
>>>>>>>> If the filesystems and swap space are on a usb drive, then maybe it's
>>>>>>>> the usb subsystem that's hanging.  The wait states you showed for those
>>>>>>>> processes are consistant with what I've seen when all buffers get
>>>>>>>> backed up in a queue on one non-responsive or slow device.  It may be
>>>>>>>> that there's a way to get the system deadlocked when it's low on
>>>>>>>> buffers and there is memory pressure causing the swap to be used (I
>>>>>>>> generally run arms systems without any swap configured).
>>>>>>>> 
>>>>>>>> Running gstat in another window while this is going on may give you
>>>>>>>> some insight into the situation.  Beyond that I don't know what to look
>>>>>>>> at, especially since you generally can't launch any new tools once the
>>>>>>>> system gets into this kind of state.
>>>>>>>> 
>>>>>>>> -- Ian
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> All USB transfers towards disk devices have timeouts, so if something
>>>>>>> is hanging at USB level, you'll get a printout eventually.
>>>>>> 
>>>>>> What sort of timescale after deadlock/live-lock is observed to
>>>>>> apparently have started does one have to wait in order to conclude
>>>>>> that the timeouts would have happened and so they do not apply to the
>>>>>> deadlock/live-lock?
>>>>>> 
>>>>>>> The USB kernel processes needed for doing I/O transfers are not
>>>>>>> pinned to RAM. Can it happen if a USB process is swapped to disk,
>>>>>>> that the system cannot wakeup a swapped out process to get more swap?
>>>>>>> 
>>>>>>> --HPS
>>>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>>> Wow. Could I use ddb to somehow check on the "USB kernel processes"
>>>>>> swap status when the overall context is deadlocked/live-locked?
>>>>> 
>>>>> Are you able to run something like:
>>>>> 
>>>>> ps auxwwH | grep usb
>>>>> 
>>>>>> If yes, how? Otherwise something in top or some such display that I'd
>>>>> left running over the serial console would have to present useful
>>>>> information on the subject. Is there anything that would?
>>>>> 
>>>> 
>>>> Are you able to SSH into the box or ping it?
>>>> 
>>>> --HPS
>>> 
>>> Once the live-lock condition is reached no new processes can be created as far as I can tell: the attempt will hang any process that attempts the creation.
>>> 
>>> I'd need "ps auxwwH" to be internally repeating to even get that much: I'd have to start it before the live-lock happened and it would have to be still running when the hang occurs, no on-going process creations involved.
>>> 
>>> I'm not so sure that two communicating processes (ps and grep over a pipe) would work but I can not get to even one new process so far.
>>> 
>>> ssh sessions also hang, input and output stop for them fairly generally. (Sometimes the context is such that ^t still works but shows no progress in what it reports.) No new ssh connections are possible: "Operation timed out".
>>> 
>>> ping does respond normally: it is more of a live-lock status then a true deadlock one overall.
>>> 
>>> The serial console still outputs what it was already running if that process does nothing that locks up. Changing what it is doing generally locks it up too.
>>> 
>>> Doing something like unplugging a usb keyboard or mouse or plugging one in does show the expected messages via the console: it is more of a live-lock status then a true deadlock one overall.
>>> 
>>> I can get to ddb after the hang. But I do not know what I'd do with it to find any useful information.
>>> 
>>> 
>>> As noted in another message: I used gstat instead of top on the serial console:
>>> 
>>>> gstat shows everything zero during a hang, even L(q) column. (Length of queue?)
>>>> 
>>>> I used:
>>>> 
>>>> gstat -cod
>>>> 
>>>> and had it running over the serial console port during the attempted portmaster activity.
>>> 
>>> 
>> ===
>> Mark Millard
>> markmi at dsl-only.net
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> freebsd-arm at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-arm
>> To unsubscribe, send any mail to "freebsd-arm-unsubscribe at freebsd.org"
>> 
> 
===
Mark Millard
markmi at dsl-only.net