[Bug 224292] processes are hanging in state ufs / possible deadlock in file system

Mon Feb 8 06:45:52 UTC 2021

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224292

sigsys at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sigsys at gmail.com

--- Comment #15 from sigsys at gmail.com ---
I'm getting occasional UFS "hangs" of sorts on a -CURRENT VM too. Must be
unrelated to whatever the problem was in the original bug report but I figured
I'd dump this here in case it could help figuring out more recent problems.

Last time it happened and that I saved some infos was from a little while ago
(FreeBSD 13.0-CURRENT #32 main-c529869-g4f597837d531: Tue Jan 12 14:41:03 EST
2021). I'll update and try to get more infos if it happens again.

`pkg upgrade` and running kyua tests are what usually trigger it but it still
happens very rarely.

load: 1.71  cmd: pkg 11799 [biowr] 331.89r 20.33u 66.48s 52% 764496k
mi_switch+0x155 sleepq_switch+0x109 _sleep+0x2b4 bufwait+0xc4 bufwrite+0x25a
ffs_update+0x2ed ffs_syncvnode+0x4da ffs_fsync+0x1f softdep_prerename+0x21a
ufs_rename+0x3ee VOP_RENAME_APV+0x40 kern_renameat+0x3fd amd64_syscall+0x149
fast_syscall_common+0xf8 

root at vm2:[~] # procstat -kk 11799
  PID    TID COMM                TDNAME              KSTACK                     
11799 100255 pkg                 -                   __mtx_lock_sleep+0xe8
__mtx_lock_flags+0xe5 process_worklist_item+0x63 softdep_prerename+0x4bd
ufs_rename+0x3ee VOP_RENAME_APV+0x40 kern_renameat+0x3fd amd64_syscall+0x149
fast_syscall_common+0xf8 

root at vm2:[~] # procstat -kk 11799
  PID    TID COMM                TDNAME              KSTACK                     
11799 100255 pkg                 -                   mi_switch+0x155
sleepq_switch+0x109 _sleep+0x2b4 bufwait+0xc4 bufwrite+0x25a ffs_update+0x2ed
ffs_syncvnode+0x4da ffs_fsync+0x1f softdep_prerename+0x21a ufs_rename+0x3ee
VOP_RENAME_APV+0x40 kern_renameat+0x3fd amd64_syscall+0x149
fast_syscall_common+0xf8 

root at vm2:[~] # ps -lp 11799
UID   PID  PPID C PRI NI    VSZ    RSS MWCHAN STAT TT     TIME COMMAND
  0 11799 11798 4  52  0 828024 764512 -      R+    1  2:38.77 pkg upgrade

root at vm2:[~] # iostat -w 1 -d
           vtbd0            vtbd1            vtbd2 
 KB/t  tps  MB/s  KB/t  tps  MB/s  KB/t  tps  MB/s 
 24.4   10   0.2  33.3    2   0.1  38.2    0   0.0 
 32.0 25091 784.1   0.0    0   0.0   0.0    0   0.0 
 32.0 23305 728.3   0.0    0   0.0   0.0    0   0.0 
 32.0 23539 735.6   0.0    0   0.0   0.0    0   0.0 
 32.0 22151 692.2   0.0    0   0.0   0.0    0   0.0 
 32.0 19310 603.4   0.0    0   0.0   0.0    0   0.0 
 32.0 22848 714.0   0.0    0   0.0   0.0    0   0.0 
 32.0 24287 759.0   0.0    0   0.0   0.0    0   0.0 
 32.0 23392 731.0   0.0    0   0.0   0.0    0   0.0 
 32.0 24586 768.3   0.0    0   0.0   0.0    0   0.0 
 32.0 23980 749.4   0.0    0   0.0   0.0    0   0.0 
 32.0 23549 735.9   0.0    0   0.0   0.0    0   0.0 
 32.0 23328 729.0   0.0    0   0.0   0.0    0   0.0 
 32.0 23173 724.2   0.0    0   0.0   0.0    0   0.0 
 32.0 24906 778.3   0.0    0   0.0   0.0    0   0.0 
 32.0 23534 735.4   0.0    0   0.0   0.0    0   0.0 
 32.0 24242 757.6   0.0    0   0.0   0.0    0   0.0 
 32.0 21295 665.5   0.0    0   0.0   0.0    0   0.0 
 32.0 19002 593.8   0.0    0   0.0   0.0    0   0.0 
 32.0 18702 584.4   0.0    0   0.0   0.0    0   0.0 
 32.0 19285 602.7   0.0    0   0.0   0.0    0   0.0 
 32.0 18171 567.8   0.0    0   0.0   0.0    0   0.0 
 32.0 18603 581.3   0.0    0   0.0   0.0    0   0.0 

I think there's always rmdir/mkdir or rename in the kernel call stack of the
hung processes when it happens.

And it's really going nuts with the I/O.  Makes me think it must be some kind
of live lock within softupdates.

When some processes get stuck in this state, more and more processes eventually
get stuck until the VM is unusable.  But running `sync` unwedges the whole
thing and everything seems to be running fine after that.

I'll try to get a core dump next time.

-- 
You are receiving this mail because:
You are the assignee for the bug.