FS hang with suspfs when creating snapshot on a UFS + GJOURNAL setup
Andreas Longwitz
longwitz at incore.de
Thu Dec 27 11:35:02 UTC 2012
On a FreeBSD 8-Stable machine with UFS + GJOURNAL (no SU) I observed the
same behaviour as described for UFS+SU+J in
lists.freebsd.org/pipermail/freebsd-current/2012-January/030937.html.
The snapshot was initiated by amanda with
dump 3ubLshf 64 1048576 0 - /dev/mirror/gm0p10.journal (pid 50347)
and the process creating the snapshot is
/sbin/mksnap_ffs /home /home/.snap/dump_snapshot (pid 50350).
The process 50350 hangs and also all following processes that tried to
access the home partition, some of them work, but don't come to an end,
e.g. a shell after (Ctrl T):
load: 0.61 cmd: sh 52670 [suspfs] 43.46r 0.00u 0.00s 0% 2272k.
All write I/O's to all gjournaled partitions are stopped. Under normal
circumstances the snapshot is taken in five seconds, so I have definitiv
not the problem I have described in
lists.freebsd.org/pipermail/freebsd-geom/2012-May/005246.html.
My system disk with root,var,usr and home is completely mirrored and
gjournaled with journals in extra partitions:
gpart show mirror/gm0 -->
=> 34 286749420 mirror/gm0 GPT (136G)
34 128 1 freebsd-boot (64k)
162 8601600 2 freebsd-swap (4.1G)
8601762 2097152 3 freebsd-swap (1.0G)
10698914 4194304 4 freebsd-swap (2.0G)
14893218 4194304 5 freebsd-swap (2.0G)
19087522 4194304 6 freebsd-swap (2.0G)
23281826 2097152 7 freebsd-ufs (1.0G)
25378978 8388608 8 freebsd-ufs (4.0G)
33767586 67108864 9 freebsd-ufs (32G)
100876450 185873004 10 freebsd-ufs (88G)
df -h -t ufs -->
Filesystem Size Used Avail Capacity Mounted on
/dev/mirror/gm0p7.journal 989M 313M 596M 34% /
/dev/mirror/gm0p8.journal 3.9G 2.2G 1.4G 61% /var
/dev/mirror/gm0p9.journal 31G 8.6G 19G 30% /usr
/dev/mirror/gm0p10.journal 85G 17G 62G 22% /home
gmirror status -->
Name Status Components
mirror/gm0 COMPLETE da6 (ACTIVE)
da7 (ACTIVE)
gjournal status -->
Name Status Components
mirror/gm0p7.journal N/A mirror/gm0p3
mirror/gm0p7
mirror/gm0p8.journal N/A mirror/gm0p4
mirror/gm0p8
mirror/gm0p9.journal N/A mirror/gm0p5
mirror/gm0p9
mirror/gm0p10.journal N/A mirror/gm0p6
mirror/gm0p10
I got some information from the hanging system with DDB:
KDB: enter: Break to debugger
[thread pid 11 tid 100004 ]
Stopped at kdb_enter+0x3b: movq $0,0x483332(%rip)
db> show pcpu
cpuid = 2
dynamic pcpu = 0xffffff807f7d0080
curthread = 0xffffff000235c000: pid 11 "idle: cpu2"
curpcb = 0xffffff8000051d00
fpcurthread = none
idlethread = 0xffffff000235c000: tid 100004 "idle: cpu2"
curpmap = 0xffffffff80889170
tssp = 0xffffffff808f65d0
commontssp = 0xffffffff808f65d0
rsp0 = 0xffffff8000051d00
gs32p = 0xffffffff808f5408
ldt = 0xffffffff808f5448
tss = 0xffffffff808f5438
db> show allpcpu
Current CPU: 2
cpuid = 0
dynamic pcpu = 0x449080
curthread = 0xffffff0002368470: pid 11 "idle: cpu0"
curpcb = 0xffffff800005bd00
fpcurthread = none
idlethread = 0xffffff0002368470: tid 100006 "idle: cpu0"
curpmap = 0xffffffff80889170
tssp = 0xffffffff808f6500
commontssp = 0xffffffff808f6500
rsp0 = 0xffffff800005bd00
gs32p = 0xffffffff808f5338
ldt = 0xffffffff808f5378
tss = 0xffffffff808f5368
cpuid = 1
dynamic pcpu = 0xffffff807f7c9080
curthread = 0xffffff00023688e0: pid 11 "idle: cpu1"
curpcb = 0xffffff8000056d00
fpcurthread = none
idlethread = 0xffffff00023688e0: tid 100005 "idle: cpu1"
curpmap = 0xffffffff80889170
tssp = 0xffffffff808f6568
commontssp = 0xffffffff808f6568
rsp0 = 0xffffff8000056d00
gs32p = 0xffffffff808f53a0
ldt = 0xffffffff808f53e0
tss = 0xffffffff808f53d0
cpuid = 2
dynamic pcpu = 0xffffff807f7d0080
curthread = 0xffffff000235c000: pid 11 "idle: cpu2"
curpcb = 0xffffff8000051d00
fpcurthread = none
idlethread = 0xffffff000235c000: tid 100004 "idle: cpu2"
curpmap = 0xffffffff80889170
tssp = 0xffffffff808f65d0
commontssp = 0xffffffff808f65d0
rsp0 = 0xffffff8000051d00
gs32p = 0xffffffff808f5408
ldt = 0xffffffff808f5448
tss = 0xffffffff808f5438
cpuid = 3
dynamic pcpu = 0xffffff807f7d7080
curthread = 0xffffff000235c470: pid 11 "idle: cpu3"
curpcb = 0xffffff800004cd00
fpcurthread = none
idlethread = 0xffffff000235c470: tid 100003 "idle: cpu3"
curpmap = 0xffffffff80889170
tssp = 0xffffffff808f6638
commontssp = 0xffffffff808f6638
rsp0 = 0xffffff800004cd00
gs32p = 0xffffffff808f5470
ldt = 0xffffffff808f54b0
tss = 0xffffffff808f54a0
db> trace
Tracing pid 11 tid 100004 td 0xffffff000235c000
kdb_enter() at kdb_enter+0x3b
kdb_alt_break_internal() at kdb_alt_break_internal+0x8a
uart_intr() at uart_intr+0x2cd
intr_event_handle() at intr_event_handle+0x62
intr_execute_handlers() at intr_execute_handlers+0x5f
lapic_handle_intr() at lapic_handle_intr+0x37
Xapic_isr1() at Xapic_isr1+0xa5
--- interrupt, rip = 0xffffffff805ae2c6, rsp = 0xffffff8000051b10, rbp =
0xffffff8000051b20 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6
acpi_cpu_idle() at acpi_cpu_idle+0x20a
sched_idletd() at sched_idletd+0x11f
fork_exit() at fork_exit+0x11f
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff8000051cf0, rbp = 0 ---
db> show lock Giant
class: sleep mutex
name: Giant
flags: {DEF, RECURSE}
state: {UNOWNED}
db> show lockedvnods
Locked vnodes
0xffffff012a730588: tag ufs, type VREG
usecount 1, writecount 0, refcount 2 mountedhere 0
flags (VV_SYSTEM)
lock type snaplk: EXCL by thread 0xffffff00847cc000 (pid 50350)
with exclusive waiters pending
ino 23552, on dev mirror/gm0p10.journal
db> show lockchain
thread 100004 (pid 11, idle: cpu2) running on CPU 2
db> show sleepchain
thread 100004 (pid 11, idle: cpu2) running on CPU 2
db> show thread
Thread 100004 at 0xffffff000235c000:
proc (pid 11): 0xffffff000235a470
name: idle: cpu2
stack: 0xffffff800004e000-0xffffff8000051fff
flags: 0x50024 pflags: 0x200000
state: RUNNING (CPU 2)
priority: 255
container lock: sched lock 2 (0xffffffff80894d80)
db> show proc 50350
Process 50350 (mksnap_ffs) at 0xffffff00847b68e0:
state: NORMAL
uid: 0 gids: 5
parent: pid 50347 at 0xffffff00842d48e0
ABI: FreeBSD ELF64
arguments: /sbin/mksnap_ffs
threads: 1
100215 D suspfs 0xffffff0002f2907c mksnap_ffs
db> show threads
100111 (0xffffff0002d85000) (stack 0xffffff8245057000) sched_switch()
at sched_switch+0xde
100166 (0xffffff0002f6b000) (stack 0xffffff824516a000) sched_switch()
at sched_switch+0xde
100622 (0xffffff006a8d2000) (stack 0xffffff8245a52000) sched_switch()
at sched_switch+0xde
....
100215 (0xffffff00847cc000) (stack 0xffffff824525f000) sched_switch()
at sched_switch+0xde
....
It seems there is a deadlock on the suspfs lock, but I could not figure
out who holds this lock.
Any hints how to get better diagnostic information for next time the
error occurs are welcome.
--
Andreas Longwitz
More information about the freebsd-stable
mailing list