panic: sbuf_vprintf called with a NULL sbuf pointer
John Baldwin
jhb at freebsd.org
Thu Jan 21 01:15:36 UTC 2016
On Wednesday, January 20, 2016 03:15:06 PM Don Lewis wrote:
> On 8 Dec, John Baldwin wrote:
> > On Tuesday, December 08, 2015 11:39:56 AM Don Lewis wrote:
> >> On 8 Dec, John Baldwin wrote:
> >> > On Monday, December 07, 2015 10:10:51 PM Don Lewis wrote:
> >> >> On 2 Dec, John Baldwin wrote:
> >> >> > On Wednesday, December 02, 2015 01:25:56 PM Don Lewis wrote:
> >> >> >> > If you want to look at this further, try going to frame 16 and dissassembling the
> >> >> >> > instructions before the call to see if you can spot which register the first
> >> >> >> > parameter (saved in %rdi IIRC) comes from.
> >> >> >>
> >> >> >> Dump of assembler code for function sbuf_printf:
> >> >> >> 0xffffffff80a673e0 <+0>: push %rbp
> >> >> >> 0xffffffff80a673e1 <+1>: mov %rsp,%rbp
> >> >> >> 0xffffffff80a673e4 <+4>: push %r14
> >> >> >> 0xffffffff80a673e6 <+6>: push %rbx
> >> >> >> 0xffffffff80a673e7 <+7>: sub $0x50,%rsp
> >> >> >> 0xffffffff80a673eb <+11>: mov %rsi,%r14
> >> >> >> 0xffffffff80a673ee <+14>: mov %rdi,%rbx
> >> >> >> 0xffffffff80a673f1 <+17>: mov %r9,-0x38(%rbp)
> >> >> >> 0xffffffff80a673f5 <+21>: mov %r8,-0x40(%rbp)
> >> >> >> 0xffffffff80a673f9 <+25>: mov %rcx,-0x48(%rbp)
> >> >> >> 0xffffffff80a673fd <+29>: mov %rdx,-0x50(%rbp)
> >> >> >> 0xffffffff80a67401 <+33>: lea -0x60(%rbp),%rax
> >> >> >> 0xffffffff80a67405 <+37>: mov %rax,-0x20(%rbp)
> >> >> >> 0xffffffff80a67409 <+41>: lea 0x10(%rbp),%rax
> >> >> >> 0xffffffff80a6740d <+45>: mov %rax,-0x28(%rbp)
> >> >> >> 0xffffffff80a67411 <+49>: movl $0x30,-0x2c(%rbp)
> >> >> >> 0xffffffff80a67418 <+56>: movl $0x10,-0x30(%rbp)
> >> >> >> 0xffffffff80a6741f <+63>: mov $0xffffffff8137bdf8,%rdi
> >> >> >> 0xffffffff80a67426 <+70>: mov %rbx,%rsi
> >> >> >> 0xffffffff80a67429 <+73>: callq 0xffffffff80a66c00 <_assert_sbuf_integrity>
> >> >> >>
> >> >> >>
> >> >> >> 0xffffffff80a237b9 <+825>: jmpq 0xffffffff80a236fd <sigexit+637>
> >> >> >> 0xffffffff80a237be <+830>: mov $0xffffffff80fd8ad3,%rsi
> >> >> >> 0xffffffff80a237c5 <+837>: xor %eax,%eax
> >> >> >> 0xffffffff80a237c7 <+839>: mov %r12,%rdi
> >> >> >> 0xffffffff80a237ca <+842>: mov -0x228(%rbp),%rdx
> >> >> >> 0xffffffff80a237d1 <+849>: callq 0xffffffff80a673e0 <sbuf_printf>
> >> >> >> => 0xffffffff80a237d6 <+854>: inc %r14d
> >> >> >> 0xffffffff80a237d9 <+857>: jmpq 0xffffffff80a236fd <sigexit+637>
> >> >> >
> >> >> > So maybe try 'p $r12' in the corefile_open() frame.
> >> >>
> >> >> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=<optimized out>,
> >> >> uid=<optimized out>, pid=<optimized out>, td=<optimized out>,
> >> >> vpp=<optimized out>, namep=<optimized out>)
> >> >> at /usr/src/sys/kern/kern_sig.c:3188
> >> >> 3188 sbuf_printf(&sb, "%s", comm);
> >> >> (kgdb) p $r12
> >> >> $1 = 0
> >> >
> >> > So it's definitely zero. :( The next step is probably to disassemble the
> >> > corefile_open function (ugh) and walk backwards to find where %r12 is read
> >> > from. It might be from a local variable on the stack, so then you would
> >> > want to examine that memory in the stack and the surrounding memory to see
> >> > if there is memory corruption and perhaps if there is anything recognizable
> >> > about it (e.g. if the corruption contains some sort of data you recognize,
> >> > or if the corruption is bounded by a certain length, etc.). It's a bit of
> >> > a shot in the dark though.
> >> >
> >> > Is this reproducible?
> >>
> >> No it's not. The only time it happened was when there was a swap
> >> timeout, probably because of a lengthy deep recovery on one of the
> >> mirrored swap devices.
> >>
> >> The code in question is:
> >> struct sbuf sb;
> >> [snip]
> >> (void)sbuf_new(&sb, name, MAXPATHLEN, SBUF_FIXEDLEN);
> >> [snip]
> >> for (i = 0; format[i] != '\0'; i++) {
> >> switch (format[i]) {
> >> case '%': /* Format character */
> >> i++;
> >> switch (format[i]) {
> >> [snip]
> >> case 'N': /* process name */
> >> sbuf_printf(&sb, "%s", comm);
> >> break;
> >>
> >>
> >> &sb is used in a bunch of places, so the compiler probably computes its
> >> value once by adding the proper offset to the stack pointer and stashing
> >> the result in a register. Since kern.corefile is "%N.core", the failure
> >> is happening on the first interation of the loop, so there isn't much
> >> opportunity for things to get corrupted. Also, the control flow in this
> >> function only depends on the format, so there shouldn't be anything
> >> special about a swap timeout vs. a segfault generated core.
> >
> > Yes, r12 is call-safe (IIRC), so I expect it only computes it once as well,
> > but I've sometimes seen the compiler spill local vars onto the stack due to
> > register pressure. That said, I think it is unlikely it would have to spill
> > &sb during the early part of the function. :(
>
> I had some more time to look at this. It's a bit messy because
> corefile_open() and coredump() both get inlined into sig_exit(). Here's
> the section of code from the malloc() call and sbuf_new() to the failing
> sbuf_printf() call:
>
> 0xffffffff80a23685 <+517>: callq 0xffffffff80a00680 <malloc>
> 0xffffffff80a2368a <+522>: mov %rax,-0x238(%rbp)
> 0xffffffff80a23691 <+529>: lea -0x138(%rbp),%r12
> ....
> 0xffffffff80a237b9 <+825>: jmpq 0xffffffff80a236fd <sigexit+637>
> 0xffffffff80a237be <+830>: mov $0xffffffff80fd8ad3,%rsi
> 0xffffffff80a237c5 <+837>: xor %eax,%eax
> 0xffffffff80a237c7 <+839>: mov %r12,%rdi
> 0xffffffff80a237ca <+842>: mov -0x228(%rbp),%rdx
> 0xffffffff80a237d1 <+849>: callq 0xffffffff80a673e0 <sbuf_printf>
> => 0xffffffff80a237d6 <+854>: inc %r14d
> 0xffffffff80a237d9 <+857>: jmpq 0xffffffff80a236fd <sigexit+637>
> 0xffffffff80a237de <+862>: mov $0xffffffff81361469,%rsi
>
> Basically it's:
> lea -0x138(%rbp),%r12
> [ snip ]
> mov %r12,%rdi
> callq sbuf_printf
Yeah, at this point I can only think of bizarre-o things like a trap
clobbering %r12 in the saved trapframe or the like. I can't imagine why
that would happen though. If we had a bug that clobbered saved registers
in the trapframe you would expect it to happen far more often. OTOH, if
some code had a stale pointer that was later reused by a kthread's stack
and the thread was interrupted (e.g. by a device interrupt) while in
this routine, _and_ the stale pointer was indirected and trashed the
saved %r12, then that would explain this. I just feel like I have better
odds of seeing a unicorn in my driveway tonight. :-/
--
John Baldwin
More information about the freebsd-current
mailing list