FreeBSD 9: fdisk -It crashes kernel

Thu Apr 25 17:57:30 UTC 2013

On Thu, Apr 25, 2013 at 11:58:42AM -0500, Guy Helmer wrote:
> On Apr 25, 2013, at 10:58 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:
> 
> > On Thu, Apr 25, 2013 at 09:06:49AM -0500, Guy Helmer wrote:
> >> Encountered a surprise when my disk resizing rc.d script caused FreeBSD 9.1-STABLE to crash. I used "fdisk -It ada0" to determine what the available size of the disk (which happened to be the root disk), and on FreeBSD 9.1 the kernel comes crashing down:
> >> 
> >> + fdisk -It ada0
> >> + /rescue/sed -En 's,.*start ([0-9]+).*size ([0-9]+).*,\1 + \2,p'
> >> vnode_pager_getpages: I/O read error
> >> vm_fault: pager read error, pid 65 (fdisk)
> >> pid 65 (fdisk), uid 0: exited on signal 11
> >> eval: arithmetic expression: expecting primary: ""
> >> Entropy harvesting: point_to_pointeval: date: Device not configured
> >> eval: df: Device not configured
> >> eval: dmesg: Device not configured
> >> cat: /bin/ls: Device not configured
> >> kickstart.
> >> eval: cannot open /etc/fstab: Device not configured
> >> eval: cannot open /etc/fstab: Device not configured
> >> eval: swapon: Device not configured
> >> Warning! No /etc/fstab: skipping disk checks
> >> fstab: /etc/fstab:0: Device not configured
> >> 
> >> Fatal trap 12: page fault while in kernel mode
> >> cpuid = 1; apic id = 01
> >> fault virtual address   = 0x0
> >> fault code                     = supervisor read, page not present
> >> instruction pointer      = 0x20:0xc0825fc4
> >> stack pointer               = 0x28:0xc5a088c8
> >> frame pointer              = 0x28:0xc5a08914
> >> code segment            = base 0x0, limit 0xfffff, type 0x1b
> >>                                     = DLP 0, pres 1, def32 1, gran 1
> >> processor eflags       = interrupt enabled, resume, IOPL = 0
> >> current process         = 91 (mount)
> >> [ thread pid 91 tid 100056 ]
> >> Stopped at  g_access+0x24: mlvl 0(%ebx),%eax
> >> db> where
> >> Tracing pid 91 tid 100056 td 0xc84c42f0
> >> g_access(c8481d34,0,1,1,0,…) at g_access+0x24/frame 0xc5a08914
> >> ffs_mount(c8481d34,c0d78380,2,c5a08c00,c829ae6c,…) af ffs_mount+0xf74/frame 0xc5a08a34
> >> vfs_donmount(c84c42f0,10000,0,c84cf200,c84cf200,…) at vfs_donmount+0x1423/frame 0xc5a08c24
> >> sys_nmount(c84c42f0,c5a08ccc,c5a08cc4,1010006,c5a08d08,…) at sys_nmount+0x7f/frame 0xc5a08c48
> >> syscall(c5a08d08) at syscall+0x443/frame 0xc508cfc
> >> Xint0x80_syscall() at Xint0x80_syscall+0x21/frame 0xc5a08cfc
> >> --- syscall (378, FreeBSD ELF32, sys_nmount), eip = 0x480d5feb, esp = 0xbfbfce1c, ebp = 0xbfbfd378 ---
> >> 
> >> I'll fix my script to not do this, but it seems odd that fdisk -It can make the disk "go away".
> > 
> > Please provide a full, unmodified copy of your script.
> > 
> > What's confusing to me is that after your sed call (which I don't even
> > understand, because it doesn't appear to be operating on anything except
> > stdin/stdout, and we don't know what that is -- again, show the script),
> > the kernel starts outputting indications that the root disk/filesystem
> > or its related metadata disappeared:
> > 
> >> vnode_pager_getpages: I/O read error
> >> vm_fault: pager read error, pid 65 (fdisk)
> >> pid 65 (fdisk), uid 0: exited on signal 11
> > 
> > Except the kernel stack trace indicates something called sys_nmount(),
> > which called vfs_donmount(), which called ffs_mount(), which calls
> > g_access().  All of those scream to me "someone tried to mount
> > something".  fdisk does not do mounting.
> 
> Right, which is why I copied the entire screen output -- it appears to me that the rc scripts had stumbled on until the kernel panicked.
> 
> > 
> > fdisk also shouldn't be writing to LBA 0 (the MBR) if you used -I -t.
> > I've been staring at fdisk.c for about 20 minutes now and I can't work
> > out a situation where -I -t would cause the MBR to be rewritten
> > actively.
> > 
> > The only GEOM calls I see in fdisk.c that would get called are
> > g_device_path(), g_open(), and g_close().  Actual device I/O uses read()
> > and write() (only in write_s0() which shouldn't be called).
> > 
> > Furthermore, GEOM has foot-shooting-prevention mechanisms in place (I'm
> > talking about kern.geom.debugflags) to keep LBA 0 from being modified.
> > Is your script setting that sysctl to 16/0x10 blindly?  Ahem.
> 
> No. The script is intended only to work for drives other than the one containing the boot partition.
> 
> > 
> > It would also help if you could state exactly what 9.1-STABLE source
> > you're using; if using svn provide revision (rXXXXXX), else provide
> > uname -a output.
> 
> rev 249788
> 
> > 
> > Finally: I would suggest using gpart(8) instead going forward.  This is
> > a separate recommendation though; if somehow I'm overlooking something
> > in fdisk.c where writes to LBA 0 really do happen, then that needs to
> > get fixed.  But gpart(8) is what you should use in general these days
> > anyway.
> > 
> 
> Seems like gpart was giving me some frustration with earlier versions of FreeBSD (7, I think) so I went with fdisk instead. Might work OK now...
> 
> I have included the full script below.
>
> { snipping for brevity; for reference, see this url: }
> { http://lists.freebsd.org/pipermail/freebsd-stable/2013-April/073234.html }

Thanks for this.

I could practically write a book on what's going on here.  Rather than
me spend hours of time reverse-engineering this, you're going to need to
step up to the plate and see if you can figure out what exactly triggers
the issue.

I will give you this analysis about fdisk -I -t:

When -I is specified, I_flag=1.

When -t is specified, v_flag=1, and also v_flag=1.

Function open_disk(), when fdisk is used with the -I option, will call
g_open() with the read-write flag set to 1.  Whether or not this
succeeds I don't know (and if it fails, but only with EPERM, then it
retries in read-only mode silently).  The -I flag correlates with the
I_flag variable (do not confuse this with i_flag):

 726 static int
 727 open_disk(int flag)
 728 {
 729         int rwmode;
 730
 731         /* Write mode if one of these flags are set. */
 732         rwmode = (a_flag || I_flag || B_flag || flag);
 733         fd = g_open(disk, rwmode);
 734         /* If the mode fails, try read-only if we didn't. */
 735         if (fd == -1 && errno == EPERM && rwmode)
 736                 fd = g_open(disk, 0);
 737         if (fd == -1 && errno == ENXIO)
 738                 return -2;
 739         if (fd == -1) {
 740                 warnx("can't open device %s", disk);
 741                 return -1;
 742         }
 743         if (get_params() == -1) {
 744                 warnx("can't get disk parameters on %s", disk);
 745                 return -1;
 746         }
 747         return fd;
 748 }

Variable fd is global.

After this call to open_disk(), read_disk() is used, but that's only
doing read operations on fd.

After this, the if (I_flag) code gets run.  This calls read_s0(),
reset_boot() (sounds ominous but isn't), and dos().

read_s0() does not issue any write I/O to fd, or call any functions that
issue write I/O.

reset_boot() just resets the in-memory-copy of the partition table.
It does not modify anything on disk.

dos() does not do any I/O at all.

At this point, if v_flag is set (which it is), print_s0() gets run.

print_s0() calls print_params(), which simply prints out the
in-memory-copy of C/H/S from the disk label and so on.  No file I/O is
done.  Once that's done, it calls print_part() on each partition,
which just outputs all the details -- again, no file I/O is done.

Finally, at this stage, if t_flag ISN'T set, then write_s0() gets run.
In this case write_s0() does not get called because t_flag=1.  FYI,
write_s0() is what does the actual write I/O to LBA 0/MBR.  After that,
exit(0) is called.

So even though -I -t calls g_open() with the read-write flag set, I
don't see anything that indicates writing to LBA 0/MBR happens.

So I do not see how fdisk -I -t could cause this situation.

fdisk -v, maybe, but again, you'll need to do the testing.

Now I have a question for you: how did you manage to get this output?

> >> + fdisk -It ada0
> >> + /rescue/sed -En 's,.*start ([0-9]+).*size ([0-9]+).*,\1 + \2,p'

Because this looks like /bin/sh -x output, but I need to know if that's
the case or not.

/bin/sh -x claims to echo commands to stderr ***before*** they're
executed.

So I'm then left wondering why we don't see output that equates to the
equivalent of this line:

    eval $(fdisk -v $DISK | $SED -En 's,.*start ([0-9]+).*size ([0-9]+).*,curroff=\1 currsize=\2,p')

Instead, we start seeing this:

> >> vnode_pager_getpages: I/O read error
> >> vm_fault: pager read error, pid 65 (fdisk)
> >> pid 65 (fdisk), uid 0: exited on signal 11
> >> eval: arithmetic expression: expecting primary: ""
>> Entropy harvesting: point_to_pointeval: date: Device not configured
>> eval: df: Device not configured
>> eval: dmesg: Device not configured
>> cat: /bin/ls: Device not configured

Your script has only 1 eval statement (and eval is very very dangerous.
I cannot stress this enough.  If you ever think you need eval in shell
scripts, you probably don't.)

Your script does not call df, dmesg, date, or /bin/ls.  So why are these
mentioned?  And "Entropy harvesting" comes from dmesg/the kernel message
buffer too, how is that ending up there?

Possibly the eval: error line only gets output by sh ***after*** all the
preceding [broken] stuff gets run.

But I'm also confused, because there isn't anything arithmetic-oriented
in your eval line, so why is it talking about arithmetic expressions?
You don't use expr either, so the only math operation comes BEFORE all
of that, specifically here:

    physsize=$(($(fdisk -It $DISK | $SED -En 's,.*start ([0-9]+).*size ([0-9]+).*,\1 + \2,p')))

My gut feeling here is that something "unexpected" happened and your
script went totally haywire as a result (probably some unexpected output
that got turned into something you didn't expect).  My favourite is
seeing asterisk/wildcards expanded to pull in all the filenames in $cwd.

I'm sorry to tell you, but there is a point when writing shell scripts
becomes unreliable/unmanageable/results in too much risk, and is time to
consider writing such things in an actual programming language
(preferably one without reliance on CLI tools, but real APIs).  I know
you don't need to hear that right now, but it's true.

See if you can work out exactly what line begins causing problems for
you.  My guess is that it's the result of fdisk segfaulting, but I'm
honestly not sure because the above output doesn't make entire sense.

Let us know what you determine/find out.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |