Panic when removing a SCSI device entry

Joerg Wunsch freebsd-scsi at uriah.heep.sax.de
Sun May 8 09:23:01 UTC 2011


I've got a setup where a tape library is attached with a
computer-controllable power switch, so it is only turned on during the
time when backups (or restores) are done.  This is mainly to reduce
the noise level, but also to reduce the overall power consumption
energy while that library is not needed.

Every now and then, the kernel panics with a page fault during the
(unattented, it happens at night times) power cycling and surrounding
actions.  The current process when the page fault happens is always
mt(1), which is used inside the powerup/down script to ensure the
drive is being properly rewound.  The page fault happens in
destroy_devl(), at this location:

        /* If we are a child, remove us from the parents list */
        if (dev->si_flags & SI_CHILD) {
here --->>>     LIST_REMOVE(dev, si_siblings);
                dev->si_flags &= ~SI_CHILD;
        }

The preprocessed code of that looks like:

 if (dev->si_flags & 0x0010) {
  if ((((dev))->si_siblings.le_next) != ((void *)0))
        (((dev))->si_siblings.le_next)->si_siblings.le_prev =
             (dev)->si_siblings.le_prev;
  *(dev)->si_siblings.le_prev = (((dev))->si_siblings.le_next);
  dev->si_flags &= ~0x0010;
 }

and it's the indirection of *(dev)->si_siblings.le_prev that hits a
NULL pointer.  Obviously, LIST_REMOVE doesn't anticipate that
dev->si_siblings.le_prev might be a NULL pointer, so this is a usage
error, somehow.  Could it be that destroy_devl() is called twice for
the same device?

This used to happen on an earlier system (some version of 7.x-stable),
and I eventually managed it to tweak the powerup/down scripts of the
library so to avoid the critical sequence of actions triggering this
situation.  Now that I finally upgraded the machine to 8.2-STABLE,
it is triggered very frequently again though.

Any ideas how to fix it, or at least apply a workaround, other than
turning

        *(elm)->field.le_prev = LIST_NEXT((elm), field);         \

in the LIST_REMOVE macro into

        if ((elm)->field.le_prev != NULL) \
          *(elm)->field.le_prev = LIST_NEXT((elm), field);       \

which affects the entire system, not just the SCSI subsystem part?

-- 
cheers, J"org               .-.-.   --... ...--   -.. .  DL8DTL

http://www.sax.de/~joerg/                        NIC: JW11-RIPE
Never trust an operating system you don't have sources for. ;-)


More information about the freebsd-scsi mailing list