strange problem with FreeBSD 7.3 64bit
free.bsd at webstyle.ch
Fri Sep 10 09:04:07 UTC 2010
we upgraded some 20 boxes from 7.1 and 7.2 to 7.3-RELEASE-p2 (all amd64)
and now are experiencing some weird behaviour on 6 of them with rsnapshot:
after a few days/several weeks (seems to be completely random),
rsnapshot reports that it can't start due it's lockfile and process
still being present. on such boxes either a zombie rm or find process
(which presumably were launched by rsnapshot) can be found.
if the backup was done to a separate partition (physical disks or RAIDs)
any access (ls, stat, fsck, etc) to the partition would kill the current
SSH session, creating a new zombie of the process one just started.
unmounting the affected partition would render the server completely
unresponsive and required a hardware reset.
when trying to restart, the machines wouldn't even shut down completely
but hanged somewhere after syncing buffers, only a hardware reset
worked. after the reboot, those partitions were unmounted and fscked.
after which the backups would work again until the next error happened
the hardware of affected and unaffected system are:
HP ProLiant DL380 G4
HP ProLiant DL380 G5
HP ProLiant DL360 G5
there is no visible pattern between affected and unaffected boxes. also
those machines were upgraded the exact same way, running identical
kernels (more or less GENERIC, with QUOTA activated).
we upgraded the most critical boxes which showed that behaviour on a
daily interval to 8.0-RELEASE and ever since this behavior has
disappeared since nearly 3 months now.
we installed a debug-kernel on an affected box, but the machine wouldn't
panic when the error occured. when trying to unmount the affected
partition it just went completely unresponsive, as mentioned above.
before trying to unmount procstat -ak showed some processes with
55396 100135 find - mi_switch sleepq_switch sleepq_wait _sleep acquire
_lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget cache_lookup
vfs_cache_lookup VOP_LOOKUP_APV lookup namei kern_lstat lstat syscall
70923 100146 rsync - mi_switch sleepq_switch sleepq_wait _sleep acquire
_lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget vfs_hash_get ffs_vgetf
ufs_lookup_ vfs_cache_lookup OP_LOOKUP_APV lookup namei kern_lstat
since this hardware has been working before 7.3 and -- as we assume --
would work again with 8.*, we would be grateful for any hints what could
be the cause of all this.
More information about the freebsd-stable