Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Mon Oct 1 20:00:42 UTC 2012

On Wednesday, June 6, 2012 8:36:04 PM UTC-5, Mark Felder wrote:
> Hi guys I'm excitedly posting this from my phone. Good news for you guys, bad news for us -- we were building HA storage on vmware for a client and can now replicate the crash on demand. I'll be posting details when I get home to my PC tonight, but this hopefully is enough to replicate the crash for any curious followers:
> 
> 
> 
> ESXi 5
> 
> 9 or 9-STABLE
> 
> HAST 
> 
> 1 cpu is fine
> 
> 1GB of ram
> 
> UFS SUJ on HAST device
> 
> No special loader.conf, sysctl, etc
> 
> No need for VMWare tools
> 
> Run Bonnie++ on the HAST device
> 
> 
> 
> We can get the crash to happen on the first run of bonnie++ right now. I'll post the exact specs and precise command run in the PR. We found an old post from 2004 when we looked up the process state obtained from CTRL+T -- flswai -- which describes the symptoms nearly perfectly.
> 
> 
> 
>  http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2004-02/0250.html 
> 
> 
> 
> Hopefully this gets us closer to a fix...

Is this a crash or a hang? Over the past couple of weeks, I've been working with a FreeBSD 9.1RC1 system under VMware ESXi 5.0 with a 64GB UFS root FS and 2TB ZFS filesystem mounted via a virtual LSI SAS interface. Sometimes during heavy I/O load (rsync from other servers) on the ZFS FS, this shows up in /var/log/messages:

Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 5 ee 60 16 0 1 0 0 
Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status Error
Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
Sep 21 02:14:55 backups kernel: (da1:mpt0:0:1:0): Retrying command
Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 3 ef 42 51 0 1 0 0 
Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status Error
Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
Sep 21 02:18:44 backups kernel: (da1:mpt0:0:1:0): Retrying command
Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 3 ef 64 51 0 1 0 0 
Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status Error
Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
Sep 21 02:18:48 backups kernel: (da1:mpt0:0:1:0): Retrying command
Sep 21 02:18:49 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 3 ef 66 51 0 1 0 0 
Sep 21 02:18:49 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status Error
Sep 21 02:18:49 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
...
Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): WRITE(10). CDB: 2a 0 41 f3 94 99 0 1 0 0 
Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): CAM status: SCSI Status Error
Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): SCSI status: Busy
Sep 21 05:06:18 backups kernel: (da1:mpt0:0:1:0): Retrying command

These have been happening roughly every other day.

mpt0 and em0 were sharing int 18, so today I put 
hint.mpt.0.msi_enable="1"
into /boot/devices.hints and rebooted; now mpt0 is using int 256. I'll see if it helps.

Guy