Multiple errors on server -- Where do I start looking?

Mon Feb 6 16:35:57 UTC 2012

I've run into some error messages on my server that are beyond my skill level of interpreting, so I'm hoping some of you can help me out. I've already posted this on the forums at http://forums.freebsd.org/showthread.php?p=165258#post165258 but since this is affecting our business, I'm trying to reach out to a broader audience and hopefully get this thing resolved.

We have an Intel modular blade server. The chassis has 2x 3-disk RAID(5) arrays. Volume 1 is what the OS (FreeBSD 7.2) is installed on and Volume 2 is mounted at /usr. These two volumes are da0 and da1.

I got email notifications saying the web host I run in a jail hosted on this server was down. I try to SSH into it, but it fails. I ping it and I get a 50% return rate. So I log in to the management blade and start a virtual KVM sessions to get into the blade. Once I'm into the basehost blade, I cat dmesg.today and get a slew of errors. Here we go..
(da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state
(da3:mpt0:0:6:1): Retrying Command (per Sense Data)
(da3:mpt0:0:6:1): READ(10). CDB: 28 0 0 0 0 0 0 0 1 0
(da3:mpt0:0:6:1): CAM Status: SCSI Status Error
(da3:mpt0:0:6:1): SCSI Status: Check Condition
(da3:mpt0:0:6:1): ILLEGAL REQUEST asc:4,b
(da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state
(da3:mpt0:0:6:1): Retrying Command (per Sense Data)
(da3:mpt0:0:6:1): READ(10). CDB: 28 0 0 0 0 0 0 0 1 0
(da3:mpt0:0:6:1): CAM Status: SCSI Status Error
(da3:mpt0:0:6:1): SCSI Status: Check Condition
(da3:mpt0:0:6:1): ILLEGAL REQUEST asc:4,b
(da3:mpt0:0:6:1): Logical unit not accessible, target port in standby state
(da3:mpt0:0:6:1): Retries Exhausted

As mentioned before, our two volumes are da0 and da1. /dev lists da2 and da3 as well, but I have no idea what they are.  How do I figure out what da3 is and what do the above error messages say about it? Someone on the forum asked me if the two volumes are on the same controller and the answer is yes, they are.

GEOM_LABEL: Label for provider da0s1a is ufsid/4aeb03874c64d9f1.
GEOM_LABEL: Label for provider da0s1d is ufsid/4aeb038ae8ae24cf.
GEOM_LABEL: Label for provider da0s1e is ufsid/4aeb0387d999941a.
GEOM_LABEL: Label for provider da0s1f is ufsid/4aeb038766c4c807.
Trying to mount root from ufs:/dev/da0s1a
GEOM_LABEL: Label ufsid/4aeb03874c64d9f1 removed.
GEOM_LABEL: Label for provider da0s1a is ufsid/4aeb03874c64d9f1.
GEOM_LABEL: Label ufsid/4aeb0387d999941a removed.
GEOM_LABEL: Label ufsid/4bd2077f23a6cc93 removed.
GEOM_LABEL: Label for provider da0s1e is ufsid/4aeb0387d999941a.
GEOM_LABEL: Label for provider da1s1 is ufsid/4bd2077f23a6cc93.
GEOM_LABEL: Label ufsid/4aeb038766c4c807 removed.
GEOM_LABEL: Label for provider da0s1f is ufsid/4aeb038766c4c807.
GEOM_LABEL: Label ufsid/4aeb038ae8ae24cf removed.
GEOM_LABEL: Label for provider da0s1d is ufsid/4aeb038ae8ae24cf.
GEOM_LABEL: Label ufsid/4aeb03874c64d9f1 removed.
GEOM_LABEL: Label ufsid/4aeb0387d999941a removed.
GEOM_LABEL: Label ufsid/4aeb038766c4c807 removed.
GEOM_LABEL: Label ufsid/4aeb038ae8ae24cf removed.
GEOM_LABEL: Label ufsid/4bd2077f23a6cc93 removed.

Was root unmounted? Whats going on here? Obviously there's some issue with da0, which is mounted at /. The server has been up and running fine, so why am I seeing "Trying to mount root from ufs:/dev/da0s1a"?

pid 93248 (httpd), uid 80: exited on signal 10
pid 95624 (httpd), uid 80: exited on signal 10
pid 97956 (httpd), uid 80: exited on signal 10
pid 97935 (httpd), uid 80: exited on signal 10
pid 96603 (httpd), uid 80: exited on signal 10
pid 93210 (httpd), uid 80: exited on signal 10
pid 98246 (httpd), uid 80: exited on signal 10

This is apparently whats killing our webserver. Apache receives a signal 10 and quits.. Everything I've read says it's an issue with Apache trying to access RAM that it shouldn't or that doesn't exist.. Is there something else with the above da0 or da3 errors that would cause a SIGBUS on httpd?

Then after that it goes back and repeats that first block of da3 errors a bunch more times. The server was down for about 10 minutes and then it just fixed itself. It's weird because it seems the apache child processes all get killed off by the sigbus but the parent process doesn't.. so once the problem works itself out, it continues operations as normal without me having to restart the daemon or anything.

The management blade in the server chassis is reporting that all the hardware is fine. We have a second blade that boots off of a second partition in Volume 1 and it doesn't have any problems at all.

I'm at a loss here!

Ryan Merrell

This e-mail message is for the sole use of the intended recipient(s) and may contain privileged or confidential information. Unauthorized use, distribution, review or disclosure is prohibited. If you are not the intended recipient, please notify the sender immediately by reply email and destroy all copies of the original message.