VIA Sata Problem maybe?

Paul-Kenji Cahier deathwolf at gmail.com
Fri Sep 29 12:40:22 PDT 2006


Hello,

I am currently experiencing some technical problems with one of my remote servers running 6.1-RELEASE(#0: Sun May  7 04:32:43 UTC 2006     root at opus.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  i386) with the default kernel.
Configuration is the following: http://83.149.102.206/lspci.txt

The problem is basically the following:
Every few days, the server crashes. Sometimes it reboots, sometimes it doesnt. In all cases, there's not a single trace of anything in logs(not even a single word in /var/log/messages).
The problem usually worsens after it auto-reboots itself as it runs an fsck at the end of the boot time. That fsck runs, and most of the time it crashes again, causing a reboot or a complete freeze, then another reboot, etc. Once the system freezes, my only option
is(since i dont have local access to the server) to remote reboot on a rescue os that allows me to set my rescue freebsd partition as main, and get a rescue freebsd. Once in, i just fsck the file system until it is marked clean and reboot. And everything goes fine...
for a few days.

Now what i tried was, after an unwanted reboot, to monitor the logs of fsck running to see what made it crash.

Here is what i got from tail -f /var/log/messages(ad4s2f is the /usr partition that has most of the data):
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=2615093 (20 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=2615094 (20 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=2615095 (20 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=2615096 (20 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=2615097 (20 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=2615098 (20 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=2615099 (20 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=2615100 (16 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=3674361 (12 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=14994399 (4 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=16016109 (6624 should be 320) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=16016110 (12 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=16016111 (4352 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: INCORRECT BLOCK COUNT I=16016112 (8 should be 0) (CORRECTED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: UNREF FILE I=6  OWNER=root MODE=100400
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: SIZE=151445575864 MTIME=Sep 29 19:02 2006  (CLEARED)
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: ALLOCATED FRAGS 431136-431167 MARKED FREE
Sep 29 19:31:54 arcueid fsck:
Sep 29 19:31:54 arcueid fsck: /dev/ad4s2f: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY.
And then the server went more or less haywire and crashed(no other input in /var/log/messages though)


I also monitored the fsck process to see how it was running:
<19:15:57> root at arcueid:~$ while (true); do echo `date` `ps aux | grep fsck | grep ad4`; sleep 2; done
<...>
Fri Sep 29 19:29:36 CEST 2006 root 777 0.7 1.8 19092 18612 ?? DN 7:11PM 0:21.07 fsck_ufs: /dev/ad4s2f p2 92% (fsck_ufs)
Fri Sep 29 19:29:38 CEST 2006 root 777 0.6 1.8 19092 18612 ?? DN 7:11PM 0:21.09 fsck_ufs: /dev/ad4s2f p2 92% (fsck_ufs)
Fri Sep 29 19:29:40 CEST 2006 root 777 0.6 1.8 19092 18612 ?? RN 7:11PM 0:21.12 fsck_ufs: /dev/ad4s2f p2 92% (fsck_ufs)
Fri Sep 29 19:29:42 CEST 2006 root 777 0.6 1.8 19092 18612 ?? RN 7:11PM 0:21.14 fsck_ufs: /dev/ad4s2f p2 94% (fsck_ufs)
Fri Sep 29 19:29:44 CEST 2006 root 777 0.6 1.8 19092 18612 ?? SN 7:11PM 0:21.17 fsck_ufs: /dev/ad4s2f p2 94% (fsck_ufs)
Fri Sep 29 19:29:46 CEST 2006 root 777 0.6 1.8 19092 18612 ?? DN 7:11PM 0:21.19 fsck_ufs: /dev/ad4s2f p2 95% (fsck_ufs)
Fri Sep 29 19:29:48 CEST 2006 root 777 0.6 1.8 19092 18612 ?? DN 7:11PM 0:21.21 fsck_ufs: /dev/ad4s2f p2 95% (fsck_ufs)
Fri Sep 29 19:29:50 CEST 2006 root 777 0.6 1.8 19092 18612 ?? SN 7:11PM 0:21.25 fsck_ufs: /dev/ad4s2f p2 95% (fsck_ufs)
Fri Sep 29 19:29:52 CEST 2006 root 777 0.5 1.8 19092 18612 ?? DN 7:11PM 0:21.29 fsck_ufs: /dev/ad4s2f p2 96% (fsck_ufs)
Fri Sep 29 19:29:54 CEST 2006 root 777 0.8 1.8 19092 18612 ?? SN 7:11PM 0:21.35 fsck_ufs: /dev/ad4s2f p2 96% (fsck_ufs)
Fri Sep 29 19:29:56 CEST 2006 root 777 0.8 1.8 19092 18612 ?? SN 7:11PM 0:21.38 fsck_ufs: /dev/ad4s2f p2 96% (fsck_ufs)
Fri Sep 29 19:29:58 CEST 2006 root 777 0.8 1.8 19092 18612 ?? DN 7:11PM 0:21.40 fsck_ufs: /dev/ad4s2f p2 96% (fsck_ufs)
Fri Sep 29 19:30:00 CEST 2006 root 777 0.7 1.8 19092 18612 ?? DN 7:11PM 0:21.43 fsck_ufs: /dev/ad4s2f p2 96% (fsck_ufs)
Fri Sep 29 19:30:02 CEST 2006 root 777 0.7 1.8 19092 18612 ?? SN 7:11PM 0:21.47 fsck_ufs: /dev/ad4s2f p2 97% (fsck_ufs)
Fri Sep 29 19:30:04 CEST 2006 root 777 0.6 1.8 19092 18612 ?? SN 7:11PM 0:21.49 fsck_ufs: /dev/ad4s2f p2 97% (fsck_ufs)
Fri Sep 29 19:30:06 CEST 2006 root 777 0.5 1.8 19092 18612 ?? DN 7:11PM 0:21.51 fsck_ufs: /dev/ad4s2f p2 97% (fsck_ufs)
Fri Sep 29 19:30:08 CEST 2006 root 777 0.4 1.8 19092 18612 ?? DN 7:11PM 0:21.53 fsck_ufs: /dev/ad4s2f p2 97% (fsck_ufs)
Fri Sep 29 19:30:10 CEST 2006 root 777 0.6 1.8 19092 18612 ?? DN 7:11PM 0:21.57 fsck_ufs: /dev/ad4s2f p2 97% (fsck_ufs)
Fri Sep 29 19:30:12 CEST 2006 root 777 0.7 1.8 19092 18612 ?? SN 7:11PM 0:21.59 fsck_ufs: /dev/ad4s2f p2 98% (fsck_ufs)
Fri Sep 29 19:30:14 CEST 2006 root 777 0.6 1.8 19092 18612 ?? SN 7:11PM 0:21.59 fsck_ufs: /dev/ad4s2f p2 98% (fsck_ufs)
Fri Sep 29 19:30:16 CEST 2006 root 777 0.5 1.8 19092 18612 ?? DN 7:11PM 0:21.61 fsck_ufs: /dev/ad4s2f p2 98% (fsck_ufs)
Fri Sep 29 19:30:18 CEST 2006 root 777 0.4 1.8 19092 18612 ?? SN 7:11PM 0:21.62 fsck_ufs: /dev/ad4s2f p2 98% (fsck_ufs)
Fri Sep 29 19:30:20 CEST 2006 root 777 0.4 1.8 19092 18612 ?? SN 7:11PM 0:21.63 fsck_ufs: /dev/ad4s2f p2 99% (fsck_ufs)
Fri Sep 29 19:30:23 CEST 2006 root 777 0.3 1.8 19092 18612 ?? DN 7:11PM 0:21.64 fsck_ufs: /dev/ad4s2f p2 99% (fsck_ufs)
Fri Sep 29 19:30:25 CEST 2006 root 777 0.2 1.8 19092 18612 ?? DN 7:11PM 0:21.66 fsck_ufs: /dev/ad4s2f p2 99% (fsck_ufs)
Fri Sep 29 19:30:27 CEST 2006 root 777 0.1 1.8 19092 18612 ?? SN 7:11PM 0:21.68 fsck_ufs: /dev/ad4s2f p2 99% (fsck_ufs)
Fri Sep 29 19:30:29 CEST 2006 root 777 0.0 1.8 19092 18612 ?? SN 7:11PM 0:21.70 fsck_ufs: /dev/ad4s2f p2 99% (fsck_ufs)
Fri Sep 29 19:31:12 CEST 2006 root 777 0.0 1.8 19092 18724 ?? RN 7:11PM 0:22.17 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:14 CEST 2006 root 777 0.9 1.8 19092 18748 ?? DN 7:11PM 0:22.34 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:16 CEST 2006 root 777 0.9 1.8 19092 18748 ?? DN 7:11PM 0:22.35 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:18 CEST 2006 root 777 0.7 1.8 19092 18748 ?? DN 7:11PM 0:22.37 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:20 CEST 2006 root 777 0.7 1.8 19092 18748 ?? DN 7:11PM 0:22.40 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:22 CEST 2006 root 777 0.6 1.8 19092 18748 ?? DN 7:11PM 0:22.40 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:25 CEST 2006 root 777 0.5 1.8 19092 18748 ?? DN 7:11PM 0:22.42 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:27 CEST 2006 root 777 0.5 1.8 19092 18748 ?? DN 7:11PM 0:22.45 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:29 CEST 2006 root 777 0.4 1.8 19092 18748 ?? DN 7:11PM 0:22.46 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:31 CEST 2006 root 777 0.4 1.8 19092 18748 ?? DN 7:11PM 0:22.48 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:33 CEST 2006 root 777 0.4 1.8 19092 18748 ?? DN 7:11PM 0:22.50 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:35 CEST 2006 root 777 0.3 1.8 19092 18748 ?? DN 7:11PM 0:22.51 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:37 CEST 2006 root 777 0.2 1.8 19092 18748 ?? DN 7:11PM 0:22.53 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:39 CEST 2006 root 777 0.1 1.8 19092 18748 ?? DN 7:11PM 0:22.55 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:41 CEST 2006 root 777 0.2 1.8 19092 18748 ?? DN 7:11PM 0:22.57 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:43 CEST 2006 root 777 0.1 1.8 19092 18748 ?? DN 7:11PM 0:22.57 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:45 CEST 2006 root 777 0.0 1.8 19092 18748 ?? DN 7:11PM 0:22.58 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:47 CEST 2006 root 777 0.0 1.8 19092 18748 ?? DN 7:11PM 0:22.59 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:49 CEST 2006 root 777 0.0 1.8 19092 18748 ?? DN 7:11PM 0:22.60 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:51 CEST 2006 root 777 0.0 1.8 19092 18748 ?? DN 7:11PM 0:22.61 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:53 CEST 2006 root 777 0.0 1.8 19092 18748 ?? DN 7:11PM 0:22.61 fsck_ufs: /dev/ad4s2f p4 0% (fsck_ufs)
Fri Sep 29 19:31:55 CEST 2006
Fri Sep 29 19:31:57 CEST 2006
Fri Sep 29 19:31:59 CEST 2006
Fri Sep 29 19:32:01 CEST 2006
Fri Sep 29 19:32:03 CEST 2006
Fri Sep 29 19:32:05 CEST 2006
Fri Sep 29 19:32:07 CEST 2006
Fri Sep 29 19:32:09 CEST 2006
Fri Sep 29 19:32:11 CEST 2006
Fri Sep 29 19:32:13 CEST 2006
Fri Sep 29 19:32:15 CEST 2006
Fri Sep 29 19:32:17 CEST 2006
Fri Sep 29 19:32:19 CEST 2006
Fri Sep 29 19:32:21 CEST 2006
Fri Sep 29 19:32:23 CEST 2006
Fri Sep 29 19:32:25 CEST 2006
<boom> 
As you can see, at the same time as softupdate reported a problem, the whole system wouldnt work anymore, process wouldnt display properly anymore either.


RAM was extensively tested and is perfectly fine.
The power supply unit seems to be running really fine too, with no particular events.
The hard drive itself was tested extensively using both smart values and writing/reading from a linux setup and had not a single fault or alarming value.


Another quite surprising event is the following:
After a crash, i went into the rescue freebsd and as usual started an fsck. Result was:
arcueid-rescue# fsck -y /dev/ad4s2f
** /dev/ad4s2f
** Last Mounted on /usr
** Phase 1 - Check Blocks and Sizes

** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
1362084 files, 47944480 used, 23676944 free (1752392 frags, 2740569 blocks, 2.4% fragmentation)

***** FILE SYSTEM MARKED CLEAN *****


Just for the fun i ran it a second time just after:
arcueid-rescue# fsck -y /dev/ad4s2f
** /dev/ad4s2f
** Last Mounted on /usr
** Phase 1 - Check Blocks and Sizes


CANNOT READ BLK: 190083360
UNEXPECTED SOFT UPDATE INCONSISTENCY

CONTINUE? yes

THE FOLLOWING DISK SECTORS COULD NOT BE READ: 190083360, 190083361, 190083362, 190083363, 190083364, 190083365, 190083366, 190083367, 190083368, 190083369, 190083370, 190083371, 190083372, 190083373, 190083374, 190083375, 190083376, 190083377, 190083378, 190083379,
190083380, 190083381, 190083382, 190083383, 190083384, 190083385, 190083386, 190083387, 190083388, 190083389, 190083390, 190083391,
INCORRECT BLOCK COUNT I=18375172 (640 should be 416)
CORRECT? yes


CANNOT READ BLK: 293557088
UNEXPECTED SOFT UPDATE INCONSISTENCY

CONTINUE? yes

THE FOLLOWING DISK SECTORS COULD NOT BE READ: 293557088, 293557089, 293557090, 293557091, 293557092, 293557093, 293557094, 293557095, 293557096, 293557097, 293557098, 293557099, 293557100, 293557101, 293557102, 293557103, 293557104, 293557105, 293557106, 293557107,
293557108, 293557109, 293557110, 293557111, 293557112, 293557113, 293557114, 293557115, 293557116, 293557117, 293557118, 293557119,

CANNOT READ BLK: 190090784
UNEXPECTED SOFT UPDATE INCONSISTENCY

CONTINUE? yes

THE FOLLOWING DISK SECTORS COULD NOT BE READ: 190090784, 190090785, 190090786, 190090787, 190090788, 190090789, 190090790, 190090791, 190090792, 190090793, 190090794, 190090795, 190090796, 190090797, 190090798, 190090799, 190090800, 190090801, 190090802, 190090803,
190090804, 190090805, 190090806, 190090807, 190090808, 190090809, 190090810, 190090811, 190090812, 190090813, 190090814, 190090815,
INCORRECT BLOCK COUNT I=18375174 (480 should be 416)
CORRECT? yes


CANNOT READ BLK: 190122624
UNEXPECTED SOFT UPDATE INCONSISTENCY

CONTINUE? yes

THE FOLLOWING DISK SECTORS COULD NOT BE READ: 190122624, 190122625, 190122626, 190122627, 190122628, 190122629, 190122630, 190122631, 190122632, 190122633, 190122634, 190122635, 190122636, 190122637, 190122638, 190122639, 190122640, 190122641, 190122642, 190122643,
190122644, 190122645, 190122646, 190122647, 190122648, 190122649, 190122650, 190122651, 190122652, 190122653, 190122654, 190122655,
INCORRECT BLOCK COUNT I=18375178 (512 should be 416)
CORRECT? yes


CANNOT READ BLK: 190422528
UNEXPECTED SOFT UPDATE INCONSISTENCY

CONTINUE? yes
[etc]

And then it crashed....
Again i did the sata tests numerous times and the drive seems
perfectly healthy... So i have to think it might be driver related
rather than hardware related.


Any help would be appreciated,

-- 
Best regards,
 Paul-Kenji  



More information about the freebsd-hackers mailing list