Problems since 5.3-RELEASE-p15

Tuc at T-B-O-H ml at t-b-o-h.net
Tue Jul 12 17:16:10 GMT 2005


> However, if you run ktrace on a simple program like ls: ktrace ls
> then do a kdump | less, you will see that after finding ls, 
> /libexec/ld-elf.so.1 is the first thing accessed.  So, when things start 
> working again, /libexec/ld-elf.so.1 is magically fixed, which just makes 
> no sense.
>
	Imagine if it makes no sense to you how *I* must be feeling!
> 
> I assume there are no other messages obvious errors in /var/log/messages?
>
	Nope, I have debug turned all the way up... And just out of the blue
on the 9th at 3am I see :

Jul  9 03:01:30 himinbjorg kernel: pid 49967 (mailwrapper), uid 0: exited on sig
nal 11 (core dumped)
Jul  9 03:01:30 himinbjorg kernel: Jul  9 03:01:30 himinbjorg kernel: pid 49967 
(mailwrapper), uid 0: exited on signal 11 (core dumped)

	and then goes down hill from there :

Jul  9 03:01:31 himinbjorg kernel: pid 49969 (sendmail), uid 25: exited on signa
l 11
Jul  9 03:01:31 himinbjorg kernel: Jul  9 03:01:31 himinbjorg kernel: pid 49969 
(sendmail), uid 25: exited on signal 11
Jul  9 03:10:02 himinbjorg kernel: pid 50135 (rmail), uid 66: exited on signal 1
1 (core dumped)
Jul  9 03:10:02 himinbjorg kernel: Jul  9 03:10:02 himinbjorg kernel: pid 50135 
(rmail), uid 66: exited on signal 11 (core dumped)
Jul  9 03:10:04 himinbjorg kernel: pid 50151 (uuxqt), uid 66: exited on signal 1
1
Jul  9 03:10:04 himinbjorg kernel: Jul  9 03:10:04 himinbjorg kernel: pid 50151 
(uuxqt), uid 66: exited on signal 11
Jul  9 03:10:05 himinbjorg kernel: pid 50152 (sh), uid 0: exited on signal 11 (c
ore dumped)
Jul  9 03:10:05 himinbjorg kernel: Jul  9 03:10:05 himinbjorg kernel: pid 50152 
(sh), uid 0: exited on signal 11 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: pid 50191 (sleep), uid 0: exited on signal 11
 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: pid 50192 (mailwrapper), uid 0: exited on sig
nal 11 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: pid 50193 (stunnel), uid 0: exited on signal 
11 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: pid 50194 (uuxqt), uid 66: exited on signal 1
1
Jul  9 03:12:01 himinbjorg kernel: Jul  9 03:12:01 himinbjorg kernel: pid 50191 
(sleep), uid 0: exited on signal 11 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: Jul  9 03:12:01 himinbjorg kernel: pid 50192 
(mailwrapper), uid 0: exited on signal 11 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: Jul  9 03:12:01 himinbjorg kernel: pid 50193 
(stunnel), uid 0: exited on signal 11 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: Jul  9 03:12:01 himinbjorg kernel: pid 50194 
(uuxqt), uid 66: exited on signal 11
Jul  9 03:12:01 himinbjorg kernel: pid 50195 (sleep), uid 0: exited on signal 11
 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: pid 50196 (mailwrapper), uid 0: exited on sig
nal 11 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: Jul  9 03:12:01 himinbjorg kernel: pid 50195 
(sleep), uid 0: exited on signal 11 (core dumped)
Jul  9 03:12:01 himinbjorg kernel: Jul  9 03:12:01 himinbjorg kernel: pid 50196 
(mailwrapper), uid 0: exited on signal 11 (core dumped)
Jul  9 03:15:04 himinbjorg kernel: pid 50229 (sendmail), uid 0: exited on signal
11

	and happens until...........

Jul  9 03:55:00 himinbjorg kernel: Jul  9 03:55:00 himinbjorg kernel: pid 50481 
(sh), uid 0: exited on signal 11 (core dumped)
Jul  9 03:55:00 himinbjorg kernel: Jul  9 03:55:00 himinbjorg kernel: pid 50482 
(sh), uid 0: exited on signal 11 (core dumped)
Jul  9 03:57:00 himinbjorg kernel: pid 50484 (sh), uid 0: exited on signal 11 (c
ore dumped)
Jul  9 03:57:00 himinbjorg kernel: Jul  9 03:57:00 himinbjorg kernel: pid 50484 
(sh), uid 0: exited on signal 11 (core dumped)


	and then everything is fine again.

	I hadn't been home since the morning of the 8th, and if I'm up at 3am
its NOT a good thing usually.... So wasn't like I was here doing anything
at the time.
>
> One final thought is that it could be the disk.  You could try 
> installing smartmontools and see if the disk thinks it is OK -- though 
> of course it could be the controller.  But in such a case I might expect 
> other errors.
>
	Already had that installed....

smartctl version 5.33 [i386-portbld-freebsd5.3] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     FUJITSU MHT2060AH
Serial Number:    NP0DT512J16R
Firmware Version: 006C
User Capacity:    60,011,642,880 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 3a
Local Time is:    Tue Jul 12 13:14:56 2005 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 440) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					No General Purpose Logging support.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  60) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   046    Pre-fail  Always       -       6586
  2 Throughput_Performance  0x0005   100   100   030    Pre-fail  Offline      -       19464192
  3 Spin_Up_Time            0x0003   100   100   025    Pre-fail  Always       -       1
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       89
  5 Reallocated_Sector_Ct   0x0033   100   100   024    Pre-fail  Always       -       8589934592000
  7 Seek_Error_Rate         0x000f   100   100   047    Pre-fail  Always       -       1434
  8 Seek_Time_Performance   0x0005   100   100   019    Pre-fail  Offline      -       0
  9 Power_On_Seconds        0x0032   094   094   000    Old_age   Always       -       3468h+21m+46s
 10 Spin_Retry_Count        0x0013   100   100   020    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       78
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       57
193 Load_Cycle_Count        0x0032   092   092   000    Old_age   Always       -       80308
194 Temperature_Celsius     0x0022   100   065   000    Old_age   Always       -       56 (Lifetime Min/Max 15/67)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       283049984
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000f   100   100   060    Pre-fail  Always       -       29362
203 Run_Out_Cancel          0x0002   100   100   000    Old_age   Always       -       1529016811656

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3457         -
# 2  Short offline       Completed without error       00%      3433         -
# 3  Short offline       Completed without error       00%      3409         -
# 4  Extended offline    Completed without error       00%      3386         -
# 5  Short offline       Completed without error       00%      3385         -
# 6  Short offline       Completed without error       00%      3361         -
# 7  Short offline       Completed without error       00%      3312         -
# 8  Short offline       Completed without error       00%      3302         -
# 9  Short offline       Completed without error       00%      3278         -
#10  Extended offline    Completed without error       00%      3256         -
#11  Short offline       Completed without error       00%      3254         -
#12  Short offline       Completed without error       00%      3233         -
#13  Short offline       Completed without error       00%      3216         -
#14  Short offline       Completed without error       00%      3192         -
#15  Short offline       Completed without error       00%      3168         -
#16  Short offline       Completed without error       00%      3144         -
#17  Short offline       Completed without error       00%      3120         -
#18  Extended offline    Completed without error       00%      3098         -
#19  Short offline       Completed without error       00%      3096         -
#20  Short offline       Completed without error       00%      3072         -
#21  Short offline       Completed without error       00%      3048         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> The classic bad memory symptoms are periodic reboots, and random 
> segmentation faults.  The latter is particular to bad memory, I would 
> say.  The fomer can have other causes.
> 
> It's the intermittent nature of the fault that really makes me think 
> hardware.  Am I right in remembering that you upgraded to 5.4 and still 
> had the same problems?
> 
	Correct... 

FreeBSD himinbjorg.tucs-beachin-obx-house.com 5.4-RELEASE-p2 FreeBSD 5.4-RELEASE-p2 #2: Tue Jun 21 00:52:02 EDT 2005     root at himinbjorg.tucs-beachin-obx-house.com:/usr/obj/usr/src/sys/HIMINBJORG53  i386


		Tuc


More information about the freebsd-questions mailing list