Problems since 5.3-RELEASE-p15
Tuc at T-B-O-H
ml at t-b-o-h.net
Tue Jul 12 17:16:10 GMT 2005
> However, if you run ktrace on a simple program like ls: ktrace ls
> then do a kdump | less, you will see that after finding ls,
> /libexec/ld-elf.so.1 is the first thing accessed. So, when things start
> working again, /libexec/ld-elf.so.1 is magically fixed, which just makes
> no sense.
>
Imagine if it makes no sense to you how *I* must be feeling!
>
> I assume there are no other messages obvious errors in /var/log/messages?
>
Nope, I have debug turned all the way up... And just out of the blue
on the 9th at 3am I see :
Jul 9 03:01:30 himinbjorg kernel: pid 49967 (mailwrapper), uid 0: exited on sig
nal 11 (core dumped)
Jul 9 03:01:30 himinbjorg kernel: Jul 9 03:01:30 himinbjorg kernel: pid 49967
(mailwrapper), uid 0: exited on signal 11 (core dumped)
and then goes down hill from there :
Jul 9 03:01:31 himinbjorg kernel: pid 49969 (sendmail), uid 25: exited on signa
l 11
Jul 9 03:01:31 himinbjorg kernel: Jul 9 03:01:31 himinbjorg kernel: pid 49969
(sendmail), uid 25: exited on signal 11
Jul 9 03:10:02 himinbjorg kernel: pid 50135 (rmail), uid 66: exited on signal 1
1 (core dumped)
Jul 9 03:10:02 himinbjorg kernel: Jul 9 03:10:02 himinbjorg kernel: pid 50135
(rmail), uid 66: exited on signal 11 (core dumped)
Jul 9 03:10:04 himinbjorg kernel: pid 50151 (uuxqt), uid 66: exited on signal 1
1
Jul 9 03:10:04 himinbjorg kernel: Jul 9 03:10:04 himinbjorg kernel: pid 50151
(uuxqt), uid 66: exited on signal 11
Jul 9 03:10:05 himinbjorg kernel: pid 50152 (sh), uid 0: exited on signal 11 (c
ore dumped)
Jul 9 03:10:05 himinbjorg kernel: Jul 9 03:10:05 himinbjorg kernel: pid 50152
(sh), uid 0: exited on signal 11 (core dumped)
Jul 9 03:12:01 himinbjorg kernel: pid 50191 (sleep), uid 0: exited on signal 11
(core dumped)
Jul 9 03:12:01 himinbjorg kernel: pid 50192 (mailwrapper), uid 0: exited on sig
nal 11 (core dumped)
Jul 9 03:12:01 himinbjorg kernel: pid 50193 (stunnel), uid 0: exited on signal
11 (core dumped)
Jul 9 03:12:01 himinbjorg kernel: pid 50194 (uuxqt), uid 66: exited on signal 1
1
Jul 9 03:12:01 himinbjorg kernel: Jul 9 03:12:01 himinbjorg kernel: pid 50191
(sleep), uid 0: exited on signal 11 (core dumped)
Jul 9 03:12:01 himinbjorg kernel: Jul 9 03:12:01 himinbjorg kernel: pid 50192
(mailwrapper), uid 0: exited on signal 11 (core dumped)
Jul 9 03:12:01 himinbjorg kernel: Jul 9 03:12:01 himinbjorg kernel: pid 50193
(stunnel), uid 0: exited on signal 11 (core dumped)
Jul 9 03:12:01 himinbjorg kernel: Jul 9 03:12:01 himinbjorg kernel: pid 50194
(uuxqt), uid 66: exited on signal 11
Jul 9 03:12:01 himinbjorg kernel: pid 50195 (sleep), uid 0: exited on signal 11
(core dumped)
Jul 9 03:12:01 himinbjorg kernel: pid 50196 (mailwrapper), uid 0: exited on sig
nal 11 (core dumped)
Jul 9 03:12:01 himinbjorg kernel: Jul 9 03:12:01 himinbjorg kernel: pid 50195
(sleep), uid 0: exited on signal 11 (core dumped)
Jul 9 03:12:01 himinbjorg kernel: Jul 9 03:12:01 himinbjorg kernel: pid 50196
(mailwrapper), uid 0: exited on signal 11 (core dumped)
Jul 9 03:15:04 himinbjorg kernel: pid 50229 (sendmail), uid 0: exited on signal
11
and happens until...........
Jul 9 03:55:00 himinbjorg kernel: Jul 9 03:55:00 himinbjorg kernel: pid 50481
(sh), uid 0: exited on signal 11 (core dumped)
Jul 9 03:55:00 himinbjorg kernel: Jul 9 03:55:00 himinbjorg kernel: pid 50482
(sh), uid 0: exited on signal 11 (core dumped)
Jul 9 03:57:00 himinbjorg kernel: pid 50484 (sh), uid 0: exited on signal 11 (c
ore dumped)
Jul 9 03:57:00 himinbjorg kernel: Jul 9 03:57:00 himinbjorg kernel: pid 50484
(sh), uid 0: exited on signal 11 (core dumped)
and then everything is fine again.
I hadn't been home since the morning of the 8th, and if I'm up at 3am
its NOT a good thing usually.... So wasn't like I was here doing anything
at the time.
>
> One final thought is that it could be the disk. You could try
> installing smartmontools and see if the disk thinks it is OK -- though
> of course it could be the controller. But in such a case I might expect
> other errors.
>
Already had that installed....
smartctl version 5.33 [i386-portbld-freebsd5.3] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: FUJITSU MHT2060AH
Serial Number: NP0DT512J16R
Firmware Version: 006C
User Capacity: 60,011,642,880 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a
Local Time is: Tue Jul 12 13:14:56 2005 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 440) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 60) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 046 Pre-fail Always - 6586
2 Throughput_Performance 0x0005 100 100 030 Pre-fail Offline - 19464192
3 Spin_Up_Time 0x0003 100 100 025 Pre-fail Always - 1
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 89
5 Reallocated_Sector_Ct 0x0033 100 100 024 Pre-fail Always - 8589934592000
7 Seek_Error_Rate 0x000f 100 100 047 Pre-fail Always - 1434
8 Seek_Time_Performance 0x0005 100 100 019 Pre-fail Offline - 0
9 Power_On_Seconds 0x0032 094 094 000 Old_age Always - 3468h+21m+46s
10 Spin_Retry_Count 0x0013 100 100 020 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 78
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 57
193 Load_Cycle_Count 0x0032 092 092 000 Old_age Always - 80308
194 Temperature_Celsius 0x0022 100 065 000 Old_age Always - 56 (Lifetime Min/Max 15/67)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 283049984
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000f 100 100 060 Pre-fail Always - 29362
203 Run_Out_Cancel 0x0002 100 100 000 Old_age Always - 1529016811656
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3457 -
# 2 Short offline Completed without error 00% 3433 -
# 3 Short offline Completed without error 00% 3409 -
# 4 Extended offline Completed without error 00% 3386 -
# 5 Short offline Completed without error 00% 3385 -
# 6 Short offline Completed without error 00% 3361 -
# 7 Short offline Completed without error 00% 3312 -
# 8 Short offline Completed without error 00% 3302 -
# 9 Short offline Completed without error 00% 3278 -
#10 Extended offline Completed without error 00% 3256 -
#11 Short offline Completed without error 00% 3254 -
#12 Short offline Completed without error 00% 3233 -
#13 Short offline Completed without error 00% 3216 -
#14 Short offline Completed without error 00% 3192 -
#15 Short offline Completed without error 00% 3168 -
#16 Short offline Completed without error 00% 3144 -
#17 Short offline Completed without error 00% 3120 -
#18 Extended offline Completed without error 00% 3098 -
#19 Short offline Completed without error 00% 3096 -
#20 Short offline Completed without error 00% 3072 -
#21 Short offline Completed without error 00% 3048 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
>
> The classic bad memory symptoms are periodic reboots, and random
> segmentation faults. The latter is particular to bad memory, I would
> say. The fomer can have other causes.
>
> It's the intermittent nature of the fault that really makes me think
> hardware. Am I right in remembering that you upgraded to 5.4 and still
> had the same problems?
>
Correct...
FreeBSD himinbjorg.tucs-beachin-obx-house.com 5.4-RELEASE-p2 FreeBSD 5.4-RELEASE-p2 #2: Tue Jun 21 00:52:02 EDT 2005 root at himinbjorg.tucs-beachin-obx-house.com:/usr/obj/usr/src/sys/HIMINBJORG53 i386
Tuc
More information about the freebsd-questions
mailing list