kern/125382: ENOSPC may be misleading, consider EIO
Andrew Hammond
andrew.george.hammond at gmail.com
Mon Jul 7 22:00:02 UTC 2008
>Number: 125382
>Category: kern
>Synopsis: ENOSPC may be misleading, consider EIO
>Confidential: no
>Severity: non-critical
>Priority: low
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Mon Jul 07 22:00:01 UTC 2008
>Closed-Date:
>Last-Modified:
>Originator: Andrew Hammond
>Release: 6.2 amd64
>Organization:
AdECN, a Microsoft Company
>Environment:
FreeBSD db1.sjc.adecn.com 6.2-RELEASE-p6 FreeBSD 6.2-RELEASE-p6 #1: Thu Jul 19 09:21:10 PDT 2007 root at qaipc1.qa1.adecn.com:/usr/obj/usr/src/sys/ADECNDB amd64
>Description:
Found the following error message in PostgreSQL logs:
vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not
write block 209610 of relation 1663/16386/236356665: No space left on
device
Didn't make sense since device is only at 18% usage. Got on pgsql-hackers mailing list (subject "the un-vacuumable table", thread starts at http://archives.postgresql.org/pgsql-hackers/2008-06/msg00922.php).
> Have you looked into the machine's kernel log to see if there is any
> evidence of low-level distress (hardware or filesystem level)? I'm
> wondering if ENOSPC is being reported because it is the closest
> available errno code, but the real problem is something different than
> the error message text suggests. Other than the errno the symptoms
> all look quite a bit like a bad-sector problem ...
Uhm, just for the record FileWrite returns error messages which get printed
this way for two reasons other than write(2) returning ENOSPC:
1) if FileAccess has to reopen the file then open(2) could return an error. I
don't see how open returns ENOSPC without O_CREAT (and that's cleared for
reopening)
2) If write(2) returns < 0 but doesn't set errno. That also seems like a
strange case that shouldn't happen, but perhaps there's some reason it can.
On Thu, Jul 3, 2008 at 10:57 PM, Andrew Hammond
<andrew.george.hammond at gmail.com> wrote:
> On Thu, Jul 3, 2008 at 3:47 PM, Tom Lane <tgl at sss.pgh.pa.us> wrote:
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted:
>> Have you looked into the machine's kernel log to see if there is any
>> evidence of low-level distress (hardware or filesystem level)? I'm
>> wondering if ENOSPC is being reported because it is the closest
>> available errno code, but the real problem is something different than
>> the error message text suggests. Other than the errno the symptoms
>> all look quite a bit like a bad-sector problem ...
da1 is the storage device where the PGDATA lives.
Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929ba560:6810
timed out for ccb 0xffffff0000e20000 (req->ccb 0xffffff0000e20000)
Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b90c0:6811
timed out for ccb 0xffffff0001081000 (req->ccb 0xffffff0001081000)
Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b9f88:6812
timed out for ccb 0xffffff0000d93800 (req->ccb 0xffffff0000d93800)
Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req
0xffffffff929ba560:6810 function 0
Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929bcc90:6813
timed out for ccb 0xffffff03e132dc00 (req->ccb 0xffffff03e132dc00)
Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req
0xffffffff929ba560:6810
Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929ba560:0 completed
Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req
0xffffffff929b90c0:6811 function 0
Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req
0xffffffff929b90c0:6811
Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b90c0:0 completed
Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req
0xffffffff929b9f88:6812 function 0
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0
0 1 6c 99 9 c0 0 0 0 20 0 0
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus
device reset occurred
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data)
Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req
0xffffffff929b9f88:6812
Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b9f88:0 completed
Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req
0xffffffff929bcc90:6813 function 0
Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req
0xffffffff929bcc90:6813
Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929bcc90:0 completed
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0
0 1 65 1b 71 a0 0 0 0 20 0 0
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus
device reset occurred
Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data)
Tom Lane writes:
Also, I suggest filing a bug with your kernel distributor --- ENOSPC was
a totally misleading error code here. Seems like EIO would be more
appropriate. They'll probably want to see the kernel log.
regards, tom lane
More information about the freebsd-bugs
mailing list