kern/125382: ENOSPC may be misleading, consider EIO

Mon Jul 7 22:00:02 UTC 2008

>Number:         125382
>Category:       kern
>Synopsis:       ENOSPC may be misleading, consider EIO
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Jul 07 22:00:01 UTC 2008
>Closed-Date:
>Last-Modified:
>Originator:     Andrew Hammond
>Release:        6.2 amd64
>Organization:
AdECN, a Microsoft Company
>Environment:
FreeBSD db1.sjc.adecn.com 6.2-RELEASE-p6 FreeBSD 6.2-RELEASE-p6 #1: Thu Jul 19 09:21:10 PDT 2007     root at qaipc1.qa1.adecn.com:/usr/obj/usr/src/sys/ADECNDB  amd64
>Description:
Found the following error message in PostgreSQL logs:

vacuumdb: vacuuming of database "adecndb" failed: ERROR:  could not
write block 209610 of relation 1663/16386/236356665: No space left on
device

Didn't make sense since device is only at 18% usage. Got on pgsql-hackers mailing list (subject "the un-vacuumable table", thread starts at http://archives.postgresql.org/pgsql-hackers/2008-06/msg00922.php).

> Have you looked into the machine's kernel log to see if there is any
> evidence of low-level distress (hardware or filesystem level)?  I'm
> wondering if ENOSPC is being reported because it is the closest
> available errno code, but the real problem is something different than
> the error message text suggests.  Other than the errno the symptoms
> all look quite a bit like a bad-sector problem ...

Uhm, just for the record FileWrite returns error messages which get printed
this way for two reasons other than write(2) returning ENOSPC:

1) if FileAccess has to reopen the file then open(2) could return an error. I
don't see how open returns ENOSPC without O_CREAT (and that's cleared for
reopening)

2) If write(2) returns < 0 but doesn't set errno. That also seems like a
strange case that shouldn't happen, but perhaps there's some reason it can.

On Thu, Jul 3, 2008 at 10:57 PM, Andrew Hammond
<andrew.george.hammond at gmail.com> wrote:
> On Thu, Jul 3, 2008 at 3:47 PM, Tom Lane <tgl at sss.pgh.pa.us> wrote:
>How-To-Repeat:

>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted:
 >> Have you looked into the machine's kernel log to see if there is any
 >> evidence of low-level distress (hardware or filesystem level)?  I'm
 >> wondering if ENOSPC is being reported because it is the closest
 >> available errno code, but the real problem is something different than
 >> the error message text suggests.  Other than the errno the symptoms
 >> all look quite a bit like a bad-sector problem ...

 da1 is the storage device where the PGDATA lives.

 Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929ba560:6810
 timed out for ccb 0xffffff0000e20000 (req->ccb 0xffffff0000e20000)
 Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b90c0:6811
 timed out for ccb 0xffffff0001081000 (req->ccb 0xffffff0001081000)
 Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929b9f88:6812
 timed out for ccb 0xffffff0000d93800 (req->ccb 0xffffff0000d93800)
 Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req
 0xffffffff929ba560:6810 function 0
 Jun 19 03:06:14 db1 kernel: mpt1: request 0xffffffff929bcc90:6813
 timed out for ccb 0xffffff03e132dc00 (req->ccb 0xffffff03e132dc00)
 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req
 0xffffffff929ba560:6810
 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929ba560:0 completed
 Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req
 0xffffffff929b90c0:6811 function 0
 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req
 0xffffffff929b90c0:6811
 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b90c0:0 completed
 Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req
 0xffffffff929b9f88:6812 function 0
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0
 0 1 6c 99 9 c0 0 0 0 20 0 0
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus
 device reset occurred
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data)
 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req
 0xffffffff929b9f88:6812
 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929b9f88:0 completed
 Jun 19 03:06:14 db1 kernel: mpt1: attempting to abort req
 0xffffffff929bcc90:6813 function 0
 Jun 19 03:06:14 db1 kernel: mpt1: completing timedout/aborted req
 0xffffffff929bcc90:6813
 Jun 19 03:06:14 db1 kernel: mpt1: abort of req 0xffffffff929bcc90:0 completed
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): WRITE(16). CDB: 8a 0 0 0
 0 1 65 1b 71 a0 0 0 0 20 0 0
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): CAM Status: SCSI Status Error
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): SCSI Status: Check Condition
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): UNIT ATTENTION asc:29,0
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Power on, reset, or bus
 device reset occurred
 Jun 19 03:06:14 db1 kernel: (da1:mpt1:0:0:0): Retrying Command (per Sense Data)

 Tom Lane writes:

 Also, I suggest filing a bug with your kernel distributor --- ENOSPC was
 a totally misleading error code here.  Seems like EIO would be more
 appropriate.  They'll probably want to see the kernel log.

                        regards, tom lane