From nobody Fri Sep 10 16:10:05 2021 X-Original-To: freebsd-hardware@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 9287717BDDA9 for ; Fri, 10 Sep 2021 16:10:19 +0000 (UTC) (envelope-from mike@jellydonut.org) Received: from mail-ej1-x634.google.com (mail-ej1-x634.google.com [IPv6:2a00:1450:4864:20::634]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4H5gnj5WYyz3LrZ for ; Fri, 10 Sep 2021 16:10:17 +0000 (UTC) (envelope-from mike@jellydonut.org) Received: by mail-ej1-x634.google.com with SMTP id e21so5212879ejz.12 for ; Fri, 10 Sep 2021 09:10:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jellydonut-org.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=Wvf1Ija0NxbcFuEB/z/eESMX1+9QwOv0/+26AGT1vGY=; b=qWyai5164BROexPhv6z9CbeAdSDUualXISxZHhhqouvKrk1fuhdOTJEwvnbdk85gv6 5qBjIm3NBu1XoTGjQleMQOZMFVe67v7aXsHcLk7AtZw/OVXhyYV7hFG359tEWd9JZCUk fo3ajs64C9tGK7RW2SXl+SAWLoycWj0GaIhwsdVeoH7whnhFHZhrtslIQkc11n8cESLQ 5piqR0sVb9dqzpRokCMQrhd5Kgz+dT24Co1vFGwA9pL7tS7DtvtychsaxYpQ+o75Y+XS 5KxTWXI0/K+hWFer+J9+jtPjQrkQ/ge/uto90dv8c/4dg//oeT1m2nenIp3TUZLOHEoA Zs1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=Wvf1Ija0NxbcFuEB/z/eESMX1+9QwOv0/+26AGT1vGY=; b=0vIpTOhZ56KyQeKCFlcFzaN6L8csDx6Kevj7ad0b4wEIKin5JHKzB2DcA7nKIawm+6 A26EehPNPaqAz3XpRn3pil/PpdA9uEGGfNdUunm22WoOgRbX9pMZHYm1f8ZWFJFQ/J4d sfAbws7ycICj2srnOyJBPczQu9lb6tsPY06L0f1+1WyG0yiaWKRNuiEZdDVgXE7kkw6N V5Y/1xA9cM7o0uJVEehSqcWFApnSRtNPxd71Zic2tCdf9Z9dgULGJ9H/RQ2eDjNJI7Sn +xJy4Vu1g7irNwFis4HobT7JygDUJQS/SLz43yycUJPpYmfk+DzeAzBIdlcuuWelBKVd clQA== X-Gm-Message-State: AOAM5311MeWebNly/hKQlmc67tH1rMvPyPrH/UDwbyBPC69mOl7gxQ/5 ryEjROx8WIXEQ9KOhjM9Ma9EKhL81cohpwQjHgShTA9cM+o= X-Google-Smtp-Source: ABdhPJxPwKJjMM2OYpNnDieJHsxsrf4x3zcPhWzyC3eSEki6MfjlUZiZ43SBeoG9kJnaM/QSgfPdtTK0dR9n880YQy4= X-Received: by 2002:a17:906:7802:: with SMTP id u2mr10210201ejm.325.1631290216484; Fri, 10 Sep 2021 09:10:16 -0700 (PDT) List-Id: General discussion of FreeBSD hardware List-Archive: https://lists.freebsd.org/archives/freebsd-hardware List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hardware@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Michael Proto Date: Fri, 10 Sep 2021 12:10:05 -0400 Message-ID: Subject: Re: Constant state-changes in a ZFS array To: FreeBSD-STABLE Mailing List , freebsd-hardware@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4H5gnj5WYyz3LrZ X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=pass header.d=jellydonut-org.20150623.gappssmtp.com header.s=20150623 header.b=qWyai516; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@jellydonut.org designates 2a00:1450:4864:20::634 as permitted sender) smtp.mailfrom=mike@jellydonut.org X-Spamd-Result: default: False [-1.50 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[jellydonut-org.20150623.gappssmtp.com:s=20150623]; FREEFALL_USER(0.00)[mike]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hardware@freebsd.org]; DMARC_NA(0.00)[jellydonut.org]; NEURAL_SPAM_SHORT(1.00)[1.000]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[jellydonut-org.20150623.gappssmtp.com:+]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::634:from]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N Just realized I neglected version info. This is on FreeBSD 11.4-RELEASE-p3 On Fri, Sep 10, 2021 at 11:45 AM Michael Proto wrote: > > Hey all, > > I had a server go dark earlier this week and after several hardware > swaps I'm left scratching my head. The server is a HP DL380p Gen8 with > a D3600 shelf attached, using 2 EF0600FARNA HP 600G disks in a ZFS > mirror (da0 and da1) and another 22 8TB Ultrastar disks in a ZFS > RAID10 for data (da3 through da23, though da23 has been removed in > this situation). They're all attached to a LSI SAS2308 operating in > HBA mode. > > The large array threw a disk shortly before which we would normally > handle online like we've done dozens of times before. In this case > there's a bigger problem I'm struggling with. In addition to the disk > being thrown I'm now unable to bring the larger ZFS array online. > Commands issued to check array status or bring it online during boot > are stalling. The 2-disk zroot mirror is recognized on boot and is > loading, so I can get into the OS as normal with the larger tank array > failing to come online. > > Looking at syslog I'm seeing a regular stream of messages coming from > devd regarding media and state-change events from both ZFS, GEOM and > DEVFS. Sample below: > > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da2' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da2' > Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS > subsystem=ZFS type=resource.fs.zfs.statechange version=0 > class=resource.fs.zfs.statechange pool_guid=9328454021323814501 > vdev_guid=8915574321583737794' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da2p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da2p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da6' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da2' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da2' > Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS > subsystem=ZFS type=resource.fs.zfs.statechange version=0 > class=resource.fs.zfs.statechange pool_guid=9328454021323814501 > vdev_guid=8915574321583737794' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da2p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da2p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da6' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da6' > Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS > subsystem=ZFS type=resource.fs.zfs.statechange version=0 > class=resource.fs.zfs.statechange pool_guid=9328454021323814501 > vdev_guid=7024987654522270730' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da6p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da6p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da9' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da9' > Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS > subsystem=ZFS type=resource.fs.zfs.statechange version=0 > class=resource.fs.zfs.statechange pool_guid=9328454021323814501 > vdev_guid=4207599288564790488' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da9p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da9p1' > > > > The disk devices appearing in these messages are all disks in the > RAID10 array. They appear as a group, every 5 seconds. Furthermore the > state changes seem to be happening evenly across all affected devices > with the exception of da15 which is precisely half the volume. Here's > a count from /var/log/messages piped to sort and uniq (count, then > message): > > 32100 cdev=da10' > 32100 cdev=da10p1' > 32100 cdev=da11' > 32100 cdev=da11p1' > 32100 cdev=da12' > 32100 cdev=da12p1' > 32100 cdev=da13' > 32100 cdev=da13p1' > 32100 cdev=da14' > 32100 cdev=da14p1' > 16050 cdev=da15' > 16050 cdev=da15p1' > 32100 cdev=da16' > 32100 cdev=da16p1' > 32100 cdev=da17' > 32100 cdev=da17p1' > 32100 cdev=da18' > 32100 cdev=da18p1' > 32100 cdev=da19' > 32100 cdev=da19p1' > 32100 cdev=da2' > 32100 cdev=da20' > 32100 cdev=da20p1' > 32100 cdev=da21' > 32100 cdev=da21p1' > 32100 cdev=da22' > 32100 cdev=da22p1' > 32100 cdev=da2p1' > 32100 cdev=da3' > 32100 cdev=da3p1' > 32100 cdev=da4' > 32100 cdev=da4p1' > 32100 cdev=da5' > 32100 cdev=da5p1' > 32100 cdev=da6' > 32100 cdev=da6p1' > 32100 cdev=da7' > 32100 cdev=da7p1' > 32100 cdev=da8' > 32100 cdev=da8p1' > 32100 cdev=da9' > 32100 cdev=da9p1' > > > I can run diskinfo against all the listed disks no problem and I see > them via camcontrol. I can also issue a reset via camcontrol to both > the chassis and the D3600 with no issues. sesutil-map sees the > chassis, shelf, and all disk devices. > > So far I've swapped the LSI controller, the D3600 shelf (twice) and > the cabling, same behavior. Previously when a collection of disks go > problematic like this we've swapped the D3600 shelf or occasionally > just reseated the external cabling and everything came back normal. > Not this time. I'm scheduling a chassis swap for next week but figured > I'd throw this out here to see if anyone has seen this before. > > > > Thanks! > Mike Proto