Re: Replacing a REMOVED drive in DEGRADED zpool
Date: Thu, 21 Aug 2025 06:14:35 UTC
On 8/20/25 17:55, Robert wrote: > I have my first zpool degraded on a FreeBSD 13.5 server and looking for > advice on the steps I'll be taking to successfully replace the REMOVED > drive in a 4 disk 2 mirror zpool. It is scrubbed monthly with last scrub > August 3rd... > > root@db1:~ # zpool list > NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH > ALTROOT > zdb1 262G 102G 160G - - 53% 38% 1.00x > DEGRADED - > root@db1:~ # zpool status > pool: zdb1 > state: DEGRADED > status: One or more devices has been removed by the administrator. > Sufficient replicas exist for the pool to continue functioning > in a > degraded state. > action: Online the device using zpool online' or replace the device with > 'zpool replace'. > scan: scrub repaired 0B in 00:27:57 with 0 errors on Sun Aug 3 > 04:43:48 2025 > config: > > NAME STATE READ WRITE CKSUM > zdb1 DEGRADED 0 0 0 > mirror-0 DEGRADED 0 0 0 > ada0p3 REMOVED 0 0 0 > ada1p3 ONLINE 0 0 0 > mirror-1 ONLINE 0 0 0 > ada2p3 ONLINE 0 0 0 > ada3p3 ONLINE 0 0 0 I am used to seeing non-zero numbers in the READ, WRITE, and/or CKSUM columns for a bad disk. Did you do something to reset the numbers? > > I have data backup of any important data. I also use zfs-autobackup form > an old remote FreeNAS server to take hourly snapshots ... > > zfs-autobackup -v --keep-source 72 --keep-target 168 --ssh-source db1 > offsite1 DATA/backups/db1 > > So, I have the last 72 hours of snapshots on the local server with 7 > days worth on the remote NAS. Good. > The disk entered the REMOVED state at 6am > this morning, a little over 14 hours ago `zpool status`, above, said "One or more devices has been removed by the administrator". Did a human remove the disk or did ZFS? If a human, what command did they use? > and I plan to replace on Friday > night to give myself some time in case a restore needs to happen. > Perhaps I should bump the local snapshot storage up to 168 hours (1 > week) as well at this point or hold what is there, can I hold all > snapshots with one command? Here is the disk info for the 3 drives > remaining in the zpool ... > > root@db1:~ # camcontrol devlist > <WDC WD1500ADFD-00NLR5 21.07QR5> at scbus1 target 0 lun 0 (ada1,pass1) > <WDC WD1500HLFS-01G6U3 04.04V05> at scbus2 target 0 lun 0 (ada2,pass2) > <WDC WD1500ADFD-00NLR5 21.07QR5> at scbus3 target 0 lun 0 (ada3,pass3) ZFS RAID10 with Raptors and VelociRaptor -- a blast from the past! :-) The problem with vintage systems is "do not throw good money after bad". If you already have a spare Raptor or VelociRaptor and all four disks test and report good with smartctl(8), then perhaps replacing the failed disk with another disk is a good idea. Otherwise, I would consider other options (a pair of SSD's). > root@db1:~ # gpart show ada1 > => 40 293046688 ada1 GPT (140G) > 40 1024 1 freebsd-boot (512K) > 1064 984 - free - (492K) > 2048 16777216 2 freebsd-swap (8.0G) > 16779264 276267008 3 freebsd-zfs (132G) > 293046272 456 - free - (228K) What partition scheme is on the disks? I do not see an EFI system partition. Is the motherboard firmware BIOS/Legacy or UEFI? How is ada0p1 freebsd-boot configured into the system? ZFS stripe-of-mirrors? UFS gmirror/gstripe RAID10? How is ada0p2 freebsd-swap configured into the system? One of four swap devices? > > All the drive report identical layouts as ada1. I've used camcontrol > with identify to get all the serial numbers of these drives, so I plan > to shut the server down, pull the bad drive and insert the replacement, > boot up and replace. Would these be the steps I need to take assuming > the replacement drive shows up as the same ada0 device? > > 1. Run `zpool offline zdb1 ada0p3` I would use zpool-detach(8) to remove the failed disk from the pool. You will need to disconnect ada0p1 freebsd-boot and ada0p2 freebsd-swap according to how they are configured into your system. > 2. Shut down and pull/insert replacement > 3. Boot up and run `gpart backup ada1 > gpart.ada1` then `gpart restore > ada0 < gpart.ada1` > 4. Run `zpool replace zdb1 ada0p3 ada0p3` Cloning ada1's GPT's (primary and secondary) to ada0 will result in duplicate identifiers on two disks -- UUID's, labels, etc.. Two disks with matching identifiers in the same computer is asking for trouble. I would not do that. If anything, clone the failed disk GPT's to the replacement disk GPT's -- but only if the failed disk GPT's are good. If the failed disk is still mostly operational with bad blocks all within the middle data portion of ada0p3 (NOT in metadata), cloning the failed disk to the replacement disk could save effort. ddrescue(1) may be required to get past bad blocks. Otherwise, I would zero the replacement disk and build it manually. I would use zpool-attach(8) to add the replacement ada0p3 as a mirror of ada1p3. You will need to build and connect the replacement ada0p1 freebsd-boot and replacement ada0p2 freebsd-swap according to how they are to be configured into your system. > > I'm just not sure if this is all that is needed with a ROOT zpool or if > all correct. I appreciate any guidance. Here is the full zfs list... > > root@db1:~ # zfs list > NAME USED AVAIL REFER > MOUNTPOINT > zdb1 103G 152G 96K /zdb1 > zdb1/ROOT 101G 151G 96K none > zdb1/ROOT/13.1-RELEASE-p7_2023-05-04_200035 8K 151G 15.8G / > zdb1/ROOT/13.2-RELEASE-p1_2023-08-07_124053 8K 151G 25.4G / > zdb1/ROOT/13.2-RELEASE-p2_2023-09-09_111305 8K 151G 29.0G / > zdb1/ROOT/13.2-RELEASE-p3_2023-12-31_111612 8K 151G 33.1G / > zdb1/ROOT/13.2-RELEASE-p9_2024-04-14_121449 8K 151G 34.9G / > zdb1/ROOT/13.2-RELEASE_2023-05-04_200614 8K 151G 15.9G / > zdb1/ROOT/13.2-RELEASE_2023-08-01_151806 8K 151G 25.1G / > zdb1/ROOT/13.3-RELEASE-p1_2024-04-14_121907 8K 151G 34.9G / > zdb1/ROOT/13.3-RELEASE-p1_2024-08-04_122937 8K 151G 36.2G / > zdb1/ROOT/13.3-RELEASE-p4_2025-01-04_162341 8K 151G 36.3G / > zdb1/ROOT/13.3-RELEASE-p8_2025-01-04_164203 8K 151G 36.9G / > zdb1/ROOT/13.4-RELEASE-p1_2025-01-04_164619 8K 151G 37.0G / > zdb1/ROOT/13.4-RELEASE-p2_2025-05-10_133828 8K 151G 39.4G / > zdb1/ROOT/13.5-RELEASE-p1_2025-07-04_113332 8K 151G 39.6G / > zdb1/ROOT/13.5-RELEASE_2025-05-10_134206 8K 151G 39.4G / > zdb1/ROOT/default 101G 151G 39.5G / > zdb1/tmp 1.12M 151G 200K /tmp > zdb1/usr 1.17G 151G 96K /usr > zdb1/usr/home 1.31M 151G 1.30M /usr/home > zdb1/usr/ports 1.17G 151G 1.17G /usr/ > ports > zdb1/usr/src 96K 151G 96K /usr/src > zdb1/var 5.35M 151G 96K /var > zdb1/var/audit 96K 151G 96K /var/audit > zdb1/var/crash 96K 151G 96K /var/crash > zdb1/var/log 4.78M 151G 660K /var/log > zdb1/var/mail 200K 151G 144K /var/mail > zdb1/var/tmp 96K 151G 96K /var/tmp > > Thank you. > > -- > Robert Finally, ZFS, ZFS stripe-of-mirrors, root-on-ZFS, and gmirror/gstripe RAID10 are all non-trivial. Replacing such a disk correctly is going to require a lot of knowledge. If you like learning adventures, go for it. But if you want 24x7 operations, I do better with backup/wipe/install/restore. It is simpler, I can estimate how long it will take, I can roll it back, and I have confidence in the results. If you go this route, I would put FreeBSD on UFS on a single small SSD and put the data on ZFS with redundant disks. Backup the OS disk with rsync(1) and take images regularly. Restoring an OS disk from an image is the fastest way to recover from a OS disk disaster. David