Proxmox: Replace a failed bootable ZFS disk in rpool
It’s what every sysadmin dreads and prepares for: a failed hard drive.
Replacing a failed SSD in a Proxmox boot pool (rpool) can be a daunting task, especially considering the critical role it plays in your server’s operation. A mirrored ZFS boot pool provides redundancy, but when one of the drives fails, it’s imperative to address the issue promptly to maintain system integrity and prevent data loss. This article will walk you through the process of replacing a failed SSD in your Proxmox rpool, ensuring that your system remains robust and reliable.
In the world of system administration, hardware failures are an inevitable reality. Fortunately, ZFS, the filesystem utilized by Proxmox, offers features that simplify the replacement process. By following the correct procedures, you can remove the failed drive, install and prepare the new SSD, and reintegrate it into the rpool with minimal downtime. This guide aims to provide clear, step-by-step instructions to help you navigate this process confidently, ensuring your Proxmox environment continues to operate smoothly.
The problem
In my main Proxmox/Storage server I have a ZFS-mirror rpool. This is where system boots from and all OS files reside. One of the disks died.
root@treebeard:~# zpool status rpool
pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0B in 00:18:07 with 0 errors on Sun Mar 9 00:42:18 2025
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-CT250MX500SSD1_2208E60D104A-part3 ONLINE 0 0 0
ata-CT250MX500SSD1_1949E22CA8C7-part3 FAULTED 6 300 0 too many errors
Luckily, ZFS caught the problem and removed the faulty disk from rpool
to avoid further problems. The other disks kept going, so no further issues presented themselves.
There is a slight complication that does not allow us to simpley zpool replace
the entire disk:
Device Start End Sectors Size Type
/dev/sds1 34 2047 2014 1007K BIOS boot
/dev/sds2 2048 1050623 1048576 512M EFI System
/dev/sds3 1050624 488397134 487346511 232.4G Solaris /usr & Apple ZFS
Each disk has it’s own BIOS and EFI partitions to boot the system (and load up a kernel with ZFS support).
ZFS does not only work with disks, it also works with partitions. In the example above, /dev/sds1
and /dev/sds2
are the same on each disk in rpool
but not part of the ZFS pool itself. /dev/sds3
, however, is a ZFS partition and mirrored to /dev/sdt3
in rpool
.
The replacement
Replacing this disk is straightforward, once you know the steps to do it…
In my case, I have maxed out my hot-swap drive bays. I’ll have to remove the faulted disk to put in a new drive. If I were to replace the disks in rpool
as a precaution or to upgrade storage or space, i.e. with rpool
still fully online, I would prefer to add the new disk and let the original ZFS disks handle the resilver. Because one disk is already dead, there will be some extra load on the remaining disk during the resilvering process. Avoid that if possible, but here I have no choice.
So, I popped out the failed drive and put in a new WD Blue 500G drive. It’s way too big for an rpool
disk, but it’s the cheapest price/performance option I could find at this time; i.e. replacing it with a 250GB version would be more expensive. I bought two new drives, because the other SDD is of a similar vintage as the failed disk. Replacing both also has the benefit of being able to use the full 500GB capacity (although I don’t know what for).
1. Note the ID and device name of the failed harddrive
zpool status -v rpool
shows that ata-CT250MX500SSD1_1949E22CA8C7-part3
is the FAULTED
partition. Then find out the device name for easy reference:
$ ls -alh /dev/disk/by-id ata-CT250MX500SSD1_1949E22CA8C7*
lrwxrwxrwx 1 root root 9 Nov 20 19:20 ata-CT250MX500SSD1_1949E22CA8C7 -> ../../sdt
lrwxrwxrwx 1 root root 10 Nov 20 19:21 ata-CT250MX500SSD1_1949E22CA8C7-part1 -> ../../sdt1
lrwxrwxrwx 1 root root 10 Nov 20 19:21 ata-CT250MX500SSD1_1949E22CA8C7-part2 -> ../../sdt2
lrwxrwxrwx 1 root root 10 Nov 20 19:21 ata-CT250MX500SSD1_1949E22CA8C7-part3 -> ../../sdt3
So, /dev/sdt
is the faulted drive.
2. Out with the old, and in with the new
Next I open up another terminal and run sudo dmesg -w
. This will tail all kernel output while I remove the old and insert the new drive. You’ll see here also which device name the new drive is assigned. This may be the same as before, but it could also differ. In my case it got assigned /dev/sdt
(the same as the old drive).
[11449978.939798] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[11449978.947370] ata6.00: ATA-11: WD Blue SA510 2.5 500GB, 52046100, max UDMA/133
[11449978.951552] ata6.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA
[11449978.989819] ata6.00: Features: Dev-Sleep
[11449979.016204] ata6.00: configured for UDMA/133
[11449979.035403] scsi 6:0:0:0: Direct-Access ATA WD Blue SA510 2. 6100 PQ: 0 ANSI: 5
[11449979.036341] sd 6:0:0:0: [sdt] 976773168 512-byte logical blocks: (500 GB/466 GiB)
[11449979.036642] sd 6:0:0:0: Attached scsi generic sg20 type 0
[11449979.036955] sd 6:0:0:0: [sdt] Write Protect is off
[11449979.038080] sd 6:0:0:0: [sdt] Mode Sense: 00 3a 00 00
[11449979.038832] sd 6:0:0:0: [sdt] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[11449979.040161] sd 6:0:0:0: [sdt] Preferred minimum I/O size 512 bytes
[11449979.042429] sd 6:0:0:0: [sdt] Attached SCSI disk
After this, I recommend you do any testing you feel is necessary. I run an extended SMART test: smartctl -t long /dev/sdt
to make sure the drives is behaving correctly.
3. Replicate the partition table
The first thing we’ll do with our new disk is get the partition table on it. It should be identical to /dev/sds
, the other disk in the pool.
$ sgdisk /dev/sds -R /dev/sdt
The operation has completed successfully.
-R
is to replicate the partition table from /dev/sds
to /dev/sdt
.
4. Randomize partition GUIDs
Each partition has its own GUID and the replicate action from the previous step also replicated the GUIDs. Because we want those GUIDs unique, we’ll randomize them for our new disk.
$ sgdisk -G /dev/sdt
The operation has completed successfully.
5. Find the new label for the new disk
We already know the label for the ZFS partition on the old disk: ata-CT250MX500SSD1_1949E22CA8C7-part3
. Now, let’s find the label for the new disk.
$ ls -alh /dev/disk/by-id | grep -i sdt
lrwxrwxrwx 1 root root 9 Apr 2 09:10 ata-WD_Blue_SA510_2.5_500GB_24420V801694 -> ../../sdt
lrwxrwxrwx 1 root root 10 Apr 2 09:10 ata-WD_Blue_SA510_2.5_500GB_24420V801694-part1 -> ../../sdt1
lrwxrwxrwx 1 root root 10 Apr 2 09:24 ata-WD_Blue_SA510_2.5_500GB_24420V801694-part2 -> ../../sdt2
lrwxrwxrwx 1 root root 10 Apr 2 09:10 ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3 -> ../../sdt3
lrwxrwxrwx 1 root root 9 Apr 2 09:10 wwn-0x5001b448c9ab9aa3 -> ../../sdt
lrwxrwxrwx 1 root root 10 Apr 2 09:10 wwn-0x5001b448c9ab9aa3-part1 -> ../../sdt1
lrwxrwxrwx 1 root root 10 Apr 2 09:24 wwn-0x5001b448c9ab9aa3-part2 -> ../../sdt2
lrwxrwxrwx 1 root root 10 Apr 2 09:10 wwn-0x5001b448c9ab9aa3-part3 -> ../../sdt3
It’s to note here that the WD Blues provide a LU WWN Device ID or Logical Unit World Wide Name in addition to a more human readable ID, including its serial number. I prefer to use the latter of easy reference.
6. Replace the faulted ZFS partition
This is it. We can not replace the faulted partition with the new one.
$ zpool replace -f rpool ata-CT250MX500SSD1_1949E22CA8C7-part3 ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3
This command should finish quickly and start the resilvering process.
$ zpool status rpool
pool: rpool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Apr 2 08:51:45 2025
138G / 138G scanned, 12.5G / 138G issued at 145M/s
12.6G resilvered, 9.03% done, 00:14:46 to go
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
ata-CT250MX500SSD1_2208E60D104A-part3 ONLINE 0 0 0
replacing-1 DEGRADED 0 0 0
ata-CT250MX500SSD1_1949E22CA8C7-part3 REMOVED 0 0 0
ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3 ONLINE 0 0 0 (resilvering)
7. Wait for the resilvering to complete
This may take a while. Here it took about 15 minutes.
$ zpool status rpool
pool: rpool
state: ONLINE
scan: resilvered 140G in 00:16:42 with 0 errors on Wed Apr 2 09:08:27 2025
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-CT250MX500SSD1_2208E60D104A-part3 ONLINE 0 0 0
ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3 ONLINE 0 0 0
8. Refresh the boot partitions
As stated earlier, each disk in rpool
is a bootable disk (you can boot from either one). For that purpose, each disk has bootable partitions, but those are not part of any ZFS replication. To make sure the data on each drive is correct and up-to-date, Proxmox provides a tool aptly named proxmox-boot-tool
. My system boots using UEFI, so let’s make sure that all works:
$ proxmox-boot-tool format /dev/sdt2
$ proxmox-boot-tool init /dev/sdt2
Running hook script 'proxmox-auto-removal'..
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/621A-90AF
Copying kernel and creating boot-entry for 6.5.13-6-pve
Copying kernel and creating boot-entry for 6.8.12-1-pve
Copying kernel and creating boot-entry for 6.8.12-7-pve
Copying and configuring kernels on /dev/disk/by-uuid/7875-951A
Copying kernel and creating boot-entry for 6.5.13-6-pve
Copying kernel and creating boot-entry for 6.8.12-1-pve
Copying kernel and creating boot-entry for 6.8.12-7-pve
WARN: /dev/disk/by-uuid/D7AD-EC73 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping
This will copy everything needed to boot the proxmox box to each disk.
Notice we get a warning here about the fact that /dev/disk/by-uuid/D7AD-EC73
does not exist. This is the failed disk we removed earlier. We can simply edit /etc/kernel/proxmox-boot-uuids
and remove the entry for D7AD-EC73
.
If you want, you can run proxmox-boot-tool refresh
to see the warning is now gone.
9. (Optional) Replace the other device
You can repeat the above steps to replace the other drive as well. I did this, because both devices were purchased around the same time. Before I remove the good working drive, I offline it first, so ZFS is not sending any writes to is as I remove it.
# Offline the drive first
$ zpool offline rpool ata-CT250MX500SSD1_2208E60D104A-part3
# Replicate the partition table and randomize GUIDs
$ sgdisk /dev/sdt -R /dev/sds
$ sgdisk -G /dev/sds
# Initialize the resilver
$ zpool replace -f rpool ata-CT250MX500SSD1_2208E60D104A-part3 ata-WD_Blue_SA510_2.5_500GB_242602A00249-part3
# Configure the boot partions and update the GUIDs in the configuration
$ proxmox-boot-tool format /dev/sds2
$ proxmox-boot-tool init /dev/sds2
$ vim /etc/kernel/proxmox-boot-uuids
10a. (Optional) SSD Over-provisioning
So, I have now upgraded my 250GB ZFS mirror with two 500GB drives. Any good Proxmox admin knows not to use rpool
for anything other than the OS. I do not need that additional 250GB of storage.
Enter SDD over-provisioning: Over-provisioning refers to setting aside extra space on an SSD that is not visible to the user or OS. This space is used by the SSD firmware for:
- Wear leveling
- Garbage collection
- Bad block management
- TRIM operations
Long story short, because the partition table still has the 250GB size from the original disks, there is 250GB of free space.
$ parted /dev/sdt print free
Model: ATA WD Blue SA510 2. (scsi)
Disk /dev/sdt: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 17.4kB 1049kB 1031kB bios_grub
2 1049kB 538MB 537MB fat32 boot, esp
3 538MB 250GB 250GB zfs
250GB 500GB 250GB Free Space
If I needed the space, I could expand the ZFS partition to make use of it. But because I do not need the space, I can leave the partition table as-is and give the SSD’s firmware a lot of room to handle wear leveling and bad block management. Most SSD’s have over provisioning built-in. If you do it manually, 10-20% of free space is recommended. I’m going to leave it as-is for now and hope I don’t have to replace these disks in a very long time.
10b. (Optional) Expand rpool
size
If you do decide to expand your pool storage, either to use the full capacity of your disks, or a smaller percentage for over-provisioning, here’s how. Use parted
, because it’s easy to use.
# Use the entire disk
$ parted -s -a opt /dev/sds "resizepart 3 100%"
# Leave space for over-provisioning
$ parted -s -a opt /dev/sdt "resizepart 3 80%"
Note do this for both drives. With the partition table set, let’s tell ZFS to make use of this new space.
zpool online -e rpool /dev/disk/by-id/ata-WD_Blue_SA510_2.5_500GB_242602A00249-part3
zpool online -e rpool /dev/disk/by-id/ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3