devroom.io


Proxmox: Replace a failed bootable ZFS disk in rpool

It’s what every sysadmin dreads and prepares for: a failed hard drive.

Replacing a failed SSD in a Proxmox boot pool (rpool) can be a daunting task, especially considering the critical role it plays in your server’s operation. A mirrored ZFS boot pool provides redundancy, but when one of the drives fails, it’s imperative to address the issue promptly to maintain system integrity and prevent data loss. This article will walk you through the process of replacing a failed SSD in your Proxmox rpool, ensuring that your system remains robust and reliable.

In the world of system administration, hardware failures are an inevitable reality. Fortunately, ZFS, the filesystem utilized by Proxmox, offers features that simplify the replacement process. By following the correct procedures, you can remove the failed drive, install and prepare the new SSD, and reintegrate it into the rpool with minimal downtime. This guide aims to provide clear, step-by-step instructions to help you navigate this process confidently, ensuring your Proxmox environment continues to operate smoothly.

The problem

In my main Proxmox/Storage server I have a ZFS-mirror rpool. This is where system boots from and all OS files reside. One of the disks died.

root@treebeard:~# zpool status rpool
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:18:07 with 0 errors on Sun Mar  9 00:42:18 2025
config:

        NAME                                       STATE     READ WRITE CKSUM
        rpool                                      DEGRADED     0     0     0
          mirror-0                                 DEGRADED     0     0     0
            ata-CT250MX500SSD1_2208E60D104A-part3  ONLINE       0     0     0
            ata-CT250MX500SSD1_1949E22CA8C7-part3  FAULTED      6   300     0  too many errors

Luckily, ZFS caught the problem and removed the faulty disk from rpool to avoid further problems. The other disks kept going, so no further issues presented themselves.

There is a slight complication that does not allow us to simpley zpool replace the entire disk:

Device       Start       End   Sectors   Size Type
/dev/sds1       34      2047      2014  1007K BIOS boot
/dev/sds2     2048   1050623   1048576   512M EFI System
/dev/sds3  1050624 488397134 487346511 232.4G Solaris /usr & Apple ZFS

Each disk has it’s own BIOS and EFI partitions to boot the system (and load up a kernel with ZFS support).

ZFS does not only work with disks, it also works with partitions. In the example above, /dev/sds1 and /dev/sds2 are the same on each disk in rpool but not part of the ZFS pool itself. /dev/sds3, however, is a ZFS partition and mirrored to /dev/sdt3 in rpool.

The replacement

Replacing this disk is straightforward, once you know the steps to do it…

In my case, I have maxed out my hot-swap drive bays. I’ll have to remove the faulted disk to put in a new drive. If I were to replace the disks in rpool as a precaution or to upgrade storage or space, i.e. with rpool still fully online, I would prefer to add the new disk and let the original ZFS disks handle the resilver. Because one disk is already dead, there will be some extra load on the remaining disk during the resilvering process. Avoid that if possible, but here I have no choice.

So, I popped out the failed drive and put in a new WD Blue 500G drive. It’s way too big for an rpool disk, but it’s the cheapest price/performance option I could find at this time; i.e. replacing it with a 250GB version would be more expensive. I bought two new drives, because the other SDD is of a similar vintage as the failed disk. Replacing both also has the benefit of being able to use the full 500GB capacity (although I don’t know what for).

1. Note the ID and device name of the failed harddrive

zpool status -v rpool shows that ata-CT250MX500SSD1_1949E22CA8C7-part3 is the FAULTED partition. Then find out the device name for easy reference:

$ ls -alh /dev/disk/by-id ata-CT250MX500SSD1_1949E22CA8C7*
lrwxrwxrwx 1 root root  9 Nov 20 19:20 ata-CT250MX500SSD1_1949E22CA8C7 -> ../../sdt
lrwxrwxrwx 1 root root 10 Nov 20 19:21 ata-CT250MX500SSD1_1949E22CA8C7-part1 -> ../../sdt1
lrwxrwxrwx 1 root root 10 Nov 20 19:21 ata-CT250MX500SSD1_1949E22CA8C7-part2 -> ../../sdt2
lrwxrwxrwx 1 root root 10 Nov 20 19:21 ata-CT250MX500SSD1_1949E22CA8C7-part3 -> ../../sdt3

So, /dev/sdt is the faulted drive.

2. Out with the old, and in with the new

Next I open up another terminal and run sudo dmesg -w. This will tail all kernel output while I remove the old and insert the new drive. You’ll see here also which device name the new drive is assigned. This may be the same as before, but it could also differ. In my case it got assigned /dev/sdt (the same as the old drive).

[11449978.939798] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[11449978.947370] ata6.00: ATA-11: WD Blue SA510 2.5 500GB, 52046100, max UDMA/133
[11449978.951552] ata6.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 32), AA
[11449978.989819] ata6.00: Features: Dev-Sleep
[11449979.016204] ata6.00: configured for UDMA/133
[11449979.035403] scsi 6:0:0:0: Direct-Access     ATA      WD Blue SA510 2. 6100 PQ: 0 ANSI: 5
[11449979.036341] sd 6:0:0:0: [sdt] 976773168 512-byte logical blocks: (500 GB/466 GiB)
[11449979.036642] sd 6:0:0:0: Attached scsi generic sg20 type 0
[11449979.036955] sd 6:0:0:0: [sdt] Write Protect is off
[11449979.038080] sd 6:0:0:0: [sdt] Mode Sense: 00 3a 00 00
[11449979.038832] sd 6:0:0:0: [sdt] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[11449979.040161] sd 6:0:0:0: [sdt] Preferred minimum I/O size 512 bytes
[11449979.042429] sd 6:0:0:0: [sdt] Attached SCSI disk

After this, I recommend you do any testing you feel is necessary. I run an extended SMART test: smartctl -t long /dev/sdt to make sure the drives is behaving correctly.

3. Replicate the partition table

The first thing we’ll do with our new disk is get the partition table on it. It should be identical to /dev/sds, the other disk in the pool.

$ sgdisk /dev/sds -R /dev/sdt
The operation has completed successfully.

-R is to replicate the partition table from /dev/sds to /dev/sdt.

4. Randomize partition GUIDs

Each partition has its own GUID and the replicate action from the previous step also replicated the GUIDs. Because we want those GUIDs unique, we’ll randomize them for our new disk.

$ sgdisk -G /dev/sdt
The operation has completed successfully.

5. Find the new label for the new disk

We already know the label for the ZFS partition on the old disk: ata-CT250MX500SSD1_1949E22CA8C7-part3. Now, let’s find the label for the new disk.

$ ls -alh /dev/disk/by-id | grep -i sdt
lrwxrwxrwx 1 root root    9 Apr  2 09:10 ata-WD_Blue_SA510_2.5_500GB_24420V801694 -> ../../sdt
lrwxrwxrwx 1 root root   10 Apr  2 09:10 ata-WD_Blue_SA510_2.5_500GB_24420V801694-part1 -> ../../sdt1
lrwxrwxrwx 1 root root   10 Apr  2 09:24 ata-WD_Blue_SA510_2.5_500GB_24420V801694-part2 -> ../../sdt2
lrwxrwxrwx 1 root root   10 Apr  2 09:10 ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3 -> ../../sdt3
lrwxrwxrwx 1 root root    9 Apr  2 09:10 wwn-0x5001b448c9ab9aa3 -> ../../sdt
lrwxrwxrwx 1 root root   10 Apr  2 09:10 wwn-0x5001b448c9ab9aa3-part1 -> ../../sdt1
lrwxrwxrwx 1 root root   10 Apr  2 09:24 wwn-0x5001b448c9ab9aa3-part2 -> ../../sdt2
lrwxrwxrwx 1 root root   10 Apr  2 09:10 wwn-0x5001b448c9ab9aa3-part3 -> ../../sdt3

It’s to note here that the WD Blues provide a LU WWN Device ID or Logical Unit World Wide Name in addition to a more human readable ID, including its serial number. I prefer to use the latter of easy reference.

6. Replace the faulted ZFS partition

This is it. We can not replace the faulted partition with the new one.

$ zpool replace -f rpool ata-CT250MX500SSD1_1949E22CA8C7-part3 ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3

This command should finish quickly and start the resilvering process.

$ zpool status rpool
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Apr  2 08:51:45 2025
        138G / 138G scanned, 12.5G / 138G issued at 145M/s
        12.6G resilvered, 9.03% done, 00:14:46 to go
config:

        NAME                                                  STATE     READ WRITE CKSUM
        rpool                                                 DEGRADED     0     0     0
          mirror-0                                            DEGRADED     0     0     0
            ata-CT250MX500SSD1_2208E60D104A-part3             ONLINE       0     0     0
            replacing-1                                       DEGRADED     0     0     0
              ata-CT250MX500SSD1_1949E22CA8C7-part3           REMOVED      0     0     0
              ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3  ONLINE       0     0     0  (resilvering)

7. Wait for the resilvering to complete

This may take a while. Here it took about 15 minutes.

$ zpool status rpool
  pool: rpool
 state: ONLINE
  scan: resilvered 140G in 00:16:42 with 0 errors on Wed Apr  2 09:08:27 2025
config:

        NAME                                                STATE     READ WRITE CKSUM
        rpool                                               ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            ata-CT250MX500SSD1_2208E60D104A-part3           ONLINE       0     0     0
            ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3  ONLINE       0     0     0

8. Refresh the boot partitions

As stated earlier, each disk in rpool is a bootable disk (you can boot from either one). For that purpose, each disk has bootable partitions, but those are not part of any ZFS replication. To make sure the data on each drive is correct and up-to-date, Proxmox provides a tool aptly named proxmox-boot-tool. My system boots using UEFI, so let’s make sure that all works:

$ proxmox-boot-tool format /dev/sdt2
$ proxmox-boot-tool init /dev/sdt2
Running hook script 'proxmox-auto-removal'..
Running hook script 'zz-proxmox-boot'..
Re-executing '/etc/kernel/postinst.d/zz-proxmox-boot' in new private mount namespace..
Copying and configuring kernels on /dev/disk/by-uuid/621A-90AF
        Copying kernel and creating boot-entry for 6.5.13-6-pve
        Copying kernel and creating boot-entry for 6.8.12-1-pve
        Copying kernel and creating boot-entry for 6.8.12-7-pve
Copying and configuring kernels on /dev/disk/by-uuid/7875-951A
        Copying kernel and creating boot-entry for 6.5.13-6-pve
        Copying kernel and creating boot-entry for 6.8.12-1-pve
        Copying kernel and creating boot-entry for 6.8.12-7-pve
WARN: /dev/disk/by-uuid/D7AD-EC73 does not exist - clean '/etc/kernel/proxmox-boot-uuids'! - skipping

This will copy everything needed to boot the proxmox box to each disk.

Notice we get a warning here about the fact that /dev/disk/by-uuid/D7AD-EC73 does not exist. This is the failed disk we removed earlier. We can simply edit /etc/kernel/proxmox-boot-uuids and remove the entry for D7AD-EC73.

If you want, you can run proxmox-boot-tool refresh to see the warning is now gone.

9. (Optional) Replace the other device

You can repeat the above steps to replace the other drive as well. I did this, because both devices were purchased around the same time. Before I remove the good working drive, I offline it first, so ZFS is not sending any writes to is as I remove it.

# Offline the drive first
$ zpool offline rpool ata-CT250MX500SSD1_2208E60D104A-part3

# Replicate the partition table and randomize GUIDs
$ sgdisk /dev/sdt -R /dev/sds
$ sgdisk -G /dev/sds

# Initialize the resilver
$ zpool replace -f rpool ata-CT250MX500SSD1_2208E60D104A-part3 ata-WD_Blue_SA510_2.5_500GB_242602A00249-part3

# Configure the boot partions and update the GUIDs in the configuration
$ proxmox-boot-tool format /dev/sds2
$ proxmox-boot-tool init /dev/sds2
$ vim /etc/kernel/proxmox-boot-uuids

10a. (Optional) SSD Over-provisioning

So, I have now upgraded my 250GB ZFS mirror with two 500GB drives. Any good Proxmox admin knows not to use rpool for anything other than the OS. I do not need that additional 250GB of storage.

Enter SDD over-provisioning: Over-provisioning refers to setting aside extra space on an SSD that is not visible to the user or OS. This space is used by the SSD firmware for:

  • Wear leveling
  • Garbage collection
  • Bad block management
  • TRIM operations

Long story short, because the partition table still has the 250GB size from the original disks, there is 250GB of free space.

$ parted /dev/sdt print free
Model: ATA WD Blue SA510 2. (scsi)
Disk /dev/sdt: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name  Flags
 1      17.4kB  1049kB  1031kB                     bios_grub
 2      1049kB  538MB   537MB   fat32              boot, esp
 3      538MB   250GB   250GB   zfs
        250GB   500GB   250GB   Free Space

If I needed the space, I could expand the ZFS partition to make use of it. But because I do not need the space, I can leave the partition table as-is and give the SSD’s firmware a lot of room to handle wear leveling and bad block management. Most SSD’s have over provisioning built-in. If you do it manually, 10-20% of free space is recommended. I’m going to leave it as-is for now and hope I don’t have to replace these disks in a very long time.

10b. (Optional) Expand rpool size

If you do decide to expand your pool storage, either to use the full capacity of your disks, or a smaller percentage for over-provisioning, here’s how. Use parted, because it’s easy to use.

# Use the entire disk
$ parted -s -a opt /dev/sds "resizepart 3 100%"

# Leave space for over-provisioning
$ parted -s -a opt /dev/sdt "resizepart 3 80%"

Note do this for both drives. With the partition table set, let’s tell ZFS to make use of this new space.

zpool online -e rpool /dev/disk/by-id/ata-WD_Blue_SA510_2.5_500GB_242602A00249-part3
zpool online -e rpool /dev/disk/by-id/ata-WD_Blue_SA510_2.5_500GB_24420V801694-part3

Copyright © 2025 Ariejan de Vroom

Validate HTML5