Ceph Revisited With Nvme
I recently rebuilt my server cluster. I finally ditched the SATADOMs I was using as /, and installed a pair of budget NVMe drives. Yes, I know, I’m sacrificing performance. I was buying six drives at the same time, OK? Anyway. Let’s review the hardware and benchmark it.
The Setup
- Hardware
- 2x HDDs (Mix of Seagate IronWolf and WD Red, all 7200RPM and CMR)
- 2x Teamgroup MP44L plugged into a bifurcated PCIe Gen3 x8 slot via PCIe Gen 4 card.
- Yes, a Gen4 drive in a Gen3 slot. It’s an oopsie, but I don’t hate it.
- Plenty of CPU to go around
- 64 GB of RAM per node
- 3 nodes
- Proxmox Root is on the NVMe drives, mirrored via ZFS on some nodes and BTRFS on others.
- Proxmox-deployed Ceph, with some quirks
- Two pools
- Each HDD has one partition for an OSD
- NVMe is partitioned as follows:
1 MB - BIOS Boot 1 GB - EFI 160 GB - Root 215 GB - DB for HDD OSD 648 GB - SSD OSD
- SSD OSDs are one-drive OSDs you can create in Proxmox
- HDD OSDs are OSD drive/DB Drive OSDs you can create in Proxmox
- Proxmox defaults DB to 10% of data drive
- Ceph docs claim Ceph only needs 4%
- Had to set DB size to 150 GB myself
- Pools use custom rules to control which OSDs are used
- This can only be done with Ceph tooling, outside of proxmox
- I used the Ceph Manager Web UI
- When the pools’ custom rules are made doesn’t seem to matter.
- Proxmox container (Debian) used for testing
- 20GB drive
- 4 core
- 4GB RAM
- Similar fio tests to before
- To test pools, I just migrated the container around
The Test
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75 --max-jobs=4
This is the same command I’ve been running.
Test results
HDD Pool:
# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75 --max-jobs=4
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.33
Starting 1 process
test: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [m(1)][99.9%][r=1021KiB/s,w=356KiB/s][r=255,w=89 IOPS][eta 00m:01s]
test: (groupid=0, jobs=1): err= 0: pid=4282: Thu Aug 15 13:54:39 2024
read: IOPS=117, BW=471KiB/s (482kB/s)(768MiB/1668855msec)
bw ( KiB/s): min= 8, max= 1688, per=100.00%, avg=473.95, stdev=149.44, samples=3317
iops : min= 2, max= 422, avg=118.44, stdev=37.33, samples=3317
write: IOPS=39, BW=157KiB/s (161kB/s)(256MiB/1668855msec); 0 zone resets
bw ( KiB/s): min= 8, max= 529, per=100.00%, avg=162.07, stdev=75.30, samples=3239
iops : min= 2, max= 132, avg=40.52, stdev=18.82, samples=3239
cpu : usr=0.14%, sys=0.40%, ctx=183750, majf=0, minf=7
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=196498,65646,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=471KiB/s (482kB/s), 471KiB/s-471KiB/s (482kB/s-482kB/s), io=768MiB (805MB), run=1668855-1668855msec
WRITE: bw=157KiB/s (161kB/s), 157KiB/s-157KiB/s (161kB/s-161kB/s), io=256MiB (269MB), run=1668855-1668855msec
Disk stats (read/write):
rbd0: ios=196479/66667, merge=0/365, ticks=73520559/28809961, in_queue=102330520, util=100.00%
Odd. This doesn’t seem to be an improvement.
SSD Pool:
# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75 --max-jobs=4
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=20.3MiB/s,w=7155KiB/s][r=5209,w=1788 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=357: Thu Aug 15 14:10:48 2024
read: IOPS=6008, BW=23.5MiB/s (24.6MB/s)(768MiB/32705msec)
bw ( KiB/s): min= 9720, max=26984, per=100.00%, avg=24146.58, stdev=3537.42, samples=65
iops : min= 2430, max= 6746, avg=6036.65, stdev=884.36, samples=65
write: IOPS=2007, BW=8029KiB/s (8222kB/s)(256MiB/32705msec); 0 zone resets
bw ( KiB/s): min= 3048, max= 9328, per=100.00%, avg=8065.23, stdev=1188.85, samples=65
iops : min= 762, max= 2332, avg=2016.31, stdev=297.21, samples=65
cpu : usr=2.85%, sys=7.11%, ctx=217272, majf=0, minf=7
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=196498,65646,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=23.5MiB/s (24.6MB/s), 23.5MiB/s-23.5MiB/s (24.6MB/s-24.6MB/s), io=768MiB (805MB), run=32705-32705msec
WRITE: bw=8029KiB/s (8222kB/s), 8029KiB/s-8029KiB/s (8222kB/s-8222kB/s), io=256MiB (269MB), run=32705-32705msec
Disk stats (read/write):
rbd0: ios=196439/65657, merge=0/57, ticks=938303/1140798, in_queue=2079101, util=99.74%
NVMe should be a huge gain over a SATA SSD. So, these results are also odd. If the flash cells were a downgrade from the SATA SSD I was using before, that might explain the decrease in performance. But, aside from calling this a failed upgrade, I’m not going to dig into that.