Suggestion for hardware for ZFS fileserver
Willem Jan Withagen
wjw at digiware.nl
Thu Dec 27 10:54:26 UTC 2018
On 22/12/2018 15:49, Sami Halabi wrote:
> Hi,
>
> What sas hba card do you recommend for 16/24 internal ports and 2 external
> that are recognized and work well with freebsd ZFS.
There is no real advise here, but what I saw is that it is relatively
easy to overload a lot of the busses involved int his.
I got this when building Ceph clusters on FreeBSD, where each disk has
its own daemon to hammer away on the platters.
The first bottleneck is the disk "backplane". It you do not need to wire
every disk with a dedicated HBA-disk cable, then you are sharing the
bandwidth on the backplane between all the disks. and dependant on the
architecture on the backplane serveral disk share one expander. And the
feed into that will be share by the disks attached to that expander.
Some expanders will have multiple inputs from the HBA, but I seen cases
where 4 sas lanes go in and only 2 get used.
The second bottleneck is that once you have all these nice disks
connected to your HBA, but that is only on a PCIe 4x slot.....
You will need PCIe x8 or x16 for that, and PCIe 3.0 stuff.
Total Bandwidth: (x16 link): PCIe 3.0 = 32GB/s, PCIe 2.0 = 16GB/s, PCIe
1.1 = 8GB/s.
So lets say that your 24 port HBA has 24 disks connected, each doing
100Mbytes/sec = 19,2 Gbit/s Which will very likely saturate that PCI
bus. Note that I'm 0nly talking 100Mbyte/sec. Since that is what I seed
spinning rust do under Ceph. I'm not even talking about the SSDs used
for journals and cache.
For ZFS the bus challenge is a bit more of a problem, because you cannot
scale out.
But I've seen designs where an extra disk cabinet with 96 disks is
attached over something like 4*12Gbit/s into a controller in a PCIe 16x
slot, wondering why it doesn't do what they thought it was going to do.
For Ceph there is a "nice" way out, because it is able to scale out by
more smaller servers with less disks per chassis. So we tend to use 16
drive chassis with 2 8 ports HBAs that have dedicated connections per
disk. It is a bit more expensive but it seems to work much better.
Note that you then will then run into network problems, which are more
of the same. Only just a bit further up the scale.
With Ceph that only plays a role during recovery of lost nodes, which
hopefully is not too often. But a died/replaced disk will be able to
restore at the max speed a disk can take. A lost/replaced node will
recover at the speed limited by the disk infrastructure of the
recovering node, since the data will come from a lot of other disks on
other servers. The local busses will saturate when the HW design was poor.
--WjW
More information about the freebsd-fs
mailing list