Massive Problems with 10G, NFS, ZFS, and iSCSI
Bob Healey
healer at rpi.edu
Fri Jul 12 16:45:48 UTC 2013
I've been beating my head against a brick wall for a week with this and
5 similar systems.
My current major headache:
Dell Poweredge R610, dual quad core Xeon E5530 @ 2.4GHz, 24GB RAM 4
onboard bce NICs, 1 mxge NIC, pair of 10K SAS drives on mpt (Dell MB SAS
controller), pair of 15 drive 1TB RAID 6 arrays on mfi (PERC 6).
The machine was originally installed with FreeBSD 7.2 and has been
upgraded through the years to 9.1. None of the issues I'm currently
seeing manifested themselves under 9.0. When under heavy NFS load, the
server currently becomes non-responsive on the network, unless the
packet payload is very small (ICMP ping packets with > 124 bytes payload
get dropped).
Current network config:
bce0: management network, connected to the 37 IPMI controllers in the
rack, has conserver running SOL connections to each
bce1: link to outside world, everything in rack trying to reach outside
is NATed through here
bce2: used for a direct host to host ISCSI link to another host in the
rack to provide a hard drive for a virtual machine. This machine is the
iscsi target, and an 80GB zvol is the backing store.
mxge0/vlan1: connected to first 25 machines in rack
mxge0/vlan2: connected to remaining 12 machines in rack, plus a vm on
host #25 on vlan 1
This is an HPC cluster, with all nodes running RHEL 5. The landing pads
(1 real, 1 virtual) are multihomed to both the internal and external
networks, so the only traffic that crosses the NAT is software updates
and job accounting information.
PF is used for firewalling and NAT. skip is enabled on all internal
interfaces.
Stuff I've tried: setting vfs.zfs.arc_max="20480M", disabling flow
control on the 10G NIC, moving the ZIL to some unused space on the boot
drive (RAID 1, mostly UFS).
I'm getting lots of Limiting open port RST response from 32325 to 200
packets/sec in the logs, ISCSI timeouts on the client, and NFS server
not responding errors. netstat -i is showing lots of input errors on
mxge, but i'm not seeing any errors on the switch (Dell Powerconnect
6248). Myricom (nic vendor) is at a loss too.
Any ideas on what I should try next? I'm at the point of throwing darts
blindfolded.
I've got 5 more similar misbehaving machines, 4 of which behave just
fine when using igb instead of mxge.
--
Bob Healey
Systems Administrator
Biocomputation and Bioinformatics Constellation
and Molecularium
healer at rpi.edu
(518) 276-4407
More information about the freebsd-fs
mailing list