[Fwd: Strange networking behaviour in storage server]

Sun Jun 14 13:30:39 UTC 2015

Am 13.06.2015 um 11:31 schrieb Edward Tomasz Napierała:
> On 0601T0902, Karli Sjöberg wrote:
>> mån 2015-06-01 klockan 10:33 +0200 skrev Andreas Nilsson:
>>>
>>>
>>> On Mon, Jun 1, 2015 at 10:14 AM, Karli Sjöberg <karli.sjoberg at slu.se>
>>> wrote:
>>>         -------- Vidarebefordrat meddelande --------
>>>         > Från: Karli Sjöberg <karli.sjoberg at slu.se>
>>>         > Till: freebsd-fs at freebsd.org <freebsd-fs at freebsd.org>
>>>         > Ämne: Strange networking behaviour in storage server
>>>         > Datum: Mon, 1 Jun 2015 07:49:56 +0000
>>>         >
>>>         > Hey!
>>>         >
>>>         > So we have this ZFS storage server upgraded from 9.3-RELEASE
>>>         to
>>>         > 10.1-STABLE to overcome not being able to 1) use SSD drives
>>>         as
>>>         > L2ARC[1]
>>>         > and 2) not being able to hotswap SATA drives[2].
>>>         >
>>>         > After the upgrade we´ve noticed a very odd networking
>>>         behaviour, it
>>>         > sends/receives full speed for a while, then there is a
>>>         couple of
>>>         > minutes
>>>         > of complete silence where even terminal commands like an
>>>         "ls" just
>>>         > waits
>>>         > until they are executed and then it starts sending full
>>>         speed again. I
>>>         > ´ve linked to a screenshot showing this send and pause
>>>         behaviour. The
>>>         > blue line is the total, green is SMB and turquoise is NFS
>>>         over jumbo
>>>         > frames. It behaves this way regardless of the protocol.
>>>         >
>>>         > http://oi62.tinypic.com/33xvjb6.jpg
>>>         >
>>>         > The problem is that these pauses can sometimes be so long
>>>         that
>>>         > connections drop. Like someone is copying files over SMB or
>>>         iSCSI and
>>>         > suddenly they get an error message saying that the transfer
>>>         failed and
>>>         > they have to start over with the file(s). That´s horrible!
>>>         >
>>>         > So far NFS has proven to be the most resillient, it´s stupid
>>>         simple
>>>         > nature just waits and resumes transfer when pause is over.
>>>         Kudus for
>>>         > that.
>>>         >
>>>         > The server is driven by a Supermicro X9SRL-F, a Xeon 1620v2
>>>         and 64GB
>>>         > ECC
>>>         > RAM. The hardware has been ruled out, we happened to have a
>>>         identical
>>>         > MB
>>>         > and CPU lying around and that didn´t improve things. We have
>>>         also
>>>         > installed a Intel PRO 100/1000 Quad-port ethernet adapter to
>>>         test if
>>>         > that would change things, but it hasn´t, it still behaves
>>>         this way.
>>>         >
>>>         > The two built-in NIC's are Intel 82574L and the Quad-port
>>>         NIC's are
>>>         > Intel 82571EB, so both em(4) driven. I happen to know that
>>>         the em
>>>         > driver
>>>         > has updated between 9.3 and 10.1. Perhaps that is to blame,
>>>         but I have
>>>         > no idea.
>>>         >
>>>         > Is there anyone that can make sense of this?
>>>         >
>>>         > [1]:
>>>         > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197164
>>>         >
>>>         > [2]:
>>>         > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191348
>>>         >
>>>         > /K
>>>         >
>>>         >
>>>         
>>>         
>>>         Another observation I´ve made is that during these pauses, the
>>>         entire
>>>         system is put on hold, even ZFS scrub stops and then resumes
>>>         after a
>>>         while. Looking in top, the system is completly idle.
>>>         
>>>         Normally during scrub, the kernel eats 20-30% CPU, but during
>>>         a pause,
>>>         even the [kernel] goes down to 0.00%. Makes me think the
>>>         networking has
>>>         nothing to do with it.
>>>         
>>>         What´s then to blame? ZFS?
>>>         
>>>         /K
>>>         _______________________________________________
>>>         freebsd-fs at freebsd.org mailing list
>>>         http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>         To unsubscribe, send any mail to
>>>         "freebsd-fs-unsubscribe at freebsd.org"
>>>
>>>
>>> Hello,
>>>
>>>
>>> does this happen when clients are only reading from server? 
>>
>> Yes it happens when clients are only reading from the server.
>>
>>> Otherwise I would suspect that it could be caused by ZFS writing out a
>>> large chunck of data sitting in its caches, and until that is complete
>>> I/O is stalled.
>>
>> That´s what so strange, we have three more systems set up about the same
>> size and none of others are acting this way.
>>
>> The only thing I can think of that differs that we haven´t tested ruling
>> out yet is ctld, the other systems are still running istgt as their
>> iSCSI daemon.
> 
> So, were you able to rule out ctld?
> 
> Do you have local, or terminal, access to the machine?  When the problem
> manifests, do local commands work?  In other words, is the whole machine
> wedged, or just the network?  If it's just the network, it might be
> caused by ctld consuming all available mbufs.  You could run "netstat -m"
> before and after to check that.
> 

You already checked (doublechecked) HBA Firmware etc? Cabling is fine?

I expect you already disabled tso, gro, rxcsum, txcsum on your NIC(s). I
had similar effects, with all those fancy uberfeatures enabled.

Give it a try... ifconfig foo0 -rxcsum -txcsum -tso -gro

Capturing a few MB of Traffic before/after could be also very helpful to
see if...

> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>