[Fwd: Strange networking behaviour in storage server]

Mon Jun 1 10:28:31 UTC 2015

mån 2015-06-01 klockan 02:56 -0700 skrev Mehmet Erol Sanliturk:
> 
> 
> On Mon, Jun 1, 2015 at 2:02 AM, Karli Sjöberg <karli.sjoberg at slu.se>
> wrote:
>         mån 2015-06-01 klockan 10:33 +0200 skrev Andreas Nilsson:
>         >
>         >
>         > On Mon, Jun 1, 2015 at 10:14 AM, Karli Sjöberg
>         <karli.sjoberg at slu.se>
>         > wrote:
>         >         -------- Vidarebefordrat meddelande --------
>         >         > Från: Karli Sjöberg <karli.sjoberg at slu.se>
>         >         > Till: freebsd-fs at freebsd.org
>         <freebsd-fs at freebsd.org>
>         >         > Ämne: Strange networking behaviour in storage
>         server
>         >         > Datum: Mon, 1 Jun 2015 07:49:56 +0000
>         >         >
>         >         > Hey!
>         >         >
>         >         > So we have this ZFS storage server upgraded from
>         9.3-RELEASE
>         >         to
>         >         > 10.1-STABLE to overcome not being able to 1) use
>         SSD drives
>         >         as
>         >         > L2ARC[1]
>         >         > and 2) not being able to hotswap SATA drives[2].
>         >         >
>         >         > After the upgrade we´ve noticed a very odd
>         networking
>         >         behaviour, it
>         >         > sends/receives full speed for a while, then there
>         is a
>         >         couple of
>         >         > minutes
>         >         > of complete silence where even terminal commands
>         like an
>         >         "ls" just
>         >         > waits
>         >         > until they are executed and then it starts sending
>         full
>         >         speed again. I
>         >         > ´ve linked to a screenshot showing this send and
>         pause
>         >         behaviour. The
>         >         > blue line is the total, green is SMB and turquoise
>         is NFS
>         >         over jumbo
>         >         > frames. It behaves this way regardless of the
>         protocol.
>         >         >
>         >         > http://oi62.tinypic.com/33xvjb6.jpg
>         >         >
>         >         > The problem is that these pauses can sometimes be
>         so long
>         >         that
>         >         > connections drop. Like someone is copying files
>         over SMB or
>         >         iSCSI and
>         >         > suddenly they get an error message saying that the
>         transfer
>         >         failed and
>         >         > they have to start over with the file(s). That´s
>         horrible!
>         >         >
>         >         > So far NFS has proven to be the most resillient,
>         it´s stupid
>         >         simple
>         >         > nature just waits and resumes transfer when pause
>         is over.
>         >         Kudus for
>         >         > that.
>         >         >
>         >         > The server is driven by a Supermicro X9SRL-F, a
>         Xeon 1620v2
>         >         and 64GB
>         >         > ECC
>         >         > RAM. The hardware has been ruled out, we happened
>         to have a
>         >         identical
>         >         > MB
>         >         > and CPU lying around and that didn´t improve
>         things. We have
>         >         also
>         >         > installed a Intel PRO 100/1000 Quad-port ethernet
>         adapter to
>         >         test if
>         >         > that would change things, but it hasn´t, it still
>         behaves
>         >         this way.
>         >         >
>         >         > The two built-in NIC's are Intel 82574L and the
>         Quad-port
>         >         NIC's are
>         >         > Intel 82571EB, so both em(4) driven. I happen to
>         know that
>         >         the em
>         >         > driver
>         >         > has updated between 9.3 and 10.1. Perhaps that is
>         to blame,
>         >         but I have
>         >         > no idea.
>         >         >
>         >         > Is there anyone that can make sense of this?
>         >         >
>         >         > [1]:
>         >         >
>         https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197164
>         >         >
>         >         > [2]:
>         >         >
>         https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191348
>         >         >
>         >         > /K
>         >         >
>         >         >
>         >
>         >
>         >         Another observation I´ve made is that during these
>         pauses, the
>         >         entire
>         >         system is put on hold, even ZFS scrub stops and then
>         resumes
>         >         after a
>         >         while. Looking in top, the system is completly idle.
>         >
>         >         Normally during scrub, the kernel eats 20-30% CPU,
>         but during
>         >         a pause,
>         >         even the [kernel] goes down to 0.00%. Makes me think
>         the
>         >         networking has
>         >         nothing to do with it.
>         >
>         >         What´s then to blame? ZFS?
>         >
>         >         /K
>         >         _______________________________________________
>         >         freebsd-fs at freebsd.org mailing list
>         >         http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>         >         To unsubscribe, send any mail to
>         >         "freebsd-fs-unsubscribe at freebsd.org"
>         >
>         >
>         > Hello,
>         >
>         >
>         > does this happen when clients are only reading from server?
>         
>         Yes it happens when clients are only reading from the server.
>         
>         > Otherwise I would suspect that it could be caused by ZFS
>         writing out a
>         > large chunck of data sitting in its caches, and until that
>         is complete
>         > I/O is stalled.
>         
>         That´s what so strange, we have three more systems set up
>         about the same
>         size and none of others are acting this way.
>         
>         The only thing I can think of that differs that we haven´t
>         tested ruling
>         out yet is ctld, the other systems are still running istgt as
>         their
>         iSCSI daemon.
>         
>         /K
>         
> 
> 
> 
> If there are other three similar systems and they are exactly
> installed with the same structure , my first possibility to consider
> would be to suspect a slowly progressing hardware failure :
> 
> 
> From a circuit , it is not possible to get a response in expected
> time , but , it is responding after a time which is not normal . Such
> an action may be caused by a faulty soldered or cracked line point in
> the circuit : When it is hot , it is disconnecting , when it is cold
> it is connecting .

As initially stated, both motherboard and processor has been replaced
with identical hardware that went through a day of memtest before being
installed. Then there´s an external Supermicro JBOD[*] but I haven´t
seen any disk timeouts or SES errors logged. At least at a driver level
there should have been timeouts at such a long delay as five minutes.

/K

[*]:
http://www.supermicro.nl/products/chassis/3U/837/SC837E26-RJBOD1.cfm

> 
> 
> 
> 
> 
> Thank you very much .
> 
> 
> 
> Mehmet Erol Sanliturk
> 
> 
> 
> 
>  
>         >
>         >
>         > Have you tried what is suggested in
>         > https://wiki.freebsd.org/ZFSTuningGuide ? In particular
>         setting
>         > vfs.zfs.write_limit_override to something appropriate for
>         your site.
>         > The timeout seems to be defaulting to 5 now.
>         >
>         >
>         > Best regards
>         >
>         > Andreas
>         >
>         >
>         >
>         
>         _______________________________________________
>         freebsd-fs at freebsd.org mailing list
>         http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>         To unsubscribe, send any mail to
>         "freebsd-fs-unsubscribe at freebsd.org"
> 
>