[Fwd: Strange networking behaviour in storage server]

Karli Sjöberg karli.sjoberg at slu.se
Mon Jun 1 10:18:42 UTC 2015


mån 2015-06-01 klockan 12:11 +0200 skrev Andreas Nilsson:
> 
> 
> On Mon, Jun 1, 2015 at 11:56 AM, Mehmet Erol Sanliturk
> <m.e.sanliturk at gmail.com> wrote:
>         
>         
>         On Mon, Jun 1, 2015 at 2:02 AM, Karli Sjöberg
>         <karli.sjoberg at slu.se> wrote:
>                 mån 2015-06-01 klockan 10:33 +0200 skrev Andreas
>                 Nilsson:
>                 >
>                 >
>                 > On Mon, Jun 1, 2015 at 10:14 AM, Karli Sjöberg
>                 <karli.sjoberg at slu.se>
>                 > wrote:
>                 >         -------- Vidarebefordrat meddelande --------
>                 >         > Från: Karli Sjöberg <karli.sjoberg at slu.se>
>                 >         > Till: freebsd-fs at freebsd.org
>                 <freebsd-fs at freebsd.org>
>                 >         > Ämne: Strange networking behaviour in
>                 storage server
>                 >         > Datum: Mon, 1 Jun 2015 07:49:56 +0000
>                 >         >
>                 >         > Hey!
>                 >         >
>                 >         > So we have this ZFS storage server
>                 upgraded from 9.3-RELEASE
>                 >         to
>                 >         > 10.1-STABLE to overcome not being able to
>                 1) use SSD drives
>                 >         as
>                 >         > L2ARC[1]
>                 >         > and 2) not being able to hotswap SATA
>                 drives[2].
>                 >         >
>                 >         > After the upgrade we´ve noticed a very odd
>                 networking
>                 >         behaviour, it
>                 >         > sends/receives full speed for a while,
>                 then there is a
>                 >         couple of
>                 >         > minutes
>                 >         > of complete silence where even terminal
>                 commands like an
>                 >         "ls" just
>                 >         > waits
>                 >         > until they are executed and then it starts
>                 sending full
>                 >         speed again. I
>                 >         > ´ve linked to a screenshot showing this
>                 send and pause
>                 >         behaviour. The
>                 >         > blue line is the total, green is SMB and
>                 turquoise is NFS
>                 >         over jumbo
>                 >         > frames. It behaves this way regardless of
>                 the protocol.
>                 >         >
>                 >         > http://oi62.tinypic.com/33xvjb6.jpg
>                 >         >
>                 >         > The problem is that these pauses can
>                 sometimes be so long
>                 >         that
>                 >         > connections drop. Like someone is copying
>                 files over SMB or
>                 >         iSCSI and
>                 >         > suddenly they get an error message saying
>                 that the transfer
>                 >         failed and
>                 >         > they have to start over with the file(s).
>                 That´s horrible!
>                 >         >
>                 >         > So far NFS has proven to be the most
>                 resillient, it´s stupid
>                 >         simple
>                 >         > nature just waits and resumes transfer
>                 when pause is over.
>                 >         Kudus for
>                 >         > that.
>                 >         >
>                 >         > The server is driven by a Supermicro
>                 X9SRL-F, a Xeon 1620v2
>                 >         and 64GB
>                 >         > ECC
>                 >         > RAM. The hardware has been ruled out, we
>                 happened to have a
>                 >         identical
>                 >         > MB
>                 >         > and CPU lying around and that didn´t
>                 improve things. We have
>                 >         also
>                 >         > installed a Intel PRO 100/1000 Quad-port
>                 ethernet adapter to
>                 >         test if
>                 >         > that would change things, but it hasn´t,
>                 it still behaves
>                 >         this way.
>                 >         >
>                 >         > The two built-in NIC's are Intel 82574L
>                 and the Quad-port
>                 >         NIC's are
>                 >         > Intel 82571EB, so both em(4) driven. I
>                 happen to know that
>                 >         the em
>                 >         > driver
>                 >         > has updated between 9.3 and 10.1. Perhaps
>                 that is to blame,
>                 >         but I have
>                 >         > no idea.
>                 >         >
>                 >         > Is there anyone that can make sense of
>                 this?
>                 >         >
>                 >         > [1]:
>                 >         >
>                 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197164
>                 >         >
>                 >         > [2]:
>                 >         >
>                 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191348
>                 >         >
>                 >         > /K
>                 >         >
>                 >         >
>                 >
>                 >
>                 >         Another observation I´ve made is that during
>                 these pauses, the
>                 >         entire
>                 >         system is put on hold, even ZFS scrub stops
>                 and then resumes
>                 >         after a
>                 >         while. Looking in top, the system is
>                 completly idle.
>                 >
>                 >         Normally during scrub, the kernel eats
>                 20-30% CPU, but during
>                 >         a pause,
>                 >         even the [kernel] goes down to 0.00%. Makes
>                 me think the
>                 >         networking has
>                 >         nothing to do with it.
>                 >
>                 >         What´s then to blame? ZFS?
>                 >
>                 >         /K
>                 >
>                  _______________________________________________
>                 >         freebsd-fs at freebsd.org mailing list
>                 >
>                  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>                 >         To unsubscribe, send any mail to
>                 >         "freebsd-fs-unsubscribe at freebsd.org"
>                 >
>                 >
>                 > Hello,
>                 >
>                 >
>                 > does this happen when clients are only reading from
>                 server?
>                 
>                 Yes it happens when clients are only reading from the
>                 server.
>                 
>                 > Otherwise I would suspect that it could be caused by
>                 ZFS writing out a
>                 > large chunck of data sitting in its caches, and
>                 until that is complete
>                 > I/O is stalled.
>                 
>                 That´s what so strange, we have three more systems set
>                 up about the same
>                 size and none of others are acting this way.
>                 
>                 The only thing I can think of that differs that we
>                 haven´t tested ruling
>                 out yet is ctld, the other systems are still running
>                 istgt as their
>                 iSCSI daemon.
>                 
>                 /K
>                 
> What does a zpool status say? Could very well be disks starting to
> fail.
> 
> 
> Anything in dmesg concerning cam timeouts?
> 
> 
> Best regards
> 
> Andreas
> 

Pool status is fine, scrubbed multiple times without errors. No storage
related errors, we´re using LSI HBA's so no cam, but nothing mps-related
either.

/K


More information about the freebsd-fs mailing list