[Fwd: Strange networking behaviour in storage server]
Karli Sjöberg
karli.sjoberg at slu.se
Mon Jun 1 10:28:31 UTC 2015
mån 2015-06-01 klockan 02:56 -0700 skrev Mehmet Erol Sanliturk:
>
>
> On Mon, Jun 1, 2015 at 2:02 AM, Karli Sjöberg <karli.sjoberg at slu.se>
> wrote:
> mån 2015-06-01 klockan 10:33 +0200 skrev Andreas Nilsson:
> >
> >
> > On Mon, Jun 1, 2015 at 10:14 AM, Karli Sjöberg
> <karli.sjoberg at slu.se>
> > wrote:
> > -------- Vidarebefordrat meddelande --------
> > > Från: Karli Sjöberg <karli.sjoberg at slu.se>
> > > Till: freebsd-fs at freebsd.org
> <freebsd-fs at freebsd.org>
> > > Ämne: Strange networking behaviour in storage
> server
> > > Datum: Mon, 1 Jun 2015 07:49:56 +0000
> > >
> > > Hey!
> > >
> > > So we have this ZFS storage server upgraded from
> 9.3-RELEASE
> > to
> > > 10.1-STABLE to overcome not being able to 1) use
> SSD drives
> > as
> > > L2ARC[1]
> > > and 2) not being able to hotswap SATA drives[2].
> > >
> > > After the upgrade we´ve noticed a very odd
> networking
> > behaviour, it
> > > sends/receives full speed for a while, then there
> is a
> > couple of
> > > minutes
> > > of complete silence where even terminal commands
> like an
> > "ls" just
> > > waits
> > > until they are executed and then it starts sending
> full
> > speed again. I
> > > ´ve linked to a screenshot showing this send and
> pause
> > behaviour. The
> > > blue line is the total, green is SMB and turquoise
> is NFS
> > over jumbo
> > > frames. It behaves this way regardless of the
> protocol.
> > >
> > > http://oi62.tinypic.com/33xvjb6.jpg
> > >
> > > The problem is that these pauses can sometimes be
> so long
> > that
> > > connections drop. Like someone is copying files
> over SMB or
> > iSCSI and
> > > suddenly they get an error message saying that the
> transfer
> > failed and
> > > they have to start over with the file(s). That´s
> horrible!
> > >
> > > So far NFS has proven to be the most resillient,
> it´s stupid
> > simple
> > > nature just waits and resumes transfer when pause
> is over.
> > Kudus for
> > > that.
> > >
> > > The server is driven by a Supermicro X9SRL-F, a
> Xeon 1620v2
> > and 64GB
> > > ECC
> > > RAM. The hardware has been ruled out, we happened
> to have a
> > identical
> > > MB
> > > and CPU lying around and that didn´t improve
> things. We have
> > also
> > > installed a Intel PRO 100/1000 Quad-port ethernet
> adapter to
> > test if
> > > that would change things, but it hasn´t, it still
> behaves
> > this way.
> > >
> > > The two built-in NIC's are Intel 82574L and the
> Quad-port
> > NIC's are
> > > Intel 82571EB, so both em(4) driven. I happen to
> know that
> > the em
> > > driver
> > > has updated between 9.3 and 10.1. Perhaps that is
> to blame,
> > but I have
> > > no idea.
> > >
> > > Is there anyone that can make sense of this?
> > >
> > > [1]:
> > >
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197164
> > >
> > > [2]:
> > >
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191348
> > >
> > > /K
> > >
> > >
> >
> >
> > Another observation I´ve made is that during these
> pauses, the
> > entire
> > system is put on hold, even ZFS scrub stops and then
> resumes
> > after a
> > while. Looking in top, the system is completly idle.
> >
> > Normally during scrub, the kernel eats 20-30% CPU,
> but during
> > a pause,
> > even the [kernel] goes down to 0.00%. Makes me think
> the
> > networking has
> > nothing to do with it.
> >
> > What´s then to blame? ZFS?
> >
> > /K
> > _______________________________________________
> > freebsd-fs at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to
> > "freebsd-fs-unsubscribe at freebsd.org"
> >
> >
> > Hello,
> >
> >
> > does this happen when clients are only reading from server?
>
> Yes it happens when clients are only reading from the server.
>
> > Otherwise I would suspect that it could be caused by ZFS
> writing out a
> > large chunck of data sitting in its caches, and until that
> is complete
> > I/O is stalled.
>
> That´s what so strange, we have three more systems set up
> about the same
> size and none of others are acting this way.
>
> The only thing I can think of that differs that we haven´t
> tested ruling
> out yet is ctld, the other systems are still running istgt as
> their
> iSCSI daemon.
>
> /K
>
>
>
>
> If there are other three similar systems and they are exactly
> installed with the same structure , my first possibility to consider
> would be to suspect a slowly progressing hardware failure :
>
>
> From a circuit , it is not possible to get a response in expected
> time , but , it is responding after a time which is not normal . Such
> an action may be caused by a faulty soldered or cracked line point in
> the circuit : When it is hot , it is disconnecting , when it is cold
> it is connecting .
As initially stated, both motherboard and processor has been replaced
with identical hardware that went through a day of memtest before being
installed. Then there´s an external Supermicro JBOD[*] but I haven´t
seen any disk timeouts or SES errors logged. At least at a driver level
there should have been timeouts at such a long delay as five minutes.
/K
[*]:
http://www.supermicro.nl/products/chassis/3U/837/SC837E26-RJBOD1.cfm
>
>
>
>
>
> Thank you very much .
>
>
>
> Mehmet Erol Sanliturk
>
>
>
>
>
> >
> >
> > Have you tried what is suggested in
> > https://wiki.freebsd.org/ZFSTuningGuide ? In particular
> setting
> > vfs.zfs.write_limit_override to something appropriate for
> your site.
> > The timeout seems to be defaulting to 5 now.
> >
> >
> > Best regards
> >
> > Andreas
> >
> >
> >
>
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to
> "freebsd-fs-unsubscribe at freebsd.org"
>
>
More information about the freebsd-fs
mailing list