[Fwd: Strange networking behaviour in storage server]

Karli Sjöberg karli.sjoberg at slu.se
Tue Jun 2 08:48:38 UTC 2015


Vet inte varför du skriver engelska när det bara är mellan oss... Kanske
glömde svara alla?

tis 2015-06-02 klockan 10:10 +0200 skrev Andreas Nilsson:
> No, mbufs should not effect a scrub.
> 
> 
> You can get some stats from vmstat -z

Hmm, what would i be looking for exactly?

> 
> 
> Have had a systat running while IO stalls?

Same as above.

> 
> 
> Also, zpool has tunable for failmode, which defaults to wait, but as
> you say scrub/zpool status indicates no errors this is unlikely the
> cause.
> 
> 
> 
> Other than that I'm out of ideas :(

We have that in common:)

/K

> 
> 
> Best regards
> 
> Andreas
> 
> 
> On Mon, Jun 1, 2015 at 1:01 PM, Karli Sjöberg <karli.sjoberg at slu.se>
> wrote:
>         mån 2015-06-01 klockan 12:53 +0200 skrev Andreas Nilsson:
>         > Interesting.
>         >
>         >
>         > Out of mbufs perhaps?
>         
>         Hmm, why would depleted mbufs stall even a scrub?
>         
>         How would I verify that?
>         
>         /K
>         
>         >
>         >
>         > /A
>         >
>         >
>         > On Mon, Jun 1, 2015 at 12:28 PM, Karli Sjöberg
>         <karli.sjoberg at slu.se>
>         > wrote:
>         >         mån 2015-06-01 klockan 02:56 -0700 skrev Mehmet Erol
>         >         Sanliturk:
>         >         >
>         >         >
>         >         > On Mon, Jun 1, 2015 at 2:02 AM, Karli Sjöberg
>         >         <karli.sjoberg at slu.se>
>         >         > wrote:
>         >         >         mån 2015-06-01 klockan 10:33 +0200 skrev
>         Andreas
>         >         Nilsson:
>         >         >         >
>         >         >         >
>         >         >         > On Mon, Jun 1, 2015 at 10:14 AM, Karli
>         Sjöberg
>         >         >         <karli.sjoberg at slu.se>
>         >         >         > wrote:
>         >         >         >         -------- Vidarebefordrat
>         meddelande
>         >         --------
>         >         >         >         > Från: Karli Sjöberg
>         >         <karli.sjoberg at slu.se>
>         >         >         >         > Till: freebsd-fs at freebsd.org
>         >         >         <freebsd-fs at freebsd.org>
>         >         >         >         > Ämne: Strange networking
>         behaviour in
>         >         storage
>         >         >         server
>         >         >         >         > Datum: Mon, 1 Jun 2015
>         07:49:56 +0000
>         >         >         >         >
>         >         >         >         > Hey!
>         >         >         >         >
>         >         >         >         > So we have this ZFS storage
>         server
>         >         upgraded from
>         >         >         9.3-RELEASE
>         >         >         >         to
>         >         >         >         > 10.1-STABLE to overcome not
>         being able
>         >         to 1) use
>         >         >         SSD drives
>         >         >         >         as
>         >         >         >         > L2ARC[1]
>         >         >         >         > and 2) not being able to
>         hotswap SATA
>         >         drives[2].
>         >         >         >         >
>         >         >         >         > After the upgrade we´ve
>         noticed a very
>         >         odd
>         >         >         networking
>         >         >         >         behaviour, it
>         >         >         >         > sends/receives full speed for
>         a while,
>         >         then there
>         >         >         is a
>         >         >         >         couple of
>         >         >         >         > minutes
>         >         >         >         > of complete silence where even
>         terminal
>         >         commands
>         >         >         like an
>         >         >         >         "ls" just
>         >         >         >         > waits
>         >         >         >         > until they are executed and
>         then it
>         >         starts sending
>         >         >         full
>         >         >         >         speed again. I
>         >         >         >         > ´ve linked to a screenshot
>         showing this
>         >         send and
>         >         >         pause
>         >         >         >         behaviour. The
>         >         >         >         > blue line is the total, green
>         is SMB and
>         >         turquoise
>         >         >         is NFS
>         >         >         >         over jumbo
>         >         >         >         > frames. It behaves this way
>         regardless
>         >         of the
>         >         >         protocol.
>         >         >         >         >
>         >         >         >         >
>         http://oi62.tinypic.com/33xvjb6.jpg
>         >         >         >         >
>         >         >         >         > The problem is that these
>         pauses can
>         >         sometimes be
>         >         >         so long
>         >         >         >         that
>         >         >         >         > connections drop. Like someone
>         is
>         >         copying files
>         >         >         over SMB or
>         >         >         >         iSCSI and
>         >         >         >         > suddenly they get an error
>         message
>         >         saying that the
>         >         >         transfer
>         >         >         >         failed and
>         >         >         >         > they have to start over with
>         the
>         >         file(s). That´s
>         >         >         horrible!
>         >         >         >         >
>         >         >         >         > So far NFS has proven to be
>         the most
>         >         resillient,
>         >         >         it´s stupid
>         >         >         >         simple
>         >         >         >         > nature just waits and resumes
>         transfer
>         >         when pause
>         >         >         is over.
>         >         >         >         Kudus for
>         >         >         >         > that.
>         >         >         >         >
>         >         >         >         > The server is driven by a
>         Supermicro
>         >         X9SRL-F, a
>         >         >         Xeon 1620v2
>         >         >         >         and 64GB
>         >         >         >         > ECC
>         >         >         >         > RAM. The hardware has been
>         ruled out, we
>         >         happened
>         >         >         to have a
>         >         >         >         identical
>         >         >         >         > MB
>         >         >         >         > and CPU lying around and that
>         didn´t
>         >         improve
>         >         >         things. We have
>         >         >         >         also
>         >         >         >         > installed a Intel PRO 100/1000
>         Quad-port
>         >         ethernet
>         >         >         adapter to
>         >         >         >         test if
>         >         >         >         > that would change things, but
>         it hasn´t,
>         >         it still
>         >         >         behaves
>         >         >         >         this way.
>         >         >         >         >
>         >         >         >         > The two built-in NIC's are
>         Intel 82574L
>         >         and the
>         >         >         Quad-port
>         >         >         >         NIC's are
>         >         >         >         > Intel 82571EB, so both em(4)
>         driven. I
>         >         happen to
>         >         >         know that
>         >         >         >         the em
>         >         >         >         > driver
>         >         >         >         > has updated between 9.3 and
>         10.1.
>         >         Perhaps that is
>         >         >         to blame,
>         >         >         >         but I have
>         >         >         >         > no idea.
>         >         >         >         >
>         >         >         >         > Is there anyone that can make
>         sense of
>         >         this?
>         >         >         >         >
>         >         >         >         > [1]:
>         >         >         >         >
>         >         >
>         >
>         https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197164
>         >         >         >         >
>         >         >         >         > [2]:
>         >         >         >         >
>         >         >
>         >
>         https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191348
>         >         >         >         >
>         >         >         >         > /K
>         >         >         >         >
>         >         >         >         >
>         >         >         >
>         >         >         >
>         >         >         >         Another observation I´ve made is
>         that
>         >         during these
>         >         >         pauses, the
>         >         >         >         entire
>         >         >         >         system is put on hold, even ZFS
>         scrub
>         >         stops and then
>         >         >         resumes
>         >         >         >         after a
>         >         >         >         while. Looking in top, the
>         system is
>         >         completly idle.
>         >         >         >
>         >         >         >         Normally during scrub, the
>         kernel eats
>         >         20-30% CPU,
>         >         >         but during
>         >         >         >         a pause,
>         >         >         >         even the [kernel] goes down to
>         0.00%.
>         >         Makes me think
>         >         >         the
>         >         >         >         networking has
>         >         >         >         nothing to do with it.
>         >         >         >
>         >         >         >         What´s then to blame? ZFS?
>         >         >         >
>         >         >         >         /K
>         >         >         >
>         >          _______________________________________________
>         >         >         >         freebsd-fs at freebsd.org mailing
>         list
>         >         >         >
>         >
>         http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>         >         >         >         To unsubscribe, send any mail to
>         >         >         >
>          "freebsd-fs-unsubscribe at freebsd.org"
>         >         >         >
>         >         >         >
>         >         >         > Hello,
>         >         >         >
>         >         >         >
>         >         >         > does this happen when clients are only
>         reading
>         >         from server?
>         >         >
>         >         >         Yes it happens when clients are only
>         reading from
>         >         the server.
>         >         >
>         >         >         > Otherwise I would suspect that it could
>         be caused
>         >         by ZFS
>         >         >         writing out a
>         >         >         > large chunck of data sitting in its
>         caches, and
>         >         until that
>         >         >         is complete
>         >         >         > I/O is stalled.
>         >         >
>         >         >         That´s what so strange, we have three more
>         systems
>         >         set up
>         >         >         about the same
>         >         >         size and none of others are acting this
>         way.
>         >         >
>         >         >         The only thing I can think of that differs
>         that we
>         >         haven´t
>         >         >         tested ruling
>         >         >         out yet is ctld, the other systems are
>         still running
>         >         istgt as
>         >         >         their
>         >         >         iSCSI daemon.
>         >         >
>         >         >         /K
>         >         >
>         >         >
>         >         >
>         >         >
>         >         > If there are other three similar systems and they
>         are
>         >         exactly
>         >         > installed with the same structure , my first
>         possibility to
>         >         consider
>         >         > would be to suspect a slowly progressing hardware
>         failure :
>         >         >
>         >         >
>         >         > From a circuit , it is not possible to get a
>         response in
>         >         expected
>         >         > time , but , it is responding after a time which
>         is not
>         >         normal . Such
>         >         > an action may be caused by a faulty soldered or
>         cracked line
>         >         point in
>         >         > the circuit : When it is hot , it is
>         disconnecting , when it
>         >         is cold
>         >         > it is connecting .
>         >
>         >
>         >         As initially stated, both motherboard and processor
>         has been
>         >         replaced
>         >         with identical hardware that went through a day of
>         memtest
>         >         before being
>         >         installed. Then there´s an external Supermicro
>         JBOD[*] but I
>         >         haven´t
>         >         seen any disk timeouts or SES errors logged. At
>         least at a
>         >         driver level
>         >         there should have been timeouts at such a long delay
>         as five
>         >         minutes.
>         >
>         >         /K
>         >
>         >         [*]:
>         >
>          http://www.supermicro.nl/products/chassis/3U/837/SC837E26-RJBOD1.cfm
>         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         > Thank you very much .
>         >         >
>         >         >
>         >         >
>         >         > Mehmet Erol Sanliturk
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >
>         >         >         >
>         >         >         >
>         >         >         > Have you tried what is suggested in
>         >         >         >
>         https://wiki.freebsd.org/ZFSTuningGuide ? In
>         >         particular
>         >         >         setting
>         >         >         > vfs.zfs.write_limit_override to
>         something
>         >         appropriate for
>         >         >         your site.
>         >         >         > The timeout seems to be defaulting to 5
>         now.
>         >         >         >
>         >         >         >
>         >         >         > Best regards
>         >         >         >
>         >         >         > Andreas
>         >         >         >
>         >         >         >
>         >         >         >
>         >         >
>         >         >
>          _______________________________________________
>         >         >         freebsd-fs at freebsd.org mailing list
>         >         >
>          http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>         >         >         To unsubscribe, send any mail to
>         >         >         "freebsd-fs-unsubscribe at freebsd.org"
>         >         >
>         >         >
>         >
>         >
>         >
>         >
>         
>         
> 
> 



More information about the freebsd-fs mailing list