CURRENT slow and shaky network stability
Cy Schubert
Cy.Schubert at komquats.com
Tue Apr 5 06:46:20 UTC 2016
In message <20160405082047.670d7241 at freyja.zeit4.iv.bundesimmobilien.de>,
"O. H
artmann" writes:
> On Sat, 02 Apr 2016 16:14:57 -0700
> Cy Schubert <Cy.Schubert at komquats.com> wrote:
>
> > In message <20160402231955.41b05526.ohartman at zedat.fu-berlin.de>, "O.
> > Hartmann"
> > writes:
> > > --Sig_/eJJPtbrEuK1nN2zIpc7BmVr
> > > Content-Type: text/plain; charset=US-ASCII
> > > Content-Transfer-Encoding: quoted-printable
> > >
> > > Am Sat, 2 Apr 2016 11:39:10 +0200
> > > "O. Hartmann" <ohartman at zedat.fu-berlin.de> schrieb:
> > >
> > > > Am Sat, 2 Apr 2016 10:55:03 +0200
> > > > "O. Hartmann" <ohartman at zedat.fu-berlin.de> schrieb:
> > > >=20
> > > > > Am Sat, 02 Apr 2016 01:07:55 -0700
> > > > > Cy Schubert <Cy.Schubert at komquats.com> schrieb:
> > > > > =20
> > > > > > In message <56F6C6B0.6010103 at protected-networks.net>, Michael Butle
> r
> > > > > > =
> > > writes: =20
> > > > > > > -current is not great for interactive use at all. The strategy of
> > > > > > > pre-emptively dropping idle processes to swap is hurting .. big
> > > > > > > tim=
> > > e. =20
> > > > > >=20
> > > > > > FreeBSD doesn't "preemptively" or arbitrarily push pages out to
> > > > > > disk.=
> > > LRU=20
> > > > > > doesn't do this.
> > > > > > =20
> > > > > > >=20
> > > > > > > Compare inactive memory to swap in this example ..
> > > > > > >=20
> > > > > > > 110 processes: 1 running, 108 sleeping, 1 zombie
> > > > > > > CPU: 1.2% user, 0.0% nice, 4.3% system, 0.0% interrupt, 94.5%
> > > > > > > i=
> > > dle
> > > > > > > Mem: 474M Active, 1609M Inact, 764M Wired, 281M Buf, 119M Free
> > > > > > > Swap: 4096M Total, 917M Used, 3178M Free, 22% Inuse =20
> > > > > >=20
> > > > > > To analyze this you need to capture vmstat output. You'll see the
> > > > > > fre=
> > > e pool=20
> > > > > > dip below a threshold and pages go out to disk in response. If you
> > > > > > ha=
> > > ve=20
> > > > > > daemons with small working sets, pages that are not part of the
> > > > > > worki=
> > > ng=20
> > > > > > sets for daemons or applications will eventually be paged out. This
> > > > > > i=
> > > s not=20
> > > > > > a bad thing. In your example above, the 281 MB of UFS buffers are
> > > > > > mor=
> > > e=20
> > > > > > active than the 917 MB paged out. If it's paged out and never used
> > > > > > ag=
> > > ain,=20
> > > > > > then it doesn't hurt. However the 281 MB of buffers saves you I/O.
> > > > > > Th=
> > > e=20
> > > > > > inactive pages are part of your free pool that were active at one
> > > > > > tim=
> > > e but=20
> > > > > > now are not. They may be reclaimed and if they are, you've just
> > > > > > saved=
> > > more=20
> > > > > > I/O.
> > > > > >=20
> > > > > > Top is a poor tool to analyze memory use. Vmstat is the better tool
> > > > > > t=
> > > o help=20
> > > > > > understand memory use. Inactive memory isn't a bad thing per se.
> > > > > > Moni=
> > > tor=20
> > > > > > page outs, scan rate and page reclaims.
> > > > > >=20
> > > > > > =20
> > > > >=20
> > > > > I give up! Tried to check via ssh/vmstat what is going on. Last lines
> > > > > b=
> > > efore broken
> > > > > pipe:
> > > > >=20
> > > > > [...]
> > > > > procs memory page disks faults
> cpu
> > > > > r b w avm fre flt re pi po fr sr ad0 ad1 in sy c
> s
> > > > > =
> > > us sy id
> > > > > 22 0 22 5.8G 1.0G 46319 0 0 0 55721 1297 0 4 219 23907
> > > > > 540=
> > > 0 95 5 0
> > > > > 22 0 22 5.4G 1.3G 51733 0 0 0 72436 1162 0 0 108 40869
> > > > > 345=
> > > 9 93 7 0
> > > > > 15 0 22 12G 1.2G 54400 0 27 0 52188 1160 0 42 148 52192
> > > > > 436=
> > > 6 91 9 0
> > > > > 14 0 22 12G 1.0G 44954 0 37 0 37550 1179 0 39 141 86209
> > > > > 436=
> > > 8 88 12 0
> > > > > 26 0 22 12G 1.1G 60258 0 81 0 69459 1119 0 27 123 779569
> > > > > 704=
> > > 359 87 13 0
> > > > > 29 3 22 13G 774M 50576 0 68 0 32204 1304 0 2 102 507337
> > > > > 484=
> > > 861 93 7 0
> > > > > 27 0 22 13G 937M 47477 0 48 0 59458 1264 3 2 112 68131
> > > > > 4440=
> > > 7 95 5 0
> > > > > 36 0 22 13G 829M 83164 0 2 0 82575 1225 1 0 126 99366
> > > > > 3806=
> > > 0 89 11 0
> > > > > 35 0 22 6.2G 1.1G 98803 0 13 0 121375 1217 2 8 112 99371
> > > > > 49=
> > > 99 85 15 0
> > > > > 34 0 22 13G 723M 54436 0 20 0 36952 1276 0 17 153 29142
> > > > > 443=
> > > 1 95 5 0
> > > > > Fssh_packet_write_wait: Connection to 192.168.0.1 port 22: Broken pip
> e
> > > > >=20
> > > > >=20
> > > > > This makes this crap system completely unusable. The server (FreeBSD
> > > > > 11=
> > > .0-CURRENT #20
> > > > > r297503: Sat Apr 2 09:02:41 CEST 2016 amd64) in question did
> > > > > poudriere=
> > > bulk job. I
> > > > > can not even determine what terminal goes down first - another one,
> > > > > muc=
> > > h more time
> > > > > idle than the one shwoing the "vmstat 5" output, is still alive!=20
> > > > >=20
> > > > > i consider this a serious bug and it is no benefit what happened sinc
> e
> > > > > =
> > > this "fancy"
> > > > > update. :-( =20
> > > >=20
> > > > By the way - it might be of interest and some hint.
> > > >=20
> > > > One of my boxes is acting as server and gateway. It utilises NAT, IPFW,
> > > > w=
> > > hen it is under
> > > > high load, as it was today, sometimes passing the network flow from ISP
> > > > i=
> > > nto the network
> > > > for clients is extremely slow. I do not consider this the reason for
> > > > coll=
> > > apsing ssh
> > > > sessions, since this incident happens also under no-load, but in the
> > > > over=
> > > all-view onto
> > > > the problem, this could be a hint - I hope.=20
> > >
> > > I just checked on one box, that "broke pipe" very quickly after I started
> p=
> > > oudriere,
> > > while it did well a couple of hours before until the pipe broke. It seems
> i=
> > > t's load
> > > dependend when the ssh session gets wrecked, but more important, after th
> e =
> > > long-haul
> > > poudriere run, I rebooted the box and tried again with the mentioned brok
> en=
> > > pipe after a
> > > couple of minutes after poudriere ran. Then I left the box for several ho
> ur=
> > > s and logged
> > > in again and checked the swap. Although there was for hours no load or ot
> he=
> > > r pressure,
> > > there were 31% of of swap used - still (box has 16 GB of RAM and is prope
> ll=
> > > ed by a XEON
> > > E3-1245 V2).
> > >
> >
> > 31%! Is it *actively* paging or is the 31% previously paged out and no
> > paging is *currently* being experienced? 31% of how swap space in total?
> >
> > Also, what does ps aumx or ps aumxww say? Pipe it to head -40 or similar.
> >
> >
>
> On FreeBSD 11.0-CURRENT #4 r297573: Tue Apr 5 07:01:19 CEST 2016 amd64, loca
> l
> network, no NAT. Stuck ssh session in the middle of administering and leaving
> the console/ssh session for a couple of minutes:
>
> root 2064 0.0 0.1 91416 8492 - Is 07:18 0:00.03 sshd:
> hartmann [priv] (sshd)
>
> hartmann 2108 0.0 0.1 91416 8664 - I 07:18 0:07.33 sshd:
> hartmann at pts/0 (sshd)
>
> root 72961 0.0 0.1 91416 8496 - Is 08:11 0:00.03 sshd:
> hartmann [priv] (sshd)
>
> hartmann 72970 0.0 0.1 91416 8564 - S 08:11 0:00.02 sshd:
> hartmann at pts/1 (sshd)
>
> The situation is worse and i consider this a serious bug.
>
There's not a lot to go on here. Do you have physical access to the machine
to pop into DDB and take a look? You did say you're using a lot of swap.
IIRC 30%. You didn't answer how much 30% was of. Without more data I can't
help you. At the best I can take wild guesses but that won't help you. Try
to answer the questions I asked last week and we can go further. Until then
all we can do is wildly guess.
--
Cheers,
Cy Schubert <Cy.Schubert at komquats.com> or <Cy.Schubert at cschubert.com>
FreeBSD UNIX: <cy at FreeBSD.org> Web: http://www.FreeBSD.org
The need of the many outweighs the greed of the few.
More information about the freebsd-current
mailing list