CURRENT slow and shaky network stability
O. Hartmann
ohartman at zedat.fu-berlin.de
Wed Apr 13 08:12:45 UTC 2016
On Sun, 10 Apr 2016 07:16:56 -0700
Cy Schubert <Cy.Schubert at komquats.com> wrote:
> In message <20160409105444.7020f2f1.ohartman at zedat.fu-berlin.de>, "O.
> Hartmann"
> writes:
> > --Sig_/SqWr.x1C_BgJVIYh7m_9T5y
> > Content-Type: text/plain; charset=US-ASCII
> > Content-Transfer-Encoding: quoted-printable
> >
> > Am Mon, 04 Apr 2016 23:46:08 -0700
> > Cy Schubert <Cy.Schubert at komquats.com> schrieb:
> >
> > > In message
> > > <20160405082047.670d7241 at freyja.zeit4.iv.bundesimmobilien.de>,=
> > =20
> > > "O. H
> > > artmann" writes:
> > > > On Sat, 02 Apr 2016 16:14:57 -0700
> > > > Cy Schubert <Cy.Schubert at komquats.com> wrote:
> > > > =20
> > > > > In message <20160402231955.41b05526.ohartman at zedat.fu-berlin.de>,
> > > > > "O.=
> > =20
> > > > > Hartmann"
> > > > > writes: =20
> > > > > > --Sig_/eJJPtbrEuK1nN2zIpc7BmVr
> > > > > > Content-Type: text/plain; charset=3DUS-ASCII
> > > > > > Content-Transfer-Encoding: quoted-printable
> > > > > >=20
> > > > > > Am Sat, 2 Apr 2016 11:39:10 +0200
> > > > > > "O. Hartmann" <ohartman at zedat.fu-berlin.de> schrieb:
> > > > > > =20
> > > > > > > Am Sat, 2 Apr 2016 10:55:03 +0200
> > > > > > > "O. Hartmann" <ohartman at zedat.fu-berlin.de> schrieb:
> > > > > > >=3D20 =20
> > > > > > > > Am Sat, 02 Apr 2016 01:07:55 -0700
> > > > > > > > Cy Schubert <Cy.Schubert at komquats.com> schrieb:
> > > > > > > > =3D20 =20
> > > > > > > > > In message <56F6C6B0.6010103 at protected-networks.net>,
> > > > > > > > > Michael=
> > Butle =20
> > > > r =20
> > > > > > > > > =3D =20
> > > > > > writes: =3D20 =20
> > > > > > > > > > -current is not great for interactive use at all. The
> > > > > > > > > > strat=
> > egy of
> > > > > > > > > > pre-emptively dropping idle processes to swap is
> > > > > > > > > > hurting ..=
> > big
> > > > > > > > > > tim=3D =20
> > > > > > e. =3D20 =20
> > > > > > > > >=3D20
> > > > > > > > > FreeBSD doesn't "preemptively" or arbitrarily push pages out
> > > > > > > > > =
> > to
> > > > > > > > > disk.=3D =20
> > > > > > LRU=3D20 =20
> > > > > > > > > doesn't do this.
> > > > > > > > > =3D20 =20
> > > > > > > > > >=3D20
> > > > > > > > > > Compare inactive memory to swap in this example ..
> > > > > > > > > >=3D20
> > > > > > > > > > 110 processes: 1 running, 108 sleeping, 1 zombie
> > > > > > > > > > CPU: 1.2% user, 0.0% nice, 4.3% system, 0.0%
> > > > > > > > > > interrupt,=
> > 94.5%
> > > > > > > > > > i=3D =20
> > > > > > dle =20
> > > > > > > > > > Mem: 474M Active, 1609M Inact, 764M Wired, 281M Buf, 119M
> > > > > > > > > > F=
> > ree
> > > > > > > > > > Swap: 4096M Total, 917M Used, 3178M Free, 22% Inuse
> > > > > > > > > > =3D=
> > 20 =20
> > > > > > > > >=3D20
> > > > > > > > > To analyze this you need to capture vmstat output. You'll
> > > > > > > > > see=
> > the
> > > > > > > > > fre=3D =20
> > > > > > e pool=3D20 =20
> > > > > > > > > dip below a threshold and pages go out to disk in response.
> > > > > > > > > I=
> > f you
> > > > > > > > > ha=3D =20
> > > > > > ve=3D20 =20
> > > > > > > > > daemons with small working sets, pages that are not part of
> > > > > > > > > t=
> > he
> > > > > > > > > worki=3D =20
> > > > > > ng=3D20 =20
> > > > > > > > > sets for daemons or applications will eventually be paged
> > > > > > > > > out=
> > . This
> > > > > > > > > i=3D =20
> > > > > > s not=3D20 =20
> > > > > > > > > a bad thing. In your example above, the 281 MB of UFS
> > > > > > > > > buffers=
> > are
> > > > > > > > > mor=3D =20
> > > > > > e=3D20 =20
> > > > > > > > > active than the 917 MB paged out. If it's paged out and
> > > > > > > > > never=
> > used
> > > > > > > > > ag=3D =20
> > > > > > ain,=3D20 =20
> > > > > > > > > then it doesn't hurt. However the 281 MB of buffers saves
> > > > > > > > > you=
> > I/O.
> > > > > > > > > Th=3D =20
> > > > > > e=3D20 =20
> > > > > > > > > inactive pages are part of your free pool that were active
> > > > > > > > > at=
> > one
> > > > > > > > > tim=3D =20
> > > > > > e but=3D20 =20
> > > > > > > > > now are not. They may be reclaimed and if they are, you've
> > > > > > > > > ju=
> > st
> > > > > > > > > saved=3D =20
> > > > > > more=3D20 =20
> > > > > > > > > I/O.
> > > > > > > > >=3D20
> > > > > > > > > Top is a poor tool to analyze memory use. Vmstat is the
> > > > > > > > > bette=
> > r tool
> > > > > > > > > t=3D =20
> > > > > > o help=3D20 =20
> > > > > > > > > understand memory use. Inactive memory isn't a bad thing per
> > > > > > > > > =
> > se.
> > > > > > > > > Moni=3D =20
> > > > > > tor=3D20 =20
> > > > > > > > > page outs, scan rate and page reclaims.
> > > > > > > > >=3D20
> > > > > > > > > =3D20 =20
> > > > > > > >=3D20
> > > > > > > > I give up! Tried to check via ssh/vmstat what is going on.
> > > > > > > > Last=
> > lines
> > > > > > > > b=3D =20
> > > > > > efore broken =20
> > > > > > > > pipe:
> > > > > > > >=3D20
> > > > > > > > [...]
> > > > > > > > procs memory page disks faults
> > > > > > > > =
> > =20
> > > > cpu =20
> > > > > > > > r b w avm fre flt re pi po fr sr ad0 ad1 in
> > > > > > > > s=
> > y c =20
> > > > s =20
> > > > > > > > =3D =20
> > > > > > us sy id =20
> > > > > > > > 22 0 22 5.8G 1.0G 46319 0 0 0 55721 1297 0 4 219
> > > > > > > > 23=
> > 907
> > > > > > > > 540=3D =20
> > > > > > 0 95 5 0 =20
> > > > > > > > 22 0 22 5.4G 1.3G 51733 0 0 0 72436 1162 0 0 108
> > > > > > > > 40=
> > 869
> > > > > > > > 345=3D =20
> > > > > > 9 93 7 0 =20
> > > > > > > > 15 0 22 12G 1.2G 54400 0 27 0 52188 1160 0 42 148
> > > > > > > > 52=
> > 192
> > > > > > > > 436=3D =20
> > > > > > 6 91 9 0 =20
> > > > > > > > 14 0 22 12G 1.0G 44954 0 37 0 37550 1179 0 39 141
> > > > > > > > 86=
> > 209
> > > > > > > > 436=3D =20
> > > > > > 8 88 12 0 =20
> > > > > > > > 26 0 22 12G 1.1G 60258 0 81 0 69459 1119 0 27 123
> > > > > > > > 77=
> > 9569
> > > > > > > > 704=3D =20
> > > > > > 359 87 13 0 =20
> > > > > > > > 29 3 22 13G 774M 50576 0 68 0 32204 1304 0 2 102
> > > > > > > > 50=
> > 7337
> > > > > > > > 484=3D =20
> > > > > > 861 93 7 0 =20
> > > > > > > > 27 0 22 13G 937M 47477 0 48 0 59458 1264 3 2 112
> > > > > > > > 68=
> > 131
> > > > > > > > 4440=3D =20
> > > > > > 7 95 5 0 =20
> > > > > > > > 36 0 22 13G 829M 83164 0 2 0 82575 1225 1 0 126
> > > > > > > > 99=
> > 366
> > > > > > > > 3806=3D =20
> > > > > > 0 89 11 0 =20
> > > > > > > > 35 0 22 6.2G 1.1G 98803 0 13 0 121375 1217 2 8 112
> > > > > > > > 9=
> > 9371
> > > > > > > > 49=3D =20
> > > > > > 99 85 15 0 =20
> > > > > > > > 34 0 22 13G 723M 54436 0 20 0 36952 1276 0 17 153
> > > > > > > > 29=
> > 142
> > > > > > > > 443=3D =20
> > > > > > 1 95 5 0 =20
> > > > > > > > Fssh_packet_write_wait: Connection to 192.168.0.1 port 22:
> > > > > > > > Brok=
> > en pip =20
> > > > e =20
> > > > > > > >=3D20
> > > > > > > >=3D20
> > > > > > > > This makes this crap system completely unusable. The server
> > > > > > > > (Fr=
> > eeBSD
> > > > > > > > 11=3D =20
> > > > > > .0-CURRENT #20 =20
> > > > > > > > r297503: Sat Apr 2 09:02:41 CEST 2016 amd64) in question did
> > > > > > > > poudriere=3D =20
> > > > > > bulk job. I =20
> > > > > > > > can not even determine what terminal goes down first - another
> > > > > > > > =
> > one,
> > > > > > > > muc=3D =20
> > > > > > h more time =20
> > > > > > > > idle than the one shwoing the "vmstat 5" output, is still
> > > > > > > > alive=
> > !=3D20
> > > > > > > >=3D20
> > > > > > > > i consider this a serious bug and it is no benefit what
> > > > > > > > happene=
> > d sinc =20
> > > > e =20
> > > > > > > > =3D =20
> > > > > > this "fancy" =20
> > > > > > > > update. :-( =3D20 =20
> > > > > > >=3D20
> > > > > > > By the way - it might be of interest and some hint.
> > > > > > >=3D20
> > > > > > > One of my boxes is acting as server and gateway. It utilises
> > > > > > > NAT,=
> > IPFW,
> > > > > > > w=3D =20
> > > > > > hen it is under =20
> > > > > > > high load, as it was today, sometimes passing the network flow
> > > > > > > fr=
> > om ISP
> > > > > > > i=3D =20
> > > > > > nto the network =20
> > > > > > > for clients is extremely slow. I do not consider this the reason
> > > > > > > =
> > for
> > > > > > > coll=3D =20
> > > > > > apsing ssh =20
> > > > > > > sessions, since this incident happens also under no-load, but in
> > > > > > > =
> > the
> > > > > > > over=3D =20
> > > > > > all-view onto =20
> > > > > > > the problem, this could be a hint - I hope.=3D20 =20
> > > > > >=20
> > > > > > I just checked on one box, that "broke pipe" very quickly after I
> > > > > > s=
> > tarted =20
> > > > p=3D =20
> > > > > > oudriere,
> > > > > > while it did well a couple of hours before until the pipe broke.
> > > > > > It=
> > seems =20
> > > > i=3D =20
> > > > > > t's load
> > > > > > dependend when the ssh session gets wrecked, but more important,
> > > > > > af=
> > ter th =20
> > > > e =3D =20
> > > > > > long-haul
> > > > > > poudriere run, I rebooted the box and tried again with the
> > > > > > mentione=
> > d brok =20
> > > > en=3D =20
> > > > > > pipe after a
> > > > > > couple of minutes after poudriere ran. Then I left the box for
> > > > > > seve=
> > ral ho =20
> > > > ur=3D =20
> > > > > > s and logged
> > > > > > in again and checked the swap. Although there was for hours no
> > > > > > load=
> > or ot =20
> > > > he=3D =20
> > > > > > r pressure,
> > > > > > there were 31% of of swap used - still (box has 16 GB of RAM and
> > > > > > is=
> > prope =20
> > > > ll=3D =20
> > > > > > ed by a XEON
> > > > > > E3-1245 V2).
> > > > > > =20
> > > > >=20
> > > > > 31%! Is it *actively* paging or is the 31% previously paged out and
> > > > > n=
> > o=20
> > > > > paging is *currently* being experienced? 31% of how swap space in
> > > > > tot=
> > al?
> > > > >=20
> > > > > Also, what does ps aumx or ps aumxww say? Pipe it to head -40 or
> > > > > simi=
> > lar.
> > > > >=20
> > > > > =20
> > > >=20
> > > > On FreeBSD 11.0-CURRENT #4 r297573: Tue Apr 5 07:01:19 CEST 2016
> > > > amd64=
> > , loca
> > > > l
> > > > network, no NAT. Stuck ssh session in the middle of administering and
> > > > l=
> > eaving
> > > > the console/ssh session for a couple of minutes:
> > > >=20
> > > > root 2064 0.0 0.1 91416 8492 - Is 07:18 0:00.03
> > > > ssh=
> > d:
> > > > hartmann [priv] (sshd)
> > > >=20
> > > > hartmann 2108 0.0 0.1 91416 8664 - I 07:18 0:07.33
> > > > ssh=
> > d:
> > > > hartmann at pts/0 (sshd)
> > > >=20
> > > > root 72961 0.0 0.1 91416 8496 - Is 08:11 0:00.03
> > > > ssh=
> > d:
> > > > hartmann [priv] (sshd)
> > > >=20
> > > > hartmann 72970 0.0 0.1 91416 8564 - S 08:11 0:00.02
> > > > ssh=
> > d:
> > > > hartmann at pts/1 (sshd)
> > > >=20
> > > > The situation is worse and i consider this a serious bug.
> > > > =20
> > >=20
> > > There's not a lot to go on here. Do you have physical access to the
> > > machi=
> > ne=20
> > > to pop into DDB and take a look? You did say you're using a lot of
> > > swap.=
> > =20
> > > IIRC 30%. You didn't answer how much 30% was of. Without more data I
> > > can'=
> > t=20
> > > help you. At the best I can take wild guesses but that won't help you.
> > > Tr=
> > y=20
> > > to answer the questions I asked last week and we can go further. Until
> > > th=
> > en=20
> > > all we can do is wildly guess.
> > >=20
> > >=20
> >
> > Apologies for the late answer, I'm busy.
>
> That happens.
>
> >
> > Well, The "homebox" is physical accessible as well as the systems at work, =
> > but at work
> > they are heavily used right now.
> >
> > As you stated in your prior to this Email, I "overload" the boxes. Yes, I d=
> > o this by
> > intention and FreeBSD CURRENT withstood those attacks - approximately until=
> > 3 or 4 weeks
> > ago, when these problems occured.
> >
> > 30% swap was the "remain" after I started poudriere, poudriere "died" due t=
> > o a
> > lost/broken pipe ssh session and did not relax after hours! The box didn't =
> > do anything in
> > that time after the pipe was broken. So I mentioned this.=20
> >
> > You also mentioned UFS and ZFS concurrency. Yes, I use a mixed system. UFS =
> > for the
> > system's partitions, and ZFS for the data volumes. UFS is on SSDs "faster",=
> > but this is
> > only a subjective impression of mine. Having /usr/ports on UFS and ZFS and =
> > enough memory
> > (32 GB RAM) shows significant differences on the very same HDD drive: while=
> > UFS has
> > finished a "matured" svn tree, the ZFS based tree could take up to 5 or 6 m=
> > inutes until
> > finished. I think this is due to the growing .svn-folder. But on ZFS this o=
> > ccurs only the
> > first time the update of /usr/ports is done.
> >
> > Just to say: if UFS and ZFS coexistency is critical, this is defintely a mu=
> > st for the
> > handbook!
>
> I don't think so. Otherwise we should also write that running too many
> applications will cause paging. It's like saying, when running large Oracle
> databases don't make the SGA larger than physical memory. It's common sense.
>
> >
> > But on the other hand, what I complain about is a dramatically change in st=
> > ability of
> > CURRENT since the first occurency of the reported problems. Before, the ver=
> > y same
> > hardware, the very same setup, the very same jobs performed well. I pushed =
> > the boxes with
> > poudriere and several scientific jobs to their limits, and they took it lik=
> > e a German
> > tank.=20
> >
> > By the way, I use csh in all scenarios - I do not know whether this helps.
>
> I think I read somewhere that csh had an issue where it died under certain
> circumstances. I for the life of me can't find the email any more. It was a
> commit log email. Try /bin/sh as a test.
>
>
By the way, I tried /bin/sh. The same issue!
More information about the freebsd-current
mailing list