11.x deadlocking during pfault (was Re: FreeBSD 11.x grinds to a halt after about 48h of uptime)
Ulrich Spörlein
uqs at FreeBSD.org
Wed Oct 26 16:43:48 UTC 2016
On Mon, 2016-10-24 at 19:43:27 +0200, Ulrich Spörlein wrote:
> On Sat, 2016-10-15 at 09:36:27 -0700, Kevin Oberman wrote:
> > On Sat, Oct 15, 2016 at 9:26 AM, Hans Petter Selasky <hps at selasky.org>
> > wrote:
> >
> > > On 10/15/16 18:18, Ulrich Spörlein wrote:
> > >
> > >> Hey all, while 11.x is -STABLE now, this happens to my machine ever
> > >> since I upgraded it to 11-CURRENT years ago. I have no idea when this
> > >> started, actually, but what always happens is this:
> > >>
> > >> - System and X11 is up and running, I keep it running over night as I'm
> > >> too lazy to reboot and restart everthing.
> > >> - There's a bunch of xterms, Chrome, Clementine-Player and some other
> > >> programs running
> > >> - Coming back to the machine the next day (or the day after) it will
> > >> exit the screensaver just fine and then either I can use it for a couple
> > >> of seconds before it freezes, or it's pretty much dead already. The
> > >> mouse cursor still moves for a bit, but the also freezes (so it this a
> > >> GPU problem??)
> > >>
> > >> Now what I currently see on the screen is a clock widget stuck at 18:04
> > >> but conky itself has last updated at 18:00:18 ...
> > >>
> > >> This time I had some SSH sessions from another machine to see some more
> > >> useful things. There was nothing in various logs under /var/log (I also
> > >> can't run dmesg anymore ...)
> > >> I had top(1) running in a loop, this is the last output:
> > >>
> > >> last pid: 25633; load averages: 0.27, 0.39, 0.36 up 1+23:03:28
> > >> 18:00:12
> > >> 202 processes: 2 running, 188 sleeping, 11 zombie, 1 waiting
> > >>
> > >> Mem: 8873M Active, 1783M Inact, 5072M Wired, 567M Buf, 132M Free
> > >> ARC: 1844M Total, 469M MFU, 268M MRU, 16K Anon, 96M Header, 1012M Other
> > >> Swap: 4096M Total, 2395M Used, 1701M Free, 58% Inuse
> > >>
> > >>
> > >> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU
> > >> COMMAND
> > >> 11 root 8 155 ki31 0K 128K CPU0 0 364.6H 772.95%
> > >> idle
> > >> 3122 uqs 15 28 0 7113M 5861M uwait 0
> > >> 94:44 13.96% chrome
> > >> 2887 uqs 28 22 0 1394M 237M
> > >> select 2 172:53 6.98% chrome
> > >> 2890 uqs 11 21 0
> > >> 1034M 178M select 5 231:21 1.95% chrome
> > >> 1062 root 9
> > >> 21 0 440M 47220K select 0 67:09 0.98% Xorg
> > >> 3002 uqs
> > >> 15 25 5 1159M 172M uwait 2 19:09 0.00% chrome
> > >> 3139 uqs 17 25 5 1163M 156M uwait 2 16:15 0.00%
> > >> chrome
> > >> 3001 uqs 18 25 5 1639M 575M uwait 0 16:05 0.00%
> > >> chrome
> > >> 12 root 24 -64 - 0K 384K WAIT -1 10:53 0.00%
> > >> intr
> > >> 3129 uqs 12 20 0 2820M 1746M uwait 6 8:36 0.00%
> > >> chrome
> > >> 2822 uqs 9 20 0 217M 81300K select 0 5:10 0.00%
> > >> conky
> > >> 3174 root 1 20 0 21532K 3188K select 0 4:20 0.00%
> > >> systat
> > >> 3130 uqs 16 20 0 1058M 131M uwait 4 3:03 0.00%
> > >> chrome
> > >> 2998 uqs 16 20 0 1110M 123M uwait 2 2:53 0.00%
> > >> chrome
> > >> 3165 uqs 10 20 0 1209M 215M uwait 6 2:52 0.00%
> > >> chrome
> > >> 3142 uqs 11 25 5 1344M 195M uwait 2 2:46 0.00%
> > >> chrome
> > >> 2876 uqs 19 20 0 580M 37164K select 3 2:42 0.00%
> > >> clementine-player
> > >> 20 root 2 -16 - 0K 32K psleep 6 2:25 0.00%
> > >> pagedaemon
> > >>
> > >> I also had systat -vm running and it continued to update its screen ...
> > >> for a short while, this is the last update before SSH died:
> > >>
> > >>
> > >> Mem usage: 0k%Phy 5%Kmem
> > >> Mem: KB REAL VIRTUAL VN PAGER SWAP
> > >> PAGER
> > >> Tot Share Tot Share Free in out in
> > >> out
> > >> Act 11051k 67868 71051992 255448 61840 count
> > >> All 11051k 67924 71058776 262100 pages
> > >> Proc:
> > >> Interrupts
> > >> r p d s w Csw Trp Sys Int Sof Flt ioflt 224
> > >> total
> > >> 25 730 11 724 109 404 101 13 cow 2
> > >> ehci0 16
> > >> zfod 3
> > >> ehci1 23
> > >> 0.0%Sys 0.1%Intr 0.0%User 0.0%Nice 99.9%Idle ozfod 16
> > >> cpu0:timer
> > >> | | | | | | | | | | %ozfod
> > >> xhci0 264
> > >> daefr 3 em0
> > >> 265
> > >> 50 dtbuf prcfr 94
> > >> hdac1 266
> > >> Namei Name-cache Dir-cache 349167 desvn totfr
> > >> ahci0 270
> > >> Calls hits % hits % 349155 numvn react 5
> > >> cpu1:timer
> > >> 121 121 100 253501 frevn pdwak 1
> > >> cpu2:timer
> > >> pdpgs 29
> > >> cpu7:timer
> > >> Disks md0 ada0 ada1 pass0 pass1 pass2 intrn 12
> > >> cpu3:timer
> > >> KB/t 0.00 0.00 0.00 0.00 0.00 0.00 5318892 wire 41
> > >> cpu6:timer
> > >> tps 0 0 0 0 0 0 9261404 act 12
> > >> cpu5:timer
> > >> MB/s 0.00 0.00 0.00 0.00 0.00 0.00 1598184 inact 6
> > >> cpu4:timer
> > >> %busy 0 0 0 0 0 0 cache
> > >> vgapci0
> > >> 61840 free
> > >> 712304 buf
> > >>
> > >>
> > >> Why do I have a Chrome tab using about 6G? What other sort of debugging
> > >> output can be helpful to get to the bottom of this? The machine still
> > >> responds to pings just fine, TCP connections get set up but the SSH
> > >> handshake never completes.
> > >>
> > >> This always happens between 30-50h and is super annoying and has been
> > >> going on for >1year. Help?
> > >>
> > >> Note, I cut the power to the monitor overnight to save electricity, can
> > >> this mess up something in the Radeon card or X server? What combinations
> > >> would be most useful to try next?
> > >>
> > >>
> > > Hi,
> > >
> > > Sounds like a memory leak. Can you track the memory use over time?
> > >
> > > Did you look at the output from:
> > >
> > > vmstat -m ?
> > >
> > > --HPS
> >
> >
> > I have noted significant memory leakage in chromium for some time. If I
> > leave it running overnight, my system is essentially frozen. If I terminate
> > the chromium process, it slowly comes back to life. I always keep a gkrellm
> > session on-screen where the memory and swap utilization is continuously
> > displayed and that clearly shows resources declining.
>
> That is not what is happening to my system though, it actually
> deadlocks. There's no way to recover from it, it seems.
>
> So I killed Chromium overnight each day, and I'm at this:
>
> % top -Sbores
> last pid: 44526; load averages: 0.10, 0.11, 0.56 up 7+09:53:30 19:33:25
> 156 processes: 2 running, 153 sleeping, 1 waiting
>
> Mem: 315M Active, 550M Inact, 5671M Wired, 515M Buf, 9324M Free
> ARC: 1852M Total, 541M MFU, 196M MRU, 16K Anon, 93M Header, 1022M Other
> Swap: 4096M Total, 2186M Used, 1910M Free, 53% Inuse
>
>
> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
> 2755 uqs 10 20 0 1697M 311M select 1 47:23 0.00% conky
> 2736 uqs 32 20 0 699M 116M select 7 94:29 0.00% clementine-player
> 3000 uqs 12 20 0 1126M 69380K select 5 9:48 0.00% digikam
> 960 root 9 20 0 448M 59076K select 0 250:22 0.00% Xorg
> 72608 uqs 8 20 0 939M 55432K uwait 5 0:01 0.00% chrome
> 72599 uqs 9 52 0 929M 55116K uwait 0 0:00 0.00% chrome
> 2567 root 1 20 0 89948K 42964K select 1 1:51 0.00% bsnmpd
> 70476 uqs 1 20 0 93656K 25712K select 2 0:05 0.00% xterm
> 2730 uqs 5 20 0 208M 14988K select 1 0:22 0.00% clock-applet
> 880 root 1 20 0 22628K 12500K select 3 0:20 0.00% ntpd
> 2726 uqs 4 20 0 206M 12456K select 6 0:09 0.00% mateweather-applet
> 44352 uqs 1 20 0 75224K 12348K select 4 0:00 0.00% xterm
> 43049 uqs 1 20 0 75224K 11792K select 5 0:00 0.00% xterm
> 3074 uqs 2 20 0 308M 9692K select 1 0:02 0.00% kdeinit4
> 2671 uqs 1 20 0 144M 9488K select 1 0:13 0.00% openbox
> 3072 uqs 1 20 0 210M 8284K select 3 0:00 0.00% kdeinit4
> 2724 uqs 4 20 0 154M 8256K select 2 0:19 0.00% wnck-applet
> 2701 uqs 5 20 0 177M 8144K select 2 0:01 0.00% mate-panel
>
>
> 7d running, pretty good. But look closer, the system is doing pretty
> much nothing but did swap out 2G. What?
>
> > Try closing your chromium at night and see if that fixes the problem.
>
> It's better, but I'm not sure it's a real fix. I've now turned off
> "hardware acceleration" in Chromium, though chrome://gpu didn't real
> inspire confidence that it was actually using any h/w accel at all.
>
>
> > If you have never tried gkrellm (sysutils/gkrellm2), it is a the best
> > system monitor I have found. though pulls in a lot of dependencies. It also
> > can run as a server with remote systems displaying the data. Handy to
> > monitor servers.
>
> I had a cacti-setup that would also monitor my workstation (through a
> OpenVPN tunnel), but that has bit-rotted and Apache only gives me 500s
> on that cacti URL and nothing in the logs, oh well ...)
>
> Hooking up a serial console and testing whether DDB works is probably
> the next best step to take ...
Sigh, I forgot to shut down Chrome overnight, and my system is
deadlocked again. I can still switch virtual desktops in X11 and a
running xterm accepts keyboard input, but it is 18:35 now and I see that
my X11 clock has last updated at 17:18 and conky is stuck at 17:14:11
my top-in-a-loop is stuck here and no longer loops:
last pid: 73731; load averages: 0.23, 0.24, 0.23 up 9+07:34:15 17:14:10
160 processes: 3 running, 146 sleeping, 11 zombie
Mem: 9302M Active, 763M Inact, 5682M Wired, 752M Buf, 113M Free
ARC: 1731M Total, 549M MFU, 129M MRU, 16K Anon, 91M Header, 963M Other
Swap:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
960 root 9 21 0 451M 55796K CPU7 7 299:15 2.98% Xorg
53884 uqs 27 22 0 904M 250M CPU4 4 105:05 2.98% chrome
54081 uqs 15 21 0 3601M 2527M uwait 6 38:11 0.98% chrome
2736 uqs 33 20 0 697M 84620K select 5 95:38 0.00% clementine-player
2755 uqs 10 20 0 2111M 721M select 0 60:47 0.00% conky
78943 uqs 1 20 0 21532K 3328K select 7 13:12 0.00% systat
54048 uqs 15 25 5 1093M 159M uwait 4 9:51 0.00% chrome
3000 uqs 12 20 0 1126M 53248K select 7 9:50 0.00% digikam
53999 uqs 19 25 5 1525M 514M uwait 5 6:28 0.00% chrome
2703 uqs 1 20 0 169M 5948K select 0 5:40 0.00% mate-volume-control
2707 uqs 1 20 0 60240K 2516K select 7 5:21 0.00% autocutsel
54077 uqs 15 20 0 1803M 821M uwait 6 3:51 0.00% chrome
78869 uqs 1 20 0 93480K 6636K select 7 3:38 0.00% sshd
38396 uqs 1 20 0 75224K 3896K select 0 3:18 0.00% xterm
2567 root 1 20 0 89948K 43892K select 6 2:25 0.00% bsnmpd
876 root 9 52 0 30120K 3468K uwait 7 2:20 0.00% nscd
883 root 1 20 0 10444K 1708K select 0 2:16 0.00% powerd
2586 haldaemon 2 20 0 56620K 3920K select 1 2:04 0.00% hald
systat -vm:
10 users Load 0.21 0.23 0.23 Oct 26 17:14
Mem usage: 0k%Phy 5%Kmem
Mem: KB REAL VIRTUAL VN PAGER SWAP PAGER
Tot Share Tot Share Free in out in out
Act 10545k 51036 67451248 249344 61912 count 120 3
All 10550k 55504 67480212 275720 pages 807 3
Proc: Interrupts
r p d s w Csw Trp Sys Int Sof Flt 149 ioflt 845 total
6 781 12 9327 6923 24k 318 18 6649 20 cow 2 ehci0 16
5064 zfod 3 ehci1 23
4.7%Sys 0.0%Intr 1.7%User 0.0%Nice 93.6%Idle 16 ozfod 106 cpu0:timer
| | | | | | | | | | %ozfod xhci0 264
==> 927 daefr 45 em0 265
51 dtbuf 545 prcfr 94 hdac1 266
Namei Name-cache Dir-cache 349167 desvn 1787 totfr 175 ahci0 270
Calls hits % hits % 337762 numvn react 50 cpu1:timer
935 901 96 170078 frevn pdwak 48 cpu3:timer
1435985 pdpgs 66 cpu2:timer
Disks md0 ada0 ada1 pass0 pass1 pass2 intrn 77 cpu7:timer
KB/t 0.00 22.59 0.00 0.00 0.00 0.00 5819100 wire 75 cpu4:timer
tps 17 168 0 0 0 0 9723008 act 51 cpu6:timer
MB/s 0.00 3.70 0.00 0.00 0.00 0.00 636300 inact 53 cpu5:timer
%busy 0 3 0 0 0 0 cache vgapci0
61912 free
770732 buf
I'm unable to scroll back the vmstat -m output in my tmux pane (running on a
different system, this is super strange), so all I can show is this:
inodedep 16 12293K - 2770872 512
bmsafemap 9 49K - 2136243 256,8192
newblk 28 24582K - 4111401 256
indirdep 4 1K - 865983 128,16384
freefrag 6 1K - 682027 128
freeblks 3 1K - 1318985 256
freefile 3 1K - 1272832 64
diradd 2 1K - 1306741 128
mkdir 0 0K - 5120 128
dirrem 0 0K - 1305358 128
newdirblk 0 0K - 2610 64
freework 9 1K - 1566489 64,128
sbdep 0 0K - 45661 64
savedino 0 0K - 186280 256
softdep 6 3K - 6 512
ufs_dirhash 1533 767K - 109513 16,32,64,128,256,512,1024,2048,4096
ufs_quota 1 2048K - 1
ufs_mount 18 55K - 18 512,2048,4096,8192
vm_pgdata 1 2048K - 2 128
UMAHash 23 5385K - 108 512,1024,2048,4096,8192,16384,32768,65536
memdesc 1 4K - 1 4096
atkbddev 2 1K - 2 64
entropy 1 1K - 894489 32,4096
apmdev 1 1K - 1 128
madt_table 0 0K - 1 4096
SCSI ENC 25 100K - 120744 16,64,256,2048,32768
io_apic 1 2K - 1 2048
MCA 18 3K - 18 64,128
msi 16 2K - 16 128
nexusdev 5 1K - 5 16
hdaa 5 54K - 5 2048,16384,32768
hdac 1 1K - 1 1024
hdacc 1 1K - 1 32
linux 29 2K - 29 64
solaris 295750 228696K - 140520323 16,32,64,128,256,512,1024,2048,4096,8192,32768
kstat_data 6 1K - 6 64
eli data 22 4K - 6218901 64,256,512,1024,2048,4096,8192,16384,32768,65536
ksem 1 8K - 1 8192
nullfs_hash 1 2048K - 1
nullfs_node 9 1K - 41 64
nullfs_mount 9 1K - 9 32
fdesc_mount 1 1K - 1 16
gem_name 46 14K - 122 32,4096
drm_global 2 1K - 2 128,256
drm_dma 1 1K - 1 32
drm_sarea 1 1K - 1 16
drm_driver 91 2278K - 125 16,32,64,128,256,512,1024,2048,4096,8192,32768
drm_minor 2 1K - 2 128
drm_files 1 1K - 3 512
drm_ctxbitmap 1 4K - 1 4096
drm_sman 41 6K - 42 128
drm_hashtab 3 4129K - 4 128,32768
drm_kms 69 19K - 163 16,32,64,128,256,512,2048,4096,8192
drm_vblank 7 1K - 7 32,256
ttm_pd 16 17K - 18 16,128,2048,65536
ttm_rman 2 1K - 2 256
ttm_zone 2 1K - 2 64
ttm_poolmgr 1 1K - 1 512
Now what?
The xterm I have running locally with a stuck top is showing the top 3 chrome
processes in pfault state and it has "Swap: 11M In" in the header, so clearly
11.x is prone to deadlock during page faults and or swapping. It has last
updated on 17:14:13 (compare to the other top at 17:14:10 not showing pfault
state yet).
More information about the freebsd-current
mailing list