ssh sbwait indfinite timeout on SMP machines
Cyril Elkaim
celkaim at dental-on-line.fr
Thu Feb 24 18:11:40 GMT 2005
Hi Robert,
Thanks for your answer.
Unhopefully I can't restart the machines or touch the kernel, these are
production machines and under heavy use. And I don't have another SMP
machine available right now.
The only thing I can assert for the moment is that if you abort an
authentication sequence it leaves ssh in CLOSE_WAIT state indefinitely
and it always reproductible.
The sshd_config file is the default one and the machines have no special
configuration in the sysctl.conf for example.
Doing the same thing on an uniprocessor do not trigger the problem.
Here a top before an ssh connection:
last pid: 80505; load averages: 0.06, 0.02,
0.00
up 49+04:25:13 19:00:58
57 processes: 1 running, 56 sleeping
CPU states: % user, % nice, % system, % interrupt, %
idle
Mem: 69M Active, 1606M Inact, 239M Wired, 84M Cache, 112M Buf, 4816K Free
Swap: 4071M Total, 340K Used, 4070M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
76291 admin 96 0 2684K 1908K select 1 1:22 0.00% 0.00% rsync
69111 root 8 0 7796K 4560K nanslp 2 0:47 0.00% 0.00% httpd
971 root 96 0 3460K 2484K select 0 0:44 0.00% 0.00% sendmail
303 root 96 0 1316K 788K select 1 0:38 0.00% 0.00% syslogd
446 root 8 0 1356K 900K nanslp 1 0:32 0.00% 0.00% cron
473 pgsql 96 0 15728K 2084K select 1 0:17 0.00% 0.00% postgres
80080 admin 96 0 2164K 1464K select 1 0:07 0.00% 0.00% rsync
29845 ldap 20 0 40008K 16824K kserel 3 0:03 0.00% 0.00% slapd
39987 bind 20 0 5692K 3980K kserel 3 0:02 0.00% 0.00% named
38045 smmsp 20 0 3324K 2032K pause 0 0:02 0.00% 0.00% sendmail
975 smmsp 20 0 3340K 2292K pause 3 0:01 0.00% 0.00% sendmail
80072 admin 96 0 6092K 2688K select 0 0:01 0.00% 0.00% sshd
69112 www 96 0 7736K 4480K select 3 0:00 0.00% 0.00% httpd
76290 admin 96 0 2408K 1640K select 1 0:00 0.00% 0.00% rsync
77762 root 96 0 3360K 2340K select 1 0:00 0.00% 0.00% sshd
72572 admin 8 0 7844K 4900K wait 1 0:00 0.00% 0.00% python
77672 root 96 0 3360K 2340K select 1 0:00 0.00% 0.00% sshd
80077 admin 8 0 7844K 5008K wait 2 0:00 0.00% 0.00% python
80459 admin 96 0 6088K 2692K select 1 0:00 0.00% 0.00% sshd
80079 admin 96 0 1808K 1124K select 3 0:00 0.00% 0.00% rsync
58472 root 96 0 1424K 876K select 0 0:00 0.00% 0.00% rsync
80073 admin 8 0 2256K 1776K wait 2 0:00 0.00% 0.00% bash
80457 root 4 0 6080K 2624K sbwait 2 0:00 0.00% 0.00% sshd
80069 root 4 0 6084K 2628K sbwait 3 0:00 0.00% 0.00% sshd
72584 pgsql 4 0 16376K 5416K sbwait 0 0:00 0.00% 0.00% postgres
80460 admin 8 0 2252K 1772K wait 2 0:00 0.00% 0.00% bash
76289 admin -8 0 3356K 2408K piperd 1 0:00 0.00% 0.00% sendmail
72567 admin 8 0 1636K 1008K wait 0 0:00 0.00% 0.00% sh
558 root 5 0 1280K 792K ttyin 1 0:00 0.00% 0.00% getty
556 root 5 0 1280K 792K ttyin 0 0:00 0.00% 0.00% getty
559 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty
557 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty
561 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty
563 root 5 0 1280K 792K ttyin 1 0:00 0.00% 0.00% getty
560 root 5 0 1280K 792K ttyin 3 0:00 0.00% 0.00% getty
562 root 5 0 1280K 792K ttyin 0 0:00 0.00% 0.00% getty
80078 pgsql 4 0 16044K 3436K sbwait 1 0:00 0.00% 0.00% postgres
80399 www 4 0 7796K 4708K accept 0 0:00 0.00% 0.00% httpd
80364 www 4 0 7796K 4708K accept 2 0:00 0.00% 0.00% httpd
80425 www 4 0 7796K 4708K accept 2 0:00 0.00% 0.00% httpd
80388 www 4 0 7796K 4708K accept 1 0:00 0.00% 0.00% httpd
80353 www 4 0 7796K 4708K accept 3 0:00 0.00% 0.00% httpd
72562 root -8 0 1400K 1080K piperd 1 0:00 0.00% 0.00% cron
474 pgsql 110 0 7476K 1924K select 3 0:00 0.00% 0.00% postgres
475 pgsql 110 0 6488K 1928K select 2 0:00 0.00% 0.00% postgres
80505 admin 96 0 2288K 1404K CPU0 0 0:00 0.00% 0.00% top
279 root 110 0 516K 224K select 2 0:00 0.00% 0.00% devd
---------------------------------------------------------------------------------------------------------------------
_after_ the aborted ssh connection, look a at the end of the top the
aborted ssh
is the one with the id 80508 in sbwait state. We had such processes
waiting since two months
before killing them manually.
last pid: 80509; load averages: 0.00, 0.00,
0.00
up 49+04:28:53 19:04:38
58 processes: 1 running, 57 sleeping
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100%
idle
Mem: 70M Active, 1612M Inact, 239M Wired, 78M Cache, 112M Buf, 3024K Free
Swap: 4071M Total, 340K Used, 4070M Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
76291 admin 96 0 2684K 1908K select 1 1:22 0.00% 0.00% rsync
69111 root 8 0 7796K 4560K nanslp 0 0:47 0.00% 0.00% httpd
971 root 96 0 3460K 2484K select 2 0:44 0.00% 0.00% sendmail
303 root 96 0 1316K 788K select 2 0:38 0.00% 0.00% syslogd
446 root 8 0 1356K 900K nanslp 3 0:32 0.00% 0.00% cron
473 pgsql 96 0 15728K 2084K select 3 0:17 0.00% 0.00% postgres
80080 admin 96 0 2164K 1464K select 1 0:07 0.00% 0.00% rsync
29845 ldap 20 0 40008K 16824K kserel 0 0:03 0.00% 0.00% slapd
39987 bind 20 0 5692K 3980K kserel 1 0:02 0.00% 0.00% named
38045 smmsp 20 0 3324K 2032K pause 0 0:02 0.00% 0.00% sendmail
975 smmsp 20 0 3340K 2292K pause 3 0:01 0.00% 0.00% sendmail
80072 admin 96 0 6092K 2688K select 0 0:01 0.00% 0.00% sshd
69112 www 96 0 7736K 4480K select 3 0:00 0.00% 0.00% httpd
76290 admin 96 0 2408K 1640K select 0 0:00 0.00% 0.00% rsync
77762 root 96 0 3360K 2340K select 1 0:00 0.00% 0.00% sshd
72572 admin 8 0 7844K 4900K wait 1 0:00 0.00% 0.00% python
77672 root 96 0 3360K 2340K select 1 0:00 0.00% 0.00% sshd
80077 admin 8 0 7844K 5008K wait 2 0:00 0.00% 0.00% python
80459 admin 96 0 6088K 2692K select 3 0:00 0.00% 0.00% sshd
80079 admin 96 0 1808K 1124K select 0 0:00 0.00% 0.00% rsync
80505 admin 96 0 2440K 1584K CPU0 1 0:00 0.00% 0.00% top
58472 root 96 0 1424K 876K select 0 0:00 0.00% 0.00% rsync
80073 admin 8 0 2256K 1776K wait 2 0:00 0.00% 0.00% bash
80457 root 4 0 6080K 2624K sbwait 2 0:00 0.00% 0.00% sshd
80069 root 4 0 6084K 2628K sbwait 3 0:00 0.00% 0.00% sshd
72584 pgsql 4 0 16376K 5416K sbwait 0 0:00 0.00% 0.00% postgres
80460 admin 8 0 2252K 1772K wait 2 0:00 0.00% 0.00% bash
76289 admin -8 0 3356K 2408K piperd 1 0:00 0.00% 0.00% sendmail
72567 admin 8 0 1636K 1008K wait 0 0:00 0.00% 0.00% sh
558 root 5 0 1280K 792K ttyin 1 0:00 0.00% 0.00% getty
556 root 5 0 1280K 792K ttyin 0 0:00 0.00% 0.00% getty
559 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty
557 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty
561 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty
563 root 5 0 1280K 792K ttyin 1 0:00 0.00% 0.00% getty
560 root 5 0 1280K 792K ttyin 3 0:00 0.00% 0.00% getty
562 root 5 0 1280K 792K ttyin 0 0:00 0.00% 0.00% getty
80078 pgsql 4 0 16044K 3436K sbwait 1 0:00 0.00% 0.00% postgres
80399 www 4 0 7796K 4708K accept 0 0:00 0.00% 0.00% httpd
80364 www 4 0 7796K 4708K accept 2 0:00 0.00% 0.00% httpd
80425 www 4 0 7796K 4708K accept 2 0:00 0.00% 0.00% httpd
80388 www 4 0 7796K 4708K accept 1 0:00 0.00% 0.00% httpd
80353 www 4 0 7796K 4708K accept 3 0:00 0.00% 0.00% httpd
72562 root -8 0 1400K 1080K piperd 1 0:00 0.00% 0.00% cron
474 pgsql 110 0 7476K 1924K select 3 0:00 0.00% 0.00% postgres
80508 root 4 0 4852K 2592K sbwait 0 0:00 0.00% 0.00% sshd
475 pgsql 110 0 6488K 1928K select 2 0:00 0.00% 0.00% postgres
279 root 110 0 516K 224K select 2 0:00 0.00% 0.00% devd
and here a dump using sockstat _after_ the aborted connection:
USER COMMAND PID FD PROTO LOCAL ADDRESS FOREIGN ADDRESS
root sshd 80508 3 stream -> ??
root sshd 80508 4 stream (not connected)
root sshd 80508 5 tcp4 192.168.254.2:22
192.168.254.1:49023
root sshd 80508 6 stream -> ??
Hope that helps,
Cyril Elkaim
Robert Watson wrote:
>On Thu, 24 Feb 2005, Dental-on-line wrote:
>
>
>
>>We are using two SMP machines (Dell 1750) with 5.3_p2 RELENG and have
>>timeout problems with SSH.
>>
>>This problem happens ONLY on SMP machines not uniprocessor ones.
>>
>>
>
>Could you try the following things:
>
>- Boot the box with the debug.mpsafenet=0 tunable set in loader.conf and
> see if the problem recurs?
>
>- Update to 5-STABLE and try with and without debug.mpsafenet=0? A number
> of TCP locking changes have been made there, and one may help.
>
>Does it happen for every SSH session, or only some?
>
>Thanks,
>
>Robert N M Watson
>
>
>
More information about the freebsd-bugs
mailing list