ssh sbwait indfinite timeout on SMP machines

Cyril Elkaim celkaim at dental-on-line.fr
Thu Feb 24 18:11:40 GMT 2005


Hi Robert,

Thanks for your answer.

Unhopefully I can't restart the machines or touch the kernel, these are 
production machines and under heavy use. And I don't have another SMP 
machine available right now.

The only thing I can assert for the moment is that if you abort an 
authentication sequence it leaves ssh in CLOSE_WAIT state indefinitely 
and it always reproductible.

The sshd_config file is the default one and the machines have no special 
configuration in the  sysctl.conf for example.

Doing the same thing on an uniprocessor do not trigger the problem.

Here a top before an ssh connection:

last pid: 80505;  load averages:  0.06,  0.02,  
0.00                                                                                                     
up 49+04:25:13  19:00:58
57 processes:  1 running, 56 sleeping
CPU states:     % user,     % nice,     % system,     % interrupt,     % 
idle
Mem: 69M Active, 1606M Inact, 239M Wired, 84M Cache, 112M Buf, 4816K Free
Swap: 4071M Total, 340K Used, 4070M Free

  PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
76291 admin     96    0  2684K  1908K select 1   1:22  0.00%  0.00% rsync
69111 root       8    0  7796K  4560K nanslp 2   0:47  0.00%  0.00% httpd
  971 root      96    0  3460K  2484K select 0   0:44  0.00%  0.00% sendmail
  303 root      96    0  1316K   788K select 1   0:38  0.00%  0.00% syslogd
  446 root       8    0  1356K   900K nanslp 1   0:32  0.00%  0.00% cron
  473 pgsql     96    0 15728K  2084K select 1   0:17  0.00%  0.00% postgres
80080 admin     96    0  2164K  1464K select 1   0:07  0.00%  0.00% rsync
29845 ldap      20    0 40008K 16824K kserel 3   0:03  0.00%  0.00% slapd
39987 bind      20    0  5692K  3980K kserel 3   0:02  0.00%  0.00% named
38045 smmsp     20    0  3324K  2032K pause  0   0:02  0.00%  0.00% sendmail
  975 smmsp     20    0  3340K  2292K pause  3   0:01  0.00%  0.00% sendmail
80072 admin     96    0  6092K  2688K select 0   0:01  0.00%  0.00% sshd
69112 www       96    0  7736K  4480K select 3   0:00  0.00%  0.00% httpd
76290 admin     96    0  2408K  1640K select 1   0:00  0.00%  0.00% rsync
77762 root      96    0  3360K  2340K select 1   0:00  0.00%  0.00% sshd
72572 admin      8    0  7844K  4900K wait   1   0:00  0.00%  0.00% python
77672 root      96    0  3360K  2340K select 1   0:00  0.00%  0.00% sshd
80077 admin      8    0  7844K  5008K wait   2   0:00  0.00%  0.00% python
80459 admin     96    0  6088K  2692K select 1   0:00  0.00%  0.00% sshd
80079 admin     96    0  1808K  1124K select 3   0:00  0.00%  0.00% rsync
58472 root      96    0  1424K   876K select 0   0:00  0.00%  0.00% rsync
80073 admin      8    0  2256K  1776K wait   2   0:00  0.00%  0.00% bash
80457 root       4    0  6080K  2624K sbwait 2   0:00  0.00%  0.00% sshd
80069 root       4    0  6084K  2628K sbwait 3   0:00  0.00%  0.00% sshd
72584 pgsql      4    0 16376K  5416K sbwait 0   0:00  0.00%  0.00% postgres
80460 admin      8    0  2252K  1772K wait   2   0:00  0.00%  0.00% bash
76289 admin     -8    0  3356K  2408K piperd 1   0:00  0.00%  0.00% sendmail
72567 admin      8    0  1636K  1008K wait   0   0:00  0.00%  0.00% sh
  558 root       5    0  1280K   792K ttyin  1   0:00  0.00%  0.00% getty
  556 root       5    0  1280K   792K ttyin  0   0:00  0.00%  0.00% getty
  559 root       5    0  1280K   792K ttyin  2   0:00  0.00%  0.00% getty
  557 root       5    0  1280K   792K ttyin  2   0:00  0.00%  0.00% getty
  561 root       5    0  1280K   792K ttyin  2   0:00  0.00%  0.00% getty
  563 root       5    0  1280K   792K ttyin  1   0:00  0.00%  0.00% getty
  560 root       5    0  1280K   792K ttyin  3   0:00  0.00%  0.00% getty
  562 root       5    0  1280K   792K ttyin  0   0:00  0.00%  0.00% getty
80078 pgsql      4    0 16044K  3436K sbwait 1   0:00  0.00%  0.00% postgres
80399 www        4    0  7796K  4708K accept 0   0:00  0.00%  0.00% httpd
80364 www        4    0  7796K  4708K accept 2   0:00  0.00%  0.00% httpd
80425 www        4    0  7796K  4708K accept 2   0:00  0.00%  0.00% httpd
80388 www        4    0  7796K  4708K accept 1   0:00  0.00%  0.00% httpd
80353 www        4    0  7796K  4708K accept 3   0:00  0.00%  0.00% httpd
72562 root      -8    0  1400K  1080K piperd 1   0:00  0.00%  0.00% cron
  474 pgsql    110    0  7476K  1924K select 3   0:00  0.00%  0.00% postgres
  475 pgsql    110    0  6488K  1928K select 2   0:00  0.00%  0.00% postgres
80505 admin     96    0  2288K  1404K CPU0   0   0:00  0.00%  0.00% top
  279 root     110    0   516K   224K select 2   0:00  0.00%  0.00% devd

---------------------------------------------------------------------------------------------------------------------
_after_ the aborted ssh connection, look a at the end of the top the 
aborted ssh
is the one with the id 80508 in sbwait state. We had such processes 
waiting since two months
before killing them manually.

last pid: 80509;  load averages:  0.00,  0.00,  
0.00                                                                                                     
up 49+04:28:53  19:04:38
58 processes:  1 running, 57 sleeping
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% 
idle
Mem: 70M Active, 1612M Inact, 239M Wired, 78M Cache, 112M Buf, 3024K Free
Swap: 4071M Total, 340K Used, 4070M Free

  PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
76291 admin     96    0  2684K  1908K select 1   1:22  0.00%  0.00% rsync
69111 root       8    0  7796K  4560K nanslp 0   0:47  0.00%  0.00% httpd
  971 root      96    0  3460K  2484K select 2   0:44  0.00%  0.00% sendmail
  303 root      96    0  1316K   788K select 2   0:38  0.00%  0.00% syslogd
  446 root       8    0  1356K   900K nanslp 3   0:32  0.00%  0.00% cron
  473 pgsql     96    0 15728K  2084K select 3   0:17  0.00%  0.00% postgres
80080 admin     96    0  2164K  1464K select 1   0:07  0.00%  0.00% rsync
29845 ldap      20    0 40008K 16824K kserel 0   0:03  0.00%  0.00% slapd
39987 bind      20    0  5692K  3980K kserel 1   0:02  0.00%  0.00% named
38045 smmsp     20    0  3324K  2032K pause  0   0:02  0.00%  0.00% sendmail
  975 smmsp     20    0  3340K  2292K pause  3   0:01  0.00%  0.00% sendmail
80072 admin     96    0  6092K  2688K select 0   0:01  0.00%  0.00% sshd
69112 www       96    0  7736K  4480K select 3   0:00  0.00%  0.00% httpd
76290 admin     96    0  2408K  1640K select 0   0:00  0.00%  0.00% rsync
77762 root      96    0  3360K  2340K select 1   0:00  0.00%  0.00% sshd
72572 admin      8    0  7844K  4900K wait   1   0:00  0.00%  0.00% python
77672 root      96    0  3360K  2340K select 1   0:00  0.00%  0.00% sshd
80077 admin      8    0  7844K  5008K wait   2   0:00  0.00%  0.00% python
80459 admin     96    0  6088K  2692K select 3   0:00  0.00%  0.00% sshd
80079 admin     96    0  1808K  1124K select 0   0:00  0.00%  0.00% rsync
80505 admin     96    0  2440K  1584K CPU0   1   0:00  0.00%  0.00% top
58472 root      96    0  1424K   876K select 0   0:00  0.00%  0.00% rsync
80073 admin      8    0  2256K  1776K wait   2   0:00  0.00%  0.00% bash
80457 root       4    0  6080K  2624K sbwait 2   0:00  0.00%  0.00% sshd
80069 root       4    0  6084K  2628K sbwait 3   0:00  0.00%  0.00% sshd
72584 pgsql      4    0 16376K  5416K sbwait 0   0:00  0.00%  0.00% postgres
80460 admin      8    0  2252K  1772K wait   2   0:00  0.00%  0.00% bash
76289 admin     -8    0  3356K  2408K piperd 1   0:00  0.00%  0.00% sendmail
72567 admin      8    0  1636K  1008K wait   0   0:00  0.00%  0.00% sh
  558 root       5    0  1280K   792K ttyin  1   0:00  0.00%  0.00% getty
  556 root       5    0  1280K   792K ttyin  0   0:00  0.00%  0.00% getty
  559 root       5    0  1280K   792K ttyin  2   0:00  0.00%  0.00% getty
  557 root       5    0  1280K   792K ttyin  2   0:00  0.00%  0.00% getty
  561 root       5    0  1280K   792K ttyin  2   0:00  0.00%  0.00% getty
  563 root       5    0  1280K   792K ttyin  1   0:00  0.00%  0.00% getty
  560 root       5    0  1280K   792K ttyin  3   0:00  0.00%  0.00% getty
  562 root       5    0  1280K   792K ttyin  0   0:00  0.00%  0.00% getty
80078 pgsql      4    0 16044K  3436K sbwait 1   0:00  0.00%  0.00% postgres
80399 www        4    0  7796K  4708K accept 0   0:00  0.00%  0.00% httpd
80364 www        4    0  7796K  4708K accept 2   0:00  0.00%  0.00% httpd
80425 www        4    0  7796K  4708K accept 2   0:00  0.00%  0.00% httpd
80388 www        4    0  7796K  4708K accept 1   0:00  0.00%  0.00% httpd
80353 www        4    0  7796K  4708K accept 3   0:00  0.00%  0.00% httpd
72562 root      -8    0  1400K  1080K piperd 1   0:00  0.00%  0.00% cron
  474 pgsql    110    0  7476K  1924K select 3   0:00  0.00%  0.00% postgres
80508 root       4    0  4852K  2592K sbwait 0   0:00  0.00%  0.00% sshd
  475 pgsql    110    0  6488K  1928K select 2   0:00  0.00%  0.00% postgres
  279 root     110    0   516K   224K select 2   0:00  0.00%  0.00% devd


and here a dump using sockstat _after_ the aborted connection:

USER     COMMAND    PID   FD PROTO  LOCAL ADDRESS         FOREIGN ADDRESS
root     sshd       80508 3  stream -> ??
root     sshd       80508 4  stream (not connected)
root     sshd       80508 5  tcp4   192.168.254.2:22      
192.168.254.1:49023
root     sshd       80508 6  stream -> ??


Hope that helps,
Cyril Elkaim

Robert Watson wrote:

>On Thu, 24 Feb 2005, Dental-on-line wrote:
>
>  
>
>>We are using two SMP machines (Dell 1750) with 5.3_p2 RELENG and have
>>timeout problems with SSH.
>>
>>This problem happens ONLY on SMP machines not uniprocessor ones. 
>>    
>>
>
>Could you try the following things:
>
>- Boot the box with the debug.mpsafenet=0 tunable set in loader.conf and
>  see if the problem recurs?
>
>- Update to 5-STABLE and try with and without debug.mpsafenet=0?  A number
>  of TCP locking changes have been made there, and one may help.
>
>Does it happen for every SSH session, or only some? 
>
>Thanks,
>
>Robert N M Watson
>
>  
>



More information about the freebsd-bugs mailing list