6.1-R ? 6-Stable ? 5.5-R ?

Thu Jun 29 17:39:03 UTC 2006

Kostik Belousov writes:

>> >  Approved by:    pjd (mentor)
>> >  Revision   Changes    Path
>> >  1.156.2.3  +16 -0     src/sys/nfsserver/nfs_serv.c
>> >  1.136.2.3  +4 -0      src/sys/nfsserver/nfs_srvsubs.c
>> 
>> The above files are what I have.

Yes from a 6.1 stable around 6-25-06

> What this means ? That you have _this_ revisions of the files,
> and your LA skyrocketed ?

LA = load average?

Our problem is vmstat  'b' column growing and nfs causing locks on the 
server side. When the machine locked it was running a background fsck. I saw 
"Giant" a lot in the status of the nfsd.

I am really wondering if 6.1 is ready for production under heavy load. And 
for sure the NFS client in the whole 6.X line seems problematic (see my post 
in the stable list under subject: NFS clients freeze and can not 
disconnect).

As for the vmstat, about the only thing doing anything even remotely 
appearing to be doing work is NFS.

For instance I saw this in another thread:
ps ax -O ppid,flags,mwchan | awk '($6 ~ /^D/ || $6 == "STAT") && $3 !~ 
/^20.$/'

And in the machine in question it shows
  PID  PPID       F MWCHAN  TT  STAT      TIME COMMAND
16124 16123       0 biowr   ??  D     46:24.76 nfsd: server (nfsd)
16125 16123       0 biowr   ??  D     16:05.58 nfsd: server (nfsd)
16126 16123       0 biowr   ??  D     11:05.53 nfsd: server (nfsd)
16127 16123       0 biowr   ??  D      8:01.21 nfsd: server (nfsd)
16128 16123       0 biowr   ??  D      6:19.15 nfsd: server (nfsd)
16129 16123       0 biowr   ??  D      5:01.27 nfsd: server (nfsd)
16130 16123       0 biowr   ??  D      3:55.56 nfsd: server (nfsd)
16131 16123       0 biowr   ??  D      3:13.11 nfsd: server (nfsd)
16132 16123       0 biowr   ??  D      2:43.26 nfsd: server (nfsd)
16133 16123       0 biowr   ??  D      2:16.40 nfsd: server (nfsd)
16134 16123       0 biowr   ??  D      1:57.00 nfsd: server (nfsd)
16135 16123       0 biowr   ??  D      1:41.02 nfsd: server (nfsd)
16136 16123       0 biowr   ??  D      1:27.07 nfsd: server (nfsd)
16137 16123       0 biowr   ??  D      1:15.25 nfsd: server (nfsd)
16138 16123       0 biowr   ??  D      1:06.54 nfsd: server (nfsd)
16139 16123       0 biowr   ??  D      0:57.57 nfsd: server (nfsd)
16140 16123       0 biowr   ??  D      0:50.65 nfsd: server (nfsd)
16141 16123       0 biowr   ??  D      0:44.60 nfsd: server (nfsd)
16142 16123       0 biowr   ??  D      0:38.29 nfsd: server (nfsd)
16143 16123       0 biowr   ??  D      0:34.21 nfsd: server (nfsd)
16144 16123       0 biowr   ??  D      0:29.34 nfsd: server (nfsd)
16145 16123       0 biowr   ??  D      0:26.35 nfsd: server (nfsd)
16146 16123       0 biowr   ??  D      0:22.25 nfsd: server (nfsd)
16147 16123       0 biowr   ??  D      0:18.17 nfsd: server (nfsd)
16148 16123       0 biowr   ??  D      0:15.95 nfsd: server (nfsd)
16149 16123       0 biowr   ??  D      0:13.66 nfsd: server (nfsd)
16150 16123       0 biowr   ??  D      0:10.81 nfsd: server (nfsd)
16151 16123       0 biowr   ??  D      0:08.92 nfsd: server (nfsd)
16152 16123       0 biowr   ??  D      0:06.82 nfsd: server (nfsd)
16153 16123       0 biowr   ??  D      0:05.16 nfsd: server (nfsd)
84338 10043    4100 ufs     ??  D      0:02.00 qmgr -l -t fifo -u
91632 10043    4100 biowr   ??  D      0:00.02 cleanup -z -t unix -u
91650 10043    4100 ufs     ??  D      0:00.04 [smtpd]
91912 86635    4100 biowr   ??  Ds     0:00.01 /usr/local/bin/maildrop -d 
cathy at sitescape.com
91916 90579    4100 biowr   ??  Ds     0:00.01 /usr/local/bin/maildrop -d 
jobs at sitescape.com
71677 71672    4002 ppwait  p1  D      0:00.15 -su (csh)  

The iostat for that machine shows:
iostat 5
      tty             da0            pass0             cpu
 tin tout  KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
   0  130 15.35 109  1.63   0.00   0  0.00   6  0  6  1 87
   0   36 10.43 230  2.34   0.00   0  0.00   3  0  2  1 93
   0   12 10.81 280  2.96   0.00   0  0.00   6  0  2  0 92
   0   12 13.03 259  3.30   0.00   0  0.00   0  0  1  1 98
   0   12 12.87 259  3.26   0.00   0  0.00   5  0  2  1 91
   0   12 17.17 228  3.82   0.00   0  0.00   8  0  3  1 87
   0   12 18.38 306  5.49   0.00   0  0.00   3  0  2  1 94
   0   12 14.53 284  4.04   0.00   0  0.00   6  0  3  1 89
   0   12 26.03 213  5.41   0.00   0  0.00   5  0  3  2 91

Before that machine went into production, during the stress test I saw the 
machine do 700+ tps and substantially more MB/s.

We also have another machine identical hardware wise and although it's tps 
is 50 to 100 less than this one.. the machine is always ver low in the 
'b' column.

I am trying now to read up in vmstat.. to see if I can see anything wrong in vmstat -s

1660720108 cpu context switches
736683712 device interrupts
 46973243 software interrupts
 99310719 traps
3405487756 system calls
       46 kernel threads created
   385149  fork() calls
     7785 vfork() calls
        0 rfork() calls
     2809 swap pager pageins
     4449 swap pager pages paged in
     2027 swap pager pageouts
     4609 swap pager pages paged out
     5068 vnode pager pageins
    20399 vnode pager pages paged in
        0 vnode pager pageouts
        0 vnode pager pages paged out
     2156 page daemon wakeups
 58310018 pages examined by the page daemon
    12161 pages reactivated
 21541481 copy-on-write faults
     3659 copy-on-write optimized faults
 38628563 zero fill pages zeroed
 30430314 zero fill pages prezeroed
     5780 intransit blocking page faults
 79476476 total VM faults taken
        0 pages affected by kernel thread creation
 30747781 pages affected by  fork()
  3054182 pages affected by vfork()
        0 pages affected by rfork()
152627514 pages freed
        6 pages freed by daemon
 35726176 pages freed by exiting processes
    51914 pages active
   810514 pages inactive
    47456 pages in VM cache
    56444 pages wired down
    24779 pages free
     4096 bytes per page
184453449 total name lookups
          cache hits (67% pos + 6% neg) system 2% per-directory
          deletions 6%, falsehits 0%, toolong 0%
root at mailstore12.simplicato.com:~/bin#uptime

Uptime:  1:35PM  up 3 days, 14:48, 3 users, load averages: 0.26, 0.36, 0.29