High traffic NFS performance and availability problems

Mon Feb 21 12:34:53 PST 2005

Here are the snapshots of the output you requested. These are from the NFS 
server. We have just upgraded them to 5.3-RELEASE as so many have recomended.
Hope that makes them more stable. The performance still needs some attention.

Thank You

--------------------------------------------------------------------------------------------------
D USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
    4 users    Load  5.28 19.37 28.00                  Feb 21 12:18

Mem:KB    REAL            VIRTUAL                     VN PAGER  SWAP PAGER
        Tot   Share      Tot    Share    Free         in  out     in  out
Act   19404    2056    90696     3344   45216 count
All 1020204    4280  4015204     7424         pages
                                                          zfod   Interrupts
Proc:r  p  d  s  w    Csw  Trp  Sys  Int  Sof  Flt        cow    7226 total
           5128  5  60861    3  14021584    9      152732 wire        4: sio0
                                                    23228 act         6: fdc0
30.2%Sys  11.8%Intr  0.0%User  0.0%Nice 58.0%Idl   803616 inact   128 8: rtc
|    |    |    |    |    |    |    |    |    |      43556 cache       13: npx
===============++++++                                1660 free        15: ata
                                                          daefr  6358 16: bge
Namei         Name-cache    Dir-cache                     prcfr     1 17: bge
    Calls     hits    %     hits    %                     react       18: mpt
     1704      971   57       11    1                     pdwak       19: mpt
                                                     5342 pdpgs   639 24: amr
Disks amrd0   da0 pass0 pass1 pass2                       intrn   100 0: clk
KB/t  22.41  0.00  0.00  0.00  0.00                114288 buf
tps     602     0     0     0     0                   510 dirtybuf
MB/s  13.16  0.00  0.00  0.00  0.00                 70235 desiredvnodes
% busy  100     0     0     0     0                 20543 numvnodes
                                                     7883 freevnodes
-----------------------------------------------------------------------------------------
last pid: 10330;  load averages: 14.69, 11.81, 18.62                                                                                           
up 0+09:01:13  12:32:57
226 processes: 5 running, 153 sleeping, 57 waiting, 11 lock
CPU states:  0.1% user,  0.0% nice, 66.0% system, 24.3% interrupt,  9.6% idle
Mem: 23M Active, 774M Inact, 150M Wired, 52M Cache, 112M Buf, 1660K Free
Swap: 1024M Total, 124K Used, 1024M Free

  PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
   63 root     -44 -163     0K    12K WAIT   0 147:05 45.07% 45.07% swi1: net
   30 root     -68 -187     0K    12K WAIT   0 101:39 32.32% 32.32% irq16: 
bge0
   12 root     117    0     0K    12K CPU2   2 329:09 19.58% 19.58% idle: cpu2
   11 root     116    0     0K    12K CPU3   3 327:29 19.24% 19.24% idle: cpu3
   13 root     114    0     0K    12K RUN    1 263:39 16.89% 16.89% idle: cpu1
   14 root     109    0     0K    12K CPU0   0 228:50 12.06% 12.06% idle: cpu0
  368 root       4    0  1220K   740K *Giant 3  45:27  7.52%  7.52% nfsd
  366 root       4    0  1220K   740K *Giant 0  48:52  7.28%  7.28% nfsd
  364 root       4    0  1220K   740K *Giant 3  53:01  7.13%  7.13% nfsd
  367 root      -8    0  1220K   740K biord  3  41:22  7.08%  7.08% nfsd
  372 root       4    0  1220K   740K *Giant 0  28:54  7.08%  7.08% nfsd
  365 root      -1    0  1220K   740K *Giant 3  51:53  6.93%  6.93% nfsd
  370 root      -1    0  1220K   740K nfsslp 0  32:49  6.84%  6.84% nfsd
  369 root      -8    0  1220K   740K biord  1  36:40  6.49%  6.49% nfsd
  371 root       4    0  1220K   740K *Giant 0  25:14  6.45%  6.45% nfsd
  374 root      -1    0  1220K   740K nfsslp 2  22:31  6.45%  6.45% nfsd
  377 root       4    0  1220K   740K *Giant 2  17:21  5.52%  5.52% nfsd
  376 root      -4    0  1220K   740K *Giant 2  15:45  5.37%  5.37% nfsd
  373 root      -4    0  1220K   740K ufs    3  19:38  5.18%  5.18% nfsd
  378 root       4    0  1220K   740K *Giant 2  13:55  4.54%  4.54% nfsd
  379 root      -8    0  1220K   740K biord  3  12:41  4.49%  4.49% nfsd
  380 root       4    0  1220K   740K -      2  11:26  4.20%  4.20% nfsd
    3 root      -8    0     0K    12K -      1  21:21  4.05%  4.05% g_up
    4 root      -8    0     0K    12K -      0  20:05  3.96%  3.96% g_down
  381 root       4    0  1220K   740K -      3   9:28  3.66%  3.66% nfsd
  382 root       4    0  1220K   740K -      1  10:13  3.47%  3.47% nfsd
  385 root      -1    0  1220K   740K nfsslp 3   7:21  3.17%  3.17% nfsd
   38 root     -64 -183     0K    12K *Giant 0  14:45  3.12%  3.12% irq24: 
amr0
  384 root       4    0  1220K   740K -      3   8:40  3.12%  3.12% nfsd
   72 root     -24 -143     0K    12K WAIT   2  16:50  2.98%  2.98% swi6:+
  383 root      -8    0  1220K   740K biord  2   7:57  2.93%  2.93% nfsd
  389 root       4    0  1220K   740K -      2   5:31  2.64%  2.64% nfsd
  390 root      -8    0  1220K   740K biord  3   5:54  2.59%  2.59% nfsd
  387 root      -8    0  1220K   740K biord  0   6:40  2.54%  2.54% nfsd
  386 root      -8    0  1220K   740K biord  1   6:22  2.44%  2.44% nfsd
  392 root       4    0  1220K   740K -      3   4:27  2.10%  2.10% nfsd
  388 root      -4    0  1220K   740K *Giant 2   4:45  2.05%  2.05% nfsd
  395 root       4    0  1220K   740K -      0   3:59  2.05%  2.05% nfsd
  391 root       4    0  1220K   740K -      2   5:10  1.95%  1.95% nfsd
  393 root       4    0  1220K   740K sbwait 1   4:13  1.56%  1.56% nfsd
  398 root       4    0  1220K   740K -      2   3:31  1.56%  1.56% nfsd
  399 root       4    0  1220K   740K -      3   3:12  1.56%  1.56% nfsd
  401 root       4    0  1220K   740K -      1   2:57  1.51%  1.51% nfsd
  403 root       4    0  1220K   740K -      0   3:04  1.42%  1.42% nfsd
  406 root       4    0  1220K   740K -      1   2:27  1.37%  1.37% nfsd
  397 root       4    0  1220K   740K -      3   3:16  1.27%  1.27% nfsd
  396 root       4    0  1220K   740K -      2   3:42  1.22%  1.22% nfsd

On Saturday 19 February 2005 04:23 am, Robert Watson wrote:
> On Thu, 17 Feb 2005, David Rice wrote:
> > Typicly we have 7 client boxes mounting storage from a single file
> > server.  Each client box servers 1000 web sites and associate email. We
> > have done the basic NFS tuning (ie: Read write size optimization and
> > kernel tuning)
>
> How many nfsd's are you running with?
>
> If you run systat -vmstat 1 on your server under high load, could you send
> us the output?  In particular, I'm interested in knowing how the system is
> spending its time, the paging level, I/O throughput on devices, and the
> systat -vmstat summary screen provides a good summary of this and more.  A
> few snapshots of "gstat" output would also be very helpful.  As would a
> snapshot or two of "top -S" output.  This will give us a picture of how
> the system is spending its time.
>
> > 2. Client boxes have high load averages and sometimes crashes due to
> > slow NFS performance.
>
> Could you be more specific about the crash failure mode?
>
> > 3. File servers that randomly crash with "Fatal trap 12: page fault
> > while in kernel mode"
>
> Could you make sure you're running with at least the latest 5.3 patch
> level on the server, which includes some NFS server stability fixes, and
> also look at sliding to the head of 5-STABLE?  There are a number of
> performance and stability improvements that may be relevant there.
>
> Could you provide serial console output of the full panic message, trap
> details, compile the kernel with KDB+DDB, and include a full stack trace?
> I'm happy to try to help debug these problems.
>
> > 4. With soft updates enabled during FSCK the fileserver will freeze with
> > all NFS processs in the "snaplck" state. We disabled soft updates
> > because of this.
>
> If it's possible to do get some more information, it would be quite
> helpful.  In particular, could you compile the server box with
> DDB+KDB+BREAK_TO_DEBUGGER, breka into the serial debugger when it appears
> wedged, and put the contents of "show lockedvnods", "ps", and "trace
> <pid>" of any processes listed in "show lockedvnods" output, that would be
> great.  A crash dump would also be very helpful.  For some hints on the
> information that is necessary here, take a look at the handbook chapter on
> kernel debugging and reporting kernel bugs, and my recent post to current@
> diagnosing a similar bug.
>
> If you e-enable soft updates but leave bgfsck disabled, does that correct
> this stability problem?
>
> In any case, I'm happy to help try to figure out what's going on -- some
> of the above information for stability and performance problems would be
> quite helpful in tracking it down.
>
> Robert N M Watson