5.2-rel NFS lockup and networking performance

Mon Jan 26 10:59:59 PST 2004

On Monday 26 January 2004 19:08, Robert Watson wrote:
*SNIP*
>
> On the console, since it sounds like that's still running, it would be
> very interesting to see the output of:
>
> while (1)
>   vmstat -i
>   sleep 1
> end
>

This gives tho following lines: (Note the very constant values, and they really
were constant (vr was 72 for at least 50 loops) no matter if the network was hung or not)
interrupt                          total       rate
irq0: clk                        6144766        999
irq1: atkbd0                           2          0
irq4: sio0                         17539          2
irq7: ppc0                             1          0
irq8: rtc                         786435        127
irq10: vr0                        447325         72
irq11: atapci1                    754942        122
irq13: npx0                            1          0
irq15: ata1                           32          0
Total                            8151043       1326
interrupt                          total       rate
irq0: clk                        6145839        999
irq1: atkbd0                           2          0
irq4: sio0                         17577          2
irq7: ppc0                             1          0
irq8: rtc                         786572        127
irq10: vr0                        447336         72
irq11: atapci1                    755510        122
irq13: npx0                            1          0
irq15: ata1                           32          0
Total                            8152870       1326
interrupt                          total       rate
irq0: clk                        6146915        999
irq1: atkbd0                           2          0
irq4: sio0                         17615          2
irq7: ppc0                             1          0
irq8: rtc                         786710        127
irq10: vr0                        447342         72
irq11: atapci1                    756160        123
irq13: npx0                            1          0
irq15: ata1                           32          0
Total                            8154778^C

> And see what's going on with interrupts.  Also, when it's "hung", could
> you do a stack trace of any nfsd processes lying around?  It's as though

If i only knew how to do that. But like mentioned I still can't break to
debugger.

> we're hitting some edge case that causes the server to spin quite hard
> doing some sort of work, it's just not clear what it is.  Speaking of
> spinning, general vmstat -w 1 would be interesting during the hang also to
> see what's going on with CPU.

Here are some lines from vmstat -w 1 (Note the lines with the zeros - That's
when the ssh connection was locked!)

 0 0 0   44260 104876  121   0   0   0  28   0   0   0 1435    0 649  1  2 97
 0 0 0   44260 104876    0   0   0   0   0   0   0   0 1423    0 611  0  2 98
 0 0 0   44260 104876   12   0   0   0   0   0   0   0 1606    0 1040  0  3 97
 0 0 0   44260 199816    0   0   0   0 25020   0   0   0 7863    0 10519  0 59 41
 0 0 0   44260 191416   12   0   0   0   8   0   0   0 12194    0 16161  0 75 25
 0 0 0   44260 181388    0   0   0   0   1   0   0   0 14104    0 17630  0 66 34
 1 0 0   42364 173152    0   0   0   0  93   0   0   0 12191    0 14839  0 58 42
 0 0 0   42364 163248   94   0   0   0 106   0   0   0 13564    0 16778  1 70 29
 0 0 0   42364 153232    0   0   0   0   0   0   0   0 13939    0 16858  0 71 29
 0 0 0   42364 145568   94   0   0   0 114   0   0   0 11151    0 14119  3 54 43
 procs      memory      page                    disks     faults      cpu
 r b w     avm    fre  flt  re  pi  po  fr  sr ad4 ad6   in   sy  cs us sy id
 0 0 0   42364 136160   94   0   0   0 106   0   0   0 13192    0 16682  2 65 32
 0 4 0   42364 129808   94   0   0   0 110   0   0   0 9715    0 11998  2 52 46
 0 4 0   44260 129436  121   0   0   0  28   0   0   0 2655    0 3754  1 88 11
 0 4 0   42364 129808    0   0   0   0  93   0   0   0 2646    0 3760  0 81 19
 0 5 0   43696 129536   93   0   0   0  25   0   0   0 2620    0 3729  2 78 20
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2628    0 3807  0 76 24
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2640    0 3833  0 81 19
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2645    0 3829  0 81 19
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2647    0 3791  0 81 19
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2627    0 3790  0 84 16
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2633    0 3817  0 81 19
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2657    0 3843  0 81 19
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2637    0 3809  0 82 18
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2630    0 3790  0 82 18
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2644    0 3797  0 78 22
 0 5 0   43696 129536    2   0   0   0   0   0   0   0 2631    0 3813  0 84 16
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2636    0 3805  0 82 18
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2623    0 3772  0 81 19
 procs      memory      page                    disks     faults      cpu
 r b w     avm    fre  flt  re  pi  po  fr  sr ad4 ad6   in   sy  cs us sy id
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2634    0 3816  0 82 18
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2650    0 3836  0 84 16
 0 5 0   43696 129536    0   0   0   0   0   0   0   0 2628    0 3794  0 84 16

>
> Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
> robert at fledge.watson.org      Senior Research Scientist, McAfee Research
>
> > Thanks,
> >
> > -Harry
> >
> > > > Breaking to debugger is working on the console.  (Which crashes my
> > > > /home each time)  Is there a possibility to shutdown the machine
> > > > "clean" after the ddb? Like I mentioned before, this is my production
> > > > Fileserver :(
> > >
> > > Normally, assuming your machine isn't already hung, you can type in
> > > "cont" to continue, which should allow the system to continue normally.
> > >  If the system was hung/generally broken when you entered DDB, you can
> > > try "call boot(0)" to see if you can get it to cleanly sync to disk,
> > > but whether that succeeds depends a lot on how hung/broken the kernel
> > > was already.
> > >
> > > Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
> > > robert at fledge.watson.org      Senior Research Scientist, McAfee
> > > Research
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: signature
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20040126/df88de46/attachment-0001.bin