CPU affinity with ULE scheduler

Thu Nov 13 11:46:33 PST 2008

On Thursday 13 November 2008 06:55:01 am Archimedes Gaviola wrote:
> On Wed, Nov 12, 2008 at 1:16 AM, John Baldwin <jhb at freebsd.org> wrote:
> > On Monday 10 November 2008 11:32:55 pm Archimedes Gaviola wrote:
> >> On Tue, Nov 11, 2008 at 6:33 AM, John Baldwin <jhb at freebsd.org> wrote:
> >> > On Monday 10 November 2008 03:33:23 am Archimedes Gaviola wrote:
> >> >> To Whom It May Concerned:
> >> >>
> >> >> Can someone explain or share about ULE scheduler (latest version 2 if
> >> >> I'm not mistaken) dealing with CPU affinity? Is there any existing
> >> >> benchmarks on this with FreeBSD? Because I am currently using 4BSD
> >> >> scheduler and as what I have observed especially on processing high
> >> >> network load traffic on multiple CPU cores, only one CPU were being
> >> >> stressed with network interrupt while the rests are mostly in idle
> >> >> state. This is an AMD-64 (4x) dual-core IBM system with GigE Broadcom
> >> >> network interface cards (bce0 and bce1). Below is the snapshot of the
> >> >> case.
> >> >
> >> > Interrupts are routed to a single CPU.  Since bce0 and bce1 are both on
> > the
> >> > same interrupt (irq 23), the CPU that interrupt is routed to is going 
to
> > end
> >> > up handling all the interrupts for bce0 and bce1.  This not something 
ULE
> > or
> >> > 4BSD have any control over.
> >> >
> >> > --
> >> > John Baldwin
> >> >
> >>
> >> Hi John,
> >>
> >> I'm sorry for the wrong snapshot. Here's the right one with my concern.
> >>
> >>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
> >>    17 root        1 171   52     0K    16K CPU0   0  54:28 95.17% idle: 
cpu0
> >>    15 root        1 171   52     0K    16K CPU2   2  55:55 93.65% idle: 
cpu2
> >>    14 root        1 171   52     0K    16K CPU3   3  58:53 93.55% idle: 
cpu3
> >>    13 root        1 171   52     0K    16K RUN    4  59:14 82.47% idle: 
cpu4
> >>    12 root        1 171   52     0K    16K RUN    5  55:42 82.23% idle: 
cpu5
> >>    16 root        1 171   52     0K    16K CPU1   1  58:13 77.78% idle: 
cpu1
> >>    11 root        1 171   52     0K    16K CPU6   6  54:08 76.17% idle: 
cpu6
> >>    36 root        1 -68 -187     0K    16K WAIT   7   8:50 65.53%
> >> irq23: bce0 bce1
> >>    10 root        1 171   52     0K    16K CPU7   7  48:19 29.79% idle: 
cpu7
> >>    43 root        1 171   52     0K    16K pgzero 2   0:35  1.51% 
pagezero
> >>  1372 root       10  20    0 16716K  5764K kserel 6  58:42  0.00% kmd
> >>  4488 root        1  96    0 30676K  4236K select 2   1:51  0.00% sshd
> >>    18 root        1 -32 -151     0K    16K WAIT   0   1:14  0.00% swi4:
> > clock s
> >>    20 root        1 -44 -163     0K    16K WAIT   0   0:30  0.00% swi1: 
net
> >>   218 root        1  96    0  3852K  1376K select 0   0:23  0.00% syslogd
> >>  2171 root        1  96    0 30676K  4224K select 6   0:19  0.00% sshd
> >>
> >> Actually I was doing a network performance testing on this system with
> >> FreeBSD-6.2 RELEASE using its default scheduler 4BSD and then I used a
> >> tool to generate big amount of traffic around 600Mbps-700Mbps
> >> traversing the FreeBSD system in bi-direction, meaning both network
> >> interfaces are receiving traffic. What happened was, the CPU (cpu7)
> >> that handles the (irq 23) on both interfaces consumed big amount of
> >> CPU utilization around 65.53% in which it affects other running
> >> applications and services like sshd and httpd. It's no longer
> >> accessible when traffic is bombarded. With the current situation of my
> >> FreeBSD system with only one CPU being stressed, I was thinking of
> >> moving to FreeBSD-7.0 RELEASE with the ULE scheduler because I thought
> >> my concern has something to do with the distributions of load on
> >> multiple CPU cores handled by the scheduler especially at the network
> >> level, processing network load. So, if it is more of interrupt
> >> handling and not on the scheduler, is there a way we can optimize it?
> >> Because if it still routed only to one CPU then for me it's still
> >> inefficient. Who handles interrupt scheduling for bounding CPU in
> >> order to prevent shared IRQ? Is there any improvements with
> >> FreeBSD-7.0 with regards to interrupt handling?
> >
> > It depends.  In all likelihood, the interrupts from bce0 and bce1 are both
> > hardwired to the same interrupt pin and so they will always share the same
> > ithread when using the legacy INTx interrupts.  However, bce(4) parts do
> > support MSI, and if you try a newer OS snap (6.3 or later) these devices
> > should use MSI in which case each NIC would be assigned to a separate CPU.  
I
> > would suggest trying 7.0 or a 7.1 release candidate and see if it does
> > better.
> >
> > --
> > John Baldwin
> >
> 
> Hi John,
> 
> I try 7.0 release and each network interface were already allocated
> separately on different CPU. Here, MSI is already working.
> 
>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
>    12 root        1 171 ki31     0K    16K CPU6   6 123:55 100.00% idle: 
cpu6
>    15 root        1 171 ki31     0K    16K CPU3   3 123:54 100.00% idle: 
cpu3
>    14 root        1 171 ki31     0K    16K CPU4   4 123:26 100.00% idle: 
cpu4
>    16 root        1 171 ki31     0K    16K CPU2   2 123:15 100.00% idle: 
cpu2
>    17 root        1 171 ki31     0K    16K CPU1   1 123:15 100.00% idle: 
cpu1
>    37 root        1 -68    -     0K    16K CPU7   7   9:09 100.00% irq256: 
bce0
>    13 root        1 171 ki31     0K    16K CPU5   5 123:49 99.07% idle: cpu5
>    40 root        1 -68    -     0K    16K WAIT   0   4:40 51.17% irq257: 
bce1
>    18 root        1 171 ki31     0K    16K RUN    0 117:48 49.37% idle: cpu0
>    11 root        1 171 ki31     0K    16K RUN    7 115:25  0.00% idle: cpu7
>    19 root        1 -32    -     0K    16K WAIT   0   0:39  0.00% swi4: 
clock s
> 14367 root        1  44    0  5176K  3104K select 2   0:01  0.00% dhcpd
>    22 root        1 -16    -     0K    16K -      3   0:01  0.00% yarrow
>    25 root        1 -24    -     0K    16K WAIT   0   0:00  0.00% swi6: 
Giant t
> 11658 root        1  44    0 32936K  4540K select 1   0:00  0.00% sshd
> 14224 root        1  44    0 32936K  4540K select 5   0:00  0.00% sshd
>    41 root        1 -60    -     0K    16K WAIT   0   0:00  0.00% irq1: 
atkbd0
>     4 root        1  -8    -     0K    16K -      2   0:00  0.00% g_down
> 
> The bce0 interface interrupt (irq256) gets stressed out which already
> have 100% of CPU7 while CPU0 is around 51.17%. Any more
> recommendations? Is there anything we can do about optimization with
> MSI?

Well, on 7.x you can try turning net.isr.direct off (sysctl).  However, it 
seems you are hammering your bce0 interface.  You might want to try using 
polling on bce0 and seeing if it keeps up with the traffic better.

-- 
John Baldwin