CPU affinity with ULE scheduler
John Baldwin
jhb at freebsd.org
Mon Nov 17 13:13:50 PST 2008
On Monday 17 November 2008 06:11:00 am Archimedes Gaviola wrote:
> On Fri, Nov 14, 2008 at 12:28 AM, John Baldwin <jhb at freebsd.org> wrote:
> > On Thursday 13 November 2008 06:55:01 am Archimedes Gaviola wrote:
> >> On Wed, Nov 12, 2008 at 1:16 AM, John Baldwin <jhb at freebsd.org> wrote:
> >> > On Monday 10 November 2008 11:32:55 pm Archimedes Gaviola wrote:
> >> >> On Tue, Nov 11, 2008 at 6:33 AM, John Baldwin <jhb at freebsd.org> wrote:
> >> >> > On Monday 10 November 2008 03:33:23 am Archimedes Gaviola wrote:
> >> >> >> To Whom It May Concerned:
> >> >> >>
> >> >> >> Can someone explain or share about ULE scheduler (latest version 2
if
> >> >> >> I'm not mistaken) dealing with CPU affinity? Is there any existing
> >> >> >> benchmarks on this with FreeBSD? Because I am currently using 4BSD
> >> >> >> scheduler and as what I have observed especially on processing high
> >> >> >> network load traffic on multiple CPU cores, only one CPU were being
> >> >> >> stressed with network interrupt while the rests are mostly in idle
> >> >> >> state. This is an AMD-64 (4x) dual-core IBM system with GigE
Broadcom
> >> >> >> network interface cards (bce0 and bce1). Below is the snapshot of
the
> >> >> >> case.
> >> >> >
> >> >> > Interrupts are routed to a single CPU. Since bce0 and bce1 are both
on
> >> > the
> >> >> > same interrupt (irq 23), the CPU that interrupt is routed to is
going
> > to
> >> > end
> >> >> > up handling all the interrupts for bce0 and bce1. This not
something
> > ULE
> >> > or
> >> >> > 4BSD have any control over.
> >> >> >
> >> >> > --
> >> >> > John Baldwin
> >> >> >
> >> >>
> >> >> Hi John,
> >> >>
> >> >> I'm sorry for the wrong snapshot. Here's the right one with my
concern.
> >> >>
> >> >> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU
COMMAND
> >> >> 17 root 1 171 52 0K 16K CPU0 0 54:28 95.17%
idle:
> > cpu0
> >> >> 15 root 1 171 52 0K 16K CPU2 2 55:55 93.65%
idle:
> > cpu2
> >> >> 14 root 1 171 52 0K 16K CPU3 3 58:53 93.55%
idle:
> > cpu3
> >> >> 13 root 1 171 52 0K 16K RUN 4 59:14 82.47%
idle:
> > cpu4
> >> >> 12 root 1 171 52 0K 16K RUN 5 55:42 82.23%
idle:
> > cpu5
> >> >> 16 root 1 171 52 0K 16K CPU1 1 58:13 77.78%
idle:
> > cpu1
> >> >> 11 root 1 171 52 0K 16K CPU6 6 54:08 76.17%
idle:
> > cpu6
> >> >> 36 root 1 -68 -187 0K 16K WAIT 7 8:50 65.53%
> >> >> irq23: bce0 bce1
> >> >> 10 root 1 171 52 0K 16K CPU7 7 48:19 29.79%
idle:
> > cpu7
> >> >> 43 root 1 171 52 0K 16K pgzero 2 0:35 1.51%
> > pagezero
> >> >> 1372 root 10 20 0 16716K 5764K kserel 6 58:42 0.00% kmd
> >> >> 4488 root 1 96 0 30676K 4236K select 2 1:51 0.00% sshd
> >> >> 18 root 1 -32 -151 0K 16K WAIT 0 1:14 0.00%
swi4:
> >> > clock s
> >> >> 20 root 1 -44 -163 0K 16K WAIT 0 0:30 0.00%
swi1:
> > net
> >> >> 218 root 1 96 0 3852K 1376K select 0 0:23 0.00%
syslogd
> >> >> 2171 root 1 96 0 30676K 4224K select 6 0:19 0.00% sshd
> >> >>
> >> >> Actually I was doing a network performance testing on this system with
> >> >> FreeBSD-6.2 RELEASE using its default scheduler 4BSD and then I used a
> >> >> tool to generate big amount of traffic around 600Mbps-700Mbps
> >> >> traversing the FreeBSD system in bi-direction, meaning both network
> >> >> interfaces are receiving traffic. What happened was, the CPU (cpu7)
> >> >> that handles the (irq 23) on both interfaces consumed big amount of
> >> >> CPU utilization around 65.53% in which it affects other running
> >> >> applications and services like sshd and httpd. It's no longer
> >> >> accessible when traffic is bombarded. With the current situation of my
> >> >> FreeBSD system with only one CPU being stressed, I was thinking of
> >> >> moving to FreeBSD-7.0 RELEASE with the ULE scheduler because I thought
> >> >> my concern has something to do with the distributions of load on
> >> >> multiple CPU cores handled by the scheduler especially at the network
> >> >> level, processing network load. So, if it is more of interrupt
> >> >> handling and not on the scheduler, is there a way we can optimize it?
> >> >> Because if it still routed only to one CPU then for me it's still
> >> >> inefficient. Who handles interrupt scheduling for bounding CPU in
> >> >> order to prevent shared IRQ? Is there any improvements with
> >> >> FreeBSD-7.0 with regards to interrupt handling?
> >> >
> >> > It depends. In all likelihood, the interrupts from bce0 and bce1 are
both
> >> > hardwired to the same interrupt pin and so they will always share the
same
> >> > ithread when using the legacy INTx interrupts. However, bce(4) parts
do
> >> > support MSI, and if you try a newer OS snap (6.3 or later) these
devices
> >> > should use MSI in which case each NIC would be assigned to a separate
CPU.
> > I
> >> > would suggest trying 7.0 or a 7.1 release candidate and see if it does
> >> > better.
> >> >
> >> > --
> >> > John Baldwin
> >> >
> >>
> >> Hi John,
> >>
> >> I try 7.0 release and each network interface were already allocated
> >> separately on different CPU. Here, MSI is already working.
> >>
> >> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
> >> 12 root 1 171 ki31 0K 16K CPU6 6 123:55 100.00% idle:
> > cpu6
> >> 15 root 1 171 ki31 0K 16K CPU3 3 123:54 100.00% idle:
> > cpu3
> >> 14 root 1 171 ki31 0K 16K CPU4 4 123:26 100.00% idle:
> > cpu4
> >> 16 root 1 171 ki31 0K 16K CPU2 2 123:15 100.00% idle:
> > cpu2
> >> 17 root 1 171 ki31 0K 16K CPU1 1 123:15 100.00% idle:
> > cpu1
> >> 37 root 1 -68 - 0K 16K CPU7 7 9:09 100.00%
irq256:
> > bce0
> >> 13 root 1 171 ki31 0K 16K CPU5 5 123:49 99.07% idle:
cpu5
> >> 40 root 1 -68 - 0K 16K WAIT 0 4:40 51.17% irq257:
> > bce1
> >> 18 root 1 171 ki31 0K 16K RUN 0 117:48 49.37% idle:
cpu0
> >> 11 root 1 171 ki31 0K 16K RUN 7 115:25 0.00% idle:
cpu7
> >> 19 root 1 -32 - 0K 16K WAIT 0 0:39 0.00% swi4:
> > clock s
> >> 14367 root 1 44 0 5176K 3104K select 2 0:01 0.00% dhcpd
> >> 22 root 1 -16 - 0K 16K - 3 0:01 0.00% yarrow
> >> 25 root 1 -24 - 0K 16K WAIT 0 0:00 0.00% swi6:
> > Giant t
> >> 11658 root 1 44 0 32936K 4540K select 1 0:00 0.00% sshd
> >> 14224 root 1 44 0 32936K 4540K select 5 0:00 0.00% sshd
> >> 41 root 1 -60 - 0K 16K WAIT 0 0:00 0.00% irq1:
> > atkbd0
> >> 4 root 1 -8 - 0K 16K - 2 0:00 0.00% g_down
> >>
> >> The bce0 interface interrupt (irq256) gets stressed out which already
> >> have 100% of CPU7 while CPU0 is around 51.17%. Any more
> >> recommendations? Is there anything we can do about optimization with
> >> MSI?
> >
> > Well, on 7.x you can try turning net.isr.direct off (sysctl). However, it
> > seems you are hammering your bce0 interface. You might want to try using
> > polling on bce0 and seeing if it keeps up with the traffic better.
> >
> > --
> > John Baldwin
> >
>
> With net.isr.direct=0, my IBM system lessens CPU utilization per
> interface (bce0 and bce1) but swi1:net increase its utilization.
> Can you explained what's happening here? What does net.isr.direct do
> with the decrease of CPU utilization on its interface? I really wanted
> to know what happened internally during the packets being processed
> and received by the interfaces then to the device interrupt up to the
> software interrupt level because I am confused when enabling/disabling
> net.isr.direct in sysctl. Is there a tool that can we used to trace
> this process just to be able to know which part of the kernel internal
> is doing the bottleneck especially when net.isr.direct=1? By the way
> with device polling enabled, the system experienced packet errors and
> the interface throughput is worst, so I avoid using it though.
>
> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
>
> 16 root 1 171 ki31 0K 16K CPU10 a 86:06 89.06% idle:
cpu10
> 27 root 1 -44 - 0K 16K CPU1 1 34:37 82.67% swi1: net
> 52 root 1 -68 - 0K 16K WAIT b 51:59 59.77% irq32:
bce1
> 15 root 1 171 ki31 0K 16K RUN b 69:28 43.16% idle:
cpu11
> 25 root 1 171 ki31 0K 16K RUN 1 115:35 24.27% idle: cpu1
> 51 root 1 -68 - 0K 16K CPU10 a 35:21 13.48% irq31:
bce0
With net.isr.direct=1, the ithread tries to pass the received packets up to
IP/UDP/TCP/socket directly. With net.isr.direct=0, the ithread places
received packets on a queue and sends a signal to 'sw1: net'. The swi thread
wakes up, pulls the packets off of the queue and sends them to
IP/UDP/TCP/socket.
--
John Baldwin
More information about the freebsd-smp
mailing list