console stops with 9.1-RELEASE when under forwarding load

Wed Feb 13 01:47:54 UTC 2013

On Tue, Feb 05, 2013 at 09:35:03PM +0100, Marius Strobl wrote:
> On Tue, Feb 05, 2013 at 04:25:53PM +0900, YongHyeon PYUN wrote:
> > On Tue, Feb 05, 2013 at 01:19:56AM -0500, Kurt Lidl wrote:
> > > On Wed, Jan 23, 2013 at 11:30:09PM +0100, Marius Strobl wrote:
> > > > On Mon, Jan 21, 2013 at 11:35:41PM -0500, Kurt Lidl wrote:
> > > > > I'm not sure if this is better directed at freebsd-sparc64@
> > > > > or freebsd-net@ but I'm going to guess here...
> > > > > 
> > > > > Anyways.  In all cases, I'm using an absolutely stock
> > > > > FreeBSD 9.1-release installation.
> > > > > 
> > > > > I got several SunFire V120 machines recently, and have been testing
> > > > > them out to verify their operation.  They all started out identically
> > > > > configured -- 1 GB of memory, 2x36GB disks, DVD-rom, 650Mhz processor.
> > > > > The V120 has two on-board "gem" network interfaces.  And the machine
> > > > > can take a single, 32-bit PCI card.
> > > > > 
> > > > > I've benchmarked the gem interfaces being able to source or sink
> > > > > about 90mbit/sec of TCP traffic.  This is comparable to the speed
> > > > > of "hme" interfaces that I've tested in my slower Netra-T1-105
> > > > > machines.
> > > > > 
> > > > > So.  I put a Intel 32bit gig-e interface (a "GT" desktop
> > > > > Gig-E interface) into the machine, and it comes up like this:
> > > > > 
> > > > > em0: <Intel(R) PRO/1000 Legacy Network Connection 1.0.4> port 0xc00200-0xc0023f mem 0x20000-0x3ffff,0x40000-0x5ffff at device 5.0 on pci2
> > > > > em0: Memory Access and/or Bus Master bits were not set!
> > > > > em0: Ethernet address: 00:1b:21:<redacted>
> > > > > 
> > > > > That interface can source or sink TCP traffic at about
> > > > > 248 mbit/sec.
> > > > > 
> > > > > Since I really want to make one of these machines my firewall/router,
> > > > > I took a different, dual-port Intel Gig-E server adaptor (a 64bit
> > > > > PCI card) and put it into one of the machines so I could look at
> > > > > the fowarding performance.  It probes like this:
> > > > > 
> > > > > em0: <Intel(R) PRO/1000 Legacy Network Connection 1.0.4> port 0xc00200-0xc0023f mem 0x20000-0x3ffff,0x40000-0x7ffff at device 5.0 on pci2
> > > > > em0: Memory Access and/or Bus Master bits were not set!
> > > > > em0: Ethernet address: 00:04:23:<redacted>
> > > > > em1: <Intel(R) PRO/1000 Legacy Network Connection 1.0.4> port 0xc00240-0xc0027f mem 0xc0000-0xdffff,0x100000-0x13ffff at device 5.1 on pci2
> > > > > em1: Memory Access and/or Bus Master bits were not set!
> > > > > em1: Ethernet address: 00:04:23:<redacted>
> > > > > 
> > > > > Now this card can source traffic at about 250 mbit/sec and can sink
> > > > > traffic around 204 mbit/sec.
> > > > > 
> > > > > But the real question is - how is the forwarding performance?
> > > > > 
> > > > > So I setup a test between some machines:
> > > > > 
> > > > > A --tcp data--> em0-sparc64-em1 --tcp data--> B
> > > > > |                                             |
> > > > > \---------<--------tcp acks-------<-----------/
> > > > > 
> > > > > So, A sends to interface em0 on the sparc64, the sparc64
> > > > > forward out em1 to host B, and the ack traffic flows out
> > > > > a different interface from B to A.  (A and B are amd64
> > > > > machines, with Gig-E interfaces that are considerably
> > > > > faster than the sparc64 machines.)
> > > > > 
> > > > > This test works surprisingly well -- 270 mbit/sec of forwarding
> > > > > traffic, at around 29500 packets/second.
> > > > > 
> > > > > The problem is when I change the test to send the tcp ack traffic
> > > > > back through the sparc64 (so, ack traffic goes from B into em1,
> > > > > then forwarded out em0 to A), while doing the data in the same way.
> > > > > 
> > > > > The console of the sparc64 becomes completely unresponsive during
> > > > > the running of this test.  The 'netstat 1' that I been running just
> > > > > stops.  When the data finishes transmitting, the netstat output
> > > > > gives one giant jump, counting all the packets that were sent during
> > > > > the test as if they happened in a single second.
> > > > > 
> > > > > It's pretty clear that the process I'm running on the console isn't
> > > > > receiving any cycles at all.  This is true for whatever I have
> > > > > running on the console of machine -- a shell, vmstat, iostat,
> > > > > whatever.  It just hangs until the forwarding test is over.
> > > > > Then the console input/output resumes normally.
> > > > > 
> > > > > Has anybody else seen this type of problem?
> > > > > 
> > > > 
> > > > I don't see what could be a sparc64-specific problem in this case.
> > > > You are certainly pushing the hardware beyond its limits though and
> > > > it would be interesting to know how a similarly "powerful" i386
> > > > machine behaves in this case.
> > > > In any case, in order to not burn any CPU cycles needlessly, you
> > > > should use a kernel built from a config stripped down to your
> > > > requirements and with options SMP removed to get the maximum out
> > > > of a UP machine. It could also be that SCHED_ULE actually helps
> > > > in this case (there's a bug in 9.1-RELEASE causing problems with
> > > > SCHED_ULE and SMP on sparc64, but for UP it should be fine).
> > > 
> > > I updated the kernel tree on one of my sparc64 machines to the
> > > latest version of 9-STABLE, and gave the following combinations a
> > > try:
> > > 	SMP+ULE
> > > 	SMP+4BSD
> > > 	non-SMP+ULE
> > > 	non-SMP+4BSD
> > > They all performed about the same, in terms of throughput,
> > > and about the same in terms of user-responsiveness when under load.
> > > None were responsive when forwarding ~214mbit/sec of traffic.
> > > 
> > > I played around a bit with tuning of the rx/tx queue depths for the
> > > em0/em1 devices, but none of that had any perceptable difference in
> > > the level of throughput or responsiveness of the machine.
> > 
> > If my memory serve me right, em(4) requires considerably fast
> > machine to offset the overhead of taskqueue(9). Because the
> > taskqueue handler is enqueued again and again under heavy RX
> > network load, most system cycles would be consumed in the
> > taskqueue handler.
> > Try polling(4) and see whether it makes any difference. I'm not
> > sure whether polling(4) works on sparc64 though.
> > 
> 
> This might or might not work or at least cause ill effects. In general,
> Sun PCI bridges synchronize DMA on interrupts and polling(4) bypasses
> that mechanism. For the host-PCI-bridges found in v210, psycho(4)
> additionally synchronizes DMA manually when bus_dmamap_sync(9) is called
> with BUS_DMASYNC_POSTREAD (as suggested in the datasheet). I'm not sure
> whether this is also sufficient for polling(4). In any case, sun4u
> hardware certainly wasn't built with something like polling(4) in mind.
> Hrm, according to my reading of the lem(4) source, it shouldn't use
> taskqueue(9) when setting the loader tunable hw.em.use_legacy_irq to
> 1 for the MACs in question. In any case, the latter certainly is easier
> to test than rebuilding a kernel with polling(4) support.
> 

Right. If the driver is lem(4), using use_legacy_irq would be
better way to eliminate taskqueue(9) overhead on slow boxes.
You may also want to tune several interrupt delay tunables.

> Marius
>