8.0-RELEASE-p3: 4k jumbo mbuf cluster exhaustion

Mon Aug 23 19:16:38 UTC 2010

On Mon, Aug 23, 2010 at 09:04:02PM +0200, Andre Oppermann wrote:
> On 23.08.2010 19:52, Pyun YongHyeon wrote:
> >On Mon, Aug 23, 2010 at 12:18:01PM +0200, Andre Oppermann wrote:
> >>On 23.08.2010 11:26, Adrian Chadd wrote:
> >>>On 23 August 2010 06:27, Pyun YongHyeon<pyunyh at gmail.com>   wrote:
> >>>
> >>>>I recall there was SIOCSIFCAP ioctl handling bug in bce(4) on 8.0 so
> >>>>it might also disable IFCAP_TSO4/IFCAP_TXCSUM/IFCAP_RXCSUM when yo
> >>>>disabled RX checksum offloading. But I can't explain how checksum
> >>>>offloading could be related with the growth of 4k jumbo buffers.
> >>>
> >>>Neither can I!
> >>>
> >>>I'm trying to come up with a reproduction method that doesn't involve
> >>>"put box on the internet, push clients through it, wait."
> >>
> >>Network drivers use 2k sized mbuf clusters on receive.  So the problem
> >>doesn't seem to be RX related.
> >>
> >
> >bce(4) is special in this regards. The controller would allocate
> >jumbo cluster on RX if jumbo frame is used. If header splitting is
> >used, driver will use normal mbuf clusters.
> 
> Didn't know that.
> 
> >>The function that is called on a socket write is sosend_generic() which
> >>makes use of m_getm2().  This function allocates mbuf chains with the
> >>tightest packing it can achieve.  It will make use 4k (page size) mbufs
> >>as much as it can.  This is where they come from.
> >>
> >>It seems the 4k clusters do not get freed back to the pool after they've
> >>been sent by the NIC and dropped from the socket buffer after the ACK has
> >>arrived.  The leak must occur in one of these two places.  The socket
> >>buffer is unlikely as it would affect not just you but everyone else too.
> >>Thus the mbuf freeing after DMA/tx in the bce(4) driver is the prime
> >>suspect.
> >>
> >
> >I know bce(4) has a couple of bug in TX path(wrong dma tag, lack of
> >bus_dmamap_sync(9) etc) but this is the same code path with/without
> >TX checksum offloading. This is one of reason why I still do not
> >understand what's really happening here. TX checksum offloading may
> >introduce additional frame processing time to fill internal FIFO to
> >compute checksum before transmitting the frame to wire such that it
> >can change timing of TX path. This timing change might trigger the
> >TX path bug. It's just vague guessing though.
> 
> Had a chat with Claudio at OpenBSD and he said that the bce(4) DMA engine
> can only access the first 1GB of physical RAM and has to use bounce
> buffers all the time.  Maybe this is related.
> 

Really? I don't remember I saw such a DMA address space limitation
in data sheet. And I don't think Broadcom made such a horrible
thing for controllers targeted for servers. The only limitation I
know is BCM5708 is not able to handle DMA addresses greater than
40bits so bce(4) limits the DMA address space in DMA tag creation.