panic: m_copym, length > size of mbuf chain
Don Lewis
truckman at FreeBSD.org
Sat Jul 10 16:24:57 PDT 2004
On 10 Jul, Robert Watson wrote:
>
> On Sat, 10 Jul 2004, Daniel Lang wrote:
>
>> So I come back to the issue. As I already wrote, I guess I can rule out
>> hardware problems now. I did a very thorough test with the Dell
>> diagnosis utilities which showed no problems.
>
> Thanks!
>
>> Also, after John's patch I did not see any WITNESS related problems (so
>> far) again. But I had the m_copy panic again (see subject). This time I
>> did file a PR and did some more detailed gdb analysis. It is all
>> documented at:
>>
>> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/68889
>>
>> I am puzzled, because the stack frame on entering m_copym has 0x0 as
>> first argument (m), however in the previous frame when m_copy() is
>> called, the struct mbuf* argument is valid.
>>
>> Ok, I just realized that there is a difference m_copy() and m_copym()
>> are apparently different functions. Is this a makro/#define discrepancy
>> it seems that that m_copym() is the function which is called in this
>> line of code.
>>
>> Ah, I found it:
>>
>> sys/mbuf.h:#define m_copy(m, o, l) m_copym((m), (o), (l), M_DONTWAIT)
>>
>> so, the puzzle remains, since the arguments passed are kept, except that
>> M_DONTWAIT flag is added.
>>
>> Is this a trashed stack?
>
> Possibly, but notice that the m_copym() function modifies its copy of 'm'
> in the stack as part of its work -- it uses 'm' to iterate the mbuf chain
> passed in in order to move to the necessary starting offset for the copy.
> Note that the requested offset ('off0') is 737, and the requested 'len' is
> at least 1222, so the loop starting at line 369 will walk until it either
> gets far enough or the "offset > size" assertion triggers:
>
> while (off > 0) {
> KASSERT(m != NULL, ("m_copym, offset > size of mbuf chain"));
> if (off < m->m_len)
> break;
> off -= m->m_len;
> m = m->m_next;
> }
>
> Since that assertion didn't trigger, we can assume m_copym() successfully
> walked at least 'off0' (737) bytes. The problem appears to be that it ran
> out of mbufs in which to find data to copy, as it hit the end of the chain
> (m == NULL):
>
> while (len > 0) {
> if (m == NULL) {
> KASSERT(len == M_COPYALL,
> ("m_copym, length > size of mbuf chain"));
> break;
> }
>
> So the initial conclusion is that the caller requested that more data be
> copied from the chain than is actually present in the chain. This
> suggests a bug in socket buffer management or the TCP code. It's
> interesting to note that the socket buffer believes it contains less than
> the requested length -- 'so_snd.sb_mbcnt' is 1536, which is arguably less
> than 737 + 1222 (although we don't know, I think, if it's iterated or not
> and therefore decreased the value of 'len'). Could you print the value of
> 'top' in the m_copym() stack? That will tell us if it's on the first mbuf
> or not.
>
> It sounds like the socket buffer state may be inconsistent with the TCP
> PCB state, or that the expectations in tcp_offset() are wrong. I've CC'd
> Paul because he's had his hands in the new SACK code that was merged, and
> it has its hands in that bit of the output code. Here are some things you
> might want to try:
>
> (1) Try running with TCP SACK disabled. Set the
> 'net.inet.tcp.sack.enable' sysctl to 0 to try this.
>
> (2) Try adding some assertions just before the copy to m_copy() in
> tcp_output(). I'd suggest something like the following:
I'm very suspicious of the SACK code. In the non-SACK case, len gets
set here:
if (!sack_rxmit)
len = ((long)ulmin(so->so_snd.sb_cc, sendwin) - off);
but when the system panics len+off > sb_cc.
It would be interesting to look at *tp and *p in the tcp_output stack
frame.
If I had to guess, I'd say that either tp->snd_recover-tp->snd_una or
p->end-tp->snd_una is greater than so->so_snd.sb_cc.
More information about the freebsd-current
mailing list