sshd crash

Sun Nov 3 16:57:21 UTC 2013

On Nov 2, 2013, at 8:39 AM, Diane Bruce <db at db.net> wrote:
> On Sat, Nov 02, 2013 at 07:33:40AM -0600, Ian Lepore wrote:
>> 
>> I'm not sure it's a mundane stray-write either.  The routine that's
>> asserting is checking to see if the contents of a page are all-zero
>> because a jemalloc internal flag is set that says it should be.  I had
>> the routine print the non-zero data it found, and it looks like this:
>> 
>> not-zero at 0 0x20c99000 = 0x20800a00
>> not-zero at 1 0x20c99004 = 0x00000001
>> not-zero at 2 0x20c99008 = 0x0000002f
>> not-zero at 3 0x20c9900c = 0xffffffff
>> not-zero at 4 0x20c99010 = 0x00007fff
>> not-zero at 5 0x20c99014 = 0x00000003
>> not-zero at 96 0x20c99180 = 0x5a5a5a5a
>> not-zero at 97 0x20c99184 = 0x5a5a5a5a
>> not-zero at 98 0x20c99188 = 0x5a5a5a5a
>> 
>> The 0x5a continues to the end of the page.  So jemalloc has metadata
>> that says it thinks the page is all-zeroes, and the page is a mix of
>> data and some zeroes and the 5a junk-fill byte.  It seems more like the
>> metadata is in error somehow.  (Maybe a stray write hit the metadata.)

This looks to me like the sort of thing that would happen if the chunk page map were corrupted.  This could happen due to a double free, freeing an interior pointer of a multi-page allocation, or a variety of more complicated errors.  The page is filled with 0x5a bytes, yet jemalloc thinks the page should contain 0x00 bytes, and that implies that the chunk page table claims this is the first use of the page since it was mapped.

Does this problem reproduce on amd64?  If so, I'll dig in and figure out if jemalloc is to blame.  If not on amd64, given enough hand holding re: hardware acquisition and configuration I can probably be convinced to set up an ARM system.

Thanks,
Jason