The arm64 fork-then-swap-out-then-swap-in failures: a program source for exploring them

Mark Millard markmi at dsl-only.net
Sun Apr 9 20:25:20 UTC 2017


[I've not tried building the kernel with
your patch yet.]

Top post of new, independent information.

Jordan Gordeev made a testing suggestion that got me to look
at kdumps of runs with jemalloc allocations sizes that fail
(14*1024) vs. work (14*1024+1).

Example comparison:

 2258 swaptesting6 0.000169 CALL  mmap(0,0x200000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
 2258 swaptesting6 0.000047 RET   mmap 1080033280/0x40600000
vs.
 2325 swaptesting7 0.000091 CALL  mmap(0,0x200000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0)
 2325 swaptesting7 0.000024 RET   mmap 1080033280/0x40600000

No difference. And so it goes.

What varies is the number of mmap's: the larger jemalloc allocation size
gets more mmap's for the same number of jemalloc allocations. (All the
mmap's from my program's explicit allocations are together, back-to-back,
with no other traced activity between.)

But varying the number of jemalloc allocations in the program varies the number
of mmap calls, yet the size of the individual jemalloc allocations still makes
the difference between failure (zeroed pages after fork-then-swap) and success.

This problem is a complicated one to classify/isolate.

After the allocations there is not much activity visible in
kdump output. I traced with "-t +" and so avoided page fault
tracing but got most everything else.

I may have to ktrace the page faults for the two jemalloc
allocation sizes and see if anything stands out.

On 2017-Apr-9, at 11:24 AM, Mark Millard <markmi at dsl-only.net> wrote:

> On 2017-Apr-9, at 10:24 AM, Mark Millard <markmi at dsl-only.net> wrote:
> 
>> On 2017-Apr-9, at 5:27 AM, Konstantin Belousov <kostikbel at gmail.com> wrote:
>> 
>>> On Sat, Apr 08, 2017 at 06:02:00PM -0700, Mark Millard wrote:
>>>> [I've identified the code path involved is the arm64 small allocations
>>>> turning into zeros for later fork-then-swapout-then-back-in,
>>>> specifically the ongoing RES(ident memory) size decrease that
>>>> "top -PCwaopid" shows before the fork/swap sequence. Hopefully
>>>> I've also exposed enough related information for someone that
>>>> knows what they are doing to get started with a specific
>>>> investigation, looking for a fix. I'd like for a pine64+
>>>> 2GB to have buildworld complete despite the forking and
>>>> swapping involved (yep: for a time zero RES(ident memory) for
>>>> some processes involved in the build).]
>>> 
>>> I was not able to follow the walls of text, but do not think that
>>> I pmap_ts_reference() is the real culprit there.
>>> 
>>> Is my impression right that the issue occurs on fork, and looks as
>>> a memory corruption, where some page suddently becomes zero-filled ?
>>> And swapping seems to be involved ?  It is somewhat interesting to see
>>> if the problem is reproducable on non-arm64 machines, e.g. armv7 or amd64.
>> 
>> Yes, yes, non-arm64 that I've tried works.
>> 
>> But I think that the following extra detail my be of use: what top
>> shows for RES over time is also odd on arm64 (only) and the amount
>> of pages that are zeroed is proportional to the decrease in RES.
>> 
>> In the test sequence:
>> 
>> A) Allocate lots of 14 KiByte allocations and initialize the content of each
>> to non-zero. The example ends up with RES of about 265M.
> 
> I did forget to list one important property: why I picked 14 KiBytes.
> 
> A) Any allocation sizes <= 14 KiBytes that I've tried
>   gets the zero's problem in my arm64 contexts (bpim3 and rip3).
> 
> B) Any allocation size >= 14 KiBYtes + 1 Byte that I've
>   tried works in those contexts.
> 
> For the arm64 contexts that I use this happens to match with
> the jemalloc SMALL_MAXCLASS size boundary. When I looked it
> appeared that 14 Ki was the smallest SMALL_MAXCLASS value
> in jemalloc so it would always fit the category.
> 
>> B) sleep some amount of time, I've been using well over 30 seconds here.
>> 
>> C) fork
>> 
>> D) sleep again (parent and child), also forcing swapping during the sleep
>>  (I used stress, manually run.)
>> 
>> E) Test the memory pattern in the parent and child process, passing over
>>  all the bytes, failed and good.
>> 
>> Both the parent and the child in (E) see the first pages allocated as zero,
>> with the number of pages being zero increasing as the sleep time in (B)
>> increases (as long as the sleep is over 30 sec or so). The parent and child
>> match for which pages are zero vs. not.
>> 
>> It fails with (B) being a no-op as well. But the proportionality with
>> the time for the sleep is interesting.
>> 
>> During (B) "top -PCwaopid" shows RES decreasing, starting after 30 sec
>> or so. The fork in (C) produces a child that does not have the same RES
>> as the parent but instead a tiny RES (80K as I remember). During (E)
>> the child's RES increases to full size.
>> 
>> My powerpc64, armv7, and amd64 tests of such do not fail, nor does RES
>> decrease during (B). The child process gets the same RES as the parent
>> as well, unlike for arm64.
>> 
>> In the failing context (arm64) RES in the parent decreases during (D)
>> before the swap-out as well.
>> 
>>> If answers to my two questions are yes, there is probably some bug with
>>> arm64 pmap handling of the dirty bit emulation.  ARMv8.0 does not provide
>>> hardware dirty bit, and pmap interprets an accessed writeable page as
>>> unconditionally dirty.  More, accessed bit is also not maintained by
>>> hardware, instead if should be set by pmap.  And arm64 pmap sets the
>>> AF bit unconditionally when creating valid pte.
>> 
>> fork-then-swap-out/in is required to see the problem. Neither fork
>> by itself nor swapping (zero RES as shown in top) by itself have
>> shown the problem so far.
>> 
>>> Hmm, could you try the following patch, I did not even compiled it.
>> 
>> I'll try it later today.
>> 
>>> diff --git a/sys/arm64/arm64/pmap.c b/sys/arm64/arm64/pmap.c
>>> index 3d5756ba891..55aa402eb1c 100644
>>> --- a/sys/arm64/arm64/pmap.c
>>> +++ b/sys/arm64/arm64/pmap.c
>>> @@ -2481,6 +2481,11 @@ pmap_protect(pmap_t pmap, vm_offset_t sva, vm_offset_t eva, vm_prot_t prot)
>>> 		    sva += L3_SIZE) {
>>> 			l3 = pmap_load(l3p);
>>> 			if (pmap_l3_valid(l3)) {
>>> +				if ((l3 & ATTR_SW_MANAGED) &&
>>> +				    pmap_page_dirty(l3)) {
>>> +					vm_page_dirty(PHYS_TO_VM_PAGE(l3 &
>>> +					    ~ATTR_MASK));
>>> +				}
>>> 				pmap_set(l3p, ATTR_AP(ATTR_AP_RO));
>>> 				PTE_SYNC(l3p);
>>> 				/* XXX: Use pmap_invalidate_range */

===
Mark Millard
markmi at dsl-only.net



More information about the freebsd-hackers mailing list