The arm64 fork-then-swap-out-then-swap-in failures: a program source for exploring them

Fri Apr 7 08:43:33 UTC 2017

[I now can: (A) crudely control the number of allocated
pages that get zeros (that should not). (B) Watch a
"top -PCwaopid" display and predict if the
test-architecture will fail or not before the fork()
or swap-out happens.]

On 2017-Apr-4, at 8:00 PM, Mark Millard <markmi at dsl-only.net> wrote:

> Uncommenting/commenting parts of the below program allows
> exploring the problems with fork-then-swap-out-then-in on
> arm64.
> 
> Note: By swap-out I mean that zero RES(ident memory) results,
>      for the process(s) of interest, as shown by
>      "top -PCwaopid" .
> 
> I discovered recently that swapping-out just before the
> fork() prevents the failure from the swapping after the
> fork().
> 
> Note:
> Without the fork() no problem happens. Without the later
> swap-out no problem happens. Both are required. But some
> activities before the fork() or between fork() and the
> swap-out prevent the failures.
> 
> Some of the comments are based on a pine64+ 2GB context.
> I use stress to force swap-outs during some sleeps in
> the program. See also Buzilla 217239 and 217138. (I now
> expect that they have the same cause.)
> 
> In my environment I've seen the fork-then-swap-out/swap-in
> failures on a pine64+ 2GB and a rpi3. They are repeatable
> on both. I do not have access to server-class machines, or
> any other arm64 machines.
> 
> 
> // swap_testing5.c
> 
> // Built via (cc was clang 4.0 in my case):
> //
> // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing5.c
> // -O0 and -O2 also gets the problem.
> 
> // Note: jemalloc's tcache needs to be enabled to get the failure.
> //       But FreeBSD can get into a state were /etc/malloc.conf
> //       -> 'tcache:false' is ineffective. Also: the allocation
> //       size needs to by sufficiently small (<= SMALL_MAXCLASS)
> //       to see the problem. Other comments are based on a specific
> //       context (pine64+ 2GB).
> 
> #include <signal.h>     // for raise(.), SIGABRT (induce core dump)
> #include <unistd.h>     // for fork(), sleep(.)
> #include <sys/types.h>  // for pid_t
> #include <sys/wait.h>   // for wait(.)
> 
> extern void test_setup(void);         // Sets up the memory byte patterns.
> extern void test_check(void);         // Tests the memory byte patterns.
> extern void memory_willneed(void); // For seeing if
>                                   // posix_madvise(.,.,POSIX_MADV_WILLNEED)
>                                   // makes a difference.
> 
> int main(void) {
>    sleep(30); // Potentialy force swap-out here.
>               // [Swap-out here does not avoid later failures.]
> 
>    test_setup();
>    test_check(); // Before potential sleep(.)/swap-out or fork(.) [passes]
> 
>    sleep(30); // Potentialy force swap-out here.
>               // [Everything below passes if swapped-out here,
>               //  no matter if there are later swap-outs
>               //  or not.]
> 
>    pid_t pid = fork(); // To test no-fork use: = 0; no-fork does not fail.
>    int wait_status = 0;
> 
>    // HERE: After fork; before sleep/swap-out/wait.
> 
>    // if (0 <  pid) memory_willneed(); // Does not prevent either parent or
>                                     // child failure if enabled.
> 
>    // if (0 == pid) memory_willneed(); // Prevents both the parent and the
>                                     // child failure. Disable to see
>                                     // failure of both parent and child.
>                                     // [Presuming no prior swap-out: that
>                                     // would make everything pass.]
> 
>    // During sleep/wait: manually force this process to
>    // swap out. I use something like:
>    //     stress -m 1 --vm-bytes 1800M
>    // in another shell and ^C'ing it after top shows the
>    // swapped status desired. 1800M just happened to work
>    // on the Pine64+ 2GB that I was using. I watch with
>    // top -PCwaopid [checking for zero RES(ident memory)].
> 
>    if (0 < pid) {
>        sleep(30);    // Intend to swap-out during sleep.
>        // test_check(); // Test in parent before child runs (longer sleep).
>                      // This test fails if run for a failing region_size
>                      // unless earlier preventing-activity happened.
>        wait(&wait_status); // Only if test_check above passes or is
>                            // disabled above.
>    }
>    if (-1 != wait_status && 0 <= pid) {
>        if (0 == pid) { sleep(90); } // Intend to swap-out during sleep.
>        test_check(); // Fails for small-enough region_size, both
>                      // parent and child processes, unless earlier
>                      // preventing-activty happened.
>    }
> }
> 
> // The memory and test code follows.
> 
> #include <stddef.h>     // for size_t, NULL
> #include <stdlib.h>     // for malloc(.), free(.)
> #include <sys/mman.h>   // for POSIX_MADV_WILLNEED, posix_madvise(.,.,.)
> 
> #define region_size (14u*1024u)
>        // Bad dyn_region pattern, parent and child processes examples:
>        // 256u, 2u*1024u, 4u*1024u, 8u*1024u, 9u*1024u, 12u*1024u, 14u*1024u
>        // No failure examples:
>        // 14u*1024u+1u, 15u*1024u, 16u*1024u, 32u*1024u, 256u*1024u*1024u
> #define num_regions (256u*1024u*1024u/region_size)
> 
> typedef volatile unsigned char value_type;
> struct region_struct { value_type array[region_size]; };
> typedef struct region_struct region;
> static region * volatile dyn_regions[num_regions] = {NULL,};
> 
> static value_type value(size_t v) { return (value_type)((v&0xFEu)|0x1u); }
>                  // value avoids zero values: the bad values are zeros.
> 
> void test_setup(void) {
>    for(size_t i=0u; i<num_regions; i++) {
>        dyn_regions[i] = malloc(sizeof(region));
>        if (!dyn_regions[i]) raise(SIGABRT);
> 
>        for(size_t j=0u; j<region_size; j++) {
>            (*dyn_regions[i]).array[j] = value(j);
>        }
>    }
> }
> 
> void memory_willneed(void) {
>    for(size_t i=0u; i<num_regions; i++) {
>        (void) posix_madvise(dyn_regions[i], region_size, POSIX_MADV_WILLNEED);
>    }
> }
> 
> static volatile size_t first_failure_idx = 0u; // dyn_regions index
> static volatile size_t first_failure_pos = 0u; //   sub-array index
> static volatile size_t after_bad_idx     = 0u; // dyn_regions index
> static volatile size_t after_bad_pos     = 0u; //   sub-array index
> static volatile size_t after_good_idx    = 0u; // dyn_regions index
> static volatile size_t after_good_pos    = 0u; //   sub-array index
> 
> // Note: Some failing cases get (conjunctive notation):
> //
> //    0 == first_failure_idx < after_bad_idx < after_good_idx == num_regions
> // && 0 == first_failure_pos && 0<=after_bad_pos<=region_size && after_good_idx==0
> // && (after_bad_pos is a multiple of the page size in Bytes, here:
> //     after_bad_pos==N*4096 for some non-negative integral value N)
> //
> // other failing cases instead fail with:
> //
> //    0 == first_failure && num_regions == after_bad_idx == after_good_idx
> // && 0 == first_failure_pos == after_bad_pos == after_good_idx
> //
> // after_bad_idx strongly tends to vary from failing run to failing run
> // as does after_bad_pos.
> 
> // Note: The working cases get:
> //
> //    num_regions == first_failure == after_bad_idx == after_good_idx
> // && 0 == first_failure_pos == after_bad_pos == after_good_idx
> 
> void test_check(void) {
>    first_failure_idx = first_failure_pos = 0u;
> 
>    while (first_failure_idx < num_regions) {
>        while (  first_failure_pos < region_size
>              && (  value(first_failure_pos)
>                 == (*dyn_regions[first_failure_idx]).array[first_failure_pos]
>                 )
>              ) {
>            first_failure_pos++;
>        }
> 
>        if (region_size != first_failure_pos) break;
> 
>        first_failure_idx++;
>        first_failure_pos = 0u;
>    }
> 
>    after_bad_idx = first_failure_idx;
>    after_bad_pos = first_failure_pos;
> 
>    while (after_bad_idx < num_regions) {
>        while (  after_bad_pos < region_size
>              && (  value(after_bad_pos)
>                 != (*dyn_regions[after_bad_idx]).array[after_bad_pos]
>                 )
>              ) {
>            after_bad_pos++;
>        }
> 
>        if(region_size != after_bad_pos) break;
> 
>        after_bad_idx++;
>        after_bad_pos = 0u;
>    }
> 
>    after_good_idx = after_bad_idx;
>    after_good_pos = after_bad_pos;
> 
>    while (after_good_idx < num_regions) {
>        while (  after_good_pos < region_size
>              && (  value(after_good_pos)
>                 == (*dyn_regions[after_good_idx]).array[after_good_pos]
>                 )
>              ) {
>            after_good_pos++;
>        }
> 
>        if(region_size != after_good_pos) break;
> 
>        after_good_idx++;
>        after_good_pos = 0u;
>    }
> 
>    if (num_regions != first_failure_idx) raise(SIGABRT);
> }

I've found that for the above swap_testing5.c
I can make variations that change how much of the
allocated region prefix ends up zero vs. stays good.

I vary the sleep time between testing the initialized
allocations and doing the fork. The longer the sleep
the more zero pages show up (be sure to read the
comments):

# diff swap_testing[56].c                                                                                                                                                                               1c1
< // swap_testing5.c
---
> // swap_testing6.c
5c5
< // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing5.c
---
> // cc -g -std=c11 -Wpedantic -o swaptesting5 swap_testing6.c
33c33
<     sleep(30); // Potentialy force swap-out here.
---
>     sleep(150); // Potentialy force swap-out here.
37a38,48
>                // For no-swap-out here cases:
>                //
>                // The longer the sleep here the more allocations
>                // that end up as zero.
>                //
>                // top's Mem Active, Inact, Wired, Bug, Free and
>                // Swap Total, Used, and Free stay unchanged.
>                // What does change is the process RES decreases
>                // while the process SIZE and SWAP stay unchanged
>                // during this sleep.
> 

NOTE: On other architectures that I've tried (such as armv6/v7)
      RES does not decrease during the sleep --and the problem
      does not happen even for as long of sleeps as I've tried.

      (I use "stress -m 2 --vm-bytes 900M" on armv6/v7 instead
      of -m 1 --vm-bytes 1800M because that large in one
      process is not allowed.)

So watching top's RES during the sleep (longer than a few
seconds) just before the fork() predicts the later
fails-vs.-not status: If RES decreases (while other things
associated with the process status stay the same) then
there will be a failure.

At this point I've no clue why the sleeping process has
a decreasing RES(ident memory) size.

I infer that without the sleep there still is a small
amount of loss of RES but on too short of a timescale
to observe in a "top -PCwaopid" or other such: in other
words that the same behavior is causing the failure then
as well, possibly for a loss of only one page of RES.

===
Mark Millard
markmi at dsl-only.net