problems with mmap() and disk caching

Fri Apr 6 09:06:59 UTC 2012

On 04/06/2012 03:38, Konstantin Belousov wrote:
> On Thu, Apr 05, 2012 at 01:25:49PM -0500, Alan Cox wrote:
>> On 04/05/2012 12:31, Konstantin Belousov wrote:
>>> On Thu, Apr 05, 2012 at 10:54:31AM -0500, Alan Cox wrote:
>>>> On 04/04/2012 02:17, Konstantin Belousov wrote:
>>>>> On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I open the file, then call mmap() on the whole file and get pointer,
>>>>>> then I work with this pointer.  I expect that page should be only once
>>>>>> touched to get it into the memory (disk cache?), but this doesn't work!
>>>>>>
>>>>>> I wrote the test (attached) and ran it for the 1G file generated from
>>>>>> /dev/random, the result is the following:
>>>>>>
>>>>>> Prepare file:
>>>>>> # swapoff -a
>>>>>> # newfs /dev/ada0b
>>>>>> # mount /dev/ada0b /mnt
>>>>>> # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024
>>>>>>
>>>>>> Purge cache:
>>>>>> # umount /mnt
>>>>>> # mount /dev/ada0b /mnt
>>>>>>
>>>>>> Run test:
>>>>>> $ ./mmap /mnt/random-1024 30
>>>>>> mmap:  1 pass took:   7.431046 (none: 262112; res:     32; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  2 pass took:   7.356670 (none: 261648; res:    496; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  3 pass took:   7.307094 (none: 260521; res:   1623; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  4 pass took:   7.350239 (none: 258904; res:   3240; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  5 pass took:   7.392480 (none: 257286; res:   4858; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  6 pass took:   7.292069 (none: 255584; res:   6560; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  7 pass took:   7.048980 (none: 251142; res:  11002; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  8 pass took:   6.899387 (none: 247584; res:  14560; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  9 pass took:   7.190579 (none: 242992; res:  19152; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 10 pass took:   6.915482 (none: 239308; res:  22836; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 11 pass took:   6.565909 (none: 232835; res:  29309; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 12 pass took:   6.423945 (none: 226160; res:  35984; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 13 pass took:   6.315385 (none: 208555; res:  53589; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 14 pass took:   6.760780 (none: 192805; res:  69339; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 15 pass took:   5.721513 (none: 174497; res:  87647; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 16 pass took:   5.004424 (none: 155938; res: 106206; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 17 pass took:   4.224926 (none: 135639; res: 126505; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 18 pass took:   3.749608 (none: 117952; res: 144192; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 19 pass took:   3.398084 (none:  99066; res: 163078; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 20 pass took:   3.029557 (none:  74994; res: 187150; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 21 pass took:   2.379430 (none:  55231; res: 206913; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 22 pass took:   2.046521 (none:  40786; res: 221358; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 23 pass took:   1.152797 (none:  30311; res: 231833; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 24 pass took:   0.972617 (none:  16196; res: 245948; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 25 pass took:   0.577515 (none:   8286; res: 253858; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 26 pass took:   0.380738 (none:   3712; res: 258432; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 27 pass took:   0.253583 (none:   1193; res: 260951; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 28 pass took:   0.157508 (none:      0; res: 262144; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 29 pass took:   0.156169 (none:      0; res: 262144; super:
>>>>>> 0; other:      0)
>>>>>> mmap: 30 pass took:   0.156550 (none:      0; res: 262144; super:
>>>>>> 0; other:      0)
>>>>>>
>>>>>> If I ran this:
>>>>>> $ cat /mnt/random-1024>    /dev/null
>>>>>> before test, when result is the following:
>>>>>>
>>>>>> $ ./mmap /mnt/random-1024 5
>>>>>> mmap:  1 pass took:   0.337657 (none:      0; res: 262144; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  2 pass took:   0.186137 (none:      0; res: 262144; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  3 pass took:   0.186132 (none:      0; res: 262144; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  4 pass took:   0.186535 (none:      0; res: 262144; super:
>>>>>> 0; other:      0)
>>>>>> mmap:  5 pass took:   0.190353 (none:      0; res: 262144; super:
>>>>>> 0; other:      0)
>>>>>>
>>>>>> This is what I expect.  But why this doesn't work without reading file
>>>>>> manually?
>>>>> Issue seems to be in some change of the behaviour of the reserv or
>>>>> phys allocator. I Cc:ed Alan.
>>>> I'm pretty sure that the behavior here hasn't significantly changed in
>>>> about twelve years.  Otherwise, I agree with your analysis.
>>>>
>>>> On more than one occasion, I've been tempted to change:
>>>>
>>>>                                          pmap_remove_all(mt);
>>>>                                          if (mt->dirty != 0)
>>>>                                                  vm_page_deactivate(mt);
>>>>                                          else
>>>>                                                  vm_page_cache(mt);
>>>>
>>>> to:
>>>>
>>>>                                          vm_page_dontneed(mt);
>>>>
>>>> because I suspect that the current code does more harm than good.  In
>>>> theory, it saves activations of the page daemon.  However, more often
>>>> than not, I suspect that we are spending more on page reactivations than
>>>> we are saving on page daemon activations.  The sequential access
>>>> detection heuristic is just too easily triggered.  For example, I've
>>>> seen it triggered by demand paging of the gcc text segment.  Also, I
>>>> think that pmap_remove_all() and especially vm_page_cache() are too
>>>> severe for a detection heuristic that is so easily triggered.
>>> Yes, I agree that such change shall be an improvement, and I expect
>>> that Andrey will test it.
>>>
>>> On the other hand, I do think that allocator should prefer unnamed
>>> pages to pages which still have valid content. On my 12G desktop,
>>> I never saw more then 100MB of cached pages, and similar numbers
>>> are observed on the 32-48GB servers. I suppose that this is related.
>> On allocation, the system does prefer free pages over cached pages.
>> When cached pages are added to the physical memory allocator, they are
>> added to VM_FREEPOOL_CACHE.  When pages are allocated, they are taken
>> from VM_FREEPOOL_DEFAULT.  Generally, pages only move from the CACHE
>> pool to the DEFAULT pool when the DEFAULT pool is depleted.  (However,
>> occasionally, they do move because of coalescing.)  When I redid the
>> physical memory allocator, I looked at the rate of cached page
>> reactivation under the old and the new allocators.  At least for the
>> tests that I did the rates weren't that different.  It was low,
>> single-digit percentages.  I think the highest likelihood of
>> reactivation comes from the pages that are cached by the sequential
>> access heuristic because it is so overzealous.
>>
>> I don't think it's related.  You see modest numbers of cached pages
>> simply because the page daemon met its target for the sum of free and
>> cached pages.  So, it just stopped moving pages from the inactive queue
>> into the physical memory allocator's cache/free queues.
> No, I mean something else. Specifically, I mean that somehow the
> preference for non-named pages does not work. At least, I cannot give
> any other explanation for the following experiment.
>
> Lets take stock HEAD without change in vm_fault.c. The initial
> state of 8GB machine is as follows, the test file was not even
> stat(2)-ed yet.
> Mem: 37M Active, 18M Inact, 150M Wired, 236K Cache, 27M Buf, 7612M Free
>
> Now, run the unmodified original Andrey' test with only one pass,
> making sequential read of the mmap of a 5GB file from UFS volume.
> After the run
> Mem: 38M Active, 18M Inact, 153M Wired, 21M Cache, 30M Buf, 7586M Free
>
> Please note that cached count increased only for 20M, and this is
> for calls to vm_page_cache() worth of 5GB. In other words, it seems
> that allocator almost never touches free memory, always preferring
> cache. This is mostly coincides with what I saw when I profiled
> original problem reported by Andrey.

Ah, I understand.