mmap MAP_NOSYNC regression in 10.x
Alan Cox
alc at rice.edu
Wed Sep 10 06:10:28 UTC 2014
On 09/09/2014 03:38, Konstantin Belousov wrote:
> On Mon, Sep 08, 2014 at 12:59:43PM -0500, Alan Cox wrote:
>> On 09/08/2014 03:41, Konstantin Belousov wrote:
>>> On Sat, Sep 06, 2014 at 07:17:26PM -0500, Alan Cox wrote:
>>>> On 09/05/2014 07:38, Konstantin Belousov wrote:
>>>>> On Fri, Sep 05, 2014 at 01:56:40PM +0200, Pieter de Goeje wrote:
>>>>>> Thanks, works for me!
>>>>> I realized that the patch contains yet another bug. The oflags page
>>>>> flags update is protected by the exclusive vm object lock, which is only
>>>>> held in shared mode on the fast path. Below is the fixed patch, where
>>>>> I take the page lock around setting VPO_NOSYNC (exclusive lock owners
>>>>> cannot race with fast path since we own shared lock, and two parallel
>>>>> fast path execution would be handled by the page lock).
>>>> Suppose that the page is clean and two threads are executing this code
>>>> concurrently. One's map entry has MAP_NOSYNC set, and the other's
>>>> doesn't. Let's call these threads NOSYNC and SYNC, respectively.
>>>>
>>>> Suppose that the thread SYNC is slightly ahead. It has already
>>>> performed "m->oflags &= ~VPO_NOSYNC;" and now it's about to perform
>>>> "vm_page_dirty(fs.m);". However, just before the thread SYNC calls
>>>> vm_page_dirty(), the thread NOSYNC evaluates "m->dirty == 0", which is
>>>> still true, and it performs "m->oflags |= VPO_NOSYNC; "
>>>>
>>>> This can't happen on the slow path. That is, a fault by a thread
>>>> without MAP_NOSYNC set on its map entry will reliably clear VPO_NOSYNC.
>>> As I understand things, it is indeed not possible on the slow path, due
>>> to PG_RW only set from pmap_enter(), am I right ? I.e. this is another
>>> place where the rule 'no PG_RW without PG_M' is important.
>>
>> Yes, it's not possible, but I'm a little confused by the rest of your
>> question, specifically, the statement "no PG_RW without PG_M". Did you
>> actually mean "no PG_M without PG_RW"?
> No, I mean what I did wrote, and I was wrong.
>
>>
>>> Let me formulate my question another way: what are the guarantees we
>>> provide to the applications when the same page is mapped with and
>>> without MAP_NOSYNC simultaneously ? Is it contractually guaranteed that
>>> any write from !MAP_NOSYNC entry triggers write in the syncer activity
>>> period ?
>>
>> Yes, that is the intent. However, I can think of at least one case
>> where the existing code doesn't work as intended. Suppose that the
>> first fault on a !MAP_NOSYNC entry is triggered by a read access. Then,
>> vm_fault() won't call vm_page_dirty(), but it will nonetheless install a
>> mapping in the pmap that allows write access. Now, suppose this same
>> process writes to the page. Finally, suppose that the second fault
>> happens on a MAP_NOSYNC entry. That fault will see a clean page, i.e.,
>> m->dirty == 0, and set VPO_NOSYNC on the page, even though the first
>> faulting process that wants the page sync'ed has dirtied the page.
> Yes, you are right. I convinced myself that this cannot happen, due to
> the false assumption above, and the fact that page flushing removes
> write bit from pte.
There is another case that I've wondered about and not had the time to
check out. What happens when a page is mapped by an entry with
MAP_NOSYNC and a write(2) is performed? Is the page being flushed or not?
>>
>>
>>>> The best course of action may be to fall back to the slow path if you
>>>> actually need to change VPO_NOSYNC's state. Usually, you won't need to.
>>>>
>>> Let me first try to improve the original patch to handle
>>> MAP_ENTRY_NOSYNC on fast path as well. It seems to be one of the cases
>>> when the parallel faults are actually useful.
>>
>> I think it may be time to take a step back, decide what semantics we
>> really want, and see if there is a better way of implementing those
>> semantics. The current approach based on toggling VPO_NOSYNC only
>> really works for the simplest cases.
> Still, I believe that the current form of the patch completely repeats
> the existing semantic of the slow path in the fast path.
>
> I may propose to avoid putting vm_page_dirty() in the scope of page
> lock for vm_fault_dirty(), mostly to simplify the logic, not to
> provide any useful optimization.
I don't think that you can do that. It leads to the race that I
described in my first email.
>>
>>> One more note: the previous patch handled m->oflags inconsistency for
>>> setting VPO_NOSYNC operation, but missed the clear one line later.
>>> I think that increasing the page lock to cover also the vm_page_dirty()
>>> would fix the race you described, and the second manipulation with
>>> oflags.
>>>
>>> diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
>>> index 30b0456..944b479 100644
>>> --- a/sys/vm/vm_fault.c
>>> +++ b/sys/vm/vm_fault.c
>>> @@ -174,6 +174,70 @@ unlock_and_deallocate(struct faultstate *fs)
>>> }
>>> }
>>>
>>> +static void
>>> +vm_fault_dirty(vm_map_entry_t entry, vm_page_t m, vm_prot_t prot,
>>> + vm_prot_t fault_type, int fault_flags, boolean_t set_wd)
>>> +{
>>> + boolean_t need_dirty;
>>> +
>>> + if (((prot & VM_PROT_WRITE) == 0 &&
>>> + (fault_flags & VM_FAULT_DIRTY) == 0) ||
>>> + (m->oflags & VPO_UNMANAGED) != 0)
>>> + return;
>>> +
>>> + VM_OBJECT_ASSERT_LOCKED(m->object);
>>> +
>>> + need_dirty = ((fault_type & VM_PROT_WRITE) != 0 &&
>>> + (fault_flags & VM_FAULT_CHANGE_WIRING) == 0) ||
>>> + (fault_flags & VM_FAULT_DIRTY) != 0;
>>> +
>>> + if (set_wd)
>>> + vm_object_set_writeable_dirty(m->object);
>>> + else
>>> + /*
>>> + * If two callers of vm_fault_dirty() with set_wd ==
>>> + * FALSE, one for the map entry with MAP_ENTRY_NOSYNC
>>> + * flag set, other with flag clear, race, it is
>>> + * possible for the no-NOSYNC thread to see m->dirty
>>> + * != 0 and not clear VPO_NOSYNC. Take vm_page lock
>>> + * around manipulation of VPO_NOSYNC and
>>> + * vm_page_dirty() call, to avoid the race and keep
>>> + * m->oflags consistent.
>>> + */
>>> + vm_page_lock(m);
>>> +
>>> + /*
>>> + * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
>>> + * if the page is already dirty to prevent data written with
>>> + * the expectation of being synced from not being synced.
>>> + * Likewise if this entry does not request NOSYNC then make
>>> + * sure the page isn't marked NOSYNC. Applications sharing
>>> + * data should use the same flags to avoid ping ponging.
>>> + */
>>> + if ((entry->eflags & MAP_ENTRY_NOSYNC) != 0) {
>>> + if (m->dirty == 0) {
>>> + m->oflags |= VPO_NOSYNC;
>>> + }
>>> + } else {
>>> + m->oflags &= ~VPO_NOSYNC;
>>> + }
>>> +
>>> + /*
>>> + * If the fault is a write, we know that this page is being
>>> + * written NOW so dirty it explicitly to save on
>>> + * pmap_is_modified() calls later.
>>> + *
>>> + * Also tell the backing pager, if any, that it should remove
>>> + * any swap backing since the page is now dirty.
>>> + */
>>> + if (need_dirty)
>>> + vm_page_dirty(m);
>>> + if (!set_wd)
>>> + vm_page_unlock(m);
>>> + if (need_dirty)
>>> + vm_pager_page_unswapped(m);
>>> +}
>>> +
>>> /*
>>> * TRYPAGER - used by vm_fault to calculate whether the pager for the
>>> * current object *might* contain the page.
>>> @@ -321,11 +385,8 @@ RetryFault:;
>>> vm_page_hold(m);
>>> vm_page_unlock(m);
>>> }
>>> - if ((fault_type & VM_PROT_WRITE) != 0 &&
>>> - (m->oflags & VPO_UNMANAGED) == 0) {
>>> - vm_page_dirty(m);
>>> - vm_pager_page_unswapped(m);
>>> - }
>>> + vm_fault_dirty(fs.entry, m, prot, fault_type, fault_flags,
>>> + FALSE);
>>> VM_OBJECT_RUNLOCK(fs.first_object);
>>> if (!wired)
>>> vm_fault_prefault(&fs, vaddr, 0, 0);
>>> @@ -898,42 +959,7 @@ vnode_locked:
>>> if (hardfault)
>>> fs.entry->next_read = fs.pindex + faultcount - reqpage;
>>>
>>> - if (((prot & VM_PROT_WRITE) != 0 ||
>>> - (fault_flags & VM_FAULT_DIRTY) != 0) &&
>>> - (fs.m->oflags & VPO_UNMANAGED) == 0) {
>>> - vm_object_set_writeable_dirty(fs.object);
>>> -
>>> - /*
>>> - * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
>>> - * if the page is already dirty to prevent data written with
>>> - * the expectation of being synced from not being synced.
>>> - * Likewise if this entry does not request NOSYNC then make
>>> - * sure the page isn't marked NOSYNC. Applications sharing
>>> - * data should use the same flags to avoid ping ponging.
>>> - */
>>> - if (fs.entry->eflags & MAP_ENTRY_NOSYNC) {
>>> - if (fs.m->dirty == 0)
>>> - fs.m->oflags |= VPO_NOSYNC;
>>> - } else {
>>> - fs.m->oflags &= ~VPO_NOSYNC;
>>> - }
>>> -
>>> - /*
>>> - * If the fault is a write, we know that this page is being
>>> - * written NOW so dirty it explicitly to save on
>>> - * pmap_is_modified() calls later.
>>> - *
>>> - * Also tell the backing pager, if any, that it should remove
>>> - * any swap backing since the page is now dirty.
>>> - */
>>> - if (((fault_type & VM_PROT_WRITE) != 0 &&
>>> - (fault_flags & VM_FAULT_CHANGE_WIRING) == 0) ||
>>> - (fault_flags & VM_FAULT_DIRTY) != 0) {
>>> - vm_page_dirty(fs.m);
>>> - vm_pager_page_unswapped(fs.m);
>>> - }
>>> - }
>>> -
>>> + vm_fault_dirty(fs.entry, fs.m, prot, fault_type, fault_flags, TRUE);
>>> vm_page_assert_xbusied(fs.m);
>>>
>>> /*
>>> diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
>>> index f12b76c..a45648d 100644
>>> --- a/sys/vm/vm_page.h
>>> +++ b/sys/vm/vm_page.h
>>> @@ -147,7 +147,7 @@ struct vm_page {
>>> uint16_t hold_count; /* page hold count (P) */
>>> uint16_t flags; /* page PG_* flags (P) */
>>> uint8_t aflags; /* access is atomic */
>>> - uint8_t oflags; /* page VPO_* flags (O) */
>>> + uint8_t oflags; /* page VPO_* flags (OM) */
>>> uint8_t queue; /* page queue index (P,Q) */
>>> int8_t psind; /* pagesizes[] index (O) */
>>> int8_t segind;
>>> @@ -163,8 +163,9 @@ struct vm_page {
>>> /*
>>> * Page flags stored in oflags:
>>> *
>>> - * Access to these page flags is synchronized by the lock on the object
>>> - * containing the page (O).
>>> + * Access to these page flags is synchronized by the exclusive lock on
>>> + * the object containing the page, or combination of shared object
>>> + * lock and the page lock (OM).
>>> *
>>> * Note: VPO_UNMANAGED (used by OBJT_DEVICE, OBJT_PHYS and OBJT_SG)
>>> * indicates that the page is not under PV management but
More information about the freebsd-hackers
mailing list