UMA allocations from a specific physical range

Mon Sep 6 04:48:40 UTC 2010

On Mon, Sep 6, 2010 at 4:28 AM, Nathan Whitehorn <nwhitehorn at freebsd.org> wrote:
> On 09/05/10 22:51, mdf at FreeBSD.org wrote:
>> On Mon, Sep 6, 2010 at 1:38 AM, Nathan Whitehorn <nwhitehorn at freebsd.org> wrote:
>>
>>> PowerPC hypervisors typically provided a restricted range on memory when
>>> the MMU is disabled, as it is when initially handling exceptions. In
>>> order to restore virtual memory, the powerpc64 code needs to read a data
>>> structure called the SLB cache, which is currently allocated out of a
>>> UMA zone, and must be mapped into wired memory, ideally 1:1
>>> physical->virtual address. Since this must be accessible in real mode,
>>> it must have a physical address in a certain range. I am trying to
>>> figure out the best way to do this.
>>>
>>> My first run at this code uses a custom UMA allocator that calls
>>> vm_phys_alloc_contig() to get a memory page. The trouble I have run into
>>> is that I cannot figure out a way to free the page. Marking the zone
>>> NOFREE is a bad solution, vm_page_free() panics the kernel due to
>>> inconsistent tracking of page wiring, and vm_phys_free_pages() causes
>>> panics in vm_page_alloc() later on ("page is not free"). What is the
>>> correct way to deallocate these pages? Or is there a different approach
>>> I should adopt?
>>>
>> I assume this is for the SLB flih?
>>
>> What AIX did was to have a 1-1 simple esid to vsid translation for
>> kernel addresses, reserve the first 16 SLB entries for various uses,
>> including one for the current process's process private segment, and
>> if the slb miss was on a process address we'd turn on translation and
>> look up the answer, the tables holding the answer being in the process
>> private segment effective address space so we wouldn't take another
>> slb miss.  This required one level deep recursion in the slb slih, in
>> case there was a miss on kernel data with xlate on in the SLB slih.
>>
> Yes, that's correct. FreeBSD has the same 1-to-1 translation for the
> kernel, but the entire address space is switched out for user processes
> (no part of the kernel is mapped into user processes), so the code to
> load the user SLB entries has to be able to execute with the MMU off,
> lest it disappear underneath itself.

Okay.  For AIX the kernel text/data in esid 0 was always in slb entry
0 (so it wasn't affected by slbia) and also was mapped into the
process address space.  So we had to be careful with KsKp bits to
prevent access to anything the user couldn't see.  The code for memcpy
and friends was at fixed addresses in the kernel segment so the
compiler knew to jump there, and there was also a user-readable
_system_configuration struct.

Even with no address sharing, the SLB flih could load entries for the
kernel and turn on translation, but it would be trickier.

>> For historical reasons due to the per-process segment table for
>> POWER3, we also had a one-page hashed lookup table per process that we
>> stored the real address of in the process private segment, so the
>> assembly code in the flih looked here before turning on MSR_DR IIRC.
>> I was trying to find ways to kill this code when I left IBM, since
>> we'd ended support for POWER3 a few years earlier.
>>
>> I haven't had the time to look at FreeBSD ppc64 sources; how large are
>> the uma-allocated slb entries and what is stored in them?  The struct
>> and filename is sufficient, though I don't have convenient access to
>> sources until Tuesday.
>>
> The entries are each 1 KB, and there is one for each pmap. Each consists
> of 64 16-byte SLBE/SLBV pairs. These buffers are just a carbon copy of
> what should be in the SLB after a context switch to that map.

But if this is for the flih, the esid that was faulted on won't be in
that struct, right?  Aren't you trying to look up in some table to
load an slb entry?

>> V=R space is rather limited (well, depending on a lot of factors; for
>> AIX on Power5 and later the hypervisor only gave us 128M, though for
>> ppc64 on a Mac G4 I assume all of memory can be mapped V=R if desired)
>> so it was best to find a non V=R solution if possible.  Turning on
>> translation in the flih after some setup and recursion stopping is one
>> of the easier ways, and also has the advantage of not needing to
>> either have separate code or macro access to data structures used in
>> both V and R modes.
>>
> On the PS3 (the target in this case), the hypervisor also limits us to
> 128 MB. The one and only kernel data structure that needs to be used in
> this mode is this SLB cache object, so I was hoping for a simple
> solution to just put them all in the real-mode accessible region.

Well, I assume if you're willing to use 4k then it should't be hard to
allocate a whole page in the V=R region.  Perhaps other useful data
for the process could be added to this page?

Admittedly, this is a bit of a digression.  The internals of UMA
always leave me confused, so I try to avoid thinking about it. :-)

IIRC the memory from vm_phys_alloc_contig() can be released like any
other page; the interface should just be fetching a specific page.
How far off is the page wire count?  I'm assuming it's hitting the
assert that it's > 1?

I think vm_page_free() is the right interface to free the page again,
so the wire count being off presumably means someone else wired it on
you; do you know what code did it?  If no one else has a reference to
the page anymore then setting the wire count to 1, while a hack,
should be safe.

Cheers,
matthew