ia64/147772: [ia64] ptc_g causes MCA on McKinley & Madison CPUs

Thu Jun 10 18:20:06 UTC 2010

>Number:         147772
>Category:       ia64
>Synopsis:       [ia64] ptc_g causes MCA on McKinley & Madison CPUs
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-ia64
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Jun 10 18:20:01 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator:     Marcel Moolenaar
>Release:        9-CURRENT
>Organization:
>Environment:
FreeBSD pluto2.freebsd.org 9.0-CURRENT FreeBSD 9.0-CURRENT #19 r208970M: Thu Jun 10 04:12:20 UTC 2010     marcel at pluto2.freebsd.org:/usr/obj/tank/usr/src/sys/PLUTO2  ia64

>Description:
Background:
    The code following the exception_save_restart and exception_restore_restart labels run with psr.ic disabled. A TLB miss after will trigger a Nested Data TLB Fault. The code has been designed so that
a TLB miss, when happens, will happen in the first bundle after these labels and the Nested Data TLB Fault handler will know how to insert the TLB and restart the bundle. The underlying assumption is that when the TLB is in the translation cache, the entire sequence will complete without a TLB miss until either psr.ic can be enabled or the rfi instruction is executed. The only TLB that we need is the one for the kernel stack so that we can read or write the trapframe.

Problem:
    The ptc.g operation for the Mckinley and Madison processors has the side-effect of purging more than the requested translation. While this is not a problem in general, it invalidates the assumption made for exception_save_restart and exception_restore_restart in SMP configurations. Since the ptc.g purges the translation caches of all CPUs in the coherency domain, a ptc.g executed on one CPU can cause a purge on another CPU that is currently running the critical code sequences following the exception_save_restart and exception_restore_restart. While the purge address is never the translation relating to the trapframe that is being read or written, the behaviour of McKInley and Madison processors in purging more than the requested translation can result in an unexpected TLB miss. This then results in the mishandling of the Nested Data TLB Fault, which typically results in a machine check.

This problem is not observed on a Montecito processor. The problem was also never observed on Merced, FWIW.

>How-To-Repeat:
Run pho's stress2 test on McKinley or Madison with SMP enabled.
>Fix:
There are 2 possible fixes:
1.  serialize ptc.g with respect to exception_save_restart and exception_restore_restart so that never execute ptc.g on one processor while some other processor is running the critical sequence.
2.  replace the use of ptc.g with an IPI mechanism and have all CPUs execute ptc.l locally. This guarantees that no purge will be visible to any CPU when executing the critical sequence.

>Release-Note:
>Audit-Trail:
>Unformatted: