ThunderX Panic after r368370
Mark Millard
marklmi at yahoo.com
Sun Dec 6 21:31:03 UTC 2020
On 2020-Dec-6, at 03:51, Michal Meloun <meloun.michal at gmail.com> wrote:
> On 06.12.2020 10:47, Mark Millard wrote:
>> On 2020-Dec-6, at 00:17, Michal Meloun <meloun.michal at gmail.com> wrote:
>>> On 06.12.2020 3:21, Marcel Flores wrote:
>>>> Hi All,
>>>> Looks like the ThunderX started panicking at boot after r368370:
>>>> https://reviews.freebsd.org/rS368370
>>>> From a verbose boot, it looks like it bails in gic0 redistributor setup(?):
>>>> gic0: CPU29 Re-Distributor woke up
>>>> gic0: CPU24 enabled CPU interface via system registers
>>>> gic0: CPU17 enabled CPU interface via system registers
>>>> gic0: CPU29 enabled CPU interface via system registers
>>>> done
>>>> Full Verbose boot:
>>>> https://gist.github.com/mesflores/f026122495c8494d041bce04d30b15bb
>>>> I'm not really familiar with the details of the commit, but happy to test
>>>> anything if anyone has any ideas.
>>>
>>>
>>> Hi Marcel
>>> are you able to get crashdump and do backtrace?
>>> https://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html#kerneldebug-obtain
>>> and
>>> https://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-gdb.html
>>> If not, I'll make some debug patch.
>>>
>>> It's weird, even though GIC is potentially affected by my patch, in this case the cpuid numbering was not changed.
>> (I've no access to a ThunderX. I just looked for my own curiosity.
>> Sorry if this is obvious and so is noise.)
>> When I looked at the code it appeared to be the last "->" in
>> the following that was dereferencing the nullptr value (via [x8]
>> in assembler notation):
>> static uint64_t
>> its_cmd_prepare(struct its_cmd *cmd, struct its_cmd_desc *desc)
>> {
>> uint64_t target;
>> uint8_t cmd_type;
>> u_int size;
>> cmd_type = desc->cmd_type;
>> target = ITS_TARGET_NONE;
>> switch (cmd_type) {
>> case ITS_CMD_MOVI: /* Move interrupt ID to another collection */
>> target = desc->cmd_desc_movi.col->col_target;
>> . . .
>> In other words: it appeared to me that the above desc->cmd_desc_movi.col
>> evaluated as 0 when used in what was reported.
> This is very probably right analysis. But problem is that cmd_desc_movi.col should not be NULL, is initialized in its_cmd_movi from sc->sc_its_cols which should be allocated in gicv3_its_attach().
>
The following is unlikely to directly contribute to the
specific problem's solution but documents an oddity that
took my time while looking around related the problem.
One (comment?) oddity I ran into looking around:
/usr/src/sys/sys/cpuset.h:#define CPU_FFS(p) BIT_FFS(CPU_SETSIZE, p)
but in /usr/src/sys/sys/bitset.h :
#define BIT_FFS(_s, p) BIT_FFS_AT((_s), (p), 0)
and (comment wrong about start?):
/*
* Note that `start` and the returned value from BIT_FFS_AT are
* 1-based bit indices.
*/
#define BIT_FFS_AT(_s, p, start) __extension__ ({ \
. . .
In other words, BIT_FFS (and CPU_FFS) provide BIT_FFS_AT with start==0
but start is documented to be a 1-based bit index.
So, looking into what happens with start==0, showing BIT_FFS_AT:
#define BIT_FFS_AT(_s, p, start) __extension__ ({ \
__size_t __i; \
long __mask; \
int __bit; \
\
__mask = ~0UL << ((start) % _BITSET_BITS); \
__bit = 0; \
for (__i = __bitset_word((_s), (start)); \
__i < __bitset_words((_s)); \
__i++) { \
if (((p)->__bits[__i] & __mask) != 0) { \
__bit = ffsl((p)->__bits[__i] & __mask); \
__bit += __i * _BITSET_BITS; \
break; \
} \
__mask = ~0UL; \
} \
__bit; \
})
It looks like this traces to use of:
__mask = ~0UL << ((start) % _BITSET_BITS); \
and to use of:
#define __bitset_word(_s, n) \
(__constexpr_cond(__bitset_words((_s)) == 1) ? \
0 : ((n) / _BITSET_BITS))
So __mask==~0UL and __bitset_word((_s), (start))==0 . Then for
__i==0:
((p)->__bits[0] & __mask) != 0 evaluates like
((p)->__bits[0] & ~0UL) != 0 which in turn evaluates like
(p)->__bits[0] != 0.
From there __bit = ffsl((p)->__bits[0] & __mask) would involve
(p)->__bits[0] & __mask evaluing like (p)->__bits[0] & ~0UL and
that in turn evaluating like just (p)->__bits[0] . Presuming non-zero
as a context, effectively for such a context:
__bit = ffsl((p)->__bits[0]);
__bit += 0;
which would seem to set __bit correctly.
It looks to me like start is 0-based in BIT_FFS_AT, not 1-based. So
I expect that the comment is wrong about start.
===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
More information about the freebsd-arm
mailing list