USB [USB3 and USB2] problems when using UEFi v1.16 to boot RPi4: Evidence of a read-time problem being involved (contexts that avoids the issue)
Mark Millard
marklmi at yahoo.com
Wed Jul 15 10:35:48 UTC 2020
On 2020-Jun-25, at 20:40, Mark Millard <marklmi at yahoo.com> wrote:
> [Looks like it is a read-time failure in some
> new testing.]
>
> On 2020-Jun-25, at 17:52, Mark Millard <marklmi at yahoo.com> wrote:
>>
>> On 2020-Jun-25, at 15:40, Klaus Küchemann <maciphone2 at googlemail.com> wrote:
>>
>>> Am 25.06.2020 um 21:29 schrieb Mark Millard via freebsd-arm <freebsd-arm at freebsd.org>:
>>>> …
>>>> .
>>>> The test still failed to produce an accurate file copy
>>>> but the kernel did not report anything either. I'm
>>>> Unsure how get evidence of the context for the bad 4K
>>>> chunks.
>>>>
>>> No clue if it has effects but maybe : dd if=xxx of=xxx bs=4k ?
>>
>> Something interesting does result from dd testing,
>> even though doing file copies that way still gets
>> the problem. In fact a couple of interesting points
>> show up.
>>
>> Using dd to copy large files still gets corrupted copies.
>> (Large files are only because the corruptions are not
>> frequent in the files but a sufficiently large file
>> seems to always have some corruption.)
>>
>> Interestingly, dd if=/dev/zero based large file
>> generation has produced good files from what I
>> can tell. (Generate separate files and diff them
>> after a reboot.)
>>
>> The problem was originally discovered copying
>> from another machine to a RPi4. But the Ethernet
>> use involved USB in providing data (but not a
>> local USB drive) --while /dev/zero does not
>> involve USB as a data source and copies of
>> data in memory via file content buffering. So
>> the contrasting dd if=/dev/zero results may be
>> indicating something.
>>
>> Another interesting point is that the following
>> sequence seems repeatable for step (E)'s resultant
>> property below:
>>
>> A) first do a couple of large dd if=/dev/zero file generations
>> B) then do a (non-zero) large file copy (dd based or cp based)
>> C) reboot
>> D) diff the 2 files generated in (A): no differences
>> E) diff the original large file and the temporary copy
>> from (B): there are differences and the temporary copy
>> has zero in every byte that is different.
>>
>> (E) suggests that the bad file copies via cp or
>> via dd are picking up data from the wrong memory
>> pages sometimes, (A) just made large numbers of
>> pages zero, making it more likely a zero page
>> would be used if the wrong page was referenced.
>>
>> An example of checking for (E) was:
>>
>> # diff clang-cortexA53-installworld-poud.tar mmjnk.other
>> Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other differ
>>
>> # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other | grep -v " 0$" | more
>> --More--(END)
>>
>>
>> Note about my example "large file" sizes:
>>
>> -rw-r--r-- 1 root wheel 4011026432 Apr 25 21:04:42 2020 clang-cortexA53-installworld-poud.tar
>>
>> and I've been mostly using 4 GiByte for the resultant size
>> of large files generated via dd.
>>
>> I have not tried to find a minimum size for reliably
>> getting corrupted file copies.
>>
>
> I continued after the above with (no additional reboot):
>
> # cpuset -l0 cp -aRx clang-cortexA53-installworld-poud.tar mmjnk.other2
>
> # diff clang-cortexA53-installworld-poud.tar mmjnk.other2
> Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other2 differ
>
> # cpuset -l2 diff clang-cortexA53-installworld-poud.tar mmjnk.other2
> Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other2 differ
>
> # cpuset -l3 cp -aRx clang-cortexA53-installworld-poud.tar mmjnk.other3
>
> # cpuset -l3 diff clang-cortexA53-installworld-poud.tar mmjnk.other3
> Binary files clang-cortexA53-installworld-poud.tar and mmjnk.other3 differ
>
> Note that the final mmjnk.other2 was via cpu 2.
> Note that the mmjnk.other3 was via cpu 3.
> Note that the original mmjnk.other was without limiting the cpu usage.
>
> Then I went back and did a compare of files not written since
> the reboot and showing zeros earlier above. First I show some
> of the output of a prior zeros-producing compare:
>
> # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other | more
> 1795768321 264 0
> 1795768322 167 0
> 1795768323 272 0
> 1795768324 6 0
> 1795768325 3 0
> 1795768326 370 0
> 1795768327 10 0
> 1795768328 112 0
> . . .
>
> (Yes, I did not lock down what cpu was to be used for the cmp -l
> usage in this activity. In the future I probably should experiment
> with that too.)
>
> The new comparison looked like:
>
> # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other | more
> 1442340865 15 0
> 1442340866 245 0
> 1442340867 1 30
> 1442340868 1 353
> 1442340869 0 11
> 1442340870 100 17
> 1442340871 226 271
> 1442340872 31 125
> . . .
>
> Not all-zeros being presented on the right any more! And not
> the same offset either (so different left hand side data).
> (Some bytes are a match to the left side and so do not show a
> line overall.)
>
> So I looked at the new copy made under cpuset -l2 :
>
> # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other2 | more
> 1442340865 15 0
> 1442340866 245 0
> 1442340867 1 30
> 1442340868 1 353
> 1442340869 0 11
> 1442340870 100 17
> 1442340871 226 271
> 1442340872 31 125
> . . .
>
> Same offset in this file and *same* values on the left and right.
> (Not just those shown above.)
>
> So I looked at the new copy made under cpuset -l3 :
>
> # cmp -l clang-cortexA53-installworld-poud.tar mmjnk.other3 | more
> 981008385 62 0
> 981008386 111 0
> 981008387 157 30
> 981008388 65 353
> 981008389 123 11
> 981008390 145 17
> 981008391 164 271
> 981008393 160 0
> . . .
>
> Different offset in this file but the *same* values on the right.
> (Not just those shown above.) The left values are different,
> matching up with the offset difference.
>
> (Some bytes are a match to the different data on the left and so
> do not show a line but the right side values appear to match the
> prior 2 examples even where lines disappear differently because
> of left-side content.)
>
> So, apparently, the same page of content used for the right
> side material but at a different point in the diff. (Lack
> of controlling the cpu used for cmp -l might be contributing?)
>
> Note: 1795768321 % 4096 == 1
> Note: 1442340865 % 4096 == 1
> Note: 981008385 % 4096 == 1
>
> cmp starts with line "1", so the above all align
> at 4096 boundaries.
>
>
> Overall this indicates that an unmodified file can have
> its content appear to change and that multiple files
> got the same block of bad data showing up in their
> respective comparisons, just not always at the same
> offset in the files.
>
> I've no clue if the roles of "left" and "right" could
> swap. So far the right seems to be the one that gets
> the bad data.
>
Turns out that the combination of enabling the 3 GiByte
limitation in uefi and not having D25219 applied in
the kernel avoids the problem.
I only used this combination in order to use
artifacts.ci.freebsd.org kernels (that do not have
D25219) in some other testing.
So, putting back my non-debug kernel that has
D25219 in it but leaving the 3 GiByte limit
in place in uefi . . . Turns out that also
avoids the problem.
This suggests that may be D25219 by itself is not
keeping everything in the memory range(s) that the
uefi 3 GiByte limitation enforces internally: With
the limitation enforced, the problem disappears.
===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
More information about the freebsd-arm
mailing list