kern.sync_on_panic

Mon Jun 27 19:44:02 UTC 2011

On Mon, Jun 27, 2011 at 3:08 AM, Andriy Gapon <avg at freebsd.org> wrote:
> on 26/06/2011 08:51 Warner Losh said the following:
>>
>> On Jun 25, 2011, at 8:49 AM, Andriy Gapon wrote:
>>> Does anybody actually use kern.sync_on_panic tunable/sysctl? If yes, then
>>> in what circumstances do you need it? That is, why any other alternative
>>> doesn't work for you? Like: 1. remounting filesystems R/O before panic if
>>> you knowingly provoke it for testing 2. using netboot for your test system
>>> 3. using su+j, gjournal or a different filesystem altogether 4. using fsck
>>> after reboot
>>>
>>> It seems to me that syncing filesystems in panic context is an adventure.
>>> And it may become even more of an adventure if we introduce code that
>>> completely stops scheduler in and after panic.
>>
>> I've used it in the past when I was developing a device driver that was in
>> the late stages of maturing.  Since all the panics in the system were when
>> the driver dereferenced NULL in that driver, sync was safe because all the
>> data structures were sane except the aforementioned driver.
>>
>> (1) It was a production system, and everything that could be was already
>> mounted r/w.  However, some small, but every critical, amount of data was
>> still r/w and it was very important to not lose this data.  Production here
>> likely should be in quotes, because it was in the late stages of
>> testing/validation.  The problem was without this sometimes the saved state
>> of the GPS receiver and other hardware would wind up being zero, which meant
>> that we'd have to do a cold start which cost us a few hours of time.  At the
>> time I was doing this, we saw zero files a couple times a day without this
>> turned on. (2) netbooting wasn't an option since we were qualifying a
>> non-netbooting system. (3) these weren't available at the time, but the goal
>> was to prevent data loss, not to necessarily have to avoid fsck on boot. (4)
>> Data loss without it.
>>
>> Now, I'll be the first to admit this has been a few years, and I haven't done
>> a fresh evaluation to see if things are still safe.  I'll also be the first
>> to admit that this was a useful debugging setting late in development, and
>> not in production.  I'm also the first to admit this isn't what I'd call a
>> very wide-spread case.  But it did come in very handy when chasing a few bugs
>> to be able to do 10 panic/reboot cycles an hour rather than 2 a day.
>
> A fine enough use-case for me.  I guess the problem ultimately boiled down to
> peculiarities of UFS behavior, but still...
> However, please be aware that sync_on_panic might get broken when/if we start
> stopping scheduler in panic.

The entirety of the sync code should be a subroutine in vfs_bio.c so
the 'buf' variable is static to the file.  At that point it would be
reasonable to explicitly call it at the beginning of panic(9) for the
sync-on-panic case, either before IPIing the other CPUs, or at least
before entering the critical section that prevents the scheduler from
running.

Cheers,
matthew