Panic in ZFS during zfs recv (while snapshots being destroyed)

Karl Denninger karl at denninger.net
Wed Dec 30 15:31:57 UTC 2015


I believe that the latest change to my patch fixes this both in patched
and unpatched systems.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594

The root cause appears to be that while it should not be possible for
dirty >= dirty_data_max in either the patched or unpatched case it in
fact can occur, and if it does you get a zero value for the delay
computation if the unsigned subtraction underflows (or an outright panic
on an integer divide-by-zero attempt.)  The latter of course produces an
immediate panic but the former corrupts the running system and results
in a later panic.  Some other reported cases of what I believe is the
same problem occurs with a reported divide-by-zero trap but when
examining the variables in question the two operands do not result in a
difference of zero.

I am hesitant to declare that this change fixes it since I cannot
provoke the panic "on demand"; it only occurs under heavy filesystem
load including snapshot activity in my case, although another reported
instance occurred during /etc/periodic runs against a huge number of
jails at the same time, and is likely related.  But, since putting the
change in on 10 December I've not had the sentinels that mark the
corrupt state which I had previously identified fire, nor have I had a
panic occur.

If you've run into a panic in the zfs code during very heavy and unusual
use I'd be real interested in whether they stop if you run the last of
the above patches; it should apply against 10-STABLE of approximately
r289078 or later.

Thanks;

On 10/23/2015 07:10, Karl Denninger wrote:
> While continuing to run this down I think I've managed to duplicate the
> state that produces the panic, but I don't know exactly how or why.
>
> The script (modified to check all returns) now produced this on a test case:
>
> ......
> receiving incremental stream of
> zsr/R/10.2-STABLE at zfs-auto-snap_frequent-2015-10-23-04h00 into
> backup/R/10.2-STABLE at zfs-auto-snap_frequent-2015-10-23-04h00
> received 559MB stream in 22 seconds (25.4MB/sec)
> receiving incremental stream of
> zsr/R/10.2-STABLE at zfs-auto-snap_hourly-2015-10-23-04h04 into
> backup/R/10.2-STABLE at zfs-auto-snap_hourly-2015-10-23-04h04
> received 4.25MB stream in 1 seconds (4.25MB/sec)
> receiving incremental stream of
> zsr/R/10.2-STABLE at zfs-auto-snap_frequent-2015-10-23-05h00 into
> backup/R/10.2-STABLE at zfs-auto-snap_frequent-2015-10-23-05h00
> received 12.5MB stream in 1 seconds (12.5MB/sec)
> receiving incremental stream of
> zsr/R/10.2-STABLE at zfs-auto-snap_frequent-2015-10-23-06h00 into
> backup/R/10.2-STABLE at zfs-auto-snap_frequent-2015-10-23-06h00
> received 13.4MB stream in 1 seconds (13.4MB/sec)
> receiving incremental stream of zsr/R/10.2-STABLE at zfs-base into
> backup/R/10.2-STABLE at zfs-base
> received 14.8MB stream in 1 seconds (14.8MB/sec)
> *will destroy zsr/R/10.2-STABLE at zfs-old**
> **will reclaim 6.50M**
> **cannot destroy snapshot zsr/R/10.2-STABLE at zfs-old: dataset is busy**
> **Snapshot destroy FAILED with code 1*
>
> And there it stopped, as I have it trapped.
>
> But, there's nothing holding that dataset open:
>
> root at NewFS:~ # zfs holds zsr/R/10.2-STABLE at zfs-old
> NAME                       TAG  TIMESTAMP
>
> There is also no send or receive command still running, and an attempt
> to force (or defer) the destroy fails too:
>
> root at NewFS:~ # zfs destroy zsr/R/10.2-STABLE at zfs-old
> cannot destroy snapshot zsr/R/10.2-STABLE at zfs-old: dataset is busy
> root at NewFS:~ # zfs destroy -d zsr/R/10.2-STABLE at zfs-old
> cannot destroy snapshot zsr/R/10.2-STABLE at zfs-old: dataset is busy
> root at NewFS:~ # zfs destroy -f zsr/R/10.2-STABLE at zfs-old
> cannot destroy snapshot zsr/R/10.2-STABLE at zfs-old: dataset is busy
>
> Parameters:
>
> root at NewFS:~ # zfs get all zsr/R/10.2-STABLE at zfs-old
> NAME                       PROPERTY               VALUE                 
> SOURCE
> zsr/R/10.2-STABLE at zfs-old  type                   snapshot               -
> zsr/R/10.2-STABLE at zfs-old  creation               Thu Oct 22 10:14 2015  -
> zsr/R/10.2-STABLE at zfs-old  used                   6.50M                  -
> zsr/R/10.2-STABLE at zfs-old  referenced             25.7G                  -
> zsr/R/10.2-STABLE at zfs-old  compressratio          1.86x                  -
> zsr/R/10.2-STABLE at zfs-old  devices                on                    
> default
> zsr/R/10.2-STABLE at zfs-old  exec                   on                    
> default
> zsr/R/10.2-STABLE at zfs-old  setuid                 on                    
> default
> zsr/R/10.2-STABLE at zfs-old  xattr                  off                   
> temporary
> zsr/R/10.2-STABLE at zfs-old  version                5                      -
> zsr/R/10.2-STABLE at zfs-old  utf8only               off                    -
> zsr/R/10.2-STABLE at zfs-old  normalization          none                   -
> zsr/R/10.2-STABLE at zfs-old  casesensitivity        sensitive              -
> zsr/R/10.2-STABLE at zfs-old  nbmand                 off                   
> default
> zsr/R/10.2-STABLE at zfs-old  primarycache           all                   
> default
> zsr/R/10.2-STABLE at zfs-old  secondarycache         all                   
> default
> zsr/R/10.2-STABLE at zfs-old  defer_destroy          off                    -
> zsr/R/10.2-STABLE at zfs-old  userrefs               0                      -
> zsr/R/10.2-STABLE at zfs-old  mlslabel                                      -
> zsr/R/10.2-STABLE at zfs-old  refcompressratio       1.86x                  -
> zsr/R/10.2-STABLE at zfs-old  written                169M                   -
> zsr/R/10.2-STABLE at zfs-old  clones                                        -
> zsr/R/10.2-STABLE at zfs-old  logicalused            0                      -
> zsr/R/10.2-STABLE at zfs-old  logicalreferenced      42.7G                  -
> zsr/R/10.2-STABLE at zfs-old  volmode                default               
> default
> zsr/R/10.2-STABLE at zfs-old  com.sun:auto-snapshot  true                  
> inherited from zsr/R/10.2-STABLE
>
> Nothing here that looks like a problem; specifically, no clones.
>
> If I run the script again and it attempts recovery (since zfs-old is
> present) the machine panics with the below.  Once the machine has
> panic'd I can remove the snapshot.
>
> This looks like some sort of bug internally in the zfs state -- but the
> question is, now that I'm in this state how do I get out without a crash
> and why did it happen, given that I can't find any reason for the
> snapshot to be non-removable (e.g. an active clone, etc)
>
> Ideas or anything that would help find the source of the problem using zdb?
>
> On 10/20/2015 10:51, Karl Denninger wrote:
>> More data on this crash from this morning; I caught it in-process this
>> time and know exactly where it was in the backup script when it detonated.
>>
>> Here's the section of the script that was running when it blew up:
>>
>> for i in $ZFS
>> do
>>         echo Processing $i
>>
>>         FILESYS=`echo $i|cut -d/ -f2`
>>
>>         zfs list $i at zfs-base >/dev/null 2>&1
>>         result=$?
>>         if [ $result -eq 1 ];
>>         then
>>                 echo "Make and send zfs-base snapshot"
>>                 zfs snapshot -r $i at zfs-base
>>                 zfs send -R $i at zfs-base | zfs receive -Fuvd $BACKUP
>>         else
>>                 base_only=`zfs get -H com.backup:base-only $i|grep true`
>>                 result=$?
>>                 if [ $result -eq 1 ];
>>                 then
>>                         echo "Bring incremental backup up to date"
>>                         old_present=`zfs list $i at zfs-old >/dev/null 2>&1`
>>                         old=$?
>>                         if [ $old -eq 0 ];
>>                         then
>>                                 echo "Previous incremental sync was
>> interrupted; resume"
>>                         else
>>                                 zfs rename -r $i at zfs-base $i at zfs-old
>>                                 zfs snapshot -r $i at zfs-base
>>                         fi
>>                         zfs send -RI $i at zfs-old $i at zfs-base | zfs
>> receive -Fudv $BACKUP
>>                         result=$?
>>                         if [ $result -eq 0 ];
>>                         then
>>                                 zfs destroy -r $i at zfs-old
>>                                 zfs destroy -r $BACKUP/$FILESYS at zfs-old
>>                         else
>>                                 echo "STOP - - ERROR RUNNING COPY on $i!!!!"
>>                                 exit 1
>>                         fi
>>                 else
>>                         echo "Base-only backup configured for $i"
>>                 fi
>>         fi
>>         echo
>> done
>>
>> And the output on the console when it happened:
>>
>> Begin local ZFS backup by SEND
>> Run backups of default [zsr/R/10.2-STABLE zsr/ssd-txlog zs/archive
>> zs/colo-archive zs/disk zs/pgsql zs/pics dbms/ticker]
>> Tue Oct 20 10:14:09 CDT 2015
>>
>> Import backup pool
>> Imported; ready to proceed
>> Processing zsr/R/10.2-STABLE
>> Bring incremental backup up to date
>> _*Previous incremental sync was interrupted; resume*_
>> attempting destroy backup/R/10.2-STABLE at zfs-auto-snap_daily-2015-10-13-00h07
>> success
>> attempting destroy
>> backup/R/10.2-STABLE at zfs-auto-snap_hourly-2015-10-18-12h04
>> success
>> *local fs backup/R/10.2-STABLE does not have fromsnap (zfs-old in stream);**
>> **must have been deleted locally; ignoring*
>> receiving incremental stream of
>> zsr/R/10.2-STABLE at zfs-auto-snap_daily-2015-10-18-00h07 into
>> backup/R/10.2-STABLE at zfs-auto-snap_daily-2015-10-18-00h07
>> snap backup/R/10.2-STABLE at zfs-auto-snap_daily-2015-10-18-00h07 already
>> exists; ignoring
>> received 0B stream in 1 seconds (0B/sec)
>> receiving incremental stream of
>> zsr/R/10.2-STABLE at zfs-auto-snap_weekly-2015-10-18-00h14 into
>> backup/R/10.2-STABLE at zfs-auto-snap_weekly-2015-10-18-00h14
>> snap backup/R/10.2-STABLE at zfs-auto-snap_weekly-2015-10-18-00h14 already
>> exists; ignoring
>> received 0B stream in 1 seconds (0B/sec)
>> receiving incremental stream of zsr/R/10.2-STABLE at zfs-base into
>> backup/R/10.2-STABLE at zfs-base
>> snap backup/R/10.2-STABLE at zfs-base already exists; ignoring
>>
>> That bolded pair of lines should _*not*_ be there.  It  means that the
>> snapshot "zfs-old" is on the source volume, but it shouldn't be there
>> because after the copy is complete we destroy it on both the source and
>> destination.  Further, when it is attempted to be sent while the
>> snapshot name (zfs-old) was found in a zfs list /*the data allegedly
>> associated with the phantom snapshot that shouldn't exist was not there
>> */(second bolded line)
>>
>>                         zfs send -RI $i at zfs-old $i at zfs-base | zfs
>> receive -Fudv $BACKUP
>>                         result=$?
>>                         if [ $result -eq 0 ];
>>                         then
>>                                *zfs destroy -r $i at zfs-old**
>> **                                zfs destroy -r $BACKUP/$FILESYS at zfs-old*
>>                         else
>>                                 echo "STOP - - ERROR RUNNING COPY on $i!!!!"
>>                                 exit 1
>>                         fi
>>
>> I don't have the trace from the last run (I didn't save it) *but there
>> were no errors returned by it *as it was run by hand (from the console)
>> while I was watching it.
>>
>> This STRONGLY implies that the zfs destroy /allegedly /succeeded (it ran
>> without an error and actually removed the snapshot) but left the
>> snapshot _*name*_ on the volume as it was able to be found by the
>> script, and then when that was referenced by the backup script in the
>> incremental send the data set was invalid producing the resulting panic.
>>
>> The bad news is that the pool shows no faults and a scrub which took
>> place a few days ago says the pool is fine.  If I re-run the backup I
>> suspect it will properly complete (it always has in the past as a resume
>> from the interrupted one) but clearly, whatever is going on here looks
>> like some sort of race in the destroy commands (which _*are*_ being run
>> recursively) that leaves the snapshot name on the volume but releases
>> the data stored in it, thus the panic when it is attempted to be
>> referenced.
>>
>> I have NOT run a scrub on the pool in an attempt to preserve whatever
>> evidence may be there; if there is something that I can look at with zdb
>> or similar I'll leave this alone for a bit -- the machine came back up
>> and is running fine.
>>
>> This is a production system but I can take it down off-hours for a short
>> time if there is a need to run something in single-user mode using zdb
>> to try to figure out what's going on.  There have been no known hardware
>> issues with it and all the other pools (including the ones on the same
>> host adapter) are and have been running without incident.
>>
>> Ideas?
>>
>>
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 9; apic id = 21
>> fault virtual address   = 0x378
>> fault code              = supervisor read data, page not present
>> instruction pointer     = 0x20:0xffffffff80940ae0
>> stack pointer           = 0x28:0xfffffe0668018680
>> frame pointer           = 0x28:0xfffffe0668018700
>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>                         = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags        = interrupt enabled, resume, IOPL = 0
>> current process         = 70921 (zfs)
>> trap number             = 12
>> panic: page fault
>> cpuid = 9
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>> 0xfffffe0668018160
>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0668018210
>> vpanic() at vpanic+0x126/frame 0xfffffe0668018250
>> panic() at panic+0x43/frame 0xfffffe06680182b0
>> trap_fatal() at trap_fatal+0x36b/frame 0xfffffe0668018310
>> trap_pfault() at trap_pfault+0x2ed/frame 0xfffffe06680183b0
>> trap() at trap+0x47a/frame 0xfffffe06680185c0
>> calltrap() at calltrap+0x8/frame 0xfffffe06680185c0
>> --- trap 0xc, rip = 0xffffffff80940ae0, rsp = 0xfffffe0668018680, rbp =
>> 0xfffffe0668018700 ---
>> *__mtx_lock_sleep() at __mtx_lock_sleep+0x1b0/frame 0xfffffe0668018700**
>> **dounmount() at dounmount+0x58/frame 0xfffffe0668018780**
>> **zfs_unmount_snap() at zfs_unmount_snap+0x114/frame 0xfffffe06680187a0**
>> **dsl_dataset_user_release_impl() at
>> dsl_dataset_user_release_impl+0x103/frame 0xfffffe0668018920**
>> **dsl_dataset_user_release_onexit() at
>> dsl_dataset_user_release_onexit+0x5c/frame 0xfffffe0668018940**
>> **zfs_onexit_destroy() at zfs_onexit_destroy+0x56/frame 0xfffffe0668018970**
>> **zfsdev_close() at zfsdev_close+0x52/frame 0xfffffe0668018990*
>> devfs_fpdrop() at devfs_fpdrop+0xa9/frame 0xfffffe06680189b0
>> devfs_close_f() at devfs_close_f+0x45/frame 0xfffffe06680189e0
>> _fdrop() at _fdrop+0x29/frame 0xfffffe0668018a00
>> closef() at closef+0x21e/frame 0xfffffe0668018a90
>> closefp() at closefp+0x98/frame 0xfffffe0668018ad0
>> amd64_syscall() at amd64_syscall+0x35d/frame 0xfffffe0668018bf0
>> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0668018bf0
>> --- syscall (6, FreeBSD ELF64, sys_close), rip = 0x801a01f5a, rsp =
>> 0x7fffffffd1e8, rbp = 0x7fffffffd200 ---
>> Uptime: 5d0h58m0s
>>
>>
>>
>> On 10/15/2015 09:27, Karl Denninger wrote:
>>> Got another one of these this morning, after a long absence...
>>>
>>> Same symptom -- happened during a backup (send/receive) and it's in
>>> the same place -- when the snapshot is unmounted.  I have a clean dump
>>> and this is against a quite-recent checkout, so the
>>> most-currently-rolled forward ZFS changes are in this kernel.
>>>
>>>
>>> Fatal trap 12: page fault while in kernel mode
>>> cpuid = 14; apic id = 34
>>> fault virtual address   = 0x378
>>> fault code              = supervisor read data, page not present
>>> instruction pointer     = 0x20:0xffffffff80940ae0
>>> stack pointer           = 0x28:0xfffffe066796c680
>>> frame pointer           = 0x28:0xfffffe066796c700
>>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>>                         = DPL 0, pres 1, long 1, def32 0, gran 1
>>> processor eflags        = interrupt enabled, resume, IOPL = 0
>>> current process         = 81187 (zfs)
>>> trap number             = 12
>>> panic: page fault
>>> cpuid = 14
>>> KDB: stack backtrace:
>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>>> 0xfffffe066796c160
>>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe066796c210
>>> vpanic() at vpanic+0x126/frame 0xfffffe066796c250
>>> panic() at panic+0x43/frame 0xfffffe066796c2b0
>>> trap_fatal() at trap_fatal+0x36b/frame 0xfffffe066796c310
>>> trap_pfault() at trap_pfault+0x2ed/frame 0xfffffe066796c3b0
>>> trap() at trap+0x47a/frame 0xfffffe066796c5c0
>>> calltrap() at calltrap+0x8/frame 0xfffffe066796c5c0
>>> --- trap 0xc, rip = 0xffffffff80940ae0, rsp = 0xfffffe066796c680, rbp
>>> = 0xfffffe
>>> 066796c700 ---
>>> __mtx_lock_sleep() at __mtx_lock_sleep+0x1b0/frame 0xfffffe066796c700
>>> dounmount() at dounmount+0x58/frame 0xfffffe066796c780
>>> zfs_unmount_snap() at zfs_unmount_snap+0x114/frame 0xfffffe066796c7a0
>>> dsl_dataset_user_release_impl() at
>>> dsl_dataset_user_release_impl+0x103/frame 0xf
>>> ffffe066796c920
>>> dsl_dataset_user_release_onexit() at
>>> dsl_dataset_user_release_onexit+0x5c/frame
>>> 0xfffffe066796c940
>>> zfs_onexit_destroy() at zfs_onexit_destroy+0x56/frame 0xfffffe066796c970
>>> zfsdev_close() at zfsdev_close+0x52/frame 0xfffffe066796c990
>>> devfs_fpdrop() at devfs_fpdrop+0xa9/frame 0xfffffe066796c9b0
>>> devfs_close_f() at devfs_close_f+0x45/frame 0xfffffe066796c9e0
>>> _fdrop() at _fdrop+0x29/frame 0xfffffe066796ca00
>>> closef() at closef+0x21e/frame 0xfffffe066796ca90
>>> closefp() at closefp+0x98/frame 0xfffffe066796cad0
>>> amd64_syscall() at amd64_syscall+0x35d/frame 0xfffffe066796cbf0
>>> Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe066796cbf0
>>> --- syscall (6, FreeBSD ELF64, sys_close), rip = 0x801a01f5a, rsp =
>>> 0x7fffffffd728, rbp = 0x7fffffffd740 ---
>>> Uptime: 3d15h37m26s
>>>
>>>
>>> On 8/27/2015 15:44, Karl Denninger wrote:
>>>> No, but that does sound like it might be involved.....
>>>>
>>>> And yeah, this did start when I moved the root pool to a mirrored pair
>>>> of Intel 530s off a pair of spinning-rust WD RE4s.... (The 530s are darn
>>>> nice performance-wise, reasonably inexpensive and thus very suitable for
>>>> a root filesystem drive and they also pass the "pull the power cord"
>>>> test, incidentally.)
>>>>
>>>> You may be onto something -- I'll try shutting it off, but due to the
>>>> fact that I can't make this happen and it's a "every week or two" panic,
>>>> but ALWAYS when the zfs send | zfs recv is running AND it's always on
>>>> the same filesystem it will be a fair while before I know if it's fixed
>>>> (like over a month, given the usual pattern here, as that would be 4
>>>> "average" periods without a panic).....
>>>>
>>>> I also wonder if I could tune this out with some of the other TRIM
>>>> parameters instead of losing it entirely.
>>>>
>>>> vfs.zfs.trim.max_interval: 1
>>>> vfs.zfs.trim.timeout: 30
>>>> vfs.zfs.trim.txg_delay: 32
>>>> vfs.zfs.trim.enabled: 1
>>>> vfs.zfs.vdev.trim_max_pending: 10000
>>>> vfs.zfs.vdev.trim_max_active: 64
>>>> vfs.zfs.vdev.trim_min_active: 1
>>>>
>>>> That it's panic'ing on a mtx_lock_sleep might point this way.... the
>>>> trace shows it coming from a zfs_onexit_destroy, which ends up calling
>>>> zfs_unmount_snap() and then it blows in dounmount() while executing
>>>> mtx_lock_sleep().
>>>>
>>>> I do wonder if I'm begging for new and innovative performance issues if
>>>> I run with TRIM off for an extended period of time, however..... :-)
>>>>
>>>>
>>>> On 8/27/2015 15:30, Sean Chittenden wrote:
>>>>> Have you tried disabling TRIM?  We recently ran in to an issue where a `zfs delete` on a large dataset caused the host to panic because TRIM was tripping over the ZFS deadman timer.  Disabling TRIM worked as  valid workaround for us.  ?  You mentioned a recent move to SSDs, so this can happen, esp after the drive has experienced a little bit of actual work.  ?  -sc
>>>>>
>>>>>
>>>>> --
>>>>> Sean Chittenden
>>>>> sean at chittenden.org
>>>>>
>>>>>
>>>>>> On Aug 27, 2015, at 13:22, Karl Denninger <karl at denninger.net> wrote:
>>>>>>
>>>>>> On 8/15/2015 12:38, Karl Denninger wrote:
>>>>>>> Update:
>>>>>>>
>>>>>>> This /appears /to be related to attempting to send or receive a
>>>>>>> /cloned /snapshot.
>>>>>>>
>>>>>>> I use /beadm /to manage boot environments and the crashes have all
>>>>>>> come while send/recv-ing the root pool, which is the one where these
>>>>>>> clones get created.  It is /not /consistent within a given snapshot
>>>>>>> when it crashes and a second attempt (which does a "recovery"
>>>>>>> send/receive) succeeds every time -- I've yet to have it panic twice
>>>>>>> sequentially.
>>>>>>>
>>>>>>> I surmise that the problem comes about when a file in the cloned
>>>>>>> snapshot is modified, but this is a guess at this point.
>>>>>>>
>>>>>>> I'm going to try to force replication of the problem on my test system.
>>>>>>>
>>>>>>> On 7/31/2015 04:47, Karl Denninger wrote:
>>>>>>>> I have an automated script that runs zfs send/recv copies to bring a
>>>>>>>> backup data set into congruence with the running copies nightly.  The
>>>>>>>> source has automated snapshots running on a fairly frequent basis
>>>>>>>> through zfs-auto-snapshot.
>>>>>>>>
>>>>>>>> Recently I have started having a panic show up about once a week during
>>>>>>>> the backup run, but it's inconsistent.  It is in the same place, but I
>>>>>>>> cannot force it to repeat.
>>>>>>>>
>>>>>>>> The trap itself is a page fault in kernel mode in the zfs code at
>>>>>>>> zfs_unmount_snap(); here's the traceback from the kvm (sorry for the
>>>>>>>> image link but I don't have a better option right now.)
>>>>>>>>
>>>>>>>> I'll try to get a dump, this is a production machine with encrypted swap
>>>>>>>> so it's not normally turned on.
>>>>>>>>
>>>>>>>> Note that the pool that appears to be involved (the backup pool) has
>>>>>>>> passed a scrub and thus I would assume the on-disk structure is ok.....
>>>>>>>> but that might be an unfair assumption.  It is always occurring in the
>>>>>>>> same dataset although there are a half-dozen that are sync'd -- if this
>>>>>>>> one (the first one) successfully completes during the run then all the
>>>>>>>> rest will as well (that is, whenever I restart the process it has always
>>>>>>>> failed here.)  The source pool is also clean and passes a scrub.
>>>>>>>>
>>>>>>>> traceback is at http://www.denninger.net/kvmimage.png; apologies for the
>>>>>>>> image traceback but this is coming from a remote KVM.
>>>>>>>>
>>>>>>>> I first saw this on 10.1-STABLE and it is still happening on FreeBSD
>>>>>>>> 10.2-PRERELEASE #9 r285890M, which I updated to in an attempt to see if
>>>>>>>> the problem was something that had been addressed.
>>>>>>>>
>>>>>>>>
>>>>>>> -- 
>>>>>>> Karl Denninger
>>>>>>> karl at denninger.net <mailto:karl at denninger.net>
>>>>>>> /The Market Ticker/
>>>>>>> /[S/MIME encrypted email preferred]/
>>>>>> Second update: I have now taken another panic on 10.2-Stable, same deal,
>>>>>> but without any cloned snapshots in the source image. I had thought that
>>>>>> removing cloned snapshots might eliminate the issue; that is now out the
>>>>>> window.
>>>>>>
>>>>>> It ONLY happens on this one filesystem (the root one, incidentally)
>>>>>> which is fairly-recently created as I moved this machine from spinning
>>>>>> rust to SSDs for the OS and root pool -- and only when it is being
>>>>>> backed up by using zfs send | zfs recv (with the receive going to a
>>>>>> different pool in the same machine.)  I have yet to be able to provoke
>>>>>> it when using zfs send to copy to a different machine on the same LAN,
>>>>>> but given that it is not able to be reproduced on demand I can't be
>>>>>> certain it's timing related (e.g. performance between the two pools in
>>>>>> question) or just that I haven't hit the unlucky combination.
>>>>>>
>>>>>> This looks like some sort of race condition and I will continue to see
>>>>>> if I can craft a case to make it occur "on demand"
>>>>>>
>>>>>> -- 
>>>>>> Karl Denninger
>>>>>> karl at denninger.net <mailto:karl at denninger.net>
>>>>>> /The Market Ticker/
>>>>>> /[S/MIME encrypted email preferred]/
>>>>> %SPAMBLOCK-SYS: Matched [+Sean Chittenden <sean at chittenden.org>], message ok
>>>>>
>>> -- 
>>> Karl Denninger
>>> karl at denninger.net <mailto:karl at denninger.net>
>>> /The Market Ticker/
>>> /[S/MIME encrypted email preferred]/

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2996 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20151230/f208b928/attachment.bin>


More information about the freebsd-fs mailing list