ZFS Snapshots Not able to be accessed under .zfs/snapshot/name

Mon Sep 9 13:08:18 UTC 2013

On 08/16/2013 8:49 am, dweimer wrote:
> On 08/15/2013 10:00 am, dweimer wrote:
>> On 08/14/2013 9:43 pm, Shane Ambler wrote:
>>> On 14/08/2013 22:57, dweimer wrote:
>>>> I have a few systems running on ZFS with a backup script that 
>>>> creates
>>>> snapshots, then  backs up the .zfs/snapshot/name directory to make 
>>>> sure
>>>> open files are not missed.  This has been working great but all of 
>>>> the
>>>> sudden one of my systems has stopped working.  It takes the 
>>>> snapshots
>>>> fine, zfs list -t spnapshot shows the snapshots, but if you do an ls
>>>> command, on the .zfs/snapshot/ directory it returns not a directory.
>>>> 
>>>> part of the zfs list output:
>>>> 
>>>> NAME                        USED  AVAIL  REFER  MOUNTPOINT
>>>> zroot                      4.48G  29.7G    31K  none
>>>> zroot/ROOT                 2.92G  29.7G    31K  none
>>>> zroot/ROOT/91p5-20130812   2.92G  29.7G  2.92G  legacy
>>>> zroot/home                  144K  29.7G   122K  /home
>>>> 
>>>> part of the zfs list -t snapshot output:
>>>> 
>>>> NAME                                            USED  AVAIL  REFER
>>>> MOUNTPOINT
>>>> zroot/ROOT/91p5-20130812 at 91p5-20130812--bsnap   340K      -  2.92G  
>>>> -
>>>> zroot/home at home--bsnap                           22K      -   122K  
>>>> -
>>>> 
>>>> ls /.zfs/snapshot/91p5-20130812--bsnap/
>>>> Does work at the right now, since the last reboot, but wasn't always
>>>> working, this is my boot environment.
>>>> 
>>>> if I do ls /home/.zfs/snapshot/, result is:
>>>> ls: /home/.zfs/snapshot/: Not a directory
>>>> 
>>>> if I do ls /home/.zfs, result is:
>>>> ls: snapshot: Bad file descriptor
>>>> shares
>>>> 
>>>> I have tried zpool scrub zroot, no errors were found, if I reboot 
>>>> the
>>>> system I can get one good backup, then I start having problems.  
>>>> Anyone
>>>> else ever ran into this, any suggestions as to a fix?
>>>> 
>>>> System is running FreeBSD 9.1-RELEASE-p5 #1 r253764: Mon Jul 29 
>>>> 15:07:35
>>>> CDT 2013, zpool is running version 28, zfs is running version 5
>>>> 
>>> 
>>> 
>>> I can say I've had this problem. Not certain what fixed it. I do
>>> remember I decided to stop snapshoting if I couldn't access them and
>>> deleted existing snapshots. I later restarted the machine before I
>>> went back for another look and they were working.
>>> 
>>> So my guess is a restart without existing snapshots may be the key.
>>> 
>>> Now if only we could find out what started the issue so we can stop 
>>> it
>>> happening again.
>> 
>> I had actually rebooted it last night, prior to seeing this message, I
>> do know it didn't have any snapshots this time.  As I am booting from
>> ZFS using boot environments I may have had an older boot environment
>> still on the system the last time it was rebooted.  Backups ran great
>> last night after the reboot, and I was able to kick off my pre-backup
>> job and access all the snapshots today.  Hopefully it doesn't come
>> back, but if it does I will see if I can find anything else wrong.
>> 
>> FYI,
>> It didn't shutdown cleanly, so if this helps anyone find the issue,
>> this is from my system logs:
>> Aug 14 22:08:04 cblproxy1 kernel:
>> Aug 14 22:08:04 cblproxy1 kernel: Fatal trap 12: page fault while in 
>> kernel mode
>> Aug 14 22:08:04 cblproxy1 kernel: cpuid = 0; apic id = 00
>> Aug 14 22:08:04 cblproxy1 kernel: fault virtual address = 0xa8
>> Aug 14 22:08:04 cblproxy1 kernel: fault code            = supervisor
>> write data, page not present
>> Aug 14 22:08:04 cblproxy1 kernel: instruction pointer   =
>> 0x20:0xffffffff808b0562
>> Aug 14 22:08:04 cblproxy1 kernel: stack pointer         =
>> 0x28:0xffffff80002238f0
>> Aug 14 22:08:04 cblproxy1 kernel: frame pointer         =
>> 0x28:0xffffff8000223910
>> Aug 14 22:08:04 cblproxy1 kernel: code segment          = base 0x0,
>> limit 0xfffff, type 0x1b
>> Aug 14 22:08:04 cblproxy1 kernel: = DPL 0, pres 1, long 1, def32 0, 
>> gran 1
>> Aug 14 22:08:04 cblproxy1 kernel: processor eflags      = interrupt
>> enabled, resume, IOPL = 0
>> Aug 14 22:08:04 cblproxy1 kernel: current process               = 1 
>> (init)
>> Aug 14 22:08:04 cblproxy1 kernel: trap number           = 12
>> Aug 14 22:08:04 cblproxy1 kernel: panic: page fault
>> Aug 14 22:08:04 cblproxy1 kernel: cpuid = 0
>> Aug 14 22:08:04 cblproxy1 kernel: KDB: stack backtrace:
>> Aug 14 22:08:04 cblproxy1 kernel: #0 0xffffffff808ddaf0 at 
>> kdb_backtrace+0x60
>> Aug 14 22:08:04 cblproxy1 kernel: #1 0xffffffff808a951d at panic+0x1fd
>> Aug 14 22:08:04 cblproxy1 kernel: #2 0xffffffff80b81578 at 
>> trap_fatal+0x388
>> Aug 14 22:08:04 cblproxy1 kernel: #3 0xffffffff80b81836 at 
>> trap_pfault+0x2a6
>> Aug 14 22:08:04 cblproxy1 kernel: #4 0xffffffff80b80ea1 at trap+0x2a1
>> Aug 14 22:08:04 cblproxy1 kernel: #5 0xffffffff80b6c7b3 at 
>> calltrap+0x8
>> Aug 14 22:08:04 cblproxy1 kernel: #6 0xffffffff815276da at
>> zfsctl_umount_snapshots+0x8a
>> Aug 14 22:08:04 cblproxy1 kernel: #7 0xffffffff81536766 at 
>> zfs_umount+0x76
>> Aug 14 22:08:04 cblproxy1 kernel: #8 0xffffffff809340bc at 
>> dounmount+0x3cc
>> Aug 14 22:08:04 cblproxy1 kernel: #9 0xffffffff8093c101 at 
>> vfs_unmountall+0x71
>> Aug 14 22:08:04 cblproxy1 kernel: #10 0xffffffff808a8eae at 
>> kern_reboot+0x4ee
>> Aug 14 22:08:04 cblproxy1 kernel: #11 0xffffffff808a89c0 at 
>> kern_reboot+0
>> Aug 14 22:08:04 cblproxy1 kernel: #12 0xffffffff80b81dab at 
>> amd64_syscall+0x29b
>> Aug 14 22:08:04 cblproxy1 kernel: #13 0xffffffff80b6ca9b at 
>> Xfast_syscall+0xfb
> 
> Well its back, 3 of the 8 file systems I am taking snapshots of failed
> in last nights backups.
> 
> The only thing different on this system from all the 4 others I have
> running is that it has a second disk volume with a UFS file system.
> 
> Setup is 2 Disks, both setup with GPART:
> =>      34  83886013  da0  GPT  (40G)
>         34       256    1  boot0  (128k)
>        290  10485760    2  swap0  (5.0G)
>   10486050  73399997    3  zroot0  (35G)
> 
> =>      34  41942973  da1  GPT  (20G)
>         34  41942973    1  squid1  (20G)
> 
> I didn't want the Squid cache directory on ZFS, system is running on
> an ESX 4.1 server backed by iSCSI SAN.  I have 4 other servers running
> on  the same group of ESX servers and SAN, booting from ZFS without
> this problem.  Two of the other 4 are also running Squid but forward
> to this one so they are running without a local disk cache.

A quick update on this, in case anyone else runs into it, I did finally 
try on the 2nd of this month to delete my UFS volume, and create a new 
ZFS volume to replace it.  I recreated the Squid cache directories and 
let squid start over building up cache.  So far their hasn't been a 
noticeable impact on performance with the switch over, and the snapshot 
problem has not reoccurred since making the change.  Its only a week 
into running this way but the problem before started within 36-48 hours.

-- 
Thanks,
    Dean E. Weimer
    http://www.dweimer.net/