Debugging pseudo-disk driver on FreeBSD

Mon May 3 12:56:26 PDT 2004

On Sun, 2 May 2004, Allan Fields wrote:

> On Sun, May 02, 2004 at 12:41:56AM -0600, Siddharth Aggarwal wrote:
> >
> > Hi,
> >
> > I am working on a Copy on Write disk driver on FreeBSD where I try to save
> > the state of a filesystem (/dev/ad0s3) to another device (/dev/ad0s4) by
> > making a virtual device that sits on top of these two (/dev/shd0).
> >
> > 1. So in the strategy routine, I get the block read/write calls to
> > (/dev/shd0) .
> > 2. For a write operation, I copy the previous contents of the block
> > (number corresponding to /dev/ad0s3)  on to a free block on /dev/ad0s4
> > 3. To restore previous contents of disk, I read the allocated free block
> > from /dev/ad0s4 and write it back to original block number /dev/ad0s3.
> >
> > The virtual device /dev/shd0 is mounted on /mnt
> >
> > So to test it out, my /dev/ad0s3 originally had a file "old1" of 13685
> > bytes containing repeating string pattern (OLDOLD)
> > I then copied a file "new1" of 8211 bytes having the repeating pattern
> > (NEWNEW) to overwrite the old one
> > i.e. cp new1 /mnt/old1
> >
> > A hexdump shows that a block of 8192 bytes containing "OLDOLD" was copied
> > over to /dev/ad0s4 and its place being taken be "NEWNEW" in /dev/ad0s3.
> > Also remaining bytes (beyond the 8192 bytes) still remain in /dev/ad0s3.
> > So this shows that the copy on write was done correctly. And I correctly
> > see 8211 bytes of "NEWNEW" in /mnt/old1 (ls -l /mnt/old1)
>
> On closer read, I see the advantage of your approach here: were the
> originating device always has the latest changes but old data is
> still stored on another device. (But for how long..  until next
> overwrite.  Revisioning possibilities?)  This means that the original

Yes I am doing some kind of versioning for these blocks which are stored
away on the shadow device.

> disk is always consistent with the most recent changes but has a
> sort of log of old blocks?
>
> This is the conceptually opposite approach to the union filesystem
> which traditionally keeps new changes to files on another filesystem
> (the overlay) and preserve the underlying filesystem contents.
>
> Your facility also allows devices containing arbitrary data which
> could be for example raw data streams as opposed to a filesystem
> which is accessible through the VFS.  But this carries with it the
> implications of device-level block-i/o.  Restoring any given file
> would involve translating the inode to physical blocks and restoring
> only those portions which were changed by the operation.  I'm unclear
> how this works.  Take undeleting a file:  Wouldn't you need to
> restore the inode, the direct blocks, any indirect blocks and
> dirents by referencing these blocks.  How do you know how to do
> this (at file granularity) at the device-level in a filesystem
> agnostic way?  (Could writes be processed atomically?)
>

Actually the use case of this thing I am writing doesn't involve much of
rolling back to a previous state but instead get a fresh disk image on
another machine and then applying these log entries to the new disk in chronological order to
reach a similar state on the new machine. So some of the concerns you
expresses above may not apply.

 > Alternatively, you
can implement this copy-on-write scheme at the > vnode layer.
>
> > I then send an IOCTL to my driver to restore to the previous state
> > (expecting it to give me 13685 bytes of "OLDOLD" back in /mnt/old1)
>
> So this is like a snapshot of the original state of the filesystem
> on the device in it's entirety (sort of like snapshots but at the
> device-level vs. file-system)?  How do you ensure it's consistent,
> especially when the device backing the storage of old blocks becomes
> full, which do you turf first?  (Problem is less significant if you
> have a 1:1 mapping of blocks like RAID mirror w/ same partition size.)
>
> > After unmounting and remounting, I see that the contents of /mnt/old1 have
> > become OLDOLD, but there are only 8211 bytes instead of 13685. A hexdump of
> > /dev/ad0s3 however, shows that there are indeed 13685 consecutive bytes of
> > OLDOLD lying there.
> >
> > This has lead me to believe that the Inode of /mnt/old1 is not being
> > refereshed (or it was never saved off to the /dev/ad0s4 in the first place). Do Inode
> > read/writes go through the strategy routine in the first place?
>
> Can you reboot the machine and see the same effects?  I know that
> sounds like an extreme measure, but that's a way to determine for
> sure if it's a caching issue.  You could also try doing a few large
> dd's form another filesystem between dis/remount.
>

I tried the reboot option too, but no success :(. One thing though is
that, if the file old1 and new1 files are of the same size, i.e. 8211
bytes. I do get the correct behavior :). But obviously that is too ideal a
case and I guess it works because filesystem metadata (particularly
Inode) is not under question here.

> > Any idea what could be going wrong? >
> No clue. ;)
>
> --
>  Allan Fields
>  AFRSL - http://afields.ca
>  BSDCan: May 2004, Ottawa - http://www.bsdcan.org
>