good/best practices for gmirror and gjournal on a pair of disks?

Tue May 13 20:36:25 UTC 2008

George Hartzell wrote:
> I've been running many of my systems for some time now using gmirror
> on a pair of identical disks, as described by Ralf at:
>
>   http://people.freebsd.org/~rse/mirror/
>
> Each disk has single slice that covers almost all of the disk.  These
> slices are combined into the gmirror device (gm0), which is then
> carved up by bsdlabel into gm0a (/), gm0b (swap), gm0d (/var), gm0e
> (/tmp), and gm0f (/usr).
>
> My latest machine is using Seagate 1TB disks so I thought I should add
> gjournal to the mix to avoid ugly fsck's if/when the machine doesn't
> shut down cleanly.  I ended up just creating a gm0f.journal and using
> it for /usr, which basically seems to be working.
>
> I'm left with a couple of questions though:
>
>   - I've read in the gjournal man page that when it is "... configured
>     on top of gmirror(8) or graid3(8) providers, it also keeps them in
>     a consistent state..."  I've been trying to figure out if this
>     simply falls out of how gjournal works or if there's explicity
>     collusion with gmirror/graid3 but can't come up with a
>     satisfactory explanation.  Can someone walk me through it?
>
>     Since I'm only gjournal'ing a portion of the underlying gmirror
>     device I assume that I don't get this benefit?
>
>   - I've also read in the gjournal man page "... that sync(2) and
>     fsync(2) system calls do not work as expected anymore."  Does this
>     invalidate any of the assumptions made by various database
>     packages such as postgresql, sqlite, berkeley db, etc.... about
>     if/when/whether their data is safely on the disk?
>
>   - What's the cleanest gjournal adaptation of rse's
>     two-disk-mirror-everything setup that would be able to avoid
>     tedious gmirror sync's.  The best I've come up with is to do two
>     slices per disk, combine the slices into a pair of gmirror
>     devices, bsdlabel the first into gm0a (/), gm0b (swap), gm0d
>     (/var) and gm0e (/tmp) and bsdlabel the second into a gm1f which
>     gets a gjournal device.
>
>     Alternatively, would it work and/or make sense to give each disk a
>     single slice, combine them into a gmirror, put a gjournal on top
>     of that, then use bsdlabel to slice it up into partitions?
>
> Is anyone using gjournal and gmirror for all of the system on a pair
> of disks in some other configuration?
>
> Thanks,
>
> g.
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
>
>   
I am pasting below the instructions I would use to convert a recently 
installed system with only / (root) and swap to be using 
gmirror+gjournal.  It is in mediawiki markup format so it could be 
pasted into one if desired.  I based my gmirror steps on the 
instructions from http://people.freebsd.org/~rse/mirror/ so thats why 
some of the words sound familiar.  I also have similar instructions for 
setting up a gmirrored da0s1a and da0s1b alongside a zfs mirror 
containing the rest.

I decided to journal /usr /var /tmp and leave / as a standard UFS 
partition because it is so small, fsck doesn't take long anyway and 
hopefully doesn't get written to enough to cause damage by an abrupt 
reboot.  Because I'm not journaling the root partition, I chose to 
ignore the possibility of gjournal marking the mirror clean.  Sudden 
reboots don't happen enough on servers for me to care.  And all my 
servers got abruptly rebooted this sunday and they all came up fine :)

I believe gjournal uses 1G for journal (2x512) which seemed to be 
sufficient on all of the systems where I have used the default, but I 
quickly found that using a smaller journal is a bad idea and leads to 
panics that I was unable to avoid with tuning.  Considering 1G was such 
a close value, I chose to go several times above the default journal 
size (disk is cheap and I want to be sure) but I ran into problems using 
gjournal label -s (size) rejecting my sizes or wrapping the value around 
to something too low.  As a workaround I chose to use a separate 
partition for each journal.  I quickly ran out of partitions in a bsd 
disklabel so I decided to partition each disk into two slices; the first 
for data and the second for journals.  This also made it easier to line 
up disk devices so they made more sense as a pair, for example:  
gm0s1d(data) + gm0s2d(journal) = /usr.

I will note that if you accidentally put a gjournal label in the 'wrong' 
spot on your disk, you might make a tough situation for yourself getting 
rid of it.  I have had plenty of times where I applied a gjournal label, 
discovered something unideal with it, but every time I did 'gjournal 
stop foo' the label would automatically get detected as a child of a 
different part of the disk because it could be seen and I could not 
unload it.  That is part of why I use -h for gjournal label, and use 
slices+partitions, and the first partition is at offset 16, some of 
which may have been for gmirror's sake too.

==Software raid on 72G disks with gjournal==
5 min to setup, around 30 min to sync

===Prepare===
*Clear any old mirror config including old gmirror labels
 sysctl kern.geom.debugflags=16
 gmirror clear da0
 gmirror clear da1
 sysctl kern.geom.debugflags=0
 dd if=/dev/zero of=/dev/da1 bs=512 count=79

*place a GEOM mirror label onto second disk
 gmirror label -v -n -b round-robin gm0 /dev/da1

*activate GEOM mirror kernel layer
 gmirror load

===Partition===
*place a PC MBR onto the second disk to make it bootable.  Also 
partition it with the majority of space as partition 1, and enough for 
your journal partitions as partition 2.
'''You might get an error, such as "fdisk: Geom not found".  If the next 
steps work, ignore the error.'''
 fdisk -v -B -I /dev/mirror/gm0

*Partition it into two slices.  I think there is an easier way but I 
cannot remember how.  Maybe I used a different method of using fdisk and 
ignored the end cyl values since they dont seem to make much sense 
anyway.   sysinstall or sade could be used as an alternative.
 fdisk -i /dev/mirror/gm0

 Do you want to change our idea of what BIOS thinks ? '''[n]'''
 The data for partition 1 is:
 sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
 start 63, size 143363997 (70001 Meg), flag 80 (active)
               '' ^^^^^^^^^
                A = 143363997''

         beg: cyl 0/ head 1/ sector 1;
         end: cyl 731/ head 254/ sector 63
 Do you want to change it? [n] '''y'''<br>
 ''We want to make partitions approx 60G(data) and 10G(journals).''
 ''So take variable A, divide by 7 and multiply by 6 to get var B.''
 ''B = 122883426''<br>
 Supply a decimal value for "sysid (165=FreeBSD)" '''[165]'''
 Supply a decimal value for "start" '''[63]'''
 Supply a decimal value for "size" [143363997] '''122883426'''
                                               ''^^^^^^^^^''
                                               ''put B here''
 fdisk: WARNING: partition does not end on a cylinder boundary
 fdisk: WARNING: this may confuse the BIOS or some operating systems
 Correct this automatically? [n] '''y'''
 fdisk: WARNING: adjusting size of partition to 122881122
 Explicitly specify beg/end address ? '''[n]'''
 sysid 165 (0xa5),(FreeBSD/NetBSD/386BSD)
     start 63, size 122881122 (60000 Meg), flag 80 (active)
                    ''^^^^^^^^^''
                    ''C = 122881122''
 ''D = C + 63 = 122881122 + 63 = 122881185''
 ''E = A - C = 143363997 - 122881185 = 20482812''<br>
         beg: cyl 0/ head 1/ sector 1;
         end: cyl 480/ head 254/ sector 63
 Are we happy with this entry? [n] '''y'''

 The data for partition 2 is:
 <UNUSED>
 Do you want to change it? [n] '''y'''
 Supply a decimal value for "sysid (165=FreeBSD)" [0] '''165'''
 Supply a decimal value for "start" [0] '''122881185'''
                                        ''^^^^^^^^^''
                                        ''put D here ''
 Supply a decimal value for "size" [0] '''20482812'''
                                       ''^^^^^^^^''
                                       ''put E here''
 Explicitly specify beg/end address ? '''[n]'''
 Are we happy with this entry? [n] '''y'''

 The data for partition 3 is:
 <UNUSED>
 Do you want to change it? '''[n]'''

 The data for partition 4 is:
 <UNUSED>
 Do you want to change it? '''[n]'''

 Partition 1 is marked active
 Do you want to change the active partition? '''[n]'''
 Should we write new partition table? [n] '''y'''

'''You might get an error, such as "fdisk: Geom not found".  If the next 
steps work, ignore the error.'''

===Disklabel===
*place a BSD disklabel onto the mirrors
 bsdlabel -w -B /dev/mirror/gm0s1
 bsdlabel -w /dev/mirror/gm0s2

NOTICE: figure out what partitions you want by referring to bsdlabel 
/dev/da0s1 and/or running bsdlabel /dev/mirror/gm0s1 on a different 
server that has already been mirrored and partition to your liking.

Size can be specified with ##M, ##G or * for remainder, and offset 
should be * to make it calculate it. Paste the output into the editor 
and make whatever changes you want as long as it includes:  start "a" 
partition at offset 16, "c" partition at offset 0)

*Partition 1:
 bsdlabel -e /dev/mirror/gm0s1

Example:
 #        size   offset    fstype   [fsize bsize bps/cpg]
    a:  1G           16    4.2BSD    
    b:  4G           *     swap
    c:  *            0     unused       # "raw" part, don't edit
    d:  10G          *     4.2BSD    
    e:  *            *     4.2BSD    
    f:  4G           *     4.2BSD    

*Partition 2:
 bsdlabel -e /dev/mirror/gm0s2

Example:
 #        size   offset    fstype   [fsize bsize bps/cpg]
    c:  *             0    unused       # "raw" part, don't edit
    d:  4G            16   4.2BSD
    e:  4G            *    4.2BSD
    f:  *             *    4.2BSD

===Gjournal label===
*Label the data and journals so the journaled partition is available.
 gjournal label -f -h mirror/gm0s1d mirror/gm0s2d
 gjournal label -f -h mirror/gm0s1e mirror/gm0s2e
 gjournal label -f -h mirror/gm0s1f mirror/gm0s2f

*Load the kernel module so the journaled partitions are detected:
 gjournal load

===Newfs===
*Format the devices with journaling support in UFS:
 newfs /dev/mirror/gm0s1a
 newfs -J /dev/mirror/gm0s1d.journal
 newfs -J /dev/mirror/gm0s1e.journal
 newfs -J /dev/mirror/gm0s1f.journal

===Mount===
*Mount them temporarily:
 mount /dev/mirror/gm0s1a /mnt
 mkdir -p /mnt/usr /mnt/var /mnt/tmp
 mount -o async /dev/mirror/gm0s1d.journal /mnt/usr
 mount -o async /dev/mirror/gm0s1e.journal /mnt/var
 mount -o async /dev/mirror/gm0s1f.journal /mnt/tmp

===Copy Data===
*Install rsync, if not already:
 pkg_add -r rsync

*Copy the original boot drive to the new device:
 rehash
 rsync -avHSx --progress / /mnt/

 (This will take about 1 minute.)

===Prepare mirror for booting===
*Edit '''/mnt/etc/fstab''' replacing the following mountpoints:

 vi /mnt/etc/fstab
Old:
 # Device                Mountpoint      FStype  Options         Dump    
Pass#
 /dev/da0s1b             none            swap    sw              0       0
 /dev/da0s1a             /               ufs     rw              1       1
 /dev/cd0                /cdrom          cd9660  ro,noauto       0       0
 /dev/acd0               /cdrom1         cd9660  ro,noauto       0       0
New:
 # Device                Mountpoint      FStype  Options         Dump    
Pass#
 /dev/mirror/gm0s1b              none            swap    sw              
0       0
 /dev/mirror/gm0s1a              /               ufs     rw              
1       1
 /dev/mirror/gm0s1d.journal      /usr            ufs     rw,async        
2       2
 /dev/mirror/gm0s1e.journal      /var            ufs     rw,async        
2       2
 /dev/mirror/gm0s1f.journal      /tmp            ufs     rw,async        
2       2
 /dev/cd0                        /cdrom          cd9660  ro,noauto       
0       0
 /dev/acd0                       /cdrom1         cd9660  ro,noauto       
0       0

*Load necessary kernel modules at boot:
 echo 'geom_journal_load="YES"' >> /mnt/boot/loader.conf
 echo 'geom_mirror_load="YES"' >> /mnt/boot/loader.conf

*instruct boot stage 2 loader on first disk to boot with the boot stage 
3 loader from the second disk (mainly because BIOS might not allow easy 
booting from second ATA disk or at least requires manual intervention on 
the console)
 echo "1:da(1,a)/boot/loader" >/boot.config

*We're done with the first stage, reboot:
 reboot

===Check results===
*Login and run df.  Should look like this:
 Filesystem                 1K-blocks   Used    Avail Capacity  Mounted on
 /dev/mirror/gm0s1a           1012974 201898   730040    22%    /
 devfs                              1      1        0   100%    /dev
 /dev/mirror/gm0s1d.journal  10154156 144920  9196904     2%    /usr
 /dev/mirror/gm0s1e.journal  40209204    322 36992146     0%    /var
 /dev/mirror/gm0s1f.journal   4058060     12  3733404     0%    /tmp

===Configure second disk into mirror===
*Add the original boot disk to the mirror.  Make sure the first disk is 
treated as a really fresh one
 dd if=/dev/zero of=/dev/da0 bs=512 count=79

*switch GEOM mirror to auto-synchronization and add first disk (first 
disk is now immediately synchronized with the second disk content)
 gmirror configure -a gm0
 gmirror insert gm0 /dev/da0

*Wait for the GEOM mirror synchronization to complete, or check it 
manually with ''gmirror list''
 sh -c 'while [ ".`gmirror list | grep SYNCHRONIZING`" != . ]; do sleep 
1; done'

*Reboot into the final two-disk GEOM mirror setup (now actually boots 
with the MBR and boot stages on first disk as it was synchronized from 
second disk)
 reboot

===Mirror check script===
*Enable daily_status_gmirror_enable in /etc/periodic.conf or write your 
own script to monitor gmirror status