disk partitioning with gmirror + gpt + gjournal (RFC)

Alfred Bartsch bartsch at dssgmbh.de
Thu Oct 20 09:57:52 UTC 2011


Am 19.10.2011 15:42, schrieb Miroslav Lachman:
> Alfred Bartsch wrote:
>> Am 18.10.2011 10:39, schrieb Miroslav Lachman:
>>> Alfred Bartsch wrote:
>>>> I am going to use the following paritioning scheme on our
>>>> servers and programmers' workstations running FreeBSD 8
>>>> (system disk): physical drive - geom_mirror - geom_part_gpt -
>>>> journaled UFS with separate boot and swap partitions.
>>>> Partition names and sizes are taken from our environment -
>>>> Your requirements may vary.
>>> 
>>> It is not good idead to use GPT on top of gmirror as was
>>> discussed in the near past at freebsd-current at . You can read
>>> more in the thread "RFC: Project geom-events" In short: 
>>> http://lists.freebsd.org/pipermail/freebsd-current/2011-October/028054.html
>>>
>>>
>>>
>>
>>> 
http://lists.freebsd.org/pipermail/freebsd-current/2011-October/028109.html
>> 
>> 
>> I know this thread. But nobody there really mentions which
>> utilities / BIOSes would fail or destroy the gmirror-metadata.
>> The only complaining utility I know of is gptboot (only warning
>> during boot). If You know other applications which will fail due
>> to GPT problems, please tell me. Most of the problems shown in
>> this thread seem to have something to do with the combined usage
>> of gpt and glabel, which I'm avoiding.
> 
> As is mentioned in the thread, the problem is with any GEOM class 
> storing is metadata at the end of the device (for example gmirror, 
> graid3, glabel and others)
> 
>> IMHO the only dangerous code is a foreign UEFI, which "repairs"
>> the last sector of the GPT disk without further inquiry. None of
>> our machines act in this way up to now. Once I will get one of
>> those "unfriendly" machines I surely have to rethink my view of
>> disk partitioning. I expect that this day either GEOM will be
>> able to handle this situation or ZFS will be production-ready.
> 
> UEFI will replace old BIOS sooner or later, so what you will do
> then? Than you will need to rework your servers and change your
> setup routine. And I think it is better to avolid known possible
> problem than hoping "it will not bite me". You can't avoid Murphy's
> law ;)
> 

>From my present point of view there are two alternatives: Hardware
RAID and (matured) ZFS.

If I were a GEOM guru, i would try to enhance the compatibility
between upcoming UEFI and GMIRROR / GRAID3 etc. Just guessing:
What about adding a flag named "-gpt" "-efi" or just "-offset" to
these geoms to reserve enough space (at least 33 sectors) behind the
metadata sector at the end of the disk provider to hold whatever
secondary gpt table is needed to satisfy UEFI.

>>> I am using gjournal on few of our servers, but we are slowly 
>>> removing it from our setups. Data writes to gjournaled disks
>>> are too slow and sometimes gjournal is not playing nice.
>> 
>> I'm heavily interested in more details.
> 
> When I did some tests in the past, gjournal cannot be used in 
> combination with iSCSI and I was not able to stop gjournal tasting 
> providers (I was not able to remove / disable gjournal on device)
> until I stop all of them and unload gjournal kernel module. I don't
> know the current state.

Up to now I'm not using any ISCSI equipment. Good to know about some
weaknesses in advance.

> 
>>> Maybe ZFS or UFS+SUJ is better option.
>> 
>> Yes, maybe. ZFS is mainly for future use. Do you use the second
>> option on large filesystems?
> 
> ZFS is there for "a long time". I feel safe to use it in production
> on few of our servers. I didn't test UFS+SUJ because it is released
> in forthcoming 9.0 and we are not deploying current on our
> servers.
> 

Compared to UFS, ZFS lifetime is relatively short. From my point of
view ZFS in its present state is too immature to hold mission critical
data, YMMV.
On the other hand ZFS needs a lot of redundant disk space and memory
to work as expected, not to forget cpu-cycles. IMHO, ZFS is not 32-bit
capable, so there is no way to use it on older and small hardware.

>>>> create the (journaled) data partitions: root partition #
>>>> gpart add -t freebsd-ufs -s 1G mirror/gm0 # gjournal label
>>>> mirror/gm0p7 mirror/gm0p3 note: IMHO journal size doesn't
>>>> need to exceed data size
>>> 
>>> I don't think gjournal is needed in such small partitions.
>>> Classic fsck will be fast.
>>> 
>> You are right. But IMHO I can not mix journaled and not journaled
>> R/W filesystems on a gmirror or I lose the main advantage of
>> avoiding remirroring the whole disk after power failure or
>> crash.
> 
> Yes, you are right, I forgot about this feature. I never used it
> this way.
> 
>>>> /etc/fstab could then look like # Device
>>>> Mountpoint FStype  Options          Dump    Pass#
>>>> /dev/mirror/gm0p2   none swap    sw               0       0
>>>> /dev/ufs/fbsdroot   / ufs     rw,noatime,async 1       1
>>>> /dev/ufs/fbsdhome   /home ufs     rw,noatime,async 2       2
>>>> /dev/ufs/fbsdusr    /usr ufs     rw,noatime,async 2       2
>>>> /dev/ufs/fbsdvar    /var ufs     rw,noatime,async 2       2 
>>>> =====================================================================
>>>
>>>>
>>>
>>>> 
And there is one more problem which I am mentioning again and again
>>> - the main problem of labels and gmirror is that "broken" 
>>> (dropped) provider (for example disk ad0) publishes its 
>>> partitioning and labels, so after reboot with degraded mirror,
>>> you can start the system with /dev/ad0p7 mounted (because it
>>> also has label "fbsdroot") instead of mirrored one. It depends
>>> on order of tasting devices etc. and if something didn't
>>> change, it is unpredictable to me, which device will be choosed
>>> if two devices have the same label.
>> 
>> Thanks for clarifying this. As I'm looking for a robust
>> configuration, I will drop these labels. This leads to some minor
>> changes in my configuration:
>> 
>> # newfs -J mirror/gm0p7.journal # newfs -J mirror/gm0p8.journal #
>> newfs -J mirror/gm0p9.journal # newfs -J mirror/gm0p10.journal
>> 
>> /etc/fstab could then look like # Device            Mountpoint
>> FStype  Options          Dump    Pass# /dev/mirror/gm0p2   none
>> swap    sw               0       0 /dev/gm0p7.journal  /
>> ufs     rw,noatime,async 1       1 /dev/gm0p10.journal /home
>> ufs     rw,noatime,async 2       2 /dev/gm0p9.journal  /usr
>> ufs     rw,noatime,async 2       2 /dev/gm0p8.journal  /var
>> ufs     rw,noatime,async 2       2
>> 
>>> 
>>>> Some questions: Is this disk configuration valid and robust? 
>>>> (I've just started testing) Are there any other proposals - 
>>>> usable as "best known practice", I didn't find a complete
>>>> setup so far?
>>> 
>>> We are using gmirror with good old mbr / fdisk / bsdlabel
>>> without mounting by labels and with gjournal only on the big
>>> data partitions. Not on root, var or partitions with databases
>>> (because gjournal is slow on writes)

Did you perform any benchmarks (UFS+Softupdates vs. UFS+Gjournal)? If
yes, did you compare async mounts + write cache enabled (gjournal) to
sync mounts + write cache disabled (softupdates)?

If I understand you right, you prefer write speed to data consistency
in these cases. This may be the right choice in your environment.

>From my point of view, I am happy to find all bits restored in /var
after an unclean shutdown for error analysis and recovery purposes,
and I hate the vision of having to restore databases from backup, even
after power failures. Furthermore I am glad, only having to wait for
gmirror synchronizing to regain redundancy after replacing a failed disk.

>> 
>> with fdisk + bsdlabel there are not enough partitions in one
>> slice to hold all the journals, and as I already mentioned I
>> really want to minimize recovery time. With gmirror + gjournal
>> I'm able to activate disk write cache without losing data
>> consistency, which improves performance significantly.
> 
> According to following commit message, bsdlabel was extended to 26 
> partitions 3 years ago. 
> http://lists.freebsd.org/pipermail/cvs-all/2007-December/239719.html
>
> 
(I didn't tested yet, because I don't need it - we are using two slices
> on our servers)

I didn't know this, thanks for revealing. I'm not sure if all BSD
utilities can deal with this.

> 
>>> I see what you are trying to do and it would be nice if "all
>>> works as one can expect", but the reality is different. So I
>>> don't think it is good idea to make it as you described.
>>> 
>> I'm not yet fully convinced, that my idea of disk partitioning is
>> a bad one, so please let me take part in your negative
>> experiences with gjournal. Thanks in advance.
> 
> I am not saying that your idea is bad. It just contains some
> things which I rather avoid.

To summarize some of the pros and cons of this method of disk
partitioning:
pros:
  - IMHO easy to configure
  - easy to recover from a failed disk (just replace with a suitable
one and resync with gmirror, needs no preparation of the new disk)
  - minimal downtime after unclean shutdowns (gjournal is responsible
for this, no sucking fsck on large file systems)
  - disk write cache can and should be enabled (enhanced performance)
  - all disk / partition sizes are supported (even > 2TB)
  - 32 bit version of FreeBSD (i386) is sufficient (small and old
hardware remains usable)

cons:
  - danger of overwriting gmirror metadata by an "unfriendly" UEFI-BIOS
  - to be continued ...

Feel free to add some topics here which I am missing.

> 
> PS: please use Reply All, to post your reply to the mailing list as
> well

PS: Sorry, my bad. As your first reply was only off-list, I believed
that you were starting a private conversation.

-- 
Alfred Bartsch
Data-Service GmbH


More information about the freebsd-geom mailing list