disk partitioning with gmirror + gpt + gjournal (RFC)

Alfred Bartsch bartsch at dssgmbh.de
Wed Oct 26 17:29:02 UTC 2011


Am 24.10.2011 14:33, schrieb Miroslav Lachman:
> Alfred Bartsch wrote:
>> Am 19.10.2011 15:42, schrieb Miroslav Lachman:
> 
> [...]
> 
>>> UEFI will replace old BIOS sooner or later, so what you will
>>> do then? Than you will need to rework your servers and change
>>> your setup routine. And I think it is better to avolid known
>>> possible problem than hoping "it will not bite me". You can't
>>> avoid Murphy's law ;)
>>> 
>> 
>>> From my present point of view there are two alternatives:
>>> Hardware
>> RAID and (matured) ZFS.
>> 
>> If I were a GEOM guru, i would try to enhance the compatibility 
>> between upcoming UEFI and GMIRROR / GRAID3 etc. Just guessing: 
>> What about adding a flag named "-gpt" "-efi" or just "-offset"
>> to these geoms to reserve enough space (at least 33 sectors)
>> behind the metadata sector at the end of the disk provider to
>> hold whatever secondary gpt table is needed to satisfy UEFI.
> 
> In ideal world, it will be "the right way", but I guess it will
> never happen in our real FreeBSD world. There are nobody with time,
> skills an courage to revork "all" GEOM classes. It is not so easy
> task (backward compatibility, compatibility with other OSes /
> tools, etc.)
> 
>>>>> I am using gjournal on few of our servers, but we are
>>>>> slowly removing it from our setups. Data writes to
>>>>> gjournaled disks are too slow and sometimes gjournal is not
>>>>> playing nice.
>>>> 
>>>> I'm heavily interested in more details.
>>> 
>>> When I did some tests in the past, gjournal cannot be used in 
>>> combination with iSCSI and I was not able to stop gjournal
>>> tasting providers (I was not able to remove / disable gjournal
>>> on device) until I stop all of them and unload gjournal kernel
>>> module. I don't know the current state.
>> 
>> Up to now I'm not using any ISCSI equipment. Good to know about
>> some weaknesses in advance.
>> 
>>> 
>>>>> Maybe ZFS or UFS+SUJ is better option.
>>>> 
>>>> Yes, maybe. ZFS is mainly for future use. Do you use the
>>>> second option on large filesystems?
>>> 
>>> ZFS is there for "a long time". I feel safe to use it in
>>> production on few of our servers. I didn't test UFS+SUJ because
>>> it is released in forthcoming 9.0 and we are not deploying
>>> current on our servers.
>>> 
>> 
>> Compared to UFS, ZFS lifetime is relatively short. From my point
>> of view ZFS in its present state is too immature to hold mission
>> critical data, YMMV.
> 
> UFS2 or UFS2+SU (Soft-Updates) is there for a longer time than ZFS,
> but UFS2+SUJ (journaled soft-updates) is there ofr a short time and
> not much tested in production. Even UFS2+gjournal is not widely
> deployed / tested.
> 
>> On the other hand ZFS needs a lot of redundant disk space and
>> memory to work as expected, not to forget cpu-cycles. IMHO, ZFS
>> is not 32-bit capable, so there is no way to use it on older and
>> small hardware.
> 
> Yes, you are right, ZFS cannot be used in some scenarios. But in
> some others scenario, ZFS is the best possible. e.g. for large
> flexible storage, I will use ZFS, for database server I will use
> UFS2+SU without gjournal.
> 
> [...]
> 
>> 
>> Did you perform any benchmarks (UFS+Softupdates vs.
>> UFS+Gjournal)? If yes, did you compare async mounts + write cache
>> enabled (gjournal) to sync mounts + write cache disabled
>> (softupdates)?
> 
> I don't have a fancy graphs or tables from benchmarking SW, I just
> have real workload experiencies where write cache were enabled in
> both cases.
> 
>> If I understand you right, you prefer write speed to data
>> consistency in these cases. This may be the right choice in your
>> environment.
>> 
>>> From my point of view, I am happy to find all bits restored in
>>> /var
>> after an unclean shutdown for error analysis and recovery
>> purposes, and I hate the vision of having to restore databases
>> from backup, even after power failures. Furthermore I am glad,
>> only having to wait for gmirror synchronizing to regain
>> redundancy after replacing a failed disk.
> 
> I am not sure, you can rely on data consistency with todays HDDs
> when cache is enabled even if you use gjournal. You allways lose
> content of device cache as rotating (and some flash devices - SSD -
> too) is known to lie to OS about "data is written", so you end up
> with lost or demaged date with unclean shutdown.

AFAIK gjournal guarantees data consistency, even in such scenarios.
There is a thread in freebsd-geom (June 2006), subject: "Journaling
UFS with gjournal", started by Pawel Jakub Dawidek, which explains
some details and prerequisits.
As far as I'm getting it, in a gjournaled UFS file system, there is a
potential data loss of one journal block at maximum. This is only
dependent on whether this last block has been fully written to the
journal or not. The underlying file system (UFS) is always consistent.
Gmirror/graid only has to pitch in, if one disk fails due to hardware
reasons.
At present, I don't know of any other software raid configuration
(other than ZFS) which combines these strong features, concerning FreeBSD.
Please correct me, if I'm wrong.

> 
> Database engines handle it in its own way with own journal log
> etc., because some of them can be installed on raw partitions
> without underling FS. (also MySQL can do it)

Ok, but I'm not sure, if these proprietary (hidden) file systems
really do better than UFS+GJOURNAL in case of unclean shutdown(s).

> 
> I rember one case (about 3 years ago) where server remains
> unbootable after kernel panic and I spent a couple of hours by
> playing with disabling gjournal, doing full fsck on the given
> partition etc. It is rare case, but can happen.

Sorry to hear of this. Hopefully, Gjournal has improved since then. I
don't expect to run into similar problems very often.

> 
>>>> with fdisk + bsdlabel there are not enough partitions in one 
>>>> slice to hold all the journals, and as I already mentioned I 
>>>> really want to minimize recovery time. With gmirror +
>>>> gjournal I'm able to activate disk write cache without losing
>>>> data consistency, which improves performance significantly.
>>> 
>>> According to following commit message, bsdlabel was extended to
>>> 26 partitions 3 years ago. 
>>> http://lists.freebsd.org/pipermail/cvs-all/2007-December/239719.html
>>>
>>>
>>
>>> 
(I didn't tested yet, because I don't need it - we are using two slices
>>> on our servers)
>> 
>> I didn't know this, thanks for revealing. I'm not sure if all
>> BSD utilities can deal with this.
>> 
>>> 
>>>>> I see what you are trying to do and it would be nice if
>>>>> "all works as one can expect", but the reality is
>>>>> different. So I don't think it is good idea to make it as
>>>>> you described.
>>>>> 
>>>> I'm not yet fully convinced, that my idea of disk
>>>> partitioning is a bad one, so please let me take part in your
>>>> negative experiences with gjournal. Thanks in advance.
>>> 
>>> I am not saying that your idea is bad. It just contains some 
>>> things which I rather avoid.
>> 
>> To summarize some of the pros and cons of this method of disk 
>> partitioning: pros: - IMHO easy to configure - easy to recover
>> from a failed disk (just replace with a suitable one and resync
>> with gmirror, needs no preparation of the new disk) - minimal
>> downtime after unclean shutdowns (gjournal is responsible for
>> this, no sucking fsck on large file systems) - disk write cache
>> can and should be enabled (enhanced performance) - all disk /
>> partition sizes are supported (even>  2TB) - 32 bit version of
>> FreeBSD (i386) is sufficient (small and old hardware remains
>> usable)
>> 
>> cons: - danger of overwriting gmirror metadata by an "unfriendly"
>> UEFI-BIOS
> - somewhat complex initial setup or future changes in partitioning 
> (you must have prepared right number of partitions for journals,
> so adding more partitions is not so easy - in case with UFS2+SUJ or
> ZFS, you just add another partition)

You're right, nothing's for free. In our environment, I can live with
that. As our system disks are all configured similarly, adding
partitions almost never happens.
I will unfortunately not be able to test UFS2+SUJ together with
realistic workload for the next few months, as we are now switching
our servers from FreeBSD-6-stable to FreeBSD-8-stable,

>> - to be continued ...
>> 
>> Feel free to add some topics here which I am missing.
> 
> One thing in my mind is longstanding problem with gjournal on
> heavily loaded servers:
> 
> Aug 16 01:48:28 praha kernel: fsync: giving up on dirty Aug 16
> 01:48:30 praha kernel: 0xc44ba9b4: tag devfs, type VCHR Aug 16
> 01:48:30 praha kernel: usecount 1, writecount 0, refcount 6941 
> mountedhere 0xc445b700 Aug 16 01:48:30 praha kernel: flags () Aug
> 16 01:48:30 praha kernel: v_object 0xc1548c00 ref 0 pages 192023 
> Aug 16 01:48:30 praha kernel: lock type devfs: EXCL (count 1) by
> thread 0xc42a7240 (pid 45) Aug 16 01:48:30 praha kernel: dev
> mirror/gm0s2e.journal Aug 16 01:48:30 praha kernel: GEOM_JOURNAL:
> Cannot suspend file system /vol0 (error=35).
> 
> Aug 16 02:32:34 praha kernel: fsync: giving up on dirty Aug 16
> 02:32:34 praha kernel: 0xc44ba9b4: tag devfs, type VCHR Aug 16
> 02:32:34 praha kernel: usecount 1, writecount 0, refcount 1418 
> mountedhere 0xc445b700 Aug 16 02:32:34 praha kernel: flags () Aug
> 16 02:32:34 praha kernel: v_object 0xc1548c00 ref 0 pages 128123 
> Aug 16 02:32:34 praha kernel: lock type devfs: EXCL (count 1) by
> thread 0xc42a7240 (pid 45) Aug 16 02:32:34 praha kernel: dev
> mirror/gm0s2e.journal Aug 16 02:32:34 praha kernel: GEOM_JOURNAL:
> Cannot suspend file system /vol0 (error=35).
> 
> This error messages is seen on theme almost every second day and
> nobody gives me suitable explanation what it really means / what it
> cause. The only answer I got was something like "it is not
> harmfull"... then why it is logged at all?

I haven't yet seen such kind of messages. Perhaps our servers are not
loaded heavily enough? ;-)
BTW, which BSD version are you using with gjournal?

> 
> So today I removed gjournal from the next older server. I will try
> UFS2+SUJ with 9.0 as one of the possible ways for a future setups,
> where ZFS cannot be used.

I would be glad to hear from your hands-on experience in this topic.

> 
> Miroslav Lachman

-- 
Alfred Bartsch
Data-Service GmbH


More information about the freebsd-geom mailing list