8.1-RELEASE: ZFS data errors
Mike Carlson
carlson39 at llnl.gov
Wed Nov 10 19:49:29 UTC 2010
On 11/10/2010 03:03 AM, Ivan Voras wrote:
> On 11/09/10 18:42, Mike Carlson wrote:
>
>>> write# gstripe label -v -s 16384 data /dev/da2 /dev/da3 /dev/da4
>>> /dev/da5 /dev/da6 /dev/da7 /dev/da8
>>> write# df -h
>>> Filesystem Size Used Avail Capacity Mounted on
>>> /dev/da0s1a 1.7T 22G 1.6T 1% /
>>> devfs 1.0K 1.0K 0B 100% /dev
>>> /dev/stripe/data 126T 4.0K 116T 0% /mnt
>>> write# fsck /mnt
>>> fsck: Could not determine filesystem type
>>> write# fsck_ufs /mnt
>>> ** /dev/stripe/data (NO WRITE)
>>> ** Last Mounted on /mnt
>>> ** Phase 1 - Check Blocks and Sizes
>>> Segmentation fault
>>> So, the data appears to be okay, I wanted to run through a FSCK just to
>>> do it but that seg faulted. Otherwise, that data looks good.
> Hmm, probably it tried to allocate a gazillion internal structures to
> check it and didn't take no for an answer.
>
>>> Question, why did you recommend using a smaller stripe size? Is that to
>>> ensure a sample 1GB test file gets written across ALL disk members?
> Yes, it's the surest way since MAXPHYS=128 KiB / 8 = 16 KiB.
>
> Well, as far as I'm concerned this probably shows that there isn't
> something wrong about hardware or GEOM, though more testing, like
> running a couple of bonnie++ rounds on the UFS on the stripe volume for
> a few hours, would probably be better.
>
> Btw. what bandwidth do you get from this combination (gstripe + UFS)?
>
The bandwidth for geom_stripe + UFS2 was very nice:
write# mount
/dev/da0s1a on / (ufs, local, soft-updates)
devfs on /dev (devfs, local, multilabel)
filevol002 on /filevol002 (zfs, local)
/dev/stripe/data on /mnt (ufs, local, soft-updates)
Simple DD write:
write# dd if=/dev/zero of=/mnt/zero.dat bs=1m count=5000
5000+0 records in
5000+0 records out
5242880000 bytes transferred in 13.503850 secs (388250759 bytes/sec)
running bonnie++
write# bonnie++ -u 100 -s24576 -d. -n64
Using uid:100, gid:65533.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.96 ------Sequential Output------ --Sequential
Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr-
--Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP /sec %CP
write.llnl.gov 24G 730 99 343750 63 106157 26 1111 86
174698 26 219.2 3
Latency 11492us 149ms 227ms 70274us
66776us 766ms
Version 1.96 ------Sequential Create------ --------Random
Create--------
write.llnl.gov -Create-- --Read--- -Delete-- -Create--
--Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec
%CP /sec %CP
64 18681 47 +++++ +++ 99516 97 26297 40 +++++
+++ 113937 96
Latency 310ms 149us 152us 68841us
144us 146us
1.96,1.96,write.llnl.gov,1,1289416723,24G,,730,99,343750,63,106157,26,1111,86,174698,26,219.2,3,64,,,,,18681,47,+++++,+++,99516,97,26297,40,+++++,+++,113937,96,11492us,149ms,227ms,70274us,66776us,766ms,310ms,149us,152us,68841us,144us,146us
The system immediately and mysteriously reboot after running bonnie++
though, that doesn't seem like a good sign...
I've got an iozone benchmark, gstripe + multipath + UFS vs. multipath +
ZFS. I can email the gzip'd file to you, as I don't want to clutter the
mailing list with file attachments.
Another question, for anyone really, but will gmultipath ever have an
'active/active' model? I'm happy that I have some type of redundancy for
my SAN, but it it was possible to aggregate the bandwidth of both
controllers, that would be pretty cool as well.
>> Oh, I almost forgot, here is the ZFS version of that gstripe array:
>>
>> write# zpool create test01 /dev/stripe/data
>> write# zpool scrub
>> write# zpool status
>> pool: test01
>> state: ONLINE
>> scrub: scrub completed after 0h0m with 0 errors on Tue Nov 9
>> 09:41:34 2010
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> test01 ONLINE 0 0 0
>> stripe/data ONLINE 0 0 0
> "scrub" verifies only written data, not the whole file system space
> (that's why it finishes so fast), so it isn't really doing any load on
> the array, but I agree that it looks more and more like there really is
> an issue in ZFS.
>
Yeah, I ran scrub when there was around 20GB of random data. In
8.1-RELEASE, that was the way I would trigger ZFS's acknowledgment that
the pool had a problem.
I also dug through my logs and saw these:
Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
path=/dev/da5 offset=749207552 size=131072
Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
path=/dev/da5 offset=749338624 size=131072
Nov 8 15:09:51 write root: ZFS: zpool I/O failure, zpool=test01
error=86
Nov 8 15:09:51 write root: ZFS: zpool I/O failure, zpool=test01
error=86
Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
path=/dev/da3 offset=748421120 size=131072
Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
path=/dev/da4 offset=746586112 size=131072
Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
path=/dev/da4 offset=746455040 size=131072
Nov 8 15:09:51 write root: ZFS: checksum mismatch, zpool=test01
path=/dev/da4 offset=746717184 size=131072
Nov 8 15:09:52 write root: ZFS: checksum mismatch, zpool=test01
path=/dev/da3 offset=748290048 size=131072
Nov 8 15:09:52 write root: ZFS: checksum mismatch, zpool=test01
path=/dev/da3 offset=748421120 size=131072
Nov 8 15:09:52 write root: ZFS: checksum mismatch, zpool=test01
path=/dev/da4 offset=746586112 size=131072
Nov 8 15:09:52 write root: ZFS: zpool I/O failure, zpool=test01
error=86
I'm inclined to believe it is an issue with ZFS.
More information about the freebsd-fs
mailing list