STF ZFS test suite, part 2

Fri Jun 21 16:41:32 UTC 2013

On Fri, Jun 21, 2013 at 9:22 AM, Steven Hartland
<killing at multiplay.co.uk> wrote:
> ----- Original Message ----- From: "Alan Somers" <asomers at freebsd.org>
>
>>>>> This appears to be an issue with using /tmp as the target dir, using
>>>>> another directory and I can run the import without issue it seems.
>>>>> Other tests also hang with the same issue:-
>>>>> zpool_upgrade_002_pos
>>>>> zpool_upgrade_003_pos
>>>>> zpool_upgrade_007_pos
>>>>> zpool_upgrade_008_pos
>>>>> zpool_upgrade_009_neg
>>>>>
>>>>> Would it be an issue to change the directory?
>>>>
>>>>
>>>> You should be able to use any output directory at all.  Also, I
>>>> usually put TMPDIR on its own UFS filesystem because it makes the
>>>> hotspare tests go faster.
>>>
>>>
>>>
>>> I've confirmed changing the default TMPDIR fixes ths issue. I believe
>>> the problem is in /tmp there's some fifo's, so possibly some bad
>>> interaction with those, that and I'm sure its not a good idea to mount
>>> a pool over the main system tmp directory ;-)
>>
>>
>> The tests shouldn't be mounting a pool on TMPDIR.  If they are, then
>> that's a bug.  It's probably something like "mount foo
>> ${TMPDIR}/${BAR}", where BAR is undefined.
>> Can you tell which test is doing it?
>
>
> Pretty much all of the zpool_upgrade_* tests I believe the function at
> fault is create_old_pool which runs:
> log_must $ZPOOL import -d /$TMPDIR $POOL_NAME

That command shouldn't mount a pool over TMPDIR.  It attempts to
import a pool, looking in TMPIR for vdev files.  I can run the
zpool_upgrade tests just fine with TMPDIR=/tmp, though it takes 5
times longer than using a dedicated UFS filesystem for TMPIR.

>
> Some new issues:-
> In the xml log I'm seeing a number of errors, which I believe are setup
> problems which could well be causing issues with the tests e.g.
> <se>gpart: arg0 'md0': Invalid argument</se>
>
> The following fixes that:-
> --- ./include/libtest.kshlib.orig       2013-06-20 00:00:37.923125499 +0000
> +++ ./include/libtest.kshlib    2013-06-20 19:02:31.396756337 +0000
> @@ -635,7 +635,7 @@ function wipe_partition_table #<whole_di
> {
>        while [[ -n $* ]]; do
>                typeset diskname=$1
> -               $GPART destroy -F $diskname
> +               $GPART destroy -F $diskname > /dev/null 2>&1
>                shift
>        done
> }

All that does is suppress the error message from gpart.  It shouldn't
affect the running of any test, unless the test evalutes
wipe_partition_table's STDERR.  That function is normally used during
setup and cleanup, where it's unknown what partition table might be
present on the disk.  So gpart complains if it didn't find a partition
table there when you try to destroy it.

>
> A number of tests fail as the ports mkfile is broken and exits 0 when
> the file creation fails including:-
> refquota_00*_pos, refreserv_001_pos, zpool_add_006_pos,
> zpool_create_004_pos

Oops.  I forgot that we had already fixed that port, but not
upstreamed the patch.  We were pretty lazy about upstreaming in 2011
and 2012.  But your patch looks better than ours.

>
> The patch attached fixes this, which I've submitted to the port maintainer.
>
> Here are my current results from stable/9, initally I was seeing 36
> unexpected
> test failures but with some patches I'm now down to 19 of which I believe
> only 4 are actual failures, details below.
>
> == Failure Key ==
> ! = Broken test
> * = Test passed but expected failure
> - = Test was broken but now passes with local fixes
> ? = Possibly broken test
> # = Real failure
>
> == Failed test cases ==
> !cli_root/zfs_snapshot/zfs_snapshot:zfs_snapshot_001_neg
>  + broken test: multiple snaps can be created
> !cli_root/zpool_import/zpool_import:zpool_import_corrupt_001_pos
>  + broken test: dd oseek cant be negative

These both work for us.  Could you please show me your atf_results.xml file?

> !cli_root/zpool_scrub/zpool_scrub:zpool_scrub_003_pos
>  + broken test: resliver completes before check for in progress can complete

Yeah, this one is racy as you discovered.  There are a few others like it.

> !mmap/mmap_write/mmap_write:mmap_write_001_pos
>  + broken test: runs out of space

Works for us.  I'd have to see your atf_results.xml file

> *cache/cache:cache_009_pos
>  + passed: expected failure

A known bug in SpectraBSD

> *cli_root/zfs_copies/zfs_copies:zfs_copies_003_pos
>  + passed: expected failure
> *cli_root/zpool_destroy/zpool_destroy:zpool_destroy_001_pos
>  + passed: expected failure

In order to fix some deadlocks, we disabled creating pools on zvols.
Obviously, that's not an ideal fix that we would upstream.

> *cli_root/zfs_create/zfs_create:zfs_create_013_pos
>  + passed: expected failure
> *cli_root/zfs_get/zfs_get:zfs_get_009_pos
>  + passed: expected failure

These failures are due to FreeBSD's mount path length limitation.  If
it works for you, then the failure might be the result of our zvol
changes which slightly enlarge the name of the zvol device.

> *hotspare/hotspare:hotspare_add_004_neg
>  + passed: expected failure
> ?inuse/inuse:inuse_005_pos
>  + passed: expected failure, could be a broken test as everything reports
> SUCCESS?

The "zpool add" operation failed for you, but not for the reason that
the test was designed to catch.  I think it failed because ZFS opens
geom providers in exclusive mode.  But in SpectraBSD, we disabled that
to fix a swath of other hotspare bugs.  Hotspare support in ZFS is
quite poorly designed.  Oracle doesn't even enable it in their storage
appliances.  We've fixed many hotspare bugs, but it was necessary to
open geoms nonexclusively.

> *pool_names/pool_names:pool_names_001_pos
>  + passed: expected failure

Surprising.  I didn't think that SpectraBSD had regressed in this
area.  We'll have to look into it.

> -cli_root/zfs_rename/zfs_rename:zfs_rename_007_pos
>  + failed: clone not creating zvol entry
> -cli_root/zfs_set/zfs_set:readonly_001_pos
>  + failed: zvol clone bug
> -zil/zil:zil_001_pos
>  + failed: broken freeze ioctl
> -zil/zil:zil_002_pos
>  + failed: broken freeze ioctl
> -zvol/zvol_misc/zvol_misc:zvol_misc_007_pos
>  + failed: recursive zvol rename fails
> -zvol/zvol_misc/zvol_misc:zvol_misc_008_pos
>  + failed: zvol rename in promote
> -zvol/zvol_misc/zvol_misc:zvol_misc_009_pos
>  + failed: recursive zvol rename fails
> ?zvol_thrash/zvol_thrash:zvol_thrash_001_pos
>  + failed: uses camcontrol which fails on md's
> #history/history:history_008_pos  + failed: missing history
> #history/history:history_009_pos
>  + failed: missing history
> #history/history:history_010_pos
>  + failed: missing history
> #snapshot/snapshot:snapshot_018_pos
>  + failed: unknown sysctl abbreviated_snapdir

What Will wrote.

> -cli_root/zfs_set/zfs_set:ro_props_001_pos
>  + failed: fixed by mkfile patch
> -cli_root/zpool_add/zpool_add:zpool_add_006_pos
>  + failed: fixed by mkfile patch
> -cli_root/zpool_create/zpool_create:zpool_create_004_pos
>  + failed: fixed by mkfile patch
> -refquota/refquota:refquota_001_pos
>  + failed: mkfile doesn't check for any errors on write
> -refquota/refquota:refquota_002_pos
>  + failed: mkfile doesn't check for any errors on write
> -refquota/refquota:refquota_003_pos
>  + failed: mkfile doesn't check for any errors on write
> -refquota/refquota:refquota_004_pos
>  + failed: mkfile doesn't check for any errors on write
> -refquota/refquota:refquota_005_pos
>  + failed: mkfile doesn't check for any errors on write
> -refreserv/refreserv:refreserv_001_pos
>  + failed: mkfile doesn't check for any errors on write
> -snapshot/snapshot:snapshot_014_pos
>  + failed: mkfile doesn't check for any errors on write
Sorry about not upstreaming that earlier

> -cli_root/zpool_import/zpool_import:zpool_import_014_pos
>  + failed: cant import deleted pools with -D

Again, we've already fixed that but failed to upstream the patch.  I
see that you fixed it yourself with a near-identical patch.

> ?cli_root/zpool_create/zpool_create:zpool_create_005_pos  + failed: possibly
> broken test, everything reports SUCCESS?

There are some annoying tests which don't print error messages for
their failures.  This one passes for us; I'd have to see your
atf_results.xml to figure out why it fails for you.

>
> == Before fixes ===
> Summary for 102 test programs:
>    393 passed test cases.
>    36 failed test cases.
>    30 expected failed test cases.
>    173 skipped test cases.
>
> == After fixes ==
> Summary for 102 test programs:
>    411 passed test cases.
>    18 failed test cases.
>    30 expected failed test cases.
>    173 skipped test cases.
>
> == Time taken for test suite ==
> 1h38m32.62s real                7m6.98s user            36m3.05s sys
>
> I'll be committing the fixes I've found as soon as full head tests are
> complete.
>
>
>    Regards
>    Steve
>
> ================================================
> This e.mail is private and confidential between Multiplay (UK) Ltd. and the
> person or entity to whom it is addressed. In the event of misdirection, the
> recipient is prohibited from using, copying, printing or otherwise
> disseminating it or any information contained in it.
> In the event of misdirection, illegible or incomplete transmission please
> telephone +44 845 868 1337
> or return the E.mail to postmaster at multiplay.co.uk.