zfs on nvme: gnop breaks pool, zfs gets stuck

Wed Apr 27 18:00:54 UTC 2016

I think for most people, the gnop hack is what is documented on the web. Hence why people are using it versus the ashift sysctl. If the sysctl for ashift is not documented in the ZFS section of the handbook, it probably should be. 

Chris

Sent from my iPhone 5

> On Apr 27, 2016, at 11:59 AM, Steven Hartland <killing at multiplay.co.uk> wrote:
> 
> 
> 
>> On 27/04/2016 15:14, Gary Palmer wrote:
>>> On Wed, Apr 27, 2016 at 03:22:44PM +0200, Gerrit K?hn wrote:
>>> Hello all,
>>> 
>>> I have a set of three NVME-ssds on PCIe-converters:
>>> 
>>> ---
>>> root at storage:~ # nvmecontrol devlist
>>>  nvme0: SAMSUNG MZVPV512HDGL-00000
>>>     nvme0ns1 (488386MB)
>>>  nvme1: SAMSUNG MZVPV512HDGL-00000
>>>     nvme1ns1 (488386MB)
>>>  nvme2: SAMSUNG MZVPV512HDGL-00000
>>>     nvme2ns1 (488386MB)
>>> ---
>>> 
>>> 
>>> I want to use a z1 raid on these and created 1m-aligned partitions:
>>> 
>>> ---
>>> root at storage:~ # gpart show
>>> =>        34  1000215149  nvd0  GPT  (477G)
>>>           34        2014        - free -  (1.0M)
>>>         2048  1000212480     1  freebsd-zfs  (477G)
>>>   1000214528         655        - free -  (328K)
>>> 
>>> =>        34  1000215149  nvd1  GPT  (477G)
>>>           34        2014        - free -  (1.0M)
>>>         2048  1000212480     1  freebsd-zfs  (477G)
>>>   1000214528         655        - free -  (328K)
>>> 
>>> =>        34  1000215149  nvd2  GPT  (477G)
>>>           34        2014        - free -  (1.0M)
>>>         2048  1000212480     1  freebsd-zfs  (477G)
>>>   1000214528         655        - free -  (328K)
>>> ---
>>> 
>>> 
>>> After creating a zpool I recognized that it was using ashift=9. I vaguely
>>> remembered that SSDs usually have 4k (or even larger) sectors, so I
>>> destroyed the pool and set up gnop-providers with -S 4k to get ashift=12.
>>> This worked as expected:
>>> 
>>> ---
>>>   pool: flash
>>>  state: ONLINE
>>>   scan: none requested
>>> config:
>>> 
>>>    NAME                STATE     READ WRITE CKSUM
>>>    flash               ONLINE       0     0     0
>>>      raidz1-0          ONLINE       0     0     0
>>>        gpt/flash0.nop  ONLINE       0     0     0
>>>        gpt/flash1.nop  ONLINE       0     0     0
>>>        gpt/flash2.nop  ONLINE       0     0     0
>>> 
>>> errors: No known data errors
>>> ---
>>> 
>>> 
>>> This pool can be used, exported and imported just fine as far as I can
>>> tell. Then I exported the pool and destroyed the gnop-providers. When
>>> starting with "advanced format" hdds some years ago, this was the way to
>>> make zfs recognize the disks with ashift=12. However, destroying the
>>> gnop-devices appears to have crashed the pool in this case:
>>> 
>>> ---
>>> root at storage:~ # zpool import
>>>    pool: flash
>>>      id: 4978839938025863522
>>>   state: ONLINE
>>>  status: One or more devices contains corrupted data.
>>>  action: The pool can be imported using its name or numeric identifier.
>>>    see: http://illumos.org/msg/ZFS-8000-4J
>>>  config:
>>> 
>>>    flash                                           ONLINE
>>>      raidz1-0                                      ONLINE
>>>        11456367280316708003                        UNAVAIL  corrupted
>>> data gptid/55ae71aa-eb84-11e5-9298-0cc47a6c7484  ONLINE
>>>        6761786983139564172                         UNAVAIL  corrupted
>>> data
>>> ---
>>> 
>>> 
>>> How can the pool be online, when two of three devices are unavailable? I
>>> tried to import the pool nevertheless, but the zpool command got stuck in
>>> state tx-tx. "soft" reboot got stuck, too. I had to push the reset button
>>> to get my system back (still with a corrupt pool). I cleared the labels
>>> and re-did everything: the issue is perfectly reproducible.
>>> 
>>> Am I doing something utterly wrong? Why is removing the gnop-nodes
>>> tampering with the devices (I think I did exactly this dozens of times on
>>> normal hdds during that previous years, and it always worked just fine)?
>>> And finally, why does the zpool import fail without any error message and
>>> requires me to reset the system?
>>> The system is 10.2-RELEASE-p9, update is scheduled for later this week
>>> (just in case it would make sense to try this again with 10.3). Any other
>>> hints are most welcome.
>> Did you destroy the gnop devices with the pool online?  In the procedure
>> I remember you export the pool, destroy the gnop devices, and then
>> reimport the pool.
>> 
>> Also, you only need to do the gnop trick for a single device in the pool
>> for the entire pool's ashift to be changed AFAIK.  There is a sysctl
>> now too
>> 
>> vfs.zfs.min_auto_ashift
>> 
>> which lets you manage the ashift on a new pool without having to try
>> the gnop trick
> This applies to each top level vdev that makes up a pool, so its not limited to just new pool creation, so there should be never a reason to use the gnop hack to set ashift.
> 
>    Regards
>    Steve
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"