8.0-RELEASE: disk IO temporarily hangs up (ZFS or ATA related problem)

Thu Dec 17 13:37:41 UTC 2009

On 17/12/09 19:30, Alexander Zagrebin wrote:
> Hi!
>
> I use onboard ICH7 SATA controller with two disks attached:
>
> atapci1:<Intel ICH7 SATA300 controller>  port
> 0x30c8-0x30cf,0x30ec-0x30ef,0x30c0-0x30c7,0x30e8-0x30eb,0x30a0-0x30af irq 19
> at device 31.2 on pci0
> atapci1: [ITHREAD]
> ata2:<ATA channel 0>  on atapci1
> ata2: [ITHREAD]
> ata3:<ATA channel 1>  on atapci1
> ata3: [ITHREAD]
> ad4: 1430799MB<Seagate ST31500541AS CC34>  at ata2-master SATA150
> ad6: 1430799MB<WDC WD15EADS-00P8B0 01.00A01>  at ata3-master SATA150
>
> The disks are used for mirrored ZFS pool.
> I have noticed that the system periodically locks up on disk operations.
> After approx. 10 min of very slow disk i/o (several KB/s) the speed of disk
> operations restores to normal.
> gstat has shown that the problem is in ad6.
> For example, there is a filtered output of iostat -x 1:
>
>                          extended device statistics
> device     r/s   w/s    kr/s    kw/s wait svc_t  %b
> ad6      818.6   0.0 10840.2     0.0    0   0.4  34
> ad6      300.6 642.0  3518.5 24830.3   50  24.8  72
> ad6        1.0 639.3    63.7 17118.3    0  62.1  98
> ad6      404.5   4.0  6837.7     4.0    0   0.5  18
> ad6      504.5   0.0 13667.2     0.0    1   0.7  32
> ad6      633.3   0.0 13190.3     0.0    1   0.7  38
> ad6      416.3 384.5  8134.7 24606.2    0  16.3  57
> ad6      538.9  76.7  9772.8  2982.2   55   2.9  40
> ad6       31.9 929.5   801.0 37498.6    0  27.2  82
> ad6      635.5   0.0 13087.1     0.0    1   0.6  35
> ad6      579.6   0.0 16669.8     0.0    0   0.8  43
> ad6      603.6   0.0 11697.4     0.0    1   0.7  40
> ad6      538.0   0.0 10438.7     0.0    0   0.9  47
> ad6       30.9 898.4   868.6 40585.4    0  36.6  78
> ad6      653.3  86.6  8566.6   202.7    1   0.8  40
> ad6      737.1   0.0  6429.4     0.0    1   0.6  42
> ad6      717.1   0.0  3958.7     0.0    0   0.5  36
> ad6     1179.5   0.0  2058.9     0.0    0   0.1  15
> ad6     1191.2   0.0  1079.6     0.0    1   0.1  15
> ad6      985.1   0.0  5093.9     0.0    0   0.2  23
> ad6      761.8   0.0  9801.3     0.0    1   0.4  31
> ad6      698.7   0.0  9215.1     0.0    0   0.4  30
> ad6      434.2 513.9  5903.1 13658.3   48  10.2  55
> ad6        3.0 762.8   191.2 28732.3    0  57.6  99
> ad6       10.0   4.0   163.9     4.0    1   1.6   2
>
> Before this line we have a normal operations.
> Then the behaviour of ad6 changes (pay attention to high average access time
> and percent of "busy" significantly greater than 100):
>
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        1.0   0.0     0.5     0.0    1 1798.3 179
> ad6        1.0   0.0     1.5     0.0    1 1775.4 177
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6       10.0   0.0    75.2     0.0    1 180.3 180
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6       83.7   0.0   862.9     0.0    1  21.4 179
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        1.0   0.0    63.7     0.0    1 1707.4 170
> ad6        1.0   0.0     9.0     0.0    0 1791.0 178
> ad6       10.9   0.0   172.2     0.0    2   0.2   0
> ad6       24.9   0.0   553.7     0.0    1 143.3 179
> ad6        0.0   0.0     0.0     0.0    7   0.0   0
> ad6        2.0  23.9    32.4  1529.9    1 336.3 177
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6       68.7   0.0  1322.8     0.0    1  26.3 181
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6       27.9   0.0   193.7     0.0    1  61.6 172
> ad6        1.0   0.0     2.5     0.0    1 1777.4 177
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        1.0   0.0     2.0     0.0    1 1786.9 178
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        2.0   0.0     6.5     0.0    1 899.4 179
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
> ad6        1.0   0.0     2.0     0.0    1 1786.7 178
> ad6        0.0   0.0     0.0     0.0    1   0.0   0
>
> And so on for about 10 minutes.
> Then the disk i/o is reverted to normal:
>
> ad6      139.4   0.0  8860.5     0.0    1   4.4  61
> ad6      167.3   0.0 10528.7     0.0    1   3.3  55
> ad6       60.8 411.5  3707.6  8574.8    1  19.6  87
> ad6      163.4   0.0 10334.9     0.0    1   4.4  72
> ad6      157.4   0.0  9770.7     0.0    1   5.0  78
> ad6      108.5   0.0  6886.8     0.0    0   3.9  43
> ad6      101.6   0.0  6381.4     0.0    0   2.6  27
> ad6      109.6   0.0  7013.9     0.0    0   2.0  22
> ad6      121.4   0.0  7769.7     0.0    0   2.4  29
> ad6       92.5   0.0  5922.6     0.0    1   3.4  31
> ad6      122.4  19.9  7833.0  1273.7    0   3.9  54
> ad6       83.6   0.0  5349.5     0.0    0   3.9  33
> ad6        5.0   0.0   318.4     0.0    0   8.1   4
>
> There are no ata error messages neither in the system log, nor on the
> console.
> The manufacture's diagnostic test is passed on ad6 without any errors.
> The ad6 also contains swap partition.
> I have tried to run several (10..20) instances of dd, which read and write
> data
> from and to the swap partition simultaneously, but it has not called the
> lockup.
> So there is a probability that this problem is ZFS related.
>
> I have been forced to switch ad6 to the offline state... :(
>
> Any suggestions on this problem?
>    
I also have been experiencing the same problem with a different 
disk/controller (via mpt on a vmware machine). During the same period I 
notice that system cpu usage hits 80+% and top shows the zfskern process 
being the main culprit. At the same time I've discovered the 
kstat.zfs.misc.arcstats.memory_throttle_count sysctl rising. Arc is also 
normally close to the arc_max limit.

Looking at the code, I'm wondering if EAGAIN really should have been 
ERESTART in the case when a reclaim was needed:
contrib/opensolaris/uts/common/fs/zfs/arc.c: line 3314

         /*
          * If we are in pageout, we know that memory is already tight,
          * the arc is already going to be evicting, so we just want to
          * continue to let page writes occur as quickly as possible.
          */
         if (curproc == pageproc) {
                 if (page_load > available_memory / 4)
                         return (ERESTART);
                 /* Note: reserve is inflated, so we deflate */
                 page_load += reserve / 8;
                 return (0);
         } else if (page_load > 0 && arc_reclaim_needed()) {
                 /* memory is low, delay before restarting */
                 ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
                 return (EAGAIN);
         }