Strange ZFS deadlock today
Peter Maloney
peter.maloney at brockmann-consult.de
Wed Mar 28 11:13:33 UTC 2012
And the top command reported (I think this was before running the first
java program):
54 processes: 1 running, 52 sleeping, 1 stopped
CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 46M Active, 5504K Inact, 34G Wired, 5808K Cache, 2592K Buf, 13G Free
Swap: 80G Total, 25M Used, 80G Free
And after running "zdb -S" I am fairly sure zfs-stats also showed arc
around 10-13 GB below the arc max I set. But I'm not sure I ran the
command when "zdb -S" was done, or just after a long time of it running.
And compression and dedup are not enabled on any dataset.
/boot/loader.conf snippet:
vm.kmem_size="40g"
vm.kmem_size_max="40g"
vfs.zfs.arc_min="80m"
vfs.zfs.arc_max="38g"
vfs.zfs.arc_meta_limit="24g"
vfs.zfs.vdev.cache.size="32m"
vfs.zfs.vdev.cache.max="256m"
On 03/28/2012 11:52 AM, Peter Maloney wrote:
> I believe it is caused by ZFS, since the state looks like the code in
> this page about ARC:
> [url]http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/[/url]
>
> To cause/reveal this problem, first, I tried to run something with
> OpenJDK 6.
>
> In one window:
> [CODE]# ./CopyFromHd.sh
> load: 0.00 cmd: java 51239 [ucond] 44.56r 0.18u 0.06s 0% 19588k
> load: 0.00 cmd: java 51239 [ucond] 46.22r 0.18u 0.06s 0% 19588k
> load: 0.00 cmd: java 51239 [ucond] 51.14r 0.18u 0.06s 0% 19588k
> load: 0.00 cmd: java 51239 [ucond] 52.90r 0.18u 0.06s 0% 19588k
> ^C
> load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock]
> 58.35r 0.18u 0.06s 0% 19872k
> load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock]
> 61.73r 0.18u 0.06s 0% 19872k
> ^C
> load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock]
> 89.67r 0.18u 0.06s 0% 19872k
> load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock]
> 141.07r 0.18u 0.06s 0% 19872k
> ^C[/CODE]
>
> Then since it was hung, I thought maybe it was zfs' fault rather than
> OpenJDK, so tried with du.
>
> Second window:
> [CODE]
> # du -shx /data/archive2/2011/09/11/x
> 3.1G /data/archive2/2011/09/11/x
>
> # du -shx /data/archive2/2011/09/11
> 24G /data/archive2/2011/09/11
>
> # du -shx /data/archive2/2011/
> load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 13.75r
> 0.00u 0.00s 0% 1012k
> ^C^C^C^Z^C
> load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 221.97r
> 0.00u 0.00s 0% 1012k
> [/CODE]
>
> Then I thought I could just run the simplest version of the program
> (which does pretty much no IO)
>
> Third window:
> [CODE]# ./CopyFromHd.sh --help
> ^C^C^C^C^C^C
> load: 0.00 cmd: java 52339 [suspended] 26.33r 0.15u 0.04s 0% 25644k
> load: 0.00 cmd: java 52339 [suspended] 27.38r 0.15u 0.04s 0% 25644k
> ^C
> load: 0.00 cmd: java 52339 [suspended] 285.23r 0.15u 0.04s 0% 25644k
> ^Z
> [1]+ Stopped ./CopyFromHd.sh
> # jobs -p
> 51988
> # kill -9 51988
> # jobs -p
> 51988
> [1]+ Killed: 9 ./CopyFromHd.sh
> # jobs -p
> [/CODE]
>
> Back to the first window:
> [CODE]load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock]
> 459.38r 0.18u 0.06s 0% 19872k
> ^C^Z
> [1]+ Stopped ./CopyFromHd.sh ...
> # jobs -p
> 51128
> # kill -9 51128
> # jobs -p
> 51128
> [1]+ Interrupt: 2 ./CopyFromHd.sh ...
> # jobs -p
> # ps axl | grep java
> 0 51239 1 0 44 0 1264940 19904 - T 0 0:00.25
> [java]
> 0 76933 77797 0 44 0 9124 1180 piperd S+ 0 0:00.00
> grep java
> 0 52339 1 0 44 0 1266988 25676 - T 1 0:00.20
> /usr/local/openjdk6 //bin/java -Xmx1024M -classpath ...
> [/CODE]
>
>
> The java above is the "--help" one from the 2nd window. So I guess the
> first one ended.
>
> But "dd" commands I ran are all stuck, and so is the 2nd java, and
> also "zdb" "zdb zroot" "zdb tank" and "zdb data" are all stuck. Also
> running "find" on either tank or data hangs.
>
> [CMD="#"]zpool iostat[/CMD] [CMD="#"]zpool status [-v][/CMD] and
> [CMD="#"]zfs list[/CMD] all run without hanging.
>
> Then I thought maybe I would disable primarycache to see what happens.
>
> [CODE]# zfs set primarycache=none data
> # zfs set primarycache=none tank
> load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 5.73r 0.00u 0.00s 0%
> 1636k
> (hang)
> ^Z^C
> load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 87.28r 0.00u 0.00s
> 0% 1636k
> [/CODE]
>
> Another window:
> [CODE]# zfs get primarycache data
> NAME PROPERTY VALUE SOURCE
> data primarycache none local
> # zfs get primarycache tank
> NAME PROPERTY VALUE SOURCE
> tank primarycache none local[/CODE]
>
>
> On a Linux server that nfs mounts /data on this FreeBSD server,
> [CMD="#"]df[/CMD] hangs at the point where the nfs mount should be
> listed. (So I have to reboot now rather than later)
>
> I have not run memtest on this machine.
>
> I have no l2arc.
>
> Uptime is 22 days.
>
> It is normally perfectly stable. I ran CopyFromHd to process over 20
> TB of data so far (2 weeks ago), so it is not normal for this to
> happen. Other similar servers don't hang this way. (most hangs I've
> caused in the past are not in this
> "buf_hash_table.ht_locks[i].ht_lock" state.
>
> The last thing I did before this was running "zdb -S tank" yesterday.
> (tank is not the same pool as data)
>
> FreeBSD version is 8.2-STABLE csupped on 2012-02-04.
>
>
> And then on restart, the console showed: "some processes would not
> die; ps axl advised" and "shutdown terminated abnormally"
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
--
--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney at brockmann-consult.de
Internet: http://www.brockmann-consult.de
--------------------------------------------
More information about the freebsd-fs
mailing list