Strange lock/crash - 100% cpu with basic command line utils
Ivan Dimitrov
zlobber at gmail.com
Tue Nov 12 12:28:32 UTC 2013
Hello list
This is my first time reporting a problem, so please excuse me if this
is not the right place or format. Also apology for my poor English.
Last month we started experiencing strange locks on some of our servers.
On semi-random occasions, when typing `cd`, `ls`, `pwd` the server would
crash and start behave strangely. Sometimes the problem is reproducible,
sometimes all commands work as expected.
All servers are Intel or AMD CPUs with FreeBSD 9.2 that netboot the
latest kernel and load the OS in RAM.
All our servers are using zfs with ssd for cache. Here is an example
server:
Also we tested out with preempted and non preempted kernel.
==========================================
[root at ph3storage5 ~]# zpool status -v
pool: zstorage5p1
state: ONLINE
scan: scrub repaired 0 in 39h36m with 0 errors on Mon Nov 4 05:11:48
2013
config:
NAME STATE READ WRITE CKSUM
zstorage5p1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
cache
ada4p1 ONLINE 0 0 0
errors: No known data errors
pool: zstorage5p2
state: ONLINE
scan: scrub repaired 0 in 14h59m with 0 errors on Sun Nov 3 04:41:50
2013
config:
NAME STATE READ WRITE CKSUM
zstorage5p2 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
cache
ada4p2 ONLINE 0 0 0
errors: No known data errors
==========================================
The typical lock would look like the following:
cd ~userdir/ ; ls
At this point, the ls command "freezes" and cannot be "ctrl+c".
We open up another console and see that the `ls` command is using 100%
CPU. Also, some disk operations randomly start taking 1 to 2 minutes to
complete. For example, we used `camcontrol` a few times, and it freezed
at one point.
Also (while crashed) we used zpool to remove the ssd cache from the
pool, than we re-added the cache back to the pool, but when we issued
zpool status, the command freezed for a minute.
We managed to collect some data from two different incidents
Incident 1: http://pastebin.com/EkCeSwY9
Incident 2: http://pastebin.com/5rj9BV68
Since the problem is reproducible, we accept proposals how to do further
tests.
Thanks in advance
Best Regards
Ivan Dimitrov
More information about the freebsd-fs
mailing list