Strange lock/crash - 100% cpu with basic command line utils

Steven Hartland killing at multiplay.co.uk
Tue Nov 12 13:28:06 UTC 2013


----- Original Message ----- 
From: "Ivan Dimitrov" <zlobber at gmail.com>
To: <freebsd-fs at freebsd.org>
Sent: Tuesday, November 12, 2013 12:28 PM
Subject: Strange lock/crash - 100% cpu with basic command line utils


> Hello list
> 
> This is my first time reporting a problem, so please excuse me if this 
> is not the right place or format. Also apology for my poor English.
> 
> Last month we started experiencing strange locks on some of our servers. 
> On semi-random occasions, when typing `cd`, `ls`, `pwd` the server would 
> crash and start behave strangely. Sometimes the problem is reproducible, 
> sometimes all commands work as expected.
> All servers are Intel or AMD CPUs with FreeBSD 9.2 that netboot the 
> latest kernel and load the OS in RAM.
> All our servers are using zfs with ssd for cache. Here is an example 
> server:
> Also we tested out with preempted and non preempted kernel.
> 
> ==========================================
> 
> [root at ph3storage5 ~]# zpool status -v
>   pool: zstorage5p1
>  state: ONLINE
>   scan: scrub repaired 0 in 39h36m with 0 errors on Mon Nov  4 05:11:48 
> 2013
> config:
> 
>     NAME        STATE     READ WRITE CKSUM
>     zstorage5p1  ONLINE       0     0     0
>       mirror-0  ONLINE       0     0     0
>         ada0    ONLINE       0     0     0
>         ada1    ONLINE       0     0     0
>     cache
>       ada4p1    ONLINE       0     0     0
> 
> errors: No known data errors
> 
>   pool: zstorage5p2
>  state: ONLINE
>   scan: scrub repaired 0 in 14h59m with 0 errors on Sun Nov  3 04:41:50 
> 2013
> config:
> 
>     NAME        STATE     READ WRITE CKSUM
>     zstorage5p2  ONLINE       0     0     0
>       mirror-0  ONLINE       0     0     0
>         ada2    ONLINE       0     0     0
>         ada3    ONLINE       0     0     0
>     cache
>       ada4p2    ONLINE       0     0     0
> 
> errors: No known data errors
> 
> ==========================================
> The typical lock would look like the following:
> cd ~userdir/ ; ls
> At this point, the ls command "freezes" and cannot be "ctrl+c".
> We open up another console and see that the `ls` command is using 100% 
> CPU. Also, some disk operations randomly start taking 1 to 2 minutes to 
> complete. For example, we used `camcontrol` a few times, and it freezed 
> at one point.
> Also (while crashed) we used zpool to remove the ssd cache from the 
> pool, than we re-added the cache back to the pool, but when we issued 
> zpool status, the command freezed for a minute.
> 
> We managed to collect some data from two different incidents
> 
> Incident 1: http://pastebin.com/EkCeSwY9
> Incident 2: http://pastebin.com/5rj9BV68
> 
> Since the problem is reproducible, we accept proposals how to do further 
> tests.

This may be off the mark, as I've not seen 100% CPU, but we have
seen random unexplained hangs when connecting to some new machines
here and it turned out to be a simple lack of mbufs caused by the
fact the machines have 6 Intel igb nic's. So the command wasn't
hanging at all it was the output over ssh which was hanging due
to lack of mbufs to send the output to the client.

If you run "netstat -m" you'll be able to check and confirm /
eliminate this as your problem.


My next check would be for a failing disk, so throw smartctl at them.

Finally memory, so memtest++ or something similar

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.



More information about the freebsd-fs mailing list