9.2-RC1: LORs / Deadlock with SU+J on HAST in "memsync" mode

Mon Aug 26 14:05:17 UTC 2013

Hello :)

On Sun, 25 Aug 2013 20:56:17 +0300
Mikolaj Golub <trociny at FreeBSD.org> wrote:

> On Thu, Aug 22, 2013 at 12:13:41PM +0200, Yamagi Burmeister wrote:
> 
> > After having some systems upgraded to FreeBSD 9.2-RC1/RC2 and switched
> > HAST to the new "memsync" mode I've seen processes getting stuck when
> > accessing files on UFS filesystems with SU+J enabled. Testing showed
> > that this only seems to happen (while I couldn't reproduce it in other
> > combinations I'm not quite sure if its really the case) when HAST is
> > running in "memsync" mode and the UFS filesystem on HAST has SU+J
> > enabled. It can be reproduced easily with the instructions below. 
> 
> I think I found (and reproduced) a scenario, when the primary might
> leak HAST IO request (hio), resulting in IO getting stuck.

- snip -

> I am attaching the patch aimed to fix this. In done_queue, it checks
> if the request is after disconnection, and if it is and is completed
> locally it is put to the done queue. To disambiguate requests I had to
> add a flag to hio, telling if memsync ack from the secondary is
> already received.
> 
> Yamagi, I don't know if this is your case, but could you try the
> patch?

I'm sorry but the patch doesn't change anything. Processes accessing
the UFS on top of HAST still deadlock within a couple of minutes.
trasz@ suggested that all "buf" maybe exhausted which would result in 
an IO deadlock, but at least increasing their number by four times by
"kern.nbuf" doesn't change anything. 

> If it does not help, please, after the hang, get core images of the
> worker processes (both primary and secondary) using gcore(1) and
> provide them together with hastd binary and libraries it is linked
> with (from `ldd /sbin/hastd' list). Note, core files might expose
> secure information from your host, if this worries you, you can send
> them to me privately.

No problem, it's a test setup without any production data. You can find
a tar archive with the binary and libs (all with debug symbols) here:
http://deponie.yamagi.org/freebsd/debug/lor_hast/hast_cores.tar.xz

I have two HAST providers, therefor two core dumps for each host:
hast_deadlocked.core -> worker for the provider an which the processes
                        deadlocked.
hast_not_deadlocked.core -> worker for the other provider

While all processes accessing the UFS filesystem on top of the provider
deadlocked, HAST still seemed to transfer data to the secondary. At
least the process generated CPU load, the switch LEDs were blinking 
and the harddrive LEDs showed activity on both sides.

> Also, grep hastd /var/log/all.log from both the primary and the
> secondary might be useful.

Nothing on the primary. The secondary aborted as soon as I reset the
primary. Of course.

Aug 26 13:47:14 helene hastd[4237]: [rechts] (secondary) Unable to
receive request header: Operation timed out. Aug 26 13:47:19 helene
hastd[1123]: [rechts] (secondary) Worker process exited ungracefully
(pid=4237, exitcode=75). Aug 26 13:47:34 helene hastd[4236]: [links]
(secondary) Unable to receive request header: Operation timed out. Aug
26 13:47:39 helene hastd[1123]: [links] (secondary) Worker process
exited ungracefully (pid=4236, exitcode=75)

Ciao,
Yamagi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20130826/b8a53c19/attachment.sig>