9.2-RC1: LORs / Deadlock with SU+J on HAST in "memsync" mode

Mikolaj Golub trociny at FreeBSD.org
Sun Aug 25 17:56:23 UTC 2013


On Thu, Aug 22, 2013 at 12:13:41PM +0200, Yamagi Burmeister wrote:

> After having some systems upgraded to FreeBSD 9.2-RC1/RC2 and switched
> HAST to the new "memsync" mode I've seen processes getting stuck when
> accessing files on UFS filesystems with SU+J enabled. Testing showed
> that this only seems to happen (while I couldn't reproduce it in other
> combinations I'm not quite sure if its really the case) when HAST is
> running in "memsync" mode and the UFS filesystem on HAST has SU+J
> enabled. It can be reproduced easily with the instructions below. 

I think I found (and reproduced) a scenario, when the primary might
leak HAST IO request (hio), resulting in IO getting stuck.

This may happen when the secondary is disconnecting and there are
pending WRITE requests in primary's hio_recv list.

In primary, remote_recv_thread():
  * hast_proto_recv_hdr() returns "Unable to receive reply header";
  * continue (restart loop);
  * memsyncack = false;
  * take hio from hio_recv list;
  * goto done_queue
  * hio has:
      hio_replication == MEMSYNC
      hio_countdown == 2 (local write complete, not in local_send
      queue)
    thus,
  * refcnt_release(&hio->hio_countdown) => hio_countdown == 1
  * !memsyncack => continue;

As a result the hio is not put in any queue and the request is leaked.

I am attaching the patch aimed to fix this. In done_queue, it checks
if the request is after disconnection, and if it is and is completed
locally it is put to the done queue. To disambiguate requests I had to
add a flag to hio, telling if memsync ack from the secondary is
already received.

Yamagi, I don't know if this is your case, but could you try the
patch?

If it does not help, please, after the hang, get core images of the
worker processes (both primary and secondary) using gcore(1) and
provide them together with hastd binary and libraries it is linked
with (from `ldd /sbin/hastd' list). Note, core files might expose
secure information from your host, if this worries you, you can send
them to me privately.

Also, grep hastd /var/log/all.log from both the primary and the
secondary might be useful.

-- 
Mikolaj Golub
-------------- next part --------------
A non-text attachment was scrubbed...
Name: primary.c.memsync.hio_leack.1.patch
Type: text/x-diff
Size: 3871 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20130825/2ebb4d87/attachment.patch>


More information about the freebsd-fs mailing list