ZFS corruption using zvols as backingstore for hvm VM's

Wed Nov 11 09:13:29 UTC 2020

	hello.  Following up on my own message, I believe I've run into a
serious problem that exists on FreeBSD-xen with FreeBSD-12.1P10 and
Xen-4.14.0.  Just in case I was running into an old bug with yesterday's
post, I updated to xen-4.14.0 and Qemu-5.0.0.  the problem was still there,
i.e. when writing to a second virtual hard drive on an hvm domu, the drive
becomes corrupted.  Again, zpool scrub shows no errors.
	So, I decided it might be some sort of memory error.  I wrote a memory
test program, shown below, and ran it on my hvm domu.  It not only
crashed the domu itself, it crashed the entire xen server!  There are some
dmesg messages that happened before the xen server crash, shown below, which
suggest a serious problem.  In my view, no matter how badly the domu hvm
host behaves, it shouldn't be able to crash the xen server itself!  The
domu is running NetBSD-5.2, an admittedly old version of the operating
system, but I'm running a fleet of these  machines, both on real hardware
and on older versions of xen with no stability issues whatsoever!  And, as
I say, I shouldn't be able to wipe out the xen server from an hvm domu, no
matter what I do!

		The memory test program takes one argument, the amount of RAM, in
megabytes, you want it to test.  It then allocates that memory, and
sequentially walks through that memory over and over again, writing to it
and reading from it, checking to make sure the data read matches the data
written.  this has the effect of causing the resident set size of the
program to grow slowly over time, as it works.  It was originally written
to test the paging efficiency of a system, but I modified it to actually
test the memory along the way.
	to reproduce the issue, perform the following steps:

1.  Set up an hvm host, I think FreeBSD as a domu hvm host will work fine.
Use zfs zvols as the backingstore for the virtual disk(s) for your host.

2.  Compile this program for that host and run it as follows:
./testmem 1000
This should ask the program to allocate 1G of memory and then walk through
and test it.  It will report each megabyte of memory it's written and
tested.  My test hvm had 4G of RAM as it was a 32-bit  OS running on the
domu.  Nothing else was running on either the xen server or the domu host.
I'm not sure exactly how far the program got in its memory walk before
things went south, but I think it touched about 100 megabytes of its 1000
megabyte allocation.
My program was not running as root, so it had no special privileges, even
on the domu host.  

	I'm not sure if the problem is with qemu, xen, or some  combination of
the two.  

	It would be great if someone could reproduce this issue and maybe shed
a bit more light on what's going on.

-thanks
-Brian

<error messages on xen server just before the crash!>

Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255.  Discarding packet
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref  =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id    =69
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size  =1000
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref  =255
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id    =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size  =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_ring2pkt:1534): Unknown extra info type 255.  Discarding packet
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref  =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =8
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id    =69
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size  =1000
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:304): netif_tx_request index =1
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:305): netif_tx_request.gref  =255
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:306): netif_tx_request.offset=0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:307): netif_tx_request.flags =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:308): netif_tx_request.id    =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_dump_txreq:309): netif_tx_request.size  =0
Nov 11 00:28:54 xen-lothlorien kernel: xnb(xnb_rxpkt2rsp:2068): Got error -1 for hypervisor gnttab_copy status

<cut here for test program, testmem.c>

/**************************************************************************
NAME: Brian Buhrow
DATE: November 11, 2020

PURPOSE: This program allocates the indicated number of megabytes of ram,
and then proceeds to touch each page to insure that it gets brought into
core memory.  In this way, this program attempts to exercise the OS's VM
paging system.
It then checks to see if there is any memory corruption by re-reading each
segment that it writes.
Terminate the program by hitting control-c.
**************************************************************************/

static char rcsid[] = "$Id: testmem.c,v 1.2 2020/11/11 08:06:27 buhrow Exp $";

#include <stdio.h>
#include <string.h>
#include <malloc.h>
#include <unistd.h>

#define TESTSTR "This is a test\n"
main(argc, argv)
	int argc;
	char **argv;
{
	int i, pgsize, bufsiz, requested, testindex, testlen;
	char *buf, *ptr;
	char tmpbuf[1024];

	if (argc != 2) {
		printf("Usage: %s <size in megabytes>\n",argv[0]);
		return(0);
	}

	sscanf(argv[1],"%d",&requested);
	if (!requested) {
		printf("%s: You must request more than 0 MB of RAM.\n",argv[0]);
		return(0);
	}

	bufsiz = requested * (1024 * 1024);
	printf("%s: Allocating %dMB of RAM (%u bytes)\n",argv[0],requested,bufsiz);

	buf = (char *)malloc(bufsiz);
	if (!buf) {
		sprintf(tmpbuf,"%s: Unable to allocate memory",argv[0]);
		perror(tmpbuf);
		exit(1);
	}

	printf("%s: Memory allocated, starting at address: 0x%8x\n",argv[0],buf);

	pgsize = getpagesize();
	testindex = 65;
	for(;;) {
		bzero(tmpbuf, 1024);
		sprintf(tmpbuf, "%s%c\n",TESTSTR,testindex);
		testindex += 1;
		if (testindex > 126) testindex = 65;
		testlen = strlen(tmpbuf);
		for (i = 0;i < bufsiz;i += testlen) {
		ptr = &buf[i];
			bcopy(tmpbuf, ptr, testlen);
			if ((i % (1024 * 1024)) <= 15) {
				printf("%u MB touched...\n",i / (1024   * 1024));
				sleep(5);
			}
		}

		for (i = 0;i < bufsiz;i += testlen) {
			if (memcmp(tmpbuf, ptr, testlen) != 0) {
				printf("Memory error near 0x%x\n",ptr);
			}
			if ((i % (1024 * 1024)) <= 15) {
				printf("%u MB checked...\n",i / (1024   * 1024));
				sleep(5);
			}
		}
	}

	/*not reached*/
	exit(0);
}