ZFS (ARC) performance regression in r321610

Mon Aug 27 13:47:51 UTC 2018

Hi,
after upgrading our database server from FreeBSD 11.1 to 11.2 we're
seeing a major performance regression. On FreeBSD 11.1 everything was
fine, both the master and the replication slaves were able to scope
with the workload. On FreeBSD 11.2 the master is rather slow and the
single threaded replication slaves are falling behind. I managed to
track this down to ZFS, especially the ARC seems to be the culprit.

The hardware:
 * 2x Intel Xeon E5-2643v3 with latest microcode.
 * 128GB RAM
 * 8x Intel DC 3700 800GB SSD connected to one LSI / Avago / Broadcom 
   SAS9300-8i HBA running the latest firmware (version 16.00.01.00)

The software:
 * FreeBSD/amd64 11.1 and 11.2 with GENERIC kernel
 * ZFS configured as RAIDZ1 over 6 SSDs, one SSD as spare.
 * The system itself is running on UFS.
 * MySQL 5.1 (don't ask, the application needs such an old version)

Configuration:
 * vfs.zfs.arc_max="16G" (The memory is better used by MySQL)
 * vfs.zfs.compressed_arc_enabled="0"
 * lz4 compression is enabled on the dataset holding the MySQL data.
 * Powermanagement (C-States, powerd) ist turned off.

The workload is somewhat special. There're hundereds of tables in the
database, but >95% of all workload is on only one ~18GB MyISAM table. As
long as _only_ this big table get's all the load everythings okay. As
soon as MySQL iterates over another big table ("select *" for example)
the performance degrades and never recovers.

While analyzing the problem I observed that reading over another table
seems to "poison" the ARC. While the zfs-stats script insists on a >99%
hit rate, the ARCs performance decreases by about 30% as soon as it was
"poisoned". And even more strange: A very small change to the ARCs
configuration recovers the performance. For example it's enough to
lower or increase vfs.zfs.arc_max by one byte.

I was able to come up with a test case. The C code can be found inline
or here: https://gist.github.com/Yamagi/992733b5eed8d1b7e209b72e42d78509

To run this programm you'll need a box with a ZFS pool and two files on
it. Both files must be bigger than the maximum ARC size. For example if
vfs.zfs.arc_max="2G" the files should be at least 2GB, better 4GB. All
tests should be done after a reboot, to be sure that the ARC is empty.

The program does what's decripted above, in more detail:
 1. At a first step it does 1.000.000 random reads over file A.
    This should (and does) show normal cache behavior, the reads
    are getting faster as the ARC fills up.
 2. As second step file B is read from start to end. That "poisons"
    the ARC.
 3. The third step does again 1.000.000 reads over file A. This
    should start slow (because data from file B is in the ARC) and
    get faster over time. But on FreeBSD 11.2 it doesn't get any
    faster, the performance stays degraded. Like the ARC is never
    hit.
 4. The fourth step lowers vfs.zfs.arc_max by one byte. 
 5. And as a final step another 1.000.000 random reads from file A.
    On FreeBSD 11.2 the performance is now good, just like it was in
    step 1.

With this program I was able to bisect the source and identify commit
r321610 (MFV r318946: 8021 ARC buf data scatter-ization) as the culprit:
https://svnweb.freebsd.org/base?view=revision&revision=321610

Of course the big question is: Is this really a regression or have I
just the found the one workload were a generally good and useful change
is a pessimation? And why is a small change to vfs.zfs.arc_max enough to
restore performance?

Anyway, running this every 15 minutes through cron just to keep our
database servers at good performance seems very, very wrong:

#!/bin/sh

CURRENT=$(sysctl -n vfs.zfs.arc_max)
NORMAL="17179869184"

if [ $CURRENT -eq $NORMAL ] ; then
	sysctl vfs.zfs.arc_max=$(($NORMAL-1))>/dev/null 2>&1
else
	sysctl vfs.zfs.arc_max=$NORMAL >/dev/null 2>&1
fi

Regards,
Yamagi

--------

These measurements were taken on a cheap desktop box with Intel Core i5
i5-4430, 24GB RAM and a Samsung 840 EVO 120GB SSD. The effect is not as
pronounced as on the much faster production servers but still easily
observable.

r321609 (good): The performance hit in step 3 is about 12%. In step 5
the performance is fully restored.

1. Doing 1000000 random reads from /mnt/file_a:
 - 100000: 1960 Reads per second
 - 200000: 2020 Reads per second
 - 300000: 2040 Reads per second
 - 400000: 2040 Reads per second
 - 500000: 2049 Reads per second
 - 600000: 2054 Reads per second
 - 700000: 2058 Reads per second
 - 800000: 2056 Reads per second
 - 900000: 2059 Reads per second
 - 1000000: 2061 Reads per second
2. Reading /mnt/file_b from start to end.
3. Doing 1000000 random reads from /mnt/file_a:
 - 100000: 1724 Reads per second
 - 200000: 1739 Reads per second
 - 300000: 1754 Reads per second
 - 400000: 1754 Reads per second
 - 500000: 1760 Reads per second
 - 600000: 1764 Reads per second
 - 700000: 1767 Reads per second
 - 800000: 1769 Reads per second
 - 900000: 1768 Reads per second
 - 1000000: 1769 Reads per second
4. Shrinking the ARC by one byte.
5. Doing 1000000 random reads from /mnt/file_a:
 - 100000: 2040 Reads per second
 - 200000: 2061 Reads per second
 - 300000: 2054 Reads per second
 - 400000: 2061 Reads per second
 - 500000: 2066 Reads per second
 - 600000: 2061 Reads per second
 - 700000: 2064 Reads per second
 - 800000: 2061 Reads per second
 - 900000: 2064 Reads per second
 - 1000000: 2066 Reads per second  

r321610 (bad): Reads are generally much slower than on r321609 (good).
The performance hit in step 3 is about 27%. In step 5 the performance
get's fully restored.

1. Doing 1000000 random reads from /mnt/file_a:
 - 100000: 1754 Reads per second
 - 200000: 1785 Reads per second
 - 300000: 1807 Reads per second
 - 400000: 1809 Reads per second
 - 500000: 1818 Reads per second
 - 600000: 1818 Reads per second
 - 700000: 1822 Reads per second
 - 800000: 1826 Reads per second
 - 900000: 1825 Reads per second
 - 1000000: 1828 Reads per second
2. Reading /mnt/file_b from start to end.
3. Doing 1000000 random reads from /mnt/file_a:
 - 100000: 1265 Reads per second
 - 200000: 1290 Reads per second
 - 300000: 1293 Reads per second
 - 400000: 1298 Reads per second
 - 500000: 1298 Reads per second
 - 600000: 1298 Reads per second
 - 700000: 1301 Reads per second
 - 800000: 1300 Reads per second
 - 900000: 1302 Reads per second
 - 1000000: 1302 Reads per second
4. Shrinking the ARC by one byte.
5. Doing 1000000 random reads from /mnt/file_a:
 - 100000: 1694 Reads per second
 - 200000: 1694 Reads per second
 - 300000: 1704 Reads per second
 - 400000: 1709 Reads per second
 - 500000: 1706 Reads per second
 - 600000: 1709 Reads per second
 - 700000: 1707 Reads per second
 - 800000: 1709 Reads per second
 - 900000: 1707 Reads per second
 - 1000000: 1709 Reads per second 

--------

// clang -g -Wall -O2 -o zfsarc zfsarc.c

#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <sys/stat.h>
#include <sys/sysctl.h>
#include <sys/time.h>
#include <sys/types.h>

#define ITERATIONS 1000000

void randreads(const char *file, struct stat *sb)
{
	FILE *fd;
	char buffer[256 * 1024];
	struct timeval start;

	gettimeofday(&start, NULL);

	if ((fd = fopen(file, "r")) == NULL)
	{
		fprintf(stderr, "Couldn't open %s: %s\n", file, strerror(errno));
		exit(1);
	}

	for (int i = 1; i <= ITERATIONS; i++)
	{
		uint32_t bytestoread = arc4random_uniform(sizeof(buffer));
		uint32_t position = arc4random_uniform(sb->st_size - sizeof(buffer));

		bytestoread = bytestoread ? bytestoread : 1;

		if (fseek(fd, position, SEEK_SET) != 0)
		{
			fprintf(stderr, "Could't seek to %i in %s: %s\n", position, file, strerror(errno));
			fclose(fd);
			exit(1);
		}

		if (fread(buffer, bytestoread, 1, fd) != 1)
		{
			fprintf(stderr, "Couldn't read %i bytes from %s: %s\n", bytestoread, file, strerror(errno));
			fclose(fd);
			exit(1);
		}

		if (!(i % 100000))
		{
			struct timeval now;

			gettimeofday(&now, NULL);
            printf(" - %i: %li Reads per second\n", i, i / (now.tv_sec - start.tv_sec)); 
		}
	}

	fclose(fd);
}

void fullread(const char *file, struct stat *sb)
{
	FILE *fd;
	char buffer[256 * 1024];
	const uint32_t iterations = sb->st_size / sizeof(buffer);

	if ((fd = fopen(file, "r")) == NULL)
	{
		fprintf(stderr, "Couldn't open %s: %s\n", file, strerror(errno));
		exit(1);
	}

	for (int i = 1; i <= iterations; i++)
	{
		uint32_t bytestoread = (i == iterations) ? sb->st_size % sizeof(buffer) : sizeof(buffer);

		if (bytestoread > 0) 
		{
			if (fread(buffer, bytestoread, 1, fd) != 1)
			{
				fprintf(stderr, "Couldn't read %i bytes from %s: %s\n", bytestoread, file, strerror(errno));
				fclose(fd);
				exit(1);
			}
		}
	}

	fclose(fd);
}

int main(int argc, char *argv[])
{
	// No stdout buffering.
	setbuf(stdout, NULL);

	// Get arguments.
	if (argc != 3)
	{
		fprintf(stderr, "Usage: ./test file_a file_b\n");
		exit(1);
	}

	const char *file_a = argv[1];
	const char *file_b = argv[2];

	// Check if the files are large enough.
	// We need at least the size of the ARC.
	uint64_t arcmax;
	size_t arglen = sizeof(arcmax);

	if (sysctlbyname("vfs.zfs.arc_max", &arcmax, &arglen, NULL, 0) != 0)
	{
		fprintf(stderr, "Couldn't get vfs.zfs.arc_max: %s\n", strerror(errno));
		exit(1);
	}

	struct stat sb_a;

    if (stat(file_a, &sb_a) != 0)
	{
		fprintf(stderr, "Couldn't stat %s: %s\n", file_a, strerror(errno));
		exit(1);
	}

	if (!S_ISREG(sb_a.st_mode))
	{
		fprintf(stderr, "%s is not a file.\n", file_a);
		exit(1);
	}

	if (sb_a.st_size < arcmax)
	{
		fprintf(stderr, "%s is to small: Must be at least %lu bytes\n", file_a, arcmax);
		exit(1);
	}

	struct stat sb_b;

    if (stat(file_b, &sb_b) != 0)
	{
		fprintf(stderr, "Couldn't stat %s: %s\n", file_b, strerror(errno));
		exit(1);
	}

	if (!S_ISREG(sb_b.st_mode))
	{
		fprintf(stderr, "%s is not a file.\n", file_b);
		exit(1);
	}

	if (sb_b.st_size < arcmax)
	{
		fprintf(stderr, "%s is to small: Must be at least %lu bytes\n", file_b, arcmax);
		exit(1);
	}

	// Do ITERATIONS random reads on file A. This will fill up the ARC.
	// In theory this should start slow and get faster as the ARC fills
	// up.
	printf("1. Doing %i random reads from %s:\n", ITERATIONS, file_a);
	randreads(file_a, &sb_a);

	// Poison the ARC by reading once through file B. This will write
	// contents from file B into the ARC, replacing contents from file
	// A.
	printf("2. Reading %s from start to end.\n", file_b);
	fullread(file_b, &sb_b);

	// Do ITERATIONS random reads on file A. This should start slow and
	// get faster as contents of file B is replaced by contents of file
	// A. This can be seen in 11.1-RELEASE, but not in 11.2-RELEASE.
	printf("3. Doing %i random reads from %s:\n", ITERATIONS, file_a);
	randreads(file_a, &sb_a);

	// Shrink the ARC by 1 byte to alter it's configuration.
	printf("4. Shrinking the ARC by one byte.\n");

	arcmax--;

	if (sysctlbyname("vfs.zfs.arc_max", NULL, NULL, &arcmax, arglen) != 0)
	{
		fprintf(stderr, "Couldn't set vfs.zfs.arc_max: %s\n", strerror(errno));
		exit(1);
	}

	// Do ITERATIONS random reads on file A. In 11.2-RELEASE this will show
	// good performance, again.
	printf("5. Doing %i random reads from %s:\n", ITERATIONS, file_a);
	randreads(file_a, &sb_a);

	return 0;
}

-- 
Homepage: https://www.yamagi.org
Github:   https://github.com/yamagi
GPG:      0x1D502515
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20180827/b3c335c4/attachment.sig>