Regression when trying to replace poll() with kqueue()
Allan Jude
allanjude at freebsd.org
Wed Oct 3 05:01:04 UTC 2018
On 2018-10-01 22:24, Thomas Munro wrote:
> Hello FreeBSD hackers,
>
> (CCing mjg and a list of others FreeBSD hackers he suggested)
>
> In a fit of enthusiasm for FreeBSD, a couple of years ago I wrote a
> patch to teach PostgreSQL to use kqueue(2). That was after we
> switched over to epoll(2) on Linux for performance reasons. Our
> default is to use poll(2) unless we have something better. The most
> common usage pattern is simply waiting for read/write readiness on the
> socket that is connected to the client + a pipe connected to the
> parent supervisor process ("postmaster"), but we have plans for more
> interesting kinds of multiplexing involving many more descriptors, and
> in general this sits behind our very thin abstraction called
> WaitEventSet (see latch.c in the PostgreSQL source tree) that can be
> used for many things.
>
> We did some testing using "pgbench" (instructions below) on various
> platforms that have kqueue(2), and we got some conflicting results
> from FreeBSD. When the system is heavily overloaded (a scenario we
> want to work well, or at least not get worse under kqueue, even if
> it's not the ideal way to run your database server), mjg reported that
> with the kqueue patch performance was way better than unpatched when
> the pgbench test client was running on a different host. Huzzah!
>
> Unfortunately, another tester reported the performance was worse when
> running pgbench from the same host (originally he complained about
> NetBSD performance and then we realised FreeBSD was the same under
> those conditions), and I confirmed that was the case for both Unix
> sockets and TCP sockets. In one 96 (!) thread test, the TPS reported
> by pgbench dropped from 70k to 50k queries per second on an 8 CPU
> system. As crazy as those test conditions may seem, that is not a
> good result.
>
> Curiously, when truss'd, in the overloaded scenario that performs
> worse, we very rarely seem to actually reach kevent(2). It seems like
> there is some kind of scheduling difference producing the change.
> Each PostgreSQL server process looks like this over ~10 seconds:
>
> syscall seconds calls errors
> sendto 0.396840146 3452 0
> recvfrom 0.415802029 3443 6
> kevent 0.000626393 6 0
> gettimeofday 2.723923249 24053 0
> ------------- ------- -------
> 3.537191817 30954 6
>
> (That was captured on a virtualised system which had gettimeofday as a
> syscall, but the effect has been reported on bare metal too and there
> no gettimeofday calls show up; I don't believe that is a factor).
>
> The pgbench client looks like this:
>
> syscall seconds calls errors
> ppoll 0.002773195 1 0
> sendto 16.597880468 7217 0
> recvfrom 25.646406008 7238 0
> ------------- ------- -------
> 42.247059671 14456 0
>
> (For whatever reason pgbench uses ppoll() instead, but I assume that's
> irrelevant here; it's also multi-threaded, unlike the server.) The
> truss -c results for the server are not much different when using
> poll(2) instead of kevent(2), although recvfrom in the pgbench client
> seems to show a few seconds less total time, which is curious. You
> can see that we're mostly able to do sendto() and recvfrom() without
> seeing EWOULDBLOCK. So it's not direct access to the kqueue that is
> affecting performance. It's something else, something caused by the
> mere existence of the kqueue object holding the descriptor.
>
> That led several people to speculate that there may be a difference in
> the wakeup logic, when one end of a descriptor is in a kqueue (mjg
> speculated wake-up-one vs broadcast could be a factor), and that may
> be leading to worse scheduling behaviour.
>
> To be clear, nobody thinks that 96 client threads talking to 96
> processes on a single 8 CPU box is a great way to run a system in real
> life! But it's still surprising that we lose performance whe using
> kqueue, and it'd be great to understand why, and hopefully improve it.
>
> The complete discussion on pgsql-hackers is here:
>
> https://www.postgresql.org/message-id/flat/CAEepm%3D37oF84-iXDTQ9MrGjENwVGds%2B5zTr38ca73kWR7ez_tA%40mail.gmail.com
>
> Any ideas would be most welcome.
>
> Thanks for reading!
>
> ====
>
> Reproduction steps (assuming you have git, gmake, flex, bison,
> readline, curl, ccache):
>
> # grab postgres
> git clone https://github.com/postgres/postgres.git
> cd postgres
>
> # grab kqueue patch
> curl -O https://www.postgresql.org/message-id/attachment/65098/0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch
> git checkout -b kqueue
> git am 0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch
>
> # build
> ./configure --prefix=$HOME/install --with-includes=/usr/local/include
> --with-libs=/usr/local/lib CC="ccache cc"
> gmake -s -j8
> gmake -s install
> gmake -C contrib/pg_prewarm install
>
> # create a db cluster and set it to use 2GB of shmem so we can hold
> whole dataset
> ~/install/bin/initdb -D ~/pgdata
> echo "shared_buffers = '2GB'" >> ~/pgdata/postgresql.conf
>
> # you can either start (and later stop) postgres in the background with pg_ctl:
> ~/install/bin/pg_ctl start -D ~/pgdata
> # ... or just run it in the foreground and hit ^C to stop it:
> # ~/install/bin/postgres -D ~/pgdata
>
> # this should produce about 1.1GB of data under ~/pgdata
> ~/install/bin/pgbench -s 10 -i postgres
>
> # install the prewarm extension, so we can run the test without doing
> any file IO
> ~/install/bin/psql postgres -c "create extension pg_prewarm"
>
> # after that, after any server restart, prewarm like so:
> ~/install/bin/psql postgres -c "select pg_prewarm(c.oid::regclass)
> from pg_class c where relkind in ('r', 'i')" | cat
>
> # then 60 second pgbench runs are simply:
> ~/install/bin/pgbench -c 96 -j 96 -M prepared -S -T 60 postgres
>
> # to make pgbench use TCP instead of Unix sockets, add -h localhost;
> # to allow connection from another host, update ~/pgdata/postgresql.conf's
> # listen_addresses
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
>
I have started to look into this a bit. I have not really gotten
anywhere yet, but I have produced a graph comparing the performance of
vanilla postgres vs your patch.
https://imgur.com/a/gKycGxW
They scale identically up to the 20 threads of hardware on my test
machine, and then kqueue falls off much more quickly.
Hopefully I'll have more useful findings tomorrow.
--
Allan Jude
More information about the freebsd-hackers
mailing list