Regression when trying to replace poll() with kqueue()

Wed Oct 3 05:01:04 UTC 2018

On 2018-10-01 22:24, Thomas Munro wrote:
> Hello FreeBSD hackers,
> 
> (CCing mjg and a list of others FreeBSD hackers he suggested)
> 
> In a fit of enthusiasm for FreeBSD, a couple of years ago I wrote a
> patch to teach PostgreSQL to use kqueue(2).  That was after we
> switched over to epoll(2) on Linux for performance reasons.  Our
> default is to use poll(2) unless we have something better.  The most
> common usage pattern is simply waiting for read/write readiness on the
> socket that is connected to the client + a pipe connected to the
> parent supervisor process ("postmaster"), but we have plans for more
> interesting kinds of multiplexing involving many more descriptors, and
> in general this sits behind our very thin abstraction called
> WaitEventSet (see latch.c in the PostgreSQL source tree) that can be
> used for many things.
> 
> We did some testing using "pgbench" (instructions below) on various
> platforms that have kqueue(2), and we got some conflicting results
> from FreeBSD.  When the system is heavily overloaded (a scenario we
> want to work well, or at least not get worse under kqueue, even if
> it's not the ideal way to run your database server), mjg reported that
> with the kqueue patch performance was way better than unpatched when
> the pgbench test client was running on a different host.  Huzzah!
> 
> Unfortunately, another tester reported the performance was worse when
> running pgbench from the same host (originally he complained about
> NetBSD performance and then we realised FreeBSD was the same under
> those conditions), and I confirmed that was the case for both Unix
> sockets and TCP sockets.  In one 96 (!) thread test, the TPS reported
> by pgbench dropped from 70k to 50k queries per second on an 8 CPU
> system.  As crazy as those test conditions may seem, that is not a
> good result.
> 
> Curiously, when truss'd, in the overloaded scenario that performs
> worse, we very rarely seem to actually reach kevent(2).  It seems like
> there is some kind of scheduling difference producing the change.
> Each PostgreSQL server process looks like this over ~10 seconds:
> 
> syscall                     seconds   calls  errors
> sendto                  0.396840146    3452       0
> recvfrom                0.415802029    3443       6
> kevent                  0.000626393       6       0
> gettimeofday            2.723923249   24053       0
>                       ------------- ------- -------
>                         3.537191817   30954       6
> 
> (That was captured on a virtualised system which had gettimeofday as a
> syscall, but the effect has been reported on bare metal too and there
> no gettimeofday calls show up; I don't believe that is a factor).
> 
> The pgbench client looks like this:
> 
> syscall                     seconds   calls  errors
> ppoll                   0.002773195       1       0
> sendto                 16.597880468    7217       0
> recvfrom               25.646406008    7238       0
>                       ------------- ------- -------
>                        42.247059671   14456       0
> 
> (For whatever reason pgbench uses ppoll() instead, but I assume that's
> irrelevant here; it's also multi-threaded, unlike the server.)  The
> truss -c results for the server are not much different when using
> poll(2) instead of kevent(2), although recvfrom in the pgbench client
> seems to show a few seconds less total time, which is curious.  You
> can see that we're mostly able to do sendto() and recvfrom() without
> seeing EWOULDBLOCK.  So it's not direct access to the kqueue that is
> affecting performance.  It's something else, something caused by the
> mere existence of the kqueue object holding the descriptor.
> 
> That led several people to speculate that there may be a difference in
> the wakeup logic, when one end of a descriptor is in a kqueue (mjg
> speculated wake-up-one vs broadcast could be a factor), and that may
> be leading to worse scheduling behaviour.
> 
> To be clear, nobody thinks that 96 client threads talking to 96
> processes on a single 8 CPU box is a great way to run a system in real
> life!  But it's still surprising that we lose performance whe using
> kqueue, and it'd be great to understand why, and hopefully improve it.
> 
> The complete discussion on pgsql-hackers is here:
> 
> https://www.postgresql.org/message-id/flat/CAEepm%3D37oF84-iXDTQ9MrGjENwVGds%2B5zTr38ca73kWR7ez_tA%40mail.gmail.com
> 
> Any ideas would be most welcome.
> 
> Thanks for reading!
> 
> ====
> 
> Reproduction steps (assuming you have git, gmake, flex, bison,
> readline, curl, ccache):
> 
> # grab postgres
> git clone https://github.com/postgres/postgres.git
> cd postgres
> 
> # grab kqueue patch
> curl -O https://www.postgresql.org/message-id/attachment/65098/0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch
> git checkout -b kqueue
> git am 0001-Add-kqueue-2-support-for-WaitEventSet-v11.patch
> 
> # build
> ./configure --prefix=$HOME/install --with-includes=/usr/local/include
> --with-libs=/usr/local/lib CC="ccache cc"
> gmake -s -j8
> gmake -s install
> gmake -C contrib/pg_prewarm install
> 
> # create a db cluster and set it to use 2GB of shmem so we can hold
> whole dataset
> ~/install/bin/initdb -D ~/pgdata
> echo "shared_buffers = '2GB'" >> ~/pgdata/postgresql.conf
> 
> # you can either start (and later stop) postgres in the background with pg_ctl:
> ~/install/bin/pg_ctl start -D ~/pgdata
> # ... or just run it in the foreground and hit ^C to stop it:
> # ~/install/bin/postgres -D ~/pgdata
> 
> # this should produce about 1.1GB of data under ~/pgdata
> ~/install/bin/pgbench -s 10 -i postgres
> 
> # install the prewarm extension, so we can run the test without doing
> any file IO
> ~/install/bin/psql postgres -c "create extension pg_prewarm"
> 
> # after that, after any server restart, prewarm like so:
> ~/install/bin/psql postgres -c "select pg_prewarm(c.oid::regclass)
> from pg_class c where relkind in ('r', 'i')" | cat
> 
> # then 60 second pgbench runs are simply:
> ~/install/bin/pgbench -c 96 -j 96 -M prepared -S -T 60 postgres
> 
> # to make pgbench use TCP instead of Unix sockets, add -h localhost;
> # to allow connection from another host, update ~/pgdata/postgresql.conf's
> # listen_addresses
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
> 

I have started to look into this a bit. I have not really gotten
anywhere yet, but I have produced a graph comparing the performance of
vanilla postgres vs your patch.

https://imgur.com/a/gKycGxW

They scale identically up to the 20 threads of hardware on my test
machine, and then kqueue falls off much more quickly.

Hopefully I'll have more useful findings tomorrow.

-- 
Allan Jude