[Bug 257788] databases/postgres{12|13|14}-{server|client}: severe Kernel TLS issues

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 12 Aug 2021 14:46:52 +0000
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257788

            Bug ID: 257788
           Summary: databases/postgres{12|13|14}-{server|client}: severe
                    Kernel TLS issues
           Product: Ports & Packages
           Version: Latest
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: Individual Port(s)
          Assignee: ports-bugs_at_FreeBSD.org
          Reporter: ohartmann_at_walstatt.org

Since May 2021 we face severe DB issues with a couple of systems running
14-CURRENT, at this time FreeBSD 14.0-CURRENT #11 main-n248668-aecd31a8a3b: Thu
Aug 12 15:15:58 CEST 2021 amd64, dual stack (IPv4/IPv6) configurations. The
ports database/postgresql13-{server|client|contrib} have been recompiled via
"portmaster -df postgresql" for several times now on two specific hosts without
success so far. Before I describe the phenomenon, I state that we use
customized kernel configurations, kernel TLS is enabled in the kernel by
default and we also played with the kernel OID

kern.ipc.tls.enable=0|1

but I'll report later. For the tests described below, kern.ipc.tls.enable=0 is
set to ZERO ("0"). Otherwise an error occurs, see below.

For the record: both systems in question I report are running on an older Intel
IvyBridge hardware (Intel(R) Core(TM) i5-3470 CPU and Intel(R) Xeon(R) CPU
E3-1245 V2).

The XEON host also acts as a poudriere package builder, see below, it seems
important to me to mention this here.

The phenomenon is as follows. On the hosts running PostgreSQL 12, 13 or 14 as
server, login via "psql -U postgres -d postgres" is always possible via local
socket, but "psql -U postgres -d postgres -h localhost" (or replace localhost
by 127.0.0.1 or ::1 to exclude any misunderstandings) fails, after a while the
client hit a timeout:

#: psql -U postgres -d postgres -h 192.168.0.223
psql: error: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Checking via sockstat -4|-6 indicates, that postgresql is listening on its
default port 5432 on those machines in question and IPFW is setup properly or
disabled (simply set to "OPEN") for test pusrposes. 
Configuring the PostgreSQL server's logging to debug does not give anything
useful, the only thing one can see in the log is, if logging is set to "info":

root_at_:~ # 2021-08-12 13:51:05.137 GMT [2132] LOG:  connection received:
host=host1.local.net port=41162

Then - silence! As the server went deaf.

To make sure that not a corrupted DB causes the problems or a hidden
misconfiguration in either pg_hba.conf and/or postgresql.conf, we installed on
both systems version 12, 13 and even 14 of the software (compiled via classical
make). It is with all versions the same problem on that hosts.

To exclude any issues regarding self-compiling postgresql, we also fetched the
pkg tarball from an official FreeBSD mirror of posygresql13-server and
installed that one. The problem remains and leaves us with either a broken
world or kernel so far. Recompiling world and kernel with vanilla settings did
not change anything so far. Using GENERIC as a kernel does also not mitigate or
resolve the problem.

As initially mentioned, the XEON box also acts as a poudriere package host
building with the very same make.conf as the host (and so the non working db
host itself) packages also for 13-STABLE.

>From a client running a recent 13-STABLE and equipted with the packages built
from the host in question above, IT IS POSSIBLE to connect to the PostgreSQL 13
server, as long as

kern.ipc.tls.enable=0

is set to =0. If one sets kern.ipc.tls.enable=1 to "1", the client (running
psql 13.3) receives:

psql: error: SSL SYSCALL error: EOF detected

So, the Postgresql 13.3 server itself on the failing host is serving as
expected, so it seems to be the client having severe problems.

The problems occured on all infected systems almost the same time arounf May,
26th this year, when we did our weekly updates of the 14-CURRENT base system
and portmaster jobs for ports, that might be a hint since I do not remember
when LLVM 12 has been introduced or KTLS has been activated.

Also, to exclude any issue with iflib and the i350 NICs on the servers, we
disabled any hardware checksum offloading for vlan and RX/TX, so that at the
end a "naked" interface without any hardware support is used. But that didn't
resolve anything, too.

Another test went really sideways. We moved the complete configuration (base
system, kernel, sysctl.conf, postgresql13 configs and databases) to another,
more modern platform (it is a XEON based system, it's remotely not accessible,
so I can't report about its hardware specs). On this box, based on 14-CURRENT
and postgresql13 in a jail, the server acts as expected and local connectiosn
as well as remote connections are possible. This is really weird and leaves me
with a preliminary conclusion, that something is really wrong.

I'm out of ideas here and floating like a dead man in the water ...

-- 
You are receiving this mail because:
You are the assignee for the bug.
Received on Thu Aug 12 2021 - 14:46:52 UTC

Original text of this message