Network anomalies after update from 11.2 STABLE to 12.1 STABLE

Paul devgs at ukr.net
Fri Oct 18 12:57:22 UTC 2019


Our current version is:

    FreeBSD 11.2-STABLE #0 r340725

New version that we have problems with:

    FreeBSD 12.1-STABLE #5 r352893


After update to new version we have started to observe an incredible number of 
errors in HTTP requests in between various services in our system. This problem
appeared on all the servers that were upgraded, and seems to not be specific to
concrete network card: we use different models, all are affected.

During various tests, we observed a lot of spontaneous TCP stream abortions, 
including at the establishment stage (SYN) in cases that were 100% issue free
on 11.2-STABLE. Concrete test cases will be shown below.

We also want to highlight that, on numerous occasions, we have observed random,
huge ACK indices in a first response to a SYN packet, instead of 1, as expected.
This forces client to abort connection via RST.

On the fist glance it looks like races in the kernel, because problem disappears when:
   * we use `dev.ixl.0.iflib.override_nrxqs=1` and `dev.ixl.0.iflib.override_ntxqs=1`
   * we use `dev.ixl.0.iflib.override_nrxqs=0` and `dev.ixl.0.iflib.override_ntxqs=0`, but don't issue concurrent TCP streams

These are some debug log messages, emitted by 12.1-STABLE:

Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16304 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16326 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16402 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16652 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:16686 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:18562 to [10.10.10.92]:80 tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, no action
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:18918 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19331 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19340 to [10.10.10.92]:80 tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, no action
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19340 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19340 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19489 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19580 to [10.10.10.92]:80 tcpflags 0x4<RST>; tcp_do_segment: Timestamp missing, no action
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19580 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:01 test kernel: TCP: [10.10.10.39]:19580 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Spurious RST without matching syncache entry (possibly syncookie only), segment ignored
Oct 18 14:59:02 test kernel: TCP: [10.10.10.39]:17705 to [10.10.10.92]:80; syncache_timer: Response timeout, retransmitting (1) SYN|ACK
Oct 18 14:59:02 test kernel: TCP: [10.10.10.39]:18066 to [10.10.10.92]:80; syncache_timer: Response timeout, retransmitting (1) SYN|ACK
Oct 18 14:59:02 test kernel: TCP: [10.10.10.39]:18066 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Our SYN|ACK was rejected, connection attempt aborted by remote endpoint
Oct 18 14:59:02 test kernel: TCP: [10.10.10.39]:17705 to [10.10.10.92]:80 tcpflags 0x4<RST>; syncache_chkrst: Our SYN|ACK was rejected, connection attempt aborted by remote endpoint

Here, 10.10.10.92 runs 12.1-STABLE, while 10.10.10.39 is a client that runs 11.2-STABLE.


In our test case we use nginx and wrk , with a minimal config, where nginx always returns 
error page 404. nginx is on the 12.1-STABLE, while wrk is on 11.2-STABLE.

We run wrk like so:

    wrk -c 10 --header "Connection: close" -d 10 -t 1 --latency http://10.10.10.92:80/missing

and often see errors like these:

    Socket errors: connect 12, read 4, write 4, timeout 0

If we reverse the test, by switching two servers places, ie 12.1-STABLE becomes a client and 
issues requests via wrk, we see no problems at all. Same is true between two between two
11.2-STABLE machines.


It seems like issue appears only when the same local port is used for multiple connections 
on 12.1-STABLE. Currently this is possible only when  12.1-STABLE is a server and accepts 
connections on port, say 80, as in our case. To confirm, this we made  another test. We've 
configured nginx to listen on 10 different ports, 80 through 89, and then launched 10 
different wrk processes, each using only one concurrent connection, meaning that we will 
have only 10 TCP streams, each having its own unique port on the 12.1-STABLE's side:

    for I in {0..9}; do wrk -c 1 --header "Connection: close" -d 10 -t 1 --latency http://10.10.10.92:8${I}/missing & ; done

Socket errors stopped appearing. We ran this test many many times, errors just don't appear.

Though, whenever we repeat a previous test, using a single port:

    wrk -c 10 --header "Connection: close" -d 10 -t 1 --latency http://10.10.10.92:80/missing

errors start appearing again and again:

    Socket errors: connect 8, read 14, write 9, timeout 0


We've tested different drivers with the same outcome:

em driver:
em0 at pci0:10:0:0:        class=0x020000 card=0x000015d9 chip=0x10d38086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82574L Gigabit Network Connection'

ixl driver:
ixl0 at pci0:4:0:0:        class=0x020000 card=0x00078086 chip=0x15728086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller X710 for 10GbE SFP+'

Even the driver from ports (/usr/ports/net/intel-ixl-kmod): ixl-1.11.9


Help with this matter would be really appreciated.

Best regards,
-Paul



More information about the freebsd-stable mailing list