Re: vnet jails loose network connectivity

From: Johan Hendriks <joh.hendriks_at_gmail.com>
Date: Mon, 07 Mar 2022 20:40:21 UTC
On 04/03/2022 15:36, Johan Hendriks wrote:
> Hello all, i use jails for some testing, but i can not seem to make it 
> stable.
> I use vnet jails with a bridge but when i put some load on it, some 
> jails loose there network connectivity.
>
> My setup is as follows, haproxy internal IP 10.233.185.20 using binat 
> to make it Public accessable.
> Then a varnish jail, and two web servers al on the 10.233.185.x range.
>
> If i give it a little load with hey (hey -h2 -n 10 -c 20 -z 60s 
> https://wp.test.nl) than within the test the haproxy jail is not 
> reachable anymore it is not pingable from the host machine, and from 
> the other jails. restarting the jails solves it, if i leave the system 
> alone for some time i saw the varnish jail become unresponsive.
>
> If i do a tcpdump on the epair${name}a interface i do see the packages 
> from the host machine to the jail but the jail itself is not reachable.
>
> There is nothing in the logs from the host and the jail itself, i can 
> ping the jails ip adres from the jail itself.
>
>
> I do not think i have a special setup, but i could be doing something 
> wrong.
> my jail.conf
>
> # Global settings applied to all jails.
> $domain = "test.nl";
> $subdomain = "";
>
> exec.start = "/bin/sh /etc/rc";
> exec.stop = "/bin/sh /etc/rc.shutdown";
> exec.clean;
>
> mount.fstab = "/storage/jails/$name.fstab";
>
> exec.system_user  = "root";
> exec.jail_user    = "root";
> mount.devfs;
> sysvshm="new";
> sysvsem="new";
> allow.raw_sockets;
> allow.set_hostname = 0;
> allow.sysvipc;
> enforce_statfs = "2";
> devfs_ruleset     = "11";
>
> path = "/storage/jails/${name}";
> host.hostname = "${name}${subdomain}.${domain}";
>
> # Networking
> $uplinkdev        = "vtnet1";
> $epid             = "${ip}";
> $subnet           = "10.233.185.";
> $cidr             = "/24";
> $ipv4_addr        = "${subnet}${ip}${cidr}";
> vnet;
> vnet.interface    = "vnet0";
>
> $epair=epair${ip};
> vnet;
> #vnet.interface    = "${epair}b";  # default vnet interface
> exec.prestart     = "ifconfig bridge0 > /dev/null 2>&1 || ( ifconfig 
> bridge0 create up && ifconfig bridge0 addm $uplinkdev )";
> exec.prestart    += "ifconfig ${epair} create up description 
> jail_${name}   || echo 'Skipped creating epair (exists?)'";
> exec.prestart    += "ifconfig bridge0 addm ${epair}a           || echo 
> 'Skipped adding bridge member (already member?)'";
> exec.created      = "ifconfig ${epair}b name vnet0";
> exec.start        = "/bin/sh /etc/rc";
> exec.consolelog   = "/var/log/jail/$name.test.nl";
> exec.stop         = "/bin/sh /etc/rc.shutdown";
> exec.poststop     = "ifconfig bridge0 deletem ${epair}a";
> exec.poststop    += "ifconfig ${epair}a destroy";
>
> varnish01 {
>     $ip = 16;
>     mount.fstab = "";
>     path = "/storage/jails/${name}";
> }
>
> web01 {
>     $ip = 18;
> }
>
> web02 {
>     $ip = 19;
> }
>
> haproxy {
>     $ip = 20;
>     mount.fstab = "";
>     path = "/storage/jails/${name}";
> }
>
> My ifconfig
>
> bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 
> mtu 1500
>     ether 58:9c:fc:10:ff:82
>     inet 10.233.185.1 netmask 0xffffff00 broadcast 10.233.185.255
>     id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
>     maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
>     root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
>     member: epair20a flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
>             ifmaxaddr 0 port 13 priority 128 path cost 2000
>     member: epair19a flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
>             ifmaxaddr 0 port 53 priority 128 path cost 2000
>     member: epair18a flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
>             ifmaxaddr 0 port 48 priority 128 path cost 2000
>     member: epair16a flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
>             ifmaxaddr 0 port 28 priority 128 path cost 2000
>     groups: bridge
>     nd6 options=9<PERFORMNUD,IFDISABLED>
> epair16a: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> 
> metric 0 mtu 1500
>     description: jail_varnish01
>     options=8<VLAN_MTU>
>     ether 02:76:32:8e:0e:0a
>     groups: epair
>     media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
>     status: active
>     nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> epair18a: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> 
> metric 0 mtu 1500
>     description: jail_web01
>     options=8<VLAN_MTU>
>     ether 02:6d:be:b8:36:0a
>     groups: epair
>     media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
>     status: active
>     nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> epair19a: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> 
> metric 0 mtu 1500
>     description: jail_web02
>     options=8<VLAN_MTU>
>     ether 02:54:fd:77:9a:0a
>     groups: epair
>     media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
>     status: active
>     nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> epair20a: flags=8963<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> 
> metric 0 mtu 1500
>     description: jail_haproxy
>     options=8<VLAN_MTU>
>     ether 02:f8:58:06:78:0a
>     groups: epair
>     media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
>     status: active
>     nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
>
> This is on both 13-STABLE and 14-HEAD.
>
>
For the sake of testing i tried it with FreeBSD 13.0-RELEASE-p7 and this 
works fine. This is an exact copy of the setup i use on 14-CURRENT and 
13-STABLE. (i did a ZFS send and receive of the jails and a copy of the 
jail.conf. pf.conf and so on) I did run the hey command targeting the 
13-0-RELEASE multiple times.

hey -h2 -n 10 -c 30 -z 300s https://wp.test.nl

Summary:
   Total:    300.0045 secs
   Slowest:    0.1137 secs
   Fastest:    0.0006 secs
   Average:    0.0090 secs
   Requests/sec:    4627.4504


Response time histogram:
   0.001 [1]    |
   0.012 [977291]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
   0.023 [21236]    |■
   0.035 [1125]    |
   0.046 [230]    |
   0.057 [12]    |
   0.068 [18]    |
   0.080 [9]    |
   0.091 [18]    |
   0.102 [30]    |
   0.114 [30]    |


Latency distribution:
   10% in 0.0037 secs
   25% in 0.0046 secs
   50% in 0.0061 secs
   75% in 0.0080 secs
   90% in 0.0096 secs
   95% in 0.0106 secs
   99% in 0.0133 secs

Details (average, fastest, slowest):
   DNS+dialup:    0.0000 secs, 0.0006 secs, 0.1137 secs
   DNS-lookup:    0.0000 secs, 0.0000 secs, 0.0028 secs
   req write:    0.0001 secs, 0.0000 secs, 0.1126 secs
   resp wait:    0.0192 secs, 0.0000 secs, 214.9645 secs
   resp read:    0.0018 secs, 0.0002 secs, 0.1076 secs

Status code distribution:
   [200]    1000000 responses


All is fine on the 13.0-RELEASE-p7 also with a higher concurrency, 
however if i do it against the 14-CURRENT or the 13-STABLE, even a run 
of 60 seconds kills the network connectivity of the jail. (haproxy in my 
case)

regards,
Johan