sshd / tcp packet corruption ?

Wed Jun 23 04:51:28 UTC 2010

So definitely some kind of packet corruption;

Using netcat to send a single megabyte of binary data to a box with no
known issues (from kinetic -> steel):

kinetic:/tmp$ dd if=/dev/urandom of=random.testfile bs=1k count=1k

1024+0 records in

1024+0 records out

1048576 bytes transferred in 0.018347 secs (57152372 bytes/sec)

kinetic:/tmp$ md5 random.testfile 

MD5 (random.testfile) = 9be700336ef81e8f89c60422fc795877

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile

Connection to steel 1234 port [tcp/*] succeeded!

kinetic:/tmp$ 

whilst on steel: (a stable linux box kinetic is MEANT to be replacing)

ff8a336e2be0c5c645e9f8a2dea67eea  random.testfile

fae5da747c7857d1d87870c05db1f152  random.testfile

a36c7166631ca10c460e323e39071094  random.testfile

50a8f005a772f9321243215d1ea1adb6  random.testfile

5da41b6f475f4655572df8c9bd81e181  random.testfile

3104dd30179bf870e8ec6ef91c34d78f  random.testfile

274a16890cf39c3089d8f0eda253f5fd  random.testfile

e8d0bae998340252c6c67529d520feb4  random.testfile

6d5377ca4545f98a55c017f518567092  random.testfile

6b464f810fe1c2902694a7817f881906  random.testfile

8912007161ececdb3e23a0018af36c36  random.testfile

3f4e17d5a939cd8dfd0941c898c5ac5f  random.testfile

9db926ba5f5f39dddcc0607983ed96f0  random.testfile

835de68b981bf6cb871ebb2ce81404e1  random.testfile

a211a3260d9c8ae595782d254798cacf  random.testfile

030e08f1d3d0fb761046f66c888fdea2  random.testfile

If I reboot kinetic and try one last time:

9be700336ef81e8f89c60422fc795877  random.testfile

Notice that is now the CORRECT checksum on steel.

Kinetic’s samba, sshd, etc will play nice for a day or so before
returning to corrupting packets.

So any idea ? Why would my packets start getting corrupted after a
couple days use?

This box just runs isc-dhcpd, openldap-server, samba34, and ZFS (the
real reason its replacing the Linux box.)

Thanks,

Martin.

From: Martin Minkus 
Sent: Wednesday, 23 June 2010 16:01
To: freebsd-questions at freebsd.org
Subject: sshd / tcp packet corruption ?

It seems this issue I reported below may actually be related to some
kind of TCP packet corruption ?

Still same box. I’ve noticed my SSH connections into the box will die
randomly, with errors.

Sshd logs the following on the box itself:

Jun 18 11:15:32 kinetic sshd[1406]: Received disconnect from
10.64.10.251: 2: Invalid packet header.  This probably indicates a
problem with key exchange or encryption. 

Jun 18 11:15:41 kinetic sshd[15746]: Accepted publickey for martinm from
10.64.10.251 port 56469 ssh2

Jun 18 11:15:58 kinetic su: nss_ldap: could not get LDAP result - Can't
contact LDAP server

Jun 18 11:15:58 kinetic su: martinm to root on /dev/pts/0

Jun 18 11:16:06 kinetic su: martinm to root on /dev/pts/1

Jun 18 11:16:29 kinetic sshd[15748]: Received disconnect from
10.64.10.251: 2: Invalid packet header.  This probably indicates a
problem with key exchange or encryption. 

Jun 18 11:16:30 kinetic sshd[15746]: syslogin_perform_logout: logout()
returned an error

Jun 18 11:16:34 kinetic sshd[16511]: Accepted publickey for martinm from
10.64.10.251 port 56470 ssh2

Jun 18 11:16:41 kinetic sshd[16513]: Received disconnect from
10.64.10.251: 2: Invalid packet header.  This probably indicates a
problem with key exchange or encryption. 

Jun 18 11:16:41 kinetic sshd[16511]: syslogin_perform_logout: logout()
returned an error

Jun 23 15:52:59 kinetic sshd[56974]: Received disconnect from
10.64.10.209: 5: Message Authentication Code did not verify (packet
#75658). Data integrity has been compromised. 

Jun 23 15:53:12 kinetic sshd[57109]: Accepted publickey for martinm from
10.64.10.209 port 9494 ssh2

Jun 23 15:53:38 kinetic su: martinm to root on /dev/pts/3

Jun 23 15:56:36 kinetic sshd[57111]: Received disconnect from
10.64.10.209: 2: Invalid packet header.  This probably indicates a
problem with key exchange or encryption. 

Jun 23 15:56:44 kinetic sshd[57151]: Accepted publickey for martinm from
10.64.10.209 port 9534 ssh2

My googlefu has failed me on this.

Any ideas what on earth this could be ?

Ethernet card?

em0: <Intel(R) PRO/1000 Legacy Network Connection 1.0.1> port
0xcc00-0xcc3f mem 0xfdfe0000-0xfdffffff,0xfdfc0000-0xfdfdffff irq 17 at
device 7.0 on pci1

em0: [FILTER]

em0: Ethernet address: 00:0e:0c:6b:d6:d3

em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu
1500

options=209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC
>

        ether 00:0e:0c:6b:d6:d3

        inet 10.64.10.10 netmask 0xffffff00 broadcast 10.64.10.255

        media: Ethernet autoselect (1000baseT <full-duplex>)

        status: active

Thanks,

Martin.

From: Martin Minkus 
Sent: Monday, 14 June 2010 11:21
To: freebsd-questions at freebsd.org
Subject: FreeBSD+ZFS+Samba: open_socket_in: Protocol not supported -
after a few days?

Samba 3.4 on FreeBSD 8-STABLE branch.

After a few days I start getting weird errors and windows PC's can't
access the samba share, have trouble accessing files, etc, and samba
becomes totally unusable.

Restarting samba doesn't fix it – only a reboot does.

Accessing files on the ZFS pool locally is fine. Other services (like
dhcpd, openldap server) on the box continue to work fine. Only samba
dies and by dies I mean it can no longer service clients and windows
brings up bizarre errors. Windows can access our other samba servers (on
linux, etc) just fine.

Kernel:

FreeBSD kinetic.pulse.local 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #4:
Wed May 26 18:09:14 NZST 2010
martinm at kinetic.pulse.local:/usr/obj/usr/src/sys/PULSE amd64

Zpool status:

kinetic:~$ zpool status

  pool: pulse

 state: ONLINE

 scrub: none requested

config:

        NAME                                          STATE     READ
WRITE CKSUM

        pulse                                         ONLINE       0    
0     0

          raidz1                                      ONLINE       0    
0     0

            gptid/3baa4ef3-3ef8-0ac0-f110-f61ea23352  ONLINE       0    
0     0

            gptid/0eaa8131-828e-6449-b9ba-89ac63729d  ONLINE       0    
0     0

            gptid/77a8da7c-8e3c-184c-9893-e0b12b2c60  ONLINE       0    
0     0

            gptid/dddb2b48-a498-c1cd-82f2-a2d2feea01  ONLINE       0    
0     0

errors: No known data errors

kinetic:~$

log.smb:

[2010/06/10 17:22:39, 0] lib/util_sock.c:902(open_socket_in)
open_socket_in(): socket() call failed: Protocol not supported
[2010/06/10 17:22:39, 0] smbd/server.c:457(smbd_open_one_socket)
smbd_open_once_socket: open_socket_in: Protocol not supported
[2010/06/10 17:22:39, 2] smbd/server.c:676(smbd_parent_loop)
waiting for connections

log.ANYPC:

[2010/06/08 19:55:55, 0] lib/util_sock.c:1491(get_peer_addr_internal)
getpeername failed. Error was Socket is not connected
read_fd_with_timeout: client 0.0.0.0 read error = Socket is not
connected.

The code in lib/util_sock.c, around line 902:

/***********************************************************************
*****
Open a socket of the specified type, port, and address for incoming
data.
************************************************************************
****/

int open_socket_in(int type,
uint16_t port,
int dlevel,
const struct sockaddr_storage *psock,
bool rebind)
{
struct sockaddr_storage sock;
int res;
socklen_t slen = sizeof(struct sockaddr_in);

sock = *psock;

#if defined(HAVE_IPV6)
if (sock.ss_family == AF_INET6) {
((struct sockaddr_in6 *)&sock)->sin6_port = htons(port);
slen = sizeof(struct sockaddr_in6);
}
#endif
if (sock.ss_family == AF_INET) {
((struct sockaddr_in *)&sock)->sin_port = htons(port);
}

res = socket(sock.ss_family, type, 0 );
if( res == -1 ) {
if( DEBUGLVL(0) ) {
dbgtext( "open_socket_in(): socket() call failed: " );
dbgtext( "%s\n", strerror( errno ) );
}

In other words, it looks like something in the kernel is exhausted
(what?). I don’t know if tuning is required, or this is some kind of
bug?

/boot/loader.conf:

mvs_load="YES"
zfs_load="YES"
vm.kmem_size="20G"

#vfs.zfs.arc_min="512M"
#vfs.zfs.arc_max="1536M"

vfs.zfs.arc_min="512M"
vfs.zfs.arc_max="3072M"

I’ve played with a few sysctl settings (found these recommendations
online, but they make no difference)

/etc/sysctl.conf:

kern.ipc.maxsockbuf=2097152

net.inet.tcp.sendspace=262144
net.inet.tcp.recvspace=262144
net.inet.tcp.mssdflt=1452

net.inet.udp.recvspace=65535
net.inet.udp.maxdgram=65535

net.local.stream.recvspace=65535
net.local.stream.sendspace=65535

Any ideas on what could possibly be going wrong?

Any help would be greatly appreciated!

Thanks,

Martin