sshd / tcp packet corruption ?
Martin Minkus
martin.minkus at punz.co.nz
Wed Jun 23 04:51:28 UTC 2010
So definitely some kind of packet corruption;
Using netcat to send a single megabyte of binary data to a box with no
known issues (from kinetic -> steel):
kinetic:/tmp$ dd if=/dev/urandom of=random.testfile bs=1k count=1k
1024+0 records in
1024+0 records out
1048576 bytes transferred in 0.018347 secs (57152372 bytes/sec)
kinetic:/tmp$ md5 random.testfile
MD5 (random.testfile) = 9be700336ef81e8f89c60422fc795877
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$ nc steel 1234 -v -O 4096 < random.testfile
Connection to steel 1234 port [tcp/*] succeeded!
kinetic:/tmp$
whilst on steel: (a stable linux box kinetic is MEANT to be replacing)
ff8a336e2be0c5c645e9f8a2dea67eea random.testfile
fae5da747c7857d1d87870c05db1f152 random.testfile
a36c7166631ca10c460e323e39071094 random.testfile
50a8f005a772f9321243215d1ea1adb6 random.testfile
5da41b6f475f4655572df8c9bd81e181 random.testfile
3104dd30179bf870e8ec6ef91c34d78f random.testfile
274a16890cf39c3089d8f0eda253f5fd random.testfile
e8d0bae998340252c6c67529d520feb4 random.testfile
6d5377ca4545f98a55c017f518567092 random.testfile
6b464f810fe1c2902694a7817f881906 random.testfile
8912007161ececdb3e23a0018af36c36 random.testfile
3f4e17d5a939cd8dfd0941c898c5ac5f random.testfile
9db926ba5f5f39dddcc0607983ed96f0 random.testfile
835de68b981bf6cb871ebb2ce81404e1 random.testfile
a211a3260d9c8ae595782d254798cacf random.testfile
030e08f1d3d0fb761046f66c888fdea2 random.testfile
If I reboot kinetic and try one last time:
9be700336ef81e8f89c60422fc795877 random.testfile
Notice that is now the CORRECT checksum on steel.
Kinetic’s samba, sshd, etc will play nice for a day or so before
returning to corrupting packets.
So any idea ? Why would my packets start getting corrupted after a
couple days use?
This box just runs isc-dhcpd, openldap-server, samba34, and ZFS (the
real reason its replacing the Linux box.)
Thanks,
Martin.
From: Martin Minkus
Sent: Wednesday, 23 June 2010 16:01
To: freebsd-questions at freebsd.org
Subject: sshd / tcp packet corruption ?
It seems this issue I reported below may actually be related to some
kind of TCP packet corruption ?
Still same box. I’ve noticed my SSH connections into the box will die
randomly, with errors.
Sshd logs the following on the box itself:
Jun 18 11:15:32 kinetic sshd[1406]: Received disconnect from
10.64.10.251: 2: Invalid packet header. This probably indicates a
problem with key exchange or encryption.
Jun 18 11:15:41 kinetic sshd[15746]: Accepted publickey for martinm from
10.64.10.251 port 56469 ssh2
Jun 18 11:15:58 kinetic su: nss_ldap: could not get LDAP result - Can't
contact LDAP server
Jun 18 11:15:58 kinetic su: martinm to root on /dev/pts/0
Jun 18 11:16:06 kinetic su: martinm to root on /dev/pts/1
Jun 18 11:16:29 kinetic sshd[15748]: Received disconnect from
10.64.10.251: 2: Invalid packet header. This probably indicates a
problem with key exchange or encryption.
Jun 18 11:16:30 kinetic sshd[15746]: syslogin_perform_logout: logout()
returned an error
Jun 18 11:16:34 kinetic sshd[16511]: Accepted publickey for martinm from
10.64.10.251 port 56470 ssh2
Jun 18 11:16:41 kinetic sshd[16513]: Received disconnect from
10.64.10.251: 2: Invalid packet header. This probably indicates a
problem with key exchange or encryption.
Jun 18 11:16:41 kinetic sshd[16511]: syslogin_perform_logout: logout()
returned an error
Jun 23 15:52:59 kinetic sshd[56974]: Received disconnect from
10.64.10.209: 5: Message Authentication Code did not verify (packet
#75658). Data integrity has been compromised.
Jun 23 15:53:12 kinetic sshd[57109]: Accepted publickey for martinm from
10.64.10.209 port 9494 ssh2
Jun 23 15:53:38 kinetic su: martinm to root on /dev/pts/3
Jun 23 15:56:36 kinetic sshd[57111]: Received disconnect from
10.64.10.209: 2: Invalid packet header. This probably indicates a
problem with key exchange or encryption.
Jun 23 15:56:44 kinetic sshd[57151]: Accepted publickey for martinm from
10.64.10.209 port 9534 ssh2
My googlefu has failed me on this.
Any ideas what on earth this could be ?
Ethernet card?
em0: <Intel(R) PRO/1000 Legacy Network Connection 1.0.1> port
0xcc00-0xcc3f mem 0xfdfe0000-0xfdffffff,0xfdfc0000-0xfdfdffff irq 17 at
device 7.0 on pci1
em0: [FILTER]
em0: Ethernet address: 00:0e:0c:6b:d6:d3
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu
1500
options=209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC
>
ether 00:0e:0c:6b:d6:d3
inet 10.64.10.10 netmask 0xffffff00 broadcast 10.64.10.255
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
Thanks,
Martin.
From: Martin Minkus
Sent: Monday, 14 June 2010 11:21
To: freebsd-questions at freebsd.org
Subject: FreeBSD+ZFS+Samba: open_socket_in: Protocol not supported -
after a few days?
Samba 3.4 on FreeBSD 8-STABLE branch.
After a few days I start getting weird errors and windows PC's can't
access the samba share, have trouble accessing files, etc, and samba
becomes totally unusable.
Restarting samba doesn't fix it – only a reboot does.
Accessing files on the ZFS pool locally is fine. Other services (like
dhcpd, openldap server) on the box continue to work fine. Only samba
dies and by dies I mean it can no longer service clients and windows
brings up bizarre errors. Windows can access our other samba servers (on
linux, etc) just fine.
Kernel:
FreeBSD kinetic.pulse.local 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #4:
Wed May 26 18:09:14 NZST 2010
martinm at kinetic.pulse.local:/usr/obj/usr/src/sys/PULSE amd64
Zpool status:
kinetic:~$ zpool status
pool: pulse
state: ONLINE
scrub: none requested
config:
NAME STATE READ
WRITE CKSUM
pulse ONLINE 0
0 0
raidz1 ONLINE 0
0 0
gptid/3baa4ef3-3ef8-0ac0-f110-f61ea23352 ONLINE 0
0 0
gptid/0eaa8131-828e-6449-b9ba-89ac63729d ONLINE 0
0 0
gptid/77a8da7c-8e3c-184c-9893-e0b12b2c60 ONLINE 0
0 0
gptid/dddb2b48-a498-c1cd-82f2-a2d2feea01 ONLINE 0
0 0
errors: No known data errors
kinetic:~$
log.smb:
[2010/06/10 17:22:39, 0] lib/util_sock.c:902(open_socket_in)
open_socket_in(): socket() call failed: Protocol not supported
[2010/06/10 17:22:39, 0] smbd/server.c:457(smbd_open_one_socket)
smbd_open_once_socket: open_socket_in: Protocol not supported
[2010/06/10 17:22:39, 2] smbd/server.c:676(smbd_parent_loop)
waiting for connections
log.ANYPC:
[2010/06/08 19:55:55, 0] lib/util_sock.c:1491(get_peer_addr_internal)
getpeername failed. Error was Socket is not connected
read_fd_with_timeout: client 0.0.0.0 read error = Socket is not
connected.
The code in lib/util_sock.c, around line 902:
/***********************************************************************
*****
Open a socket of the specified type, port, and address for incoming
data.
************************************************************************
****/
int open_socket_in(int type,
uint16_t port,
int dlevel,
const struct sockaddr_storage *psock,
bool rebind)
{
struct sockaddr_storage sock;
int res;
socklen_t slen = sizeof(struct sockaddr_in);
sock = *psock;
#if defined(HAVE_IPV6)
if (sock.ss_family == AF_INET6) {
((struct sockaddr_in6 *)&sock)->sin6_port = htons(port);
slen = sizeof(struct sockaddr_in6);
}
#endif
if (sock.ss_family == AF_INET) {
((struct sockaddr_in *)&sock)->sin_port = htons(port);
}
res = socket(sock.ss_family, type, 0 );
if( res == -1 ) {
if( DEBUGLVL(0) ) {
dbgtext( "open_socket_in(): socket() call failed: " );
dbgtext( "%s\n", strerror( errno ) );
}
In other words, it looks like something in the kernel is exhausted
(what?). I don’t know if tuning is required, or this is some kind of
bug?
/boot/loader.conf:
mvs_load="YES"
zfs_load="YES"
vm.kmem_size="20G"
#vfs.zfs.arc_min="512M"
#vfs.zfs.arc_max="1536M"
vfs.zfs.arc_min="512M"
vfs.zfs.arc_max="3072M"
I’ve played with a few sysctl settings (found these recommendations
online, but they make no difference)
/etc/sysctl.conf:
kern.ipc.maxsockbuf=2097152
net.inet.tcp.sendspace=262144
net.inet.tcp.recvspace=262144
net.inet.tcp.mssdflt=1452
net.inet.udp.recvspace=65535
net.inet.udp.maxdgram=65535
net.local.stream.recvspace=65535
net.local.stream.sendspace=65535
Any ideas on what could possibly be going wrong?
Any help would be greatly appreciated!
Thanks,
Martin
More information about the freebsd-questions
mailing list