lge fiber-optic loose connection for 1-6s

Wed Jun 6 06:13:55 UTC 2007

On Tue, Jun 05, 2007 at 06:03:20PM +0100, Paul Bielecki wrote:
 > Hello All
 > 
 > I have network connection problems with my small database/samba server.
 > Machine is on small shuttle box with lge fiber-optic 1000baseSX on LAN
 > and rl0 to VPN connection.
 > Server been set up by somebody else, about 4 years ago and have not
 > been update since.
 > I have 6x FreeBSD +2x linux + 4x M$ servers, but it is only one server
 > I have connection problems with.
 > 
 > It is FreeBSD 4.8 stable, Mysql 4.0.12, Samba 2.2.8
 > 
 > Network: 330 machines + network printers; 60 machines including this
 > server on 10.0.0.0/24, printers are on 10.0.0.0/22 and the rest lan is
 > 10.0.1.0/22, 10.0.2.0/22, 10.0.3.0/22.
 > Default gateway is set to host in 10.0.0.0/24.
 > rl link is connected to a second FreeBSD box which act only as a VPN,
 > network 172.16.12.0/24.
 > There is one main switch which connects servers and uplinks from all
 > rooms and buildings.
 > Almost all windows machines in network are up-to date and all have
 > anti virus software installed.
 > 
 > What happen is that occasionally, from 6 to 20 times a day, all
 > machines seems to lose connection with this server for 1-6 seconds.
 > 
 > If it happens
 > -I can ping google.com or other host in the same network from server
 > itself and I have reply (?)
 > -I lose my ssh connection to this server
 > -there is no errors or warnings in messages apart smbd errors
 > -samba gives me lots of "smbd read_data: read failure for 4. Error =
 > Operation time out" or smbd_oplock/oplock break.
 > -tcpdump shows lots of ACK packtes from to server on 139
 > 
 > I think that having 10.0.0.0/24 and 10.0.0.0/22 as a one big thing
 > doesn't help, believe that it should be set up with  VLANs but I can't
 > change it just like that.
 > The second thing is that M$ network is not configured properly, there
 > should be one wins server or PDC, no bcasts.
 > 
 > I use to just blindly watch tcpdump -v -s 255 -i lge0 port not 22 and
 > port not 139 and not icmp
 > but I dont know what should I look for.
 > 
 > Let me know your thoughts and please give me some "tips" how can I
 > diagnose what can cause my problems.
 > 
 > some help with tcpdump would be much appreciated too,
 > for instance:
 > 17:05:49.644256 0.00:01:e6:9d:07:16.452 >
 > 0.ff:ff:ff:ff:ff:ff.452:ipx-sap-resp 30c '0001E69D071680DDNPI9D0716'
 > addr 0.00:01:e6:9d:07:16
 > 17:33:04.521449 802.1d config 8000.00:05:5d:1f:00:80.8002 root
 > 8000.00:05:5d:1f:00:80 pathcost 0 age 0 max 20 hello 2 fdelay 15
 > 
 > # printers
 > 17:33:07.370377 10.0.0.225.svrloc > HP-DEVICE-DISC.MCAST.NET.svrloc:
 > [udp sum ok] udp 151 (ttl 4, id 51568, len 179)
 > 17:05:18.409507 10.0.0.237.netbios-dgm > 255.255.255.255.netbios-dgm:
 > [udp sum ok] NBT UDP PACKET(138) (ttl 60, id 14452, len 229)
 > 17:05:18.757053 10.0.0.218.netbios-dgm > 255.255.255.255.netbios-dgm:
 > [udp sum ok] NBT UDP PACKET(138) (ttl 60, id 20727, len 229)
 > 
 > # another samba server to bcast
 > 17:05:29.708120 10.0.0.127.33191 > 10.0.3.255.netbios-ns: [udp sum ok]
 > NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST (DF) (ttl 64, id 0, len
 > 78)
 > 
 > 

I'm unsure what caused this issue but it seems that lge(4) lacks some
protections from overly-fragmented packets.
Did you see "watchdog timeout" messages in console?
I don't have lge(4) hardwares so it's hard to fix it.
It seems that lge(4) needs the following work.
 - endian clean
 - bus_dma(9) conversion
 - fragment handling as the hardware can't handle more than 10 fragments.

-- 
Regards,
Pyun YongHyeon