HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

Mon Sep 29 08:53:46 UTC 2008

On Mon, Sep 29, 2008 at 10:53:06AM +0300, Anton - Valqk wrote:
> Moring and have a nice week!
> The problems continues.
> It appears that this has somethind to do with the xl,fxp,atapci cards.
> On friday evening when I had hw access to the machine I've pulled out
> the power of one of the disks
> and remove rl0. The situation was absolutely the same! timeouts and
> timeouts again....
> After seeing this isn't helping I've patched my kernel with
> http://freenas.svn.sourceforge.net/viewvc/freenas/branches/0.69/build/kernel-patches/ata/files/patch-ata.diff?view=markup
> this patch and tried increasing the timeout to 10 then to 15 then to 25
> - nothing helped.
> I got crazy and pulled of my home movie station (PIII 1ghz, 384ram,
> via(I think chipser) - I'll post new dmesg here).
> I've moved the disks (only 4 of the as I wasn't using the fitfth) and
> plugged xl0, fxp0 and promise card in the new machine.
> It started just fine and seemed to work while.... I've started to get
> the *bad* timeouts again!
> F***!!! I saw that the timeout of the tunable sysctl is 5 and increased
> it to 15 but still on heavy (read mythtv scan about 300G media files and
> other stuff) and 3-4 fetches fetching online radios) ... so I think is
> has something to do with the promise card.

The patch you're trying for FreeNAS does not guarantee success (I
thought I outlined this in my Wiki, but maybe I should put it in bold).
It helps for some people, which indicates that their disks are taking
longer than 5 seconds to perform some internal operations.  Your problem
is of quite a different nature.  You're being plagued with a multitude
of problems.

> Oh yes, I've also got
> Sep 28 21:32:47 azimud kernel: xl0: transmission error: 90
> Sep 28 21:32:47 azimud kernel: xl0: tx underrun, increasing tx start
> threshold to 120 bytes

This is a completely different issue, and is commonly seen on xl(4)
cards.  I know because I used to deal with this caveat on my xl-based
servers many years ago.  You might see this message a few times as it
increases the RX and TX byte buffers accordingly.

> because these problems drives me tooo mad! I simply want a working machine,

Your motherboard lacks an APIC for starters, which means lots of devices
are going to share a single IRQ.  There's no guarantee your motherboard
is at all stable/reliable either; so far the evidence points to it being
too overloaded, or flaky.

There's also the issue of bus mastering, although I'm not sure if this
could cause the problems seen with the Promise card.

Additionally, you're using rl(4) and xl(4).  The rl(4) man page is
indicative of engineering mistakes by Realtek, and xl(4) is older, but
should still work.  fxp(4) (e.g. Pro/100 S) is reliable and still
occasionally used, but has been superseded by em(4) -- Intel, AFAIK,
does not sell fxp-based chips any longer.  In fact, I think the Pro/1000
PT cards cost less than what you might find a Pro/100 S card for.

You might start by disabling features in your BIOS (specifically any
on-board chips which you don't use, e.g. USB, unused PATA ports, etc).
This can free up IRQs, and depending upon how your motherboard shares
IRQs with certain PCI levels (e.g. PCI-A through PCI-D), you might be
able to find a good configuration where each card is on a separate IRQ.

> I'm going to pull of the xl0 and promise card from this machine (use
> integrated ide controllers) and another fxp because seems that fbsd now
> has problems and with xl's (Which reminds me for the 4.X - it used to
> work perfectly with xls rls fxps... but that's another topic...anyways).

I'm sorry to tell you, but 4.x had the same issues with xl(4) as you're
reporting.  It's dependent upon the amount of network I/O you put
through the card.

> If that doesn't help I'll be forced to migrate to linux :(...

And what's wrong with that?  There is absolutely no shame in using a
different operating system.  I do not know why people seem to think that
FreeBSD "does everything and does it better than <other OS>".  This is
flat out untrue.  I often wonder what makes people think that in the
first place.

I remind people of this quite often: use whatever tool/OS gets the job
done for you.  If you run into problems with one, and the amount of
effort it's taking to work around or solve the problems isn't worth it,
go with something else.

> Because I'll have free machine with the problematic promise ultra133
> controller and two disks in it, I can provide a serial console to this
> machine is anyone is interested in debugging this issue.... anyone?

Regarding the ATA errors specifically (not your other problems) -- if
they are easily reproducible, set up remote serial console and
immediately contact Scott Long <scottl at samsco.org>.  He has offered to
help track these issues down, but absolutely requires *reliable* test
cases.

> here are the new info:
> http://valqk.ath.cx/tmp/dmesg.new
> http://valqk.ath.cx/tmp/vmstat.new
> 
> (oh I have no usbs connected to this machine [yet]).
> 
> cheers,
> valqk.
> 
> Anton - Valqk wrote:
> > Thanks Jeremy and Peter,
> > you are right that the machine has *lots* ot hardware in it,
> > I was thinking of the power supply as a reason and measured the 5 and 12
> > volts - seemd to be ok 11.8 and 5.2 with all hardware in it.
> > The shared irq is the one I've thought of and that's why I've posted
> > vmstat -i to hear your opinion.
> > [forgot to mention that I've read the wiki and next step is to patch the
> > kernel with
> > http://freenas.svn.sourceforge.net/viewvc/freenas/branches/0.69/build/kernel-patches/ata/files/patch-ata.diff?view=markup
> > this patch (any bad words for this patch or could just run - nothing bad
> > can happen?)]
> >
> > Yes, I have 3 nics(2 on pci) + pci ide promise, I'll get a smart switch
> > with vlans and I'll leave just the integrated xl0 and fxp0 with both
> > external ips on it these days,
> > but first I'll patch the kernel if Jeremy says it won't hurt (as far as
> > I saw just a timeout is moved from hardcoded value to a sysctl?)...
> > I have another promise card that is a raid controller, but when I've
> > started loking for one I've asked here and there were  answers for
> > PROMISE ULTRA ATA133 for being a good card for my freebsd (
> > http://docs.freebsd.org/cgi/getmsg.cgi?fetch=290848+0+archive/2008/freebsd-stable/20080316.freebsd-stable
> > )
> > (hmm, just saw that Jeremy pointed out promise card:  'Their Ultra133
> > TX2 card works fine on 33MHz PCI bus machines; don't worry about the
> > card being 66MHz, it will downthrottle correctly.') so maybe the problem
> > will be solved if I leave just two nics and no rl0...
> > Actually I'm using 6.3 here because I didn't wanted this to happen and I
> > was ware of such problems happening on 7-current....
> >
> > So test must be done... pls just answer about the patch will it be
> > helpful or I should try:
> >
> > 1. remove rl0 and run only one isp for the test.
> > 2. replace the ultra 133 card with another one.
> > 3. try to replace the ATA100 cables (the one with 80 wires) with an
> > older ones with only 40 cabels?
> > 4. ? anything else?
> >
> >
> > Anton - Valqk wrote:
> >   
> >> Hello,
> >> I have a VERY strange behaving 6-3p3 with DMA tmieouts and network cards
> >> 'dropping traffic'.
> >> Following is the explanation of hardware and the thinga that are happening.
> >> The machine is DELL optiplex PII 300mHZ with 512RAM.
> >> It has 3 NICs:
> >> fxp0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
> >>         options=8<VLAN_MTU>
> >>         inet 7.8.9.10 netmask 0xfffff000 broadcast 7.8.9.255
> >>         ether 00:91:21:16:14:bf
> >>         media: Ethernet autoselect (100baseTX <full-duplex>)
> >>         status: active
> >> rl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
> >>         options=8<VLAN_MTU>
> >>         inet 8.9.10.11 netmask 0xffffffe0 broadcast 8.9.10.255
> >>         ether 00:02:44:73:2a:fa
> >>         media: Ethernet autoselect (100baseTX <full-duplex>)
> >>         status: active
> >> xl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
> >>         options=9<RXCSUM,VLAN_MTU>
> >>         inet 192.168.123.2 netmask 0xffffff00 broadcast 192.168.123.255
> >>         inet 192.168.123.5 netmask 0xffffff00 broadcast 192.168.123.255
> >>         inet 192.168.123.6 netmask 0xffffff00 broadcast 192.168.123.255
> >>         ether 00:c0:4f:20:66:a3
> >>         media: Ethernet autoselect (100baseTX <full-duplex>)
> >>         status: active
> >> fxp0 and rl0 are external links to the world and are plugged into pci slots
> >> xl0 is the internal interface and is integrated on motherboard.
> >> It also has 1 PROMISE ULTRA133 ATA pci IDE controller plugged into the
> >> pci slot.
> >> It has 5 disks in it - 4 connected to the PROMISE card and 1 to the
> >> motherboard ide.
> >>
> >> they are as follows:
> >> ad0 and ad6 are two identical hitachi disks in gmirror for the system
> >> and a partition that I keep backups on.
> >>
> >> ad4, ad5 and ad7 are storage disks - seagates 500GB 8mb cache that I
> >> keep isos etc files on and are the problematic (maybe because of high
> >> traffic operations compared to the other two?).
> >>
> >> What is the problem:
> >> Actually there are two problems:
> >> 1. I get a lot of dma times outs. mostly on ad5 and ad7 where I keep
> >> files over 4-5MBs and write/read very often with 3-6-8MB/s from the
> >> disk. I don't use ad4 so I can not tell if there's gona be timeous but I
> >> suppose there will (currently has linux partitions on it and is not
> >> mounted). I get these errors:
> >> dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=5554848
> >> dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=5914112
> >> dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=14924096
> >> dmesg.today:ad7: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=374303456
> >> dmesg.today:ad7: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR>
> >> error=10<NID_NOT_FOUND> LBA=374303456
> >> dmesg.today:g_vfs_done():ad7[WRITE(offset=191643369472,
> >> length=131072)]error = 5
> >> dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50757760
> >> dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50760192
> >> dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=12032
> >> dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50769792
> >>
> >> strange thing is that I'm seeing the g_vfs_done just recently and this
> >> problem is from the very start of this hardware setup of the machine.
> >> The machine used to work with two hitachi disks connected to the ad0 and
> >> ad1 (integrated ide) and only one - xl0 - nic perfectly.
> >> The problems started when I plugged in the PROMISE and other nic cards
> >> and started using it as router, fileserver and backup server (each in
> >> separate jail, except the pf firewall).
> >> 2. The other strange issue is that when (I guess) it starts timeouting
> >> *sometimes* not everytime I'm loosing connection to xl0 or fxp0
> >> (sometimes the rl0 works and accepts connections from the outside,
> >> sometimes - not). When I go to the machine and plug a monitor - there
> >> are no messages from kernel, no logs in /var/log/messages or debug -
> >> noting. Stange thing is that I ping host from the local net and it time
> >> outs, ifconfig shows that interface is connected at fd 100mbit and
> >> everyting seems ok. I've tried ifconfig xl0 down up but doesn't help,
> >> tried plugging out the cable and it got connected but not packets passed
> >> - timeout again!
> >> I've rebooted and nic came up. These 'drops' became more and more common
> >> recently and last night I wasn't able to login for about an hour and
> >> after that the machine came back up again by itself!!!that's in the lan
> >> - but it wasn't accessible at all from the outside - strange thins is
> >> that it replied to ping but I wasn't able to even open the ssh port
> >> connection and the nat wasn't working?! After that I've remembered that
> >> at this time I have a cronjob started for about an hour that fetches
> >> into a file a online radio cast for an hour.... wired!!! it also have
> >> rtorrent, apache22, samba (in a jail) runing.
> >>
> >> some output from it can be found here:
> >> http://valqk.ath.cx/tmp/dmesg
> >> http://valqk.ath.cx/tmp/vmstat
> >> http://valqk.ath.cx/tmp/smartctl
> >>
> >>
> >> please give any ideas/hints/solutions!
> >>
> >> thanks a lot to everyone!
> >> cheers,
> >> valqk.
> >> _______________________________________________
> >> freebsd-stable at freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> >> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
> >>
> >>   
> >>     
> >
> > _______________________________________________
> > freebsd-stable at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
> >
> >   
> 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |