HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

Jeremy Chadwick koitsu at FreeBSD.org
Fri Sep 26 11:11:45 UTC 2008


On Fri, Sep 26, 2008 at 01:12:14PM +0300, Anton - Valqk wrote:
> Hello,
> I have a VERY strange behaving 6-3p3 with DMA tmieouts and network cards
> 'dropping traffic'.

The disk errors you see are well-known, but the reasons for them
happening differ per person.  Some people replace cables and the problem
goes away.  Others change controller cards.  Others found no solution
and went to Linux.

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

Here's some facts:

1) The LBAs reported to have problems are scattered, which indicates to
me there are probably not bad blocks on your disks,

2) You have two separate disks showing the above behaviour, decreasing
the probability of it being bad blocks/sectors,

3) Your dmesg.today doesn't include timestamps, so I have to assume the
problems all happen at once or within short moments of one another,
rather than at random moments throughout a 24 hour period,

> strange thing is that I'm seeing the g_vfs_done just recently and this
> problem is from the very start of this hardware setup of the machine.

I believe the g_vfs_done issues can either be attributed to the disk
errors you're seeing, or oddities with gmirror/GEOM.  I've seen people
report this before, and GEOM often spits back an error on an
index/offset which seems way too large for it to be realistic.

> The machine used to work with two hitachi disks connected to the ad0 and
> ad1 (integrated ide) and only one - xl0 - nic perfectly.
> The problems started when I plugged in the PROMISE and other nic cards
> and started using it as router, fileserver and backup server (each in
> separate jail, except the pf firewall).
> ...
>
> 2. The other strange issue is that when (I guess) it starts timeouting
> *sometimes* not everytime I'm loosing connection to xl0 or fxp0
> (sometimes the rl0 works and accepts connections from the outside,
> sometimes - not). When I go to the machine and plug a monitor - there
> are no messages from kernel, no logs in /var/log/messages or debug -
> noting. Stange thing is that I ping host from the local net and it time
> outs, ifconfig shows that interface is connected at fd 100mbit and
> everyting seems ok. I've tried ifconfig xl0 down up but doesn't help,
> tried plugging out the cable and it got connected but not packets passed
> - timeout again!

I've looked at your dmesg and vmstat output, and I have a feeling the
problem is an obvious one.

Your system has no APIC (this is not a typo), so your system *must*
share IRQs.  You have ***four*** devices on IRQ 11: a USB controller,
your fxp0 card, your rl0 card, and your xl0 card.

> http://valqk.ath.cx/tmp/dmesg
> http://valqk.ath.cx/tmp/vmstat
> http://valqk.ath.cx/tmp/smartctl
> 
> please give any ideas/hints/solutions!

I would recommend you start yanking PCI cards out of the system and
see which solve the problem.  You did state once you added the Promise
card (which makes your system have FIVE PCI cards in it?!?  Sheesh) the
problems began.

I can't imagine you'll have a stable system with that many cards in the
box all sharing a single IRQ -- especially on a board that old.

I'd recommend decreasing the amount of cards you have in that system, or
get a motherboard that has an APIC and preferably some reliable on-board
networking (read: Intel chips).  Toss the rl0 card if possible, and
consider replacing the Promise controller with a different one.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |



More information about the freebsd-stable mailing list