em0 watchdog timeout (and 3ware problems) 7-stable

Jack Vogel jfvogel at gmail.com
Mon Apr 27 17:21:17 UTC 2009


Greg,

I have another report of this problem, and I have a patch for you to try
out, will
be sending it out a bit later today.

Jack


On Sun, Apr 26, 2009 at 5:50 AM, Greg Byshenk <freebsd at byshenk.net> wrote:

> I have one machine that is seeing watchdog timeouts on em0, running
> 7-STABLE
> amd64 as of 2009.04.19, and also some other more perverse errors.
>
> Twice now in the last 48 hours, this machine has become unreachable via the
> network, and connecting to the console shows an endless string of
>
>   [...]
>   em0: watchdog timeout -- resetting
>   em0: watchdog timeout -- resetting
>   em0: watchdog timeout -- resetting
>
> messages. The machine is almost locked up.  That is, I can get a login
> prompt, but can go no further than typing in a username; after the
> username, no password prompt, and nothing further.  The only option is
> to hard reset the machine or to drop to debugger and reboot.
>
> Now the "perverse" part.  After restarting, the system partition is no
> more.
>
> Background detail:  the machine is a fileserver, with a 3Ware 9650SE-16ML
> SATA controller, connected to 16 1TB SATA drives, this configured as
> a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition
> and 6.5TB data partition.  The system partition is configured as da1,
> with one slice and more or less standard partitions for / /var /tmp, etc.
> (the data partition of the array is sliced with gpt).
>
> The issue here is that, upon restart, all parition information on da0
> seems to have disappeared, and restarting results in a "no operating
> system found" message, and a failure to boot (obviously).
>
> But all of the data is still present.  If I boot into rescue mode,
> recreate da0s1, mark it bootable, and restore the bsdlabel, then
> everything works again.  I can restart the machine, and it comes back
> up normally (it requires an fsck of everything on da0, but after that
> everything is back to normal).
>
> I don't know if this is two unrelated problems, or one problem with
> two symptoms, or something else.  I think that I can safely say that
> it is not a problem with the 3Ware controller itself, as I replaced
> the controller with a spare (identical model), and the problem
> recurred.  Additionally, I have an almost-identical configuration on
> four other machines, none of which are experiencing any problems.
> One thing that is different is that the other machines use
> Intel PRO/1000 PF (pci-e) NICs.
>
> Is there some known problem with the Intel 2572 fibre NIC?  Or some
> potential interaction of it with the 3ware RAID controller?
>
> For the moment, I've set hw.pci.enable_msi=0 (as discussed in the
> threads on 7.2/bge), and am building a new kernel/world from sources
> csup'd one hour ago, but I'd really like to hear any ideas about this
> -- particularly the wiping of the label.
>
> Some information about the system:
>
>
> # /dev/da0s1:
> 8 partitions:
> #        size   offset    fstype   [fsize bsize bps/cpg]
>  a:  2097152        0    4.2BSD        0     0     0
>  b:  8388608  2097152      swap
>  c: 104856192        0    unused        0     0         # "raw" part, don't
> edit
>  d:  8388608 10485760    4.2BSD        0     0     0
>  e:  2097152 18874368    4.2BSD        0     0     0
>  f: 41943040 20971520    4.2BSD        0     0     0
>  g: 41941632 62914560    4.2BSD        0     0     0
>
>
> em0 at pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02
> hdr=0x00
>    vendor     = 'Intel Corporation'thernet Controller (Fiber)'
>    device     = '2572 10/100/1000 Ethernet Controller (Fiber)'
>    class      = networktory, range 32, base 0xda000000, size 131072,
> enabled
>    subclass   = ethernetory, range 32, base 0xda000000, size 131072,
> enabled
>    bar   [10] = type Memory, range 32, base 0xda000000, size 131072,
> enabled
>    bar   [14] = type Memory, range 32, base 0xda020000, size 65536,
> enabled0x00
>
> twa0 at pci0:9:0:0:        class=0x010400 card=0x100413c1 chip=0x100413c1
> rev=0x01 hdr=0x00
>    device     = '9650SE Series PCI-Express SATA2 Raid Controller'
>    class      = mass storage
>    subclass   = RAID
>    bar   [10] = type Prefetchable Memory, range 64, base 0xd8000000, size
> 33554432, enabled
>    bar   [18] = type Memory, range 64, base 0xda300000, size 4096, enabled
>    bar   [20] = type I/O Port, range 32, base 0x3000, size 256, enabled
>    cap 01[40] = powerspec 2  supports D0 D1 D2 D3  current D0
>    cap 05[50] = MSI supports 32 messages, 64 bit
>    cap 10[70] = PCI-Express 1 legacy endpoint
>
> --
> greg byshenk  -  gbyshenk at byshenk.net  -  Leiden, NL
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
>


More information about the freebsd-stable mailing list