sparc64/141918: [ehci] ehci_interrupt: unrecoverable error, controller halted (sparc64)

Tue Apr 24 12:10:15 UTC 2012

The following reply was made to PR sparc64/141918; it has been noted by GNATS.

From: Manuel Tobias Schiller <mala at hinterbergen.de>
To: Marius Strobl <marius at alchemy.franken.de>
Cc: bug-followup at FreeBSD.org
Subject: Re: sparc64/141918: [ehci] ehci_interrupt: unrecoverable error,
 controller halted (sparc64)
Date: Tue, 24 Apr 2012 14:05:47 +0200

 --Sig_/9VY0Wp.oK7i1jICn3i=InqH
 Content-Type: text/plain; charset=US-ASCII
 Content-Transfer-Encoding: quoted-printable

 Hi Marius,

 I'm rather busy with work at the moment, so I'm not working quite as much
 on troubleshooting this issue right now... (See below for answers to your
 questions...)

 On Sun, 15 Apr 2012 14:51:05 +0200
 Marius Strobl <marius at alchemy.franken.de> wrote:
 > [...]
 > > > >=20
 > > > > Hi,
 > > > >=20
 > > > > the "VIA quirk fix" on its own gives the familiar message in dmesg
 > > > > (unrecoverable error, controller halted), so I'm compiling a
 > > > > kernel which
 > > >=20
 > > > Oof, this likely means there's a more basic problem with this
 > > > device. Have you already tried to re-seat the card in case there's
 > > > an electrical problem?
 > > > Please also provide the output of `pciconf -rb ehci0 at pci0:2:5:2
 > > > 0:255' from a booting kernel.
 > > > FYI, after some digging I've found the following card
 > > > ehci0 at pci0:2:5:2: class=3D0x0c0320 card=3D0x31041106 chip=3D0x31041106
 > > > rev=3D0x6h0 which is a newer revision of your device and works just
 > > > fine in a T1-200 including with the usb(4) fixes. The publicly
 > > > available datasheets for the VIA USB controllers are minimal and
 > > > exclude errata and Linux also doesn't seem to use any additional
 > > > work arounds, so I'm starting to run out of ideas what could be
 > > > wrong with your revision. The only remaining thing to give a try I
 > > > currently can think of is to test whether it chokes on the generic
 > > > initialization done by the sparc64 PCI code using the attached
 > > > patch.
 > > >=20
 > > > > combines this fix with your latest busdma fix to try them both
 > > > > together;
 > > >=20
 > > > This combination is unlikely to make a difference.
 > > >=20
 > > > Marius
 > > >=20
 > >=20
 > > Hi Marius,
 > >=20
 > > I've tried your new patch, both on its own and in conjunction with
 > > the latest busdma and Via quirk fixes, and I still get the same error
 > > message...
 > >=20
 > > Here's the output of pciconf you requested:
 > >=20
 > > mala at router:~> sudo pciconf -rb ehci0 at pci0:2:5:2 0:255
 > > Password:
 > > 06 11 04 31 06 00 10 22  65 20 03 0c 00 16 80 00=20
 > > 00 a0 00 00 00 00 00 00  00 00 00 00 00 00 00 00=20
 > > 00 00 00 00 00 00 00 00  00 00 00 00 06 11 04 31=20
 > > 00 00 00 00 80 00 00 00  00 00 00 00 14 03 00 00=20
 > > 00 00 0b 00 00 00 00 00  a0 20 00 29 00 00 ff ff=20
 >=20
 > This is rather confusing; the 0x29 in the above line means that the
 > VIA workaround is applied. Didn't you say that with the fix to
 > actually apply it, the kernel panics as soon as attaching the
 > device?
 > Apart from this, the configuration space differs in 3 undocumented
 > bytes from mine. I'm not sure whether it's worth trying whether
 > these make a difference ...

 Yes, this was from a kernel with your patch and the VIA workaround
 applied; the kernel usually stops when I start using these devices
 heavily (i.e. the automatic checks done during a ZFS mount operation).

 > > 00 5a 04 80 00 00 00 00  04 0b 88 88 33 00 00 00=20
 > > 20 20 01 00 00 00 00 00  01 00 00 00 00 00 00 c0=20
 > > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00=20
 > > 01 00 0a 7e 00 00 00 00  00 00 00 00 00 00 00 00=20
 > > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00=20
 > > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00=20
 > > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00=20
 > > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00=20
 > > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00=20
 > > 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00=20
 > > 00 00 00 00 00 00 00 03  00 00 00 00 00 00 00 00
 > >=20
 > > This was taken after the controller stopped, on a kernel with your
 > > latest patch, but I'd guess that doesn't matter - the EHCI driver
 > > should not be playing with the PCI settings after initialisation...
 > >=20
 > > I've also opened the machine, and the PCI card is seated properly. I
 > > even removed it and tried an even older VIA EHCI controller and one
 > > of the first USB 2.0 controllers by NEC - no luck, the VIA one had
 > > trouble recognizing devices, the NEC one did not recognize a single
 > > one I plugged in.
 > >=20
 >=20
 > This also is rather strange. Have you ever used any other type of
 > card in the slot, f.e. an NIC, so you can rule out it's broken
 > somehow?

 Some four or five years ago, the slot held a quad fast ethernet NIC, and
 that seemed to work fine... But: a lot can happen during this time, so I
 ordered a new USB controller to test with, just in case...

 > How does using the on-board USB controller work out?

 As far as I know, the on-board controller is USB1.1, so I have not really
 tried it because it's going to be a no-go option for disks (I'd get
 similar speed getting data from some server here at CERN over my DSL
 connection, and I probably wouldn't even have to administer the server
 myself - if I could get them to host my data ;)... I can give the onboard
 USB 1.1 controller a try, though...

 I noticed something else when reconnecting everything to the server: The
 USB ground seems to have a quite high (voltage) potential with respect to
 the chassis of the server (and the protective ground of the wall outlet),
 about 80 Volts. I've tried to locate a single faulty power supply of the
 hard disks (since the server chassis is at ground levels), but when
 tested individually, none of them shows this behaviour. It only happens
 when I connect all eight USB disks to the USB hub which in turn connects
 to the server. Apparently, this is some collective effect. Obviously, when
 the USB cable from the hub is plugged into the server, this potential
 difference is no longer there, and the disks are recognised.

 I'm not sure what this observation means (except that I'd really prefer
 linear over switching mode power supplies because of the galvanic
 separation between primary and secondary sides), but I thought I
 mention it anyway.

 Manuel

 > Marius
 >=20
 >=20

 --=20
 Homepage: http://www.hinterbergen.de/mala
 OpenPGP: 0xA330353E (DSA) or 0xD87D188C (RSA)

 --Sig_/9VY0Wp.oK7i1jICn3i=InqH
 Content-Type: application/pgp-signature; name=signature.asc
 Content-Disposition: attachment; filename=signature.asc

 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)

 iQGcBAEBAgAGBQJPlpcbAAoJEEPbVOqHHK4gOYcL/1RX9RFO/1igVUFYkXiZJJg/
 ctFL8SmAWDPofWO4xCoHzCeLVG1nj5dkn0QdMB93t+JRq8mhH+Dyv+VPgbv94dea
 uYdRr2fjQktRJptkLFtMTvK7NyxItQ6PNSBEVkIYJrbo7/cqumeF1hJ7ZB255Iub
 gHdR4zQQv/0PiwFeBSdjFK1RHMAcp/0LnzWiBW/xKeKEE4U7YzNt+5Xo1c0ym5me
 1FNl403xtgttlUzAK3pQqh54dWJbtyFpz489eRY92+ZydGuT3XtDf6svqoyUGx2K
 2q5Kq72MaTmSittwPeV5UxfqI45Iz6PUha2R3P9GHc75CVY7vN9wF+M3/qIwAToB
 H75vI7KF1ZUM8HR2OX9MnWCsaJiNsHKqyDgitjI7O1IRDeXVcgVnzQVtez3ZKTHN
 aoid3ItzMK0Sh6HBSktNl5CvTCwH7sPcdfpCp4OybANFb6UDeZhrW8XBrAoV8mx3
 9nOfiAVjsLsPpDq423BvanI9s8xd72OhbcgxKAoYAQ==
 =Hc4a
 -----END PGP SIGNATURE-----

 --Sig_/9VY0Wp.oK7i1jICn3i=InqH--