LOR No 9 and strange other kernel messages

Mon Jun 7 09:17:32 GMT 2004

On Sun, 6 Jun 2004, Don Lewis wrote:
> On  7 Jun, Tai-hwa Liang wrote:
> > On Sun, 6 Jun 2004, Don Lewis wrote:
> > [...]
> >> >> Maybe your sound client software (or something else on the system) is
> >> >> querying for the number of vchans.  Even "sysctl -a" will invoke the
> >> >> handler.
> >> >
> >> > I'm using the stock mpg123 0.59r to play MP3 files. Since I've never read
> >> > the source before, I'm not sure whether it utilised vchan related sysctls
> >> > or not.
> >>
> >> The code that calls pcm_inprog() and prints the "x: 2" debug message
> >> appears to be an attempt at implementing a reader/writer lock.  I'm
> >> pretty sure the failure that triggers the debug message is harmless,
> >> other than causing the sysctl() call to return EWOULDBLOCK.
> >
> > I'm glad to know that the message is harmless. However, the "x: %d" is
> > a little too obscured to endusers IMHO. Shouldn't that be protected by
> > #ifdef DIAGNOSTIC or something like PCM_DEBUG?
>
> This diagnotic message should probably just go away.  Also the locking
> should be fixed to avoid the EWOULDBLOCK error.

Should I file a PR or just forward the thread to the pcm maintainer?
(to be honest, I have no idea about who he/she is, luigi@ / cg@ / ?)

>
> >> The most likely trigger for this event is if the client software is
> >> writing data to sound device and is blocked in the write() syscall when
> >> another thread calls sysctl().  You could find out which process is
> >> calling sysctl() by printing curproc->p_pid in place of or in addition
> >> to x.
> >
> > Ah, that explained everything. I've added the diagnostic code in
> > sys/dev/sound/pcm/sound.c, and can 100% trigger this while mpg123 is
> > playing and having a "sysctl -a" running in another xterm at the same time.
> >
> > x: 2, current pid: 39234 (sysctl)
> > x: 2, current pid: 39234 (sysctl)
> >
> > It also looks like gcc will use sysctl() to retrieve physical memory size
> > before compilation. I guess the "x: 2" I ran into was just a result of
> > running gcc too frequently(make buildworld buildkernel).
>
> Gcc only looks up hw.physmem and hw.usermem, so it shouldn't trigger
> this message.

Weird.... now it's fairly difficult to reproduce that without "sysctl -a"
(even buildworld/build mozilla/mpg123 at the same time).

>
[...]
> >> > Whoops! My first trial on booting without ACPI -- kernel panic in
> >> > usb_get_next_event(). Apparently I have to disconnect the USB mouse
> >> > prior to kernel booting. See final section for the backtrace.
> >>
> >> Sounds like bug #3.
> >
> > Any idea about how bug #3 can happen? I've tried to comment out the
> > #ifdef DIAGNOSTIC code around sys/dev/usb/usb.c:752 to see if the extra
> > checking for NULL ue helps. Unfortunately that didn't work for me: the
> > kernel always panic at usb_get_next_event+0x5e after starting usbd.
>
> I'd recommend posting a separate message about this panic.  That way you
> are more likely to catch the attention of a USB expert.

Understood. I just finished my posting. Hope that any USB guru would be
interested in it....

> >> > And, yes, the em0 didn't hang after booting to the non-ACPI environment.
> >> > However, there're still something weird happening at that moment:
> >> >
> >> > 	1. There're still a couple of "x: 2" dumped on console. Test case:
> >> > 	"make buildworld buildkernel -DNOCLEAN" in one xterm, mpg123 in
> >> > 	another xterm, and providing file downloading to another Windoze
> >> > 	box.
> >> >
> >> > 	2. While the remote box was downloading(Intel eepro 100/VE) files,
> >> > 	the number of interrupt on em0 was dramatically reduced to 300 ~ 500+
> >> > 	per second. It was 3000+ when ACPI was enabled(no device polling in
> >> > 	kernel).
> >> >
> >> > 	3. The system average load is surprising low(I'm using SCHED_4BSD),
> >> > 	about 0.06 ~ 0.30; however, the download was awfully slooowwwww.
> >> > 	It took my brother's Windoze XP about 25 minutes to complete the
> >> > 	whole downloading(3 ISO files, about 2GB in total). I'm pretty sure
> >> > 	that the mediaopt on em0 was 100baseTX + full-duplex.
> >>
> >> I wonder if the em0 interrupt is not getting enabled or is getting
> >> misrouted and the em0 interrupt service routing is getting triggered by
> >> another interrupt source that is happening at a lower rate.  Print out
> >> the irq mapping with ACPI disabled to see what em0 is sharing its irq
> >> with.  This is probably something that jhb and njl will need to look at.
> >
> > Oops, it looks like I accidentally enabled the DEVICE_POLLING in kernel
> > configuration file and forgot to turn kern.polling.enable on.
> >
> > After disabling DEVICE_POLLING and booting w/o ACPI, I can get rid of
> > the aforementioned em0 sluggish problem. The interrupt rate on em0 is
> > now 5700 ~ 7200 per second while downloading a couple of files from this
> > laptop, and the average speed is 10+ MBytes.
> >
> > Hmm... it seems that I have to disable USB mouse and ACPI to get em0
> > back to work at this moment.
> >
> > I'm not quite sure about where the "IRQ mapping" you're talking about to
> > retrieve from; therefore, I assume that came from dmesg:
> >
> > Found $PIR table, 15 entries at 0xc00fdea0
> > PCI-Only Interrupts: none

[...]

> > For the complete dmesg, please consult:
> >
> > 	http://www.mmlab.cse.yzu.edu.tw/~avatar/dmesg-noacpi.txt
> > 	http://www.mmlab.cse.yzu.edu.tw/~avatar/dmesg-acpi.txt
>
> Wierd ... in the ACPI case, em0 gets attached, gets detached, and then
> gets attached again.

[...]

> em0:  Speed:N/A  Duplex:N/A
> em0: Link is up 100 Mbps Full Duplex
>
> It looks like em0 shares irq 4 with the exact same set of devices in
> both cases.  It does seem strange to me that it is sharing irq 4 with
> sio0.  Typically the sio ports get exclusive use of irqs.

Perhaps this is related to the new interrupt routing algorithm that jhb@
is working on. Without backing sys/dev/acpica/acpi_pci_link.c to 1.14,
my T40 locked up after probing sio0. I'll try to tweak the IRQ settings
in BIOS to see whether it would help or not.

> The whole part of the dmesg starting with "em0: detached" and "pci0:
> driver added" is something that I haven't seen before, but I'm not
> running bleeding edge -CURRENT at the moment.

The detach/re-attach message was came from another test: "kldunload if_em"
and kldload it again to see whether it helps to the interrupt lockup
symptom or not. Sorry for the inconvenience, you can just simply ignore them.

>
> I'd recommend starting a separate thread to discuss your em0 problem.

I did it on May but haven't receive any reply, yet:

  http://lists.freebsd.org/pipermail/freebsd-current/2004-May/027225.html

Anyway, I'll start another thread later. Hope I get better luck this time....

Thanks a lot for your help, Mr. Lewis!