LOR No 9 and strange other kernel messages

Sun Jun 6 20:12:42 GMT 2004

On  6 Jun, Tai-hwa Liang wrote:
> On Sat, 5 Jun 2004, Don Lewis wrote:
>> On  5 Jun, Tai-hwa Liang wrote:
>> > On Fri, 4 Jun 2004, Don Lewis wrote:
>> >> On  4 Jun, Tai-hwa Liang wrote:
>> >> > On Fri, 4 Jun 2004, Daniel Lang wrote:
>> >> >> Hi,
>> >> >>
>> >> >> I just went through my syslog and stumbled across a LOR, which is
>> >> >> documented on the Zabbadoz LOR page (id 009).
>> >> >>
>> >> >> The reason for this mail is the following kernel messages:
>> >> >>
>> >> >> [..]
>> >> >> Jun  1 11:08:06 atrbg11 kernel: x: 2
>> >> >> Jun  1 11:08:06 atrbg11 kernel: x: 2
>> >> >> [..]
>> >> >>
>> >> >> Seems some leftovers from a developer who did not want to
>> >> >> bother others with the gory details. ;-))
>> >> >
>> >> > I guess that came from sys/dev/sound/pcm/sound.c:872. My Thinkpad T40 also
>> >> > ran into this when the disk/network(combination) load is high enough.
>> >>
>> >> Looks likely, but I'd expect this would only get triggered when sysctl
>> >> is used to set or query the number of vchans.  In any case, this marks
>> >> another spot where locking is broken in the sound code.
>> >
>> > I use the default setting without tweaking any vchan related sysctl
>> > such like hw.snd.pcm0.vchans(leave it to default 0) or hw.snd.maxautovchans
>> > (also left to 0 on my T40).
>>
>> Maybe your sound client software (or something else on the system) is
>> querying for the number of vchans.  Even "sysctl -a" will invoke the
>> handler.
> 
> I'm using the stock mpg123 0.59r to play MP3 files. Since I've never read
> the source before, I'm not sure whether it utilised vchan related sysctls
> or not.

The code that calls pcm_inprog() and prints the "x: 2" debug message
appears to be an attempt at implementing a reader/writer lock.  I'm
pretty sure the failure that triggers the debug message is harmless,
other than causing the sysctl() call to return EWOULDBLOCK.

The most likely trigger for this event is if the client software is
writing data to sound device and is blocked in the write() syscall when
another thread calls sysctl().  You could find out which process is
calling sysctl() by printing curproc->p_pid in place of or in addition
to x.

>> > I'm wondering about whether this "bug" is ACPI related(interrupt storm?)
>> > since recently when I run "make buildworld buildkernel" on one terminal and
>> > do large file transmission(FTP ISO images on using em0) at the same time,
>> > the sound tends to become "broken." Meanwhile, the "x: 2" message popped
>> > on my console and later on my em0 doesn't work anymore(I have to do a reboot
>> > to get it back; however, the other part of the system still works while
>> > em0 was dead: sound continue playing, can compile/edit programs).
>>
>> It is quite possible that locking collisions in the sound code might
>> happen more frequently when the system is busy.
>>
>> Do your sound hardware and em0 share the same irq?  If they do, that
> 
> Nope.
>
> interrupt                          total       rate
> irq0: clk                         356127         99
> irq1: atkbd0                        6006          1
> irq4: cbb0 em0++                 1378364        386
> irq6: cbb1 pcm0                   110211         30
> irq7: ppc0                             3          0
> irq8: rtc                         455852        127
> irq12: psm0                        16785          4
> irq13: npx0                            1          0
> irq14: ata0                       322853         90
> irq15: ata1                           55          0
> Total                            2646257        742

In that case, the em0 problem is likely to be something else.  Does em0
lock up even if you don't use the sound hardware?

>> might be a clue that the problems are related.  Can you boot with ACPI
>> disabled, and if so, does it change the symptoms?
> 
> Whoops! My first trial on booting without ACPI -- kernel panic in
> usb_get_next_event(). Apparently I have to disconnect the USB mouse
> prior to kernel booting. See final section for the backtrace.

Sounds like bug #3.

> And, yes, the em0 didn't hang after booting to the non-ACPI environment.
> However, there're still something weird happening at that moment:
> 
> 	1. There're still a couple of "x: 2" dumped on console. Test case:
> 	"make buildworld buildkernel -DNOCLEAN" in one xterm, mpg123 in
> 	another xterm, and providing file downloading to another Windoze
> 	box.
> 
> 	2. While the remote box was downloading(Intel eepro 100/VE) files,
> 	the number of interrupt on em0 was dramatically reduced to 300 ~ 500+
> 	per second. It was 3000+ when ACPI was enabled(no device polling in
> 	kernel).
> 
> 	3. The system average load is surprising low(I'm using SCHED_4BSD),
> 	about 0.06 ~ 0.30; however, the download was awfully slooowwwww.
> 	It took my brother's Windoze XP about 25 minutes to complete the
> 	whole downloading(3 ISO files, about 2GB in total). I'm pretty sure
> 	that the mediaopt on em0 was 100baseTX + full-duplex.

I wonder if the em0 interrupt is not getting enabled or is getting
misrouted and the em0 interrupt service routing is getting triggered by
another interrupt source that is happening at a lower rate.  Print out
the irq mapping with ACPI disabled to see what em0 is sharing its irq
with.  This is probably something that jhb and njl will need to look at.