serial console oddity
Don Lewis
truckman at FreeBSD.org
Sun Nov 9 02:30:25 PST 2003
On 9 Nov, Bruce Evans wrote:
> On Sat, 8 Nov 2003, Don Lewis wrote:
>
>> I've been seeing some wierd things for many months when using a serial
>> console on my -CURRENT box. I finally had a chance to take a closer
>> look today.
>>
>> It looks like the problem is some sort of interference between kernel
>> output to the console and userland writes to /dev/console. I typically
>> see syslogd output to the console get corrupted. Each message that
>> syslogd writes seems to get truncated or otherwise corrupted. The most
>> common thing I see is that each syslog message is reduced to a space and
>> the first character of the month, or sometimes just a space, or
>> sometimes nothing at all.
>
> This is (at least primarily) a longstanding bug in ttymsg(). It uses
> nonblocking mode so that it doesn't block in write() or close(). For
> the same reason, it doesn't wait for output to drain before close().
> If the close happens to be the last one on the device, this causes any
> data buffered in the tty and lower software layers to be discarded
> cleanly and any data in lower hardware layers to by discarded in a
> driver plus hardware-dependent way (usually not so cleanly, especially
> for the character being transmitted).
I didn't think of a flush on close problem because I thought syslogd
always kept the console open.
>> This is totally consistent until I "kill
>> -HUP" syslogd, which I believe causes syslogd to close and open
>> /dev/console, after which the syslog output appears correct on the
>> console. When the syslogd output is being corrupted, I can cat a file to
>> /dev/console and the output appears to be correct.
>
> When I debugged this, syslogd didn't seem to keep the console open,
> so the open()/close() in ttymsg() always caused the problem. I didn't
> notice killing syslogd makes a difference. Perhaps it helps due to a
> missing close. Holding the console open may be a workaround or even
> the correct fix. It's not clear where this should be done (should all
> clients of ttymsg() do it?). Running getty on the console or on the
> underlying tty device should do it accidentally.
It looks to me like syslogd keeps the console open in addition to the
open()/close() in ttymsg(). cfline() calls open() on anything that
begins with '/' and calls isatty() to figure out whether it should set
the type to F_CONSOLE, F_TTY, or F_FILE, and init() closes the file
descriptor for all of these when syslogd is HUPed.
I wonder if the console descriptor is getting revoked ...
>> I truss'ed syslogd, and it appears to be working normally, the writev()
>> call that writes the data to the console appears to be writing the
>> correct character count, so it would appear that the fault is in the
>> kernel.
>
> If there are any kernel bugs in this area, then they would be that
> last close of the console affects the underlying tty. The multiple
> console changes are quite likely to have broken this if getty is run
> on the underlying tty (they silently discarded the half-close of the
> underlying tty which was needed to avoided trashing some of its state
> when only the console is closed).
I'm not running getty on my serial console. It is running on ttyv*. I'm
only using the serial console to capture kernel stack traces, etc.
>> The problem doesn't appear to be specific to syslogd, because I have
>> seen the output from the shutdown scripts that goes to the console get
>> truncated as well.
>
> Yes, in theory it should affect anything that uses ttymsg() or does
> direct non-blocking writes without waiting for the output to drain.
> Here are some half-baked fixes. The part that clears O_NONBLOCK is
> wrong, and the usleep() part is obviously a hack. ttymsg() shouldn't
> block even in close(), since if the close is in the parent ttymsg()
> might block forever and if the close() is in a forked child then
> blocking could create zillions of blocked children.
>
> Another part of the patch is concerned with limiting forked children.
> If I were happy with that part then blocking would not be so bad. In
> practice, I don't have enough system activity for blocked children to
> be a problem. To see the problem with blocked children, do something
> like the following:
> - turn off clocal on the console so that the console can block better.
> For sio consoles this often requires turning it off in the lock-state
> device, since the driver defends against this foot shooting by locking
> it on.
> - hold the console open or otherwise avoid the original bug in this
> thread, else messages will just be discarded in close() faster than
> they can pile up.
> - turn off your external console device or otherwise drop carrier.
> - send lots of messages.
I ran into this years ago when I was using actual terminals for console
devices and someone typed a ^S and went away. The failure mode wasn't
pretty. It was probably in the days when I was managing Masscomp boxes,
but it might have been the first batch of Sun servers.
>
> For a non-half-baked fix, do somethng like:
> - never block in ttymsg(), but always wait for output to drain using
> tcdrain() in a single child process. It's probably acceptable for
> this to not report errors to ttymsg()'s caller.
> - limit children better. I think we now fork children iff writev()
> returns EWOULDBLOCK and this happens mainly when the tty buffers
> fill up due to clocal being off and the external console not
> listening. Handling this right seems to require handing off the
> messages to a single child process that can buffer the messages
> in userland and can block writing and draining them. Blocked write()s
> and tcdrain()s are easy enough to handle in a specialized process by
> sending signals to abort them.
I like this idea better. The userland process could be smart to drop
entire lines of output if things started getting too backlogged.
More information about the freebsd-current
mailing list