[Bug 205398] [regression] [tty] tty_drain() kernel function lacks timeout support it had before

Fri Dec 18 16:58:07 UTC 2015

18.12.2015 23:05, Bruce Evans пишет:
> On Fri, 18 Dec 2015 a bug system that doesn't want replies wrote:
>
>> Revision 181905 by ed at freebsd.org brought the new MPSAFE TTY layer and removed
>> "drainwain" timeout support. Now applications working with serial port can hang
>> forever on close() system call:

> It brought many other bugs.  About 20 more related to draining.
>
> Some of the other bugs accidentally ameliorate this one.  The tty layer
> never waits long enough for the last few characters to drain (though
> I finished fixing this for sio in 1996).  So it takes a large buffer
> to possibly give an endless wait.  Flow control must be on for the
> wait to be endless.  Flow control is also broken...
>
> There is a hack for last-close that is supposed to give a hard-coded timeout
> of 1 second.  Not sure why this doesn't work for you.  My quick fix that
> restores the timeout uses slightly different logic where this hack was.

I've made a mistake (now corrected) while filling PR: my system is 9.3-STABLE
and not 10.2-STABLE. It has no "leaving" case hack.

> The timeout is also a hack (breaks POSIX conformance), but at least the
> user can control it and it doesn't default to a too small value.  The
> old default of 300 seconds was a bit too large, but I kept it.  My systems
> have always changed this to 180 seconds in /etc/rc.d.  I set it to 1
> second per-device only transiently.
>
>> - an application opens /dev/cuau0 in non-blocing i/o mode and tries to detect
>> GSM gateway there writing commands like ATZ, ATE1 etc. to the device;
>> - the device may be dead (lost power, broken, disconnected etc.) and does not
>> answer back;
>
> Old versions also had a hack by me that breaks waiting in last-close if
> the device is in non-blocking mode.
>
> If the device is really disconnected, then the tty should be in a zombie
> state and should not wait.  I think this still works.  CLOCAL or lack of
> modem signals may break detection of last-close.

The device does not get disconnected in process, it was not connected
from the moment of open().

> Did you have CRTSCTS flow control enabled?  This is probably the main
> source of hangs.  The RTS and CTS signals are not ignored in CLOCAL mode,
> flow control should be invoked when they go away when th device goes
> away.

It has both of CRTSCTS flow control and CLOCAL enabled and
I'd like to keep them both enabled and working.

>> - application timeouts waiting for answer and closes device with close()
>> - tty layer tries to drain output "forever", until a signal arrives.
>
> Perhaps the hard-coded 1 second timeout only works for close() in exit().
> So it helps more for sloppy applications that exit without waiting for
> their data to go out.
>
> Applications that do the above are still sloppy.  POSIX specifies waiting
> "forever" again to drain in close().  A non-buggy application would do:
>
>       write();
>       // set up timeout for draining
>       tcdrain();
>       // when timeout expires, try to recover
>       // when recovery is impossible, clean up and exit
>       tcflush();        // this is a critical step in the cleanup
>       // set up timeout for closing, just in case there is a kernel bug
>       close();        // now it can't block unless there was a kernel bug
>
>> gnokii (comms/gnokii) suffers from this problem.
>>
>> Please re-implement tunable timeout and TIOCSDRAINWAIT syscall kernel has
>> before.
>
> This is mostly fixed in my version.  I started to cut out the patches,
> but they were too entwined with other fixes.  Here is the part that
> replaces the hard-coded 1 second timeout:
>
> X diff -c2 ./kern/tty.c~ ./kern/tty.c
> X *** ./kern/tty.c~    Thu Mar 19 18:23:08 2015
> X --- ./kern/tty.c    Sat Aug  8 11:40:23 2015
> X ***************
> X *** 133,155 ****
> X           return (0);
> X X !     while (ttyoutq_bytesused(&tp->t_outq) > 0) {
> X           ttydevsw_outwakeup(tp);
> X           /* Could be handled synchronously. */
> X           bytesused = ttyoutq_bytesused(&tp->t_outq);
> X !         if (bytesused == 0)
> X               return (0);
> X X           /* Wait for data to be drained. */
> X !         if (leaving) {
> X               revokecnt = tp->t_revokecnt;
> X !             error = tty_timedwait(tp, &tp->t_outwait, hz);
> X               switch (error) {
> X               case ERESTART:
> X                   if (revokecnt != tp->t_revokecnt)
> X                       error = 0;
> X                   break;
> X               case EWOULDBLOCK:
> X !                 if (ttyoutq_bytesused(&tp->t_outq) < bytesused)
> X                       error = 0;
> X                   break;
> X               }
> X --- 196,225 ----
> X           return (0);
> X X !     while (ttyoutq_bytesused(&tp->t_outq) != 0 || tp->t_flags & TS_BUSY) {

Strange diff format... Should patch(1) apply this with all those X'es ?

Thank you for answer, anyway! I'll try to understand and test patches next week.

> For a quick fix, try turning off flow control (both hardware and software)
> in last-close.  This should limit the wait.  Only large buffers or small
> speeds take very long to drain if draining is not blocked completely by
> flow control.  I use small speeds to test bugs in this area.  E.g., at 50
> bps, a 4K buffer takes 800 seconds to drain; at 1 bps, it takes 40960
> seconds to drain.  This shows how broken a hard-coded timeout of 1 or
> even 300 seconds is.  Also, how broken an application that doesn't do
> its own draining and error handling is.

Well, I just use port comms/gnokii to talk to my GSM gateways via serial port
to send SMS with one-time security codes to my customers and occasionally
informational SMS. If one GSM gateway would fail, I'd like gnokii not to hang
so my script would proceed with backup gateway. Meantime, I've ported timeout(1)
to 9.3-STABLE and it kills gnokii if it hangs for too long. But that's ugly.