head behaviour

Mon Jun 7 00:06:09 UTC 2010

On Mon, 07 Jun 2010 00:13:28 +0200 =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des at des.no>  wrote:
> 
> The reason why head(1) doesn't work as expected is that it uses buffered
> I/O with a fairly large buffer, so it consumes more than it needs.  The
> only way to make it behave as the OP expected is to use unbuffered I/O
> and never read more bytes than the number of lines left, since the worst
> case is input consisting entirely of empty lines.  We could add an
> option to do just that, but the same effect can be achieved more
> portably with read(1) loops:

Except read doesn't do it quite right:

$ ps | (read a; echo $a ; grep zsh)
PID  TT  STAT      TIME COMMAND
 1196  p0  Is     0:02.23 -zsh (zsh)
 1209  p1  Is     0:00.35 -zsh (zsh)

Alignment of column titles is messed up. Using egrep we can
get the right alignment but egrep also shows up.

$ ps | egrep 'TIME|zsh'
  PID  TT  STAT      TIME COMMAND
 1196  p0  Is     0:02.23 -zsh (zsh)
 1209  p1  Is     0:00.35 -zsh (zsh)
71945  p2  DL+    0:00.01 egrep TIME|zsh

A small point but it is not trivial to get it exactly right.
head -n directly expresses what one wants.

But there is a deeper point.

Several people pointed out alternatives for the examples
given but in general you can't use a single command to
replace a sequence of commands where each operates on part of
the shared input in a different way.

The reason we can't do this is buffering for efficiency.
Usually there is no further use for the buffered but
unconsumed input & it can be safely thrown away. So this is
almost always the right thing to do but not when there *is*
further use for the unconsumed input.  Some programs already
do the right thing (dd, for instance, as you pointed out).
Some other commands do give you this option in a limited way.
"man grep" & you will find:

       -m NUM, --max-count=NUM
              Stop reading a file after NUM matching lines.  If the  input  is
              standard  input  from a regular file, and NUM matching lines are
>>>>          output, grep ensures that the standard input  is  positioned  to
>>>>          just  after the last matching line before exiting, regardless of
              the presence of trailing context lines.  This enables a  calling
              process  to resume a search.

So for instance

$ < /usr/share/dict/words (grep -m 1 ''; grep -m 1 '') 
A
a

But pipe the file in and see what you get:

$ cat /usr/share/dict/words | (grep -m 1 ''; grep -m 1 '') 
A
nterectasia

Grep does the right thing for files but not pipes!  Now I do
understand *why* this happens but still, it is annoying.  So
I believe there is value in providing an option to read *as
much as needed* but not more.  It will be slower but will
handle the cases we are discussing.  This will enhance
*composability* -- supposedly part of the unix philosophy.
The slow-but-read-just-as-much-as-needed option to be used
when you need certain kind of composability and there is no
other way.  And yes, now do I think this is useful not just
for head but also any other program that quits before reading
to the end!

[cc'ed Rob in case he wishes to chime in]