Kernel Support for System Call Performance Monitoring

Mon Jun 23 08:55:25 PDT 2003

Hi,
Thank you so much for your constructive notes. We do feel confident to
merge some parts into the mainline code. Other parts may require further
discussion:

Bosko Milekic wrote:

>     1) User-visible DeBoxInfo structure has the magic number "5"
>        PerSleepInfo structs and the magic number "200" CallTrace
>        structs.  It seems that it would be somewhat less crude to turn
>        the struct arrays in DeBoxInfo into pointers in which case you
>        have several options.  You could provide a library to link
>        applications compiled for DeBox use with that would take care of
>        allocating the space in which to store maxSleeps and
>        maxTrace-worth of memory and hooking the data into resultBuf or
>        providing the addresses as separate arguments to the
>        DeBoxControl() system call.  For what concerns the kernel, you
>        could take a similar approach and dynamically pre-allocate the
>        PerSleepInfo and CallTrace structures, based on the requirements
>        given by the DeBoxControl system call.

This would be a better solution. We admit that the magic numbers we took
were entirely for experimental purpose and we agree that better approaches
should be taken if DeBox is going to be adopted.

>     2) The problem of modifying entry-exit paths in function calls.
>        Admittedly, this is hard, but crudely modifying a select number
>        of functions to Do The Right Thing for what concerns call tracing
>        is hard to justify from a general perspective.  I don't mean to
>        spread FUD here; the change you made is totally OK from a
>        measurement perspective and serves great for the paper, it's just
>        tougher to integrate this stuff into the mainline code.

You are right about the problems of manual modification. We opted for the
manual modification only as a short-term solution while we investigate other
approaches.
We started by trying to modify mcount, but didn't succeed in controlling it,
namely how to make it profile interested functions only. Then we switched to
gcc's entry/exit functions specified via the "instrument functions" option
and encountered unbearable overhead. Moreover, the common problem of these
two approaches is how to avoid the bottom-half invocations within a system
call because these interruption-handling functions don't belong to any of
the system call paths. Automating this modification process might be
possible with some compiler assistance, or also possible with mcount, but we
didn't find the right way?

>   - On the Case Study.  I was most interested in the sendfile
>     modifications you talk about and would be interested in seeing
>     patches.  I know that some of the modifications you mention have
>     already been done in 5.x; Notably, if you have not already, you'll
>     want to glance at:
>
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/uipc_syscalls.c? \
>     rev=1.144&content-type=text/x-cvsweb-markup
>
>     (regarding your mapping caching in sf_bufs)
>
>     and this [gigantic] thread:
>
> http://www.freebsd.org/cgi/getmsg.cgi?fetch=12432+15802+ \
>     /usr/local/www/db/text/2003/freebsd-arch/20030601.freebsd-arch
>
>     (subject: sendfile(2) SF_NOPUSH flag proposal on freebsd-arch@, at
>      least).
>
>    You may want to contact Igor Sysoev or other concerned parties in
>    that thread to show them that you actually have performance results
>    resulting from such a change.

In terms of the sendfile optimization, we started doing it back in last
October, and were aware that some issues got discussed on this list later.
We also went some steps further, specifically:

1. Cache the mapping between VM pages and physical map, and don't ever free
these caching until the number of cache entries reaches "nsfbufs".
   This aggressive cache does cause more wired memory but the reduction in
mapping/releasing overhead and address space consumption outweighs the
drawbacks according to our measurements. It may be necessary to free these
pages based on some timer system if they're not being used anymore.

 2. We made a variant form of sendfile to avoid disk IO by returning an
error if the file is set to be non-blocking and not in the memory.
     This optimization is very powerful for applications like event-driven
servers. Given the fact that any blocking IO hurts performance seriously, we
actually used to maintain a mmap cache then use mincore() to avoid any disk
IO requests. But now it is no longer needed and saves a lot of overhead.
This change makes sendfile non-block on both socket writing and disk reading
if interested, but by default it still keeps the traditional semantics.

3. Pack header and tail into the body packets using mbuf cluster space.
    The current implementation of calling writev for header and body causes
more packets to be generated and really hurts the performance of small
transfers. The consequence is more of an issue for fast services in WANs
because of needless latency.
    Compared to writev, sendfile used to have a performance loss on a
portion of our
workload. The performance loss on small file is more than the performance
gain on larger files, leading to a net loss for our web server. Though it is
possible to use writev for small files while leaving large files for
sendfile, as Terry Lambert pointed out in the discussion of
http://www.freebsd.org/cgi/getmsg.cgi?fetch=24340+0+/usr/local/www/db/text/2003/freebsd-arch/20030601.freebsd-arch

, this makes applications too complicated. We found that building the mbuf
chain and passing it was a significant benefit and is more straightforward
than
TCP_NOPUSH option proposed by Igor Sysoev.

Regards

- Yaoping