Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

Fri Mar 24 20:38:29 UTC 2006

Matthew Dillon wrote:
>   It is possible that the kernel believes the VM system to be too loaded
>   to issue read-aheads, as a consequence of your blowing out of the system
>   caches.

See attachment for the snapshot of `systat 1 -vm' -- it stays like that for 
the most of the compression run time with only occasional flushes to the 
amrd0 device (the destination for the compressed output).

Bakul Shah followed up:

> May be the OS needs "reclaim-behind" for the sequential case?
> This way you can mmap many many pages and use a much smaller
> pool of physical pages to back them.  The idea is for the VM
> to reclaim pages N-k..N-1 when page N is accessed and allow
> the same process to reuse this page.

Although it may hard for the kernel to guess, which pages it can reclaim 
efficiently in the general case, my issuing of madvise with MADV_SEQUENTIONAL 
should've given it a strong hint.

It is for this reasons, that I very much prefer the mmap API to read/write 
(against Matt's repeated advice) -- there is a way to advise the kernel, 
which there is not with the read. Read also requires fairly large buffers in 
the user space to be efficient -- *in addition* to the buffers in the kernel. 
Managing such buffers properly makes the program far messier _and_ 
OS-dependent, than using the mmap interface has to be.

I totally agree with Matt, that FreeBSD's (and probably DragonFly's too) mmap 
interface is better than others', but, it seems to me, there is plenty of 
room for improvement. Reading via mmap should never be slower, than via read 
-- it should be just a notch faster, in fact...

I'm also quite certain, that fulfulling my "demands" would add quite a bit of 
complexity to the mmap support in kernel, but hey, that's what the kernel is 
there for :-)

Unlike grep, which seems to use only 32k buffers anyway (and does not use 
madvise -- see attachment), my program mmaps gigabytes of the input file at 
once, trusting the kernel to do a better job at reading the data in the most 
efficient manner :-)

Peter Jeremy wrote:
> On an amd64 system running about 6-week old -stable, both ['grep' and 'grep 
> --mmap' -mi] behave pretty much identically.

Peter, I read grep's source -- it is not using madvise (because it hurts 
performance on SunOS-4.1!) and reads in chunks of 32k anyway. Would you care 
to look at my program instead? Thanks:

	http://aldan.algebra.com/mzip.c

(link with -lz and -lbz2).

Matthew Dillon wrote:
[...]
>    If the times for the mmap case do not blow up, we are back to square
>    one and I would start investigating the disk driver that Mikhail is
>    using.

On the machine, where both mzip and the disk run at only 50%, the disk is a 
plain SATA drive (mzip's state goes from "RUN" to "vnread" and back).

Thanks, everyone!

	-mi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: grep.diff
Type: text/x-diff
Size: 967 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060324/9d5e3e15/grep.bin
-------------- next part --------------
   18 users    Load  0.46  0.53  0.60                  24 бер 15:15

Mem:KB    REAL            VIRTUAL                     VN PAGER  SWAP PAGER
        Tot   Share      Tot    Share    Free         in  out     in  out
Act 1833864    5880 27758552    45268   92216 count  240
All 1881188    5992 1432466k    52864         pages 3413
                                                                 Interrupts
Proc:r  p  d  s  w    Csw  Trp  Sys  Int  Sof  Flt        cow    2252 total
     1     2101      1605 2025  197  422    2 2018 251432 wire        irq1: atkb
                                                   506156 act         irq6: fdc0
 3.0%Sys   0.0%Intr 45.2%User  0.0%Nice 51.9%Idl  1038216 inact       irq15: ata
|    |    |    |    |    |    |    |    |    |      89252 cache       irq17: fwo
=>>>>>>>>>>>>>>>>>>>>>>>                             2964 free        irq20: nve
                                                          daefr       irq21: ohc
Namei         Name-cache    Dir-cache                     prcfr   241 irq22: ehc
    Calls     hits    %     hits    %                 951 react    11 irq25: em0
                                                          pdwak       irq29: amr
                                      618 zfod            pdpgs  2000 cpu0: time
Disks   ad4 amrd0                         ofod            intrn
KB/t  56.79  0.00                         %slo-z   200816 buf
tps     241     0                    5143 tfree         8 dirtybuf
MB/s  13.38  0.00                                  100000 desiredvnodes
% busy   47     0                                   34717 numvnodes
                                                    24991 freevnodes