Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
Mikhail Teterin
mi+mx at aldan.algebra.com
Fri Mar 24 20:38:29 UTC 2006
Matthew Dillon wrote:
> It is possible that the kernel believes the VM system to be too loaded
> to issue read-aheads, as a consequence of your blowing out of the system
> caches.
See attachment for the snapshot of `systat 1 -vm' -- it stays like that for
the most of the compression run time with only occasional flushes to the
amrd0 device (the destination for the compressed output).
Bakul Shah followed up:
> May be the OS needs "reclaim-behind" for the sequential case?
> This way you can mmap many many pages and use a much smaller
> pool of physical pages to back them. The idea is for the VM
> to reclaim pages N-k..N-1 when page N is accessed and allow
> the same process to reuse this page.
Although it may hard for the kernel to guess, which pages it can reclaim
efficiently in the general case, my issuing of madvise with MADV_SEQUENTIONAL
should've given it a strong hint.
It is for this reasons, that I very much prefer the mmap API to read/write
(against Matt's repeated advice) -- there is a way to advise the kernel,
which there is not with the read. Read also requires fairly large buffers in
the user space to be efficient -- *in addition* to the buffers in the kernel.
Managing such buffers properly makes the program far messier _and_
OS-dependent, than using the mmap interface has to be.
I totally agree with Matt, that FreeBSD's (and probably DragonFly's too) mmap
interface is better than others', but, it seems to me, there is plenty of
room for improvement. Reading via mmap should never be slower, than via read
-- it should be just a notch faster, in fact...
I'm also quite certain, that fulfulling my "demands" would add quite a bit of
complexity to the mmap support in kernel, but hey, that's what the kernel is
there for :-)
Unlike grep, which seems to use only 32k buffers anyway (and does not use
madvise -- see attachment), my program mmaps gigabytes of the input file at
once, trusting the kernel to do a better job at reading the data in the most
efficient manner :-)
Peter Jeremy wrote:
> On an amd64 system running about 6-week old -stable, both ['grep' and 'grep
> --mmap' -mi] behave pretty much identically.
Peter, I read grep's source -- it is not using madvise (because it hurts
performance on SunOS-4.1!) and reads in chunks of 32k anyway. Would you care
to look at my program instead? Thanks:
http://aldan.algebra.com/mzip.c
(link with -lz and -lbz2).
Matthew Dillon wrote:
[...]
> If the times for the mmap case do not blow up, we are back to square
> one and I would start investigating the disk driver that Mikhail is
> using.
On the machine, where both mzip and the disk run at only 50%, the disk is a
plain SATA drive (mzip's state goes from "RUN" to "vnread" and back).
Thanks, everyone!
-mi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: grep.diff
Type: text/x-diff
Size: 967 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060324/9d5e3e15/grep.bin
-------------- next part --------------
18 users Load 0.46 0.53 0.60 24 бер 15:15
Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER
Tot Share Tot Share Free in out in out
Act 1833864 5880 27758552 45268 92216 count 240
All 1881188 5992 1432466k 52864 pages 3413
Interrupts
Proc:r p d s w Csw Trp Sys Int Sof Flt cow 2252 total
1 2101 1605 2025 197 422 2 2018 251432 wire irq1: atkb
506156 act irq6: fdc0
3.0%Sys 0.0%Intr 45.2%User 0.0%Nice 51.9%Idl 1038216 inact irq15: ata
| | | | | | | | | | 89252 cache irq17: fwo
=>>>>>>>>>>>>>>>>>>>>>>> 2964 free irq20: nve
daefr irq21: ohc
Namei Name-cache Dir-cache prcfr 241 irq22: ehc
Calls hits % hits % 951 react 11 irq25: em0
pdwak irq29: amr
618 zfod pdpgs 2000 cpu0: time
Disks ad4 amrd0 ofod intrn
KB/t 56.79 0.00 %slo-z 200816 buf
tps 241 0 5143 tfree 8 dirtybuf
MB/s 13.38 0.00 100000 desiredvnodes
% busy 47 0 34717 numvnodes
24991 freevnodes
More information about the freebsd-stable
mailing list