[Bug 199587] libc strncmp() performance
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 27 Nov 2024 13:27:30 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199587
Robert Clausecker <fuz@FreeBSD.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Assignee|bugs@FreeBSD.org |fuz@FreeBSD.org
Status|New |Closed
CC| |fuz@FreeBSD.org
--- Comment #2 from Robert Clausecker <fuz@FreeBSD.org> ---
strncmp has been rewritten in assembly for better performance in FreeBSD 14.1.
Do the performance issues still occur? If yes, please reopen.
Note that your benchmark favours your own implementation: you always use the
same inputs to strncmp(), allowing the branch predictor to predict the control
flow of your very branchy implementation perfectly. This is not reflective of
real-world input where each pair of strings passed to strncmp() is likely
different.
I have designed a more comprehensive benchmark here:
https://github.com/clausecker/strperf
Comparing your function with the SIMD-enhanced implementation I wrote, we get:
os: FreeBSD
arch: amd64
cpu: Intel(R) Core(TM) i7-4910MQ CPU @ 2.90GHz
│ strncmp.simple.out │ strncmp.scalar.out
│ strncmp.baseline.out │
│ sec/op │ sec/op vs base
│ sec/op vs base │
StrncmpShortAligned 144.11µ ± 0% 141.34µ ± 1% -1.92% (p=0.000
n=20) 63.05µ ± 0% -56.25% (p=0.000 n=20)
StrncmpMidAligned 83.93µ ± 0% 64.46µ ± 1% -23.20% (p=0.000
n=20) 21.51µ ± 1% -74.37% (p=0.000 n=20)
StrncmpLongAligned 59.528µ ± 1% 38.266µ ± 0% -35.72% (p=0.000
n=20) 6.154µ ± 1% -89.66% (p=0.000 n=20)
StrncmpShortUnaligned 143.07µ ± 0% 142.14µ ± 1% -0.65% (p=0.003
n=20) 69.63µ ± 0% -51.33% (p=0.000 n=20)
StrncmpMidUnaligned 83.61µ ± 1% 64.62µ ± 0% -22.71% (p=0.000
n=20) 26.60µ ± 1% -68.19% (p=0.000 n=20)
StrncmpLongUnaligned 59.416µ ± 0% 38.392µ ± 1% -35.39% (p=0.000
n=20) 6.356µ ± 1% -89.30% (p=0.000 n=20)
StrncmpShortQsort 1.000m ± 1% 1.053m ± 0% +5.28% (p=0.000
n=20) 1.221m ± 1% +22.11% (p=0.000 n=20)
StrncmpMidQsort 243.2µ ± 0% 243.4µ ± 0% ~ (p=0.968
n=20) 271.7µ ± 0% +11.71% (p=0.000 n=20)
geomean 137.1µ 115.4µ -15.78%
48.88µ -64.33%
│ strncmp.simple.out │ strncmp.scalar.out
│ strncmp.baseline.out │
│ B/s │ B/s vs base
│ B/s vs base │
StrncmpShortAligned 867.4Mi ± 0% 884.4Mi ± 1% +1.95% (p=0.000
n=20) 1982.6Mi ± 0% +128.57% (p=0.000 n=20)
StrncmpMidAligned 1.454Gi ± 0% 1.894Gi ± 1% +30.20% (p=0.000
n=20) 5.675Gi ± 1% +290.14% (p=0.000 n=20)
StrncmpLongAligned 2.051Gi ± 1% 3.190Gi ± 0% +55.56% (p=0.000
n=20) 19.835Gi ± 1% +867.25% (p=0.000 n=20)
StrncmpShortUnaligned 873.7Mi ± 0% 879.4Mi ± 1% +0.66% (p=0.003
n=20) 1795.2Mi ± 0% +105.47% (p=0.000 n=20)
StrncmpMidUnaligned 1.460Gi ± 1% 1.889Gi ± 0% +29.39% (p=0.000
n=20) 4.589Gi ± 1% +214.35% (p=0.000 n=20)
StrncmpLongUnaligned 2.054Gi ± 0% 3.180Gi ± 1% +54.76% (p=0.000
n=20) 19.207Gi ± 1% +834.86% (p=0.000 n=20)
StrncmpShortQsort 125.0Mi ± 1% 118.7Mi ± 0% -5.01% (p=0.000
n=20) 102.4Mi ± 1% -18.10% (p=0.000 n=20)
StrncmpMidQsort 514.0Mi ± 0% 513.6Mi ± 0% ~ (p=0.968
n=20) 460.1Mi ± 0% -10.49% (p=0.000 n=20)
geomean 912.1Mi 1.058Gi +18.74%
2.497Gi +180.37%
where "simple" is your implementation (compiled with -O3), "scalar" is our
non-SIMD implementation (see simd(7), somewhat similar to yours, but manually
unrolled) and "baseline" is our SIMD implementation. As you can see, the
simple implementation is only faster on the qsort benchmark (sorting an array
of random strings) and there only slightly.
--
You are receiving this mail because:
You are the assignee for the bug.