NFS server bottlenecks
Garrett Wollman
wollman at freebsd.org
Tue Oct 2 04:21:20 UTC 2012
I had an email conversation with Rick Macklem about six months ago
about NFS server bottlenecks. I'm now in a position to observe my
large-scale NFS server under an actual production load, so I thought I
would update folks on what it looks like. This is a 9.1 prerelease
kernel (I hope 9.1 will be released soon as I have four moe of these
servers to deploy!). When under nearly 100% load on an 8-core
(16-thread) Quanta QSSC-S99Q storage server, with a 10G network
interface, pmcstat tells me this:
PMC: [INST_RETIRED.ANY_P] Samples: 2727105 (100.0%) , 27 unresolved
Key: q => exiting...
%SAMP IMAGE FUNCTION CALLERS
29.3 kernel _mtx_lock_sleep nfsrvd_updatecache:10.0 nfsrvd_getcache:7.4 ...
9.5 kernel cpu_search_highest cpu_search_highest:8.1 sched_idletd:1.4
7.4 zfs.ko lzjb_decompress zio_decompress
4.3 kernel _mtx_lock_spin turnstile_trywait:2.2 pmclog_reserve:1.0 ...
4.0 zfs.ko fletcher_4_native zio_checksum_error:3.1 zio_checksum_compute:0.8
3.6 kernel cpu_search_lowest cpu_search_lowest
3.3 kernel nfsrc_trimcache nfsrvd_getcache:1.6 nfsrvd_updatecache:1.6
2.3 kernel ipfw_chk ipfw_check_hook
2.1 pmcstat _init
1.1 kernel _sx_xunlock
0.9 kernel _sx_xlock
0.9 kernel spinlock_exit
This does seem to confirm my original impression that the NFS replay
cache is quite expensive. Running a gprof(1) analysis on the same PMC
data reveals a bit more detail (I've removed some uninteresting parts
of the call graph):
called/total parents
index %time self descendents called+self name index
called/total children
4881.00 2004642.70 932627/932627 svc_run_internal [2]
[4] 45.1 4881.00 2004642.70 932627 nfssvc_program [4]
13199.00 504436.33 584319/584319 nfsrvd_updatecache [9]
23075.00 403396.18 468009/468009 nfsrvd_getcache [14]
1032.25 416249.44 2239/2284 svc_sendreply_mbuf [15]
6168.00 381770.44 11618/11618 nfsrvd_dorpc [24]
3526.87 86869.88 112478/112514 nfsrvd_sentcache [74]
890.00 50540.89 4252/4252 svc_getcred [101]
14876.60 32394.26 4177/24500 crfree <cycle 3> [263]
11550.11 25150.73 3243/24500 free <cycle 3> [102]
1348.88 15451.66 2716/16831 m_freem [59]
4066.61 216.81 1434/1456 svc_freereq [321]
2342.15 677.40 557/1459 malloc_type_freed [265]
59.14 1916.84 134/2941 crget [113]
1602.25 0.00 322/9682 bzero [105]
690.93 0.00 43/44 getmicrotime [571]
287.22 7.33 138/1205 prison_free [384]
233.61 0.00 60/798 PHYS_TO_VM_PAGE [358]
203.12 0.00 94/230 nfsrv_mallocmget_limit [632]
151.76 0.00 51/1723 pmap_kextract [309]
0.78 70.28 9/3281 _mtx_unlock_sleep [154]
19.22 16.88 38/400403 nfsrc_trimcache [26]
11.05 21.74 7/197 crsetgroups [532]
30.37 0.00 11/6592 critical_enter [190]
25.50 0.00 9/36 turnstile_chain_unlock [844]
24.86 0.00 3/7 nfsd_errmap [913]
12.36 8.57 8/2145 in_cksum_skip [298]
9.10 3.59 5/12455 mb_free_ext [140]
1.84 4.85 2/2202 VOP_UNLOCK_APV [269]
-----------------------------------------------
0.49 0.15 1/1129009 uhub_explore [1581]
0.49 0.15 1/1129009 tcp_output [10]
0.49 0.15 1/1129009 pmap_remove_all [1141]
0.49 0.15 1/1129009 vm_map_insert [236]
0.49 0.15 1/1129009 vnode_create_vobject [281]
0.49 0.15 1/1129009 biodone [351]
0.49 0.15 1/1129009 vm_object_madvise [670]
0.49 0.15 1/1129009 xpt_done [483]
0.49 0.15 1/1129009 vputx [80]
0.49 0.15 1/1129009 vm_map_delete <cycle 3> [49]
0.49 0.15 1/1129009 vm_object_deallocate <cycle 3> [356]
0.49 0.15 1/1129009 vm_page_unwire [338]
0.49 0.15 1/1129009 pmap_change_wiring [318]
0.98 0.31 2/1129009 getnewvnode [227]
0.98 0.31 2/1129009 pmap_clear_reference [1004]
0.98 0.31 2/1129009 usbd_do_request_flags [1282]
0.98 0.31 2/1129009 vm_object_collapse <cycle 3> [587]
0.98 0.31 2/1129009 vm_object_page_remove [122]
1.48 0.46 3/1129009 mpt_pci_intr [487]
1.48 0.46 3/1129009 pmap_extract [355]
1.48 0.46 3/1129009 vm_fault_unwire [171]
1.97 0.62 4/1129009 vgonel [270]
1.97 0.62 4/1129009 vm_object_shadow [926]
1.97 0.62 4/1129009 zone_alloc_item [434]
2.46 0.77 5/1129009 vnlru_free [235]
2.46 0.77 5/1129009 insmntque1 [737]
2.95 0.93 6/1129009 zone_free_item [409]
3.94 1.24 8/1129009 pmap_enter_object [805]
4.92 1.55 10/1129009 vfs_ref [722]
4.92 1.55 10/1129009 pmap_clear_modify [814]
5.41 1.70 11/1129009 vn_isdisk [391]
5.90 1.86 12/1129009 rtalloc1_fib [158]
5.90 1.86 12/1129009 soreceive_generic [99]
6.40 2.01 13/1129009 pmap_remove_pages [403]
6.89 2.16 14/1129009 _mtx_lock_flags [565]
6.89 2.16 14/1129009 random_yarrow_read [542]
6.89 2.16 14/1129009 nfsrv_checkgetattr [1133]
7.38 2.32 15/1129009 svc_vc_recv [55]
7.38 2.32 15/1129009 ixgbe_rxeof [12]
8.85 2.78 18/1129009 _cv_timedwait_sig [859]
9.35 2.94 19/1129009 vref [242]
10.82 3.40 22/1129009 pmap_enter_quick [690]
11.31 3.56 23/1129009 tcp_do_segment [35]
13.28 4.18 27/1129009 sosend_generic [25]
15.74 4.95 32/1129009 g_io_schedule_down [76]
19.19 6.03 39/1129009 vn_start_write [400]
20.66 6.49 42/1129009 ip_output [32]
23.12 7.27 47/1129009 pmap_copy [467]
23.12 7.27 47/1129009 __mnt_vnode_next_all [520]
29.52 9.28 60/1129009 vm_page_free_toq [149]
30.50 9.59 62/1129009 vm_fault_hold [268]
36.90 11.60 75/1129009 g_io_request [172]
38.37 12.06 78/1129009 prison_hold [330]
38.86 12.22 79/1129009 taskqueue_member [423]
39.36 12.37 80/1129009 vm_page_deactivate [812]
39.85 12.53 81/1129009 vm_page_activate [790]
40.83 12.83 83/1129009 in_matroute [197]
45.75 14.38 93/1129009 vm_page_alloc [128]
48.70 15.31 99/1129009 prison_free [384]
66.90 21.03 136/1129009 vholdl [275]
73.30 23.04 149/1129009 vfs_unbusy [470]
82.65 25.98 168/1129009 key_addref [233]
85.60 26.91 174/1129009 vn_finished_write [407]
91.50 28.76 186/1129009 _key_freesp [286]
92.48 29.07 188/1129009 vfs_busy [288]
102.32 32.16 208/1129009 _vm_object_allocate [738]
107.24 33.71 218/1129009 taskqueue_enqueue [68]
112.16 35.26 228/1129009 xprt_active [267]
125.94 39.59 256/1129009 vm_pageq_remove [650]
128.89 40.51 262/1129009 svc_freereq [321]
128.89 40.51 262/1129009 pmap_remove [106]
153.49 48.25 312/1129009 vdropl [251]
154.96 48.71 315/1129009 vfs_busyfs [239]
180.05 56.60 366/1129009 pmap_enter [245]
182.02 57.22 370/1129009 svc_run_internal [2]
213.50 67.11 434/1129009 ixgbe_msix_que [11]
248.43 78.09 505/1129009 key_allocsp [221]
255.32 80.26 519/1129009 camisr [107]
373.87 117.52 760/1129009 ipmi_poll [635]
404.87 127.27 823/1129009 g_io_deliver [162]
440.78 138.56 896/1129009 g_io_schedule_up [137]
462.91 145.51 941/1129009 taskqueue_run_locked [43]
744.30 233.97 1513/1129009 _sleep [150]
906.15 284.84 1842/1129009 dastrategy [116]
1170.32 367.88 2379/1129009 mps_intr_msi [194]
1329.22 417.83 2702/1129009 ixgbe_handle_que [56]
15078.43 4739.79 30651/1129009 nfsrc_unlock [174]
25869.62 8131.91 52587/1129009 _vm_map_lock [129]
44082.69 13857.05 89610/1129009 nfsrvd_sentcache [74]
113077.19 35544.94 229860/1129009 nfsrc_trimcache [26]
149533.34 47004.65 303967/1129009 nfsrvd_getcache [14]
198698.26 62459.27 403908/1129009 nfsrvd_updatecache [9]
[5] 16.4 555404.00 174586.97 1129009 _mtx_lock_sleep [5]
9074.71 139536.06 87523/87603 turnstile_trywait [45]
3085.07 11049.12 3870/3964 turnstile_wait [207]
166.20 10020.76 4390/4728 lockstat_nsecs [229]
505.86 276.39 322/324 turnstile_cancel [552]
516.23 123.24 384/341743 spinlock_exit <cycle 1> [136]
161.51 0.00 61/9744 spinlock_enter [156]
44.18 0.00 16/6592 critical_enter [190]
23.20 0.00 4/35 turnstile_setowner [767]
1.35 3.10 1/1531 propagate_priority [280]
-----------------------------------------------
13199.00 504436.33 584319/584319 nfssvc_program [4]
[9] 11.6 13199.00 504436.33 584319 nfsrvd_updatecache [9]
198698.26 62459.27 403908/1129009 _mtx_lock_sleep [5]
107453.19 94387.86 212423/400403 nfsrc_trimcache [26]
1974.56 16386.18 2411/3646 nfsrc_freecache [148]
120.15 10877.79 1393/3281 _mtx_unlock_sleep [154]
139.67 10006.86 2/12 lapic_handle_intr [93]
257.17 946.79 431/6821 m_copym [186]
539.54 0.00 124/26922 bcopy [53]
2.07 31.89 20/87603 turnstile_trywait [45]
10.68 23.27 3/24500 free <cycle 3> [102]
25.54 6.10 19/341743 spinlock_exit <cycle 1> [136]
28.33 0.00 10/36 turnstile_chain_unlock [844]
13.58 10.56 8/30514 uma_zalloc_arg [70]
18.35 0.00 7/37 nfsrc_wanted [851]
10.90 0.83 5/284 turnstile_lookup [582]
0.11 6.85 3/4728 lockstat_nsecs [229]
-----------------------------------------------
23075.00 403396.18 468009/468009 nfssvc_program [4]
[14] 9.6 23075.00 403396.18 468009 nfsrvd_getcache [14]
149533.34 47004.65 303967/1129009 _mtx_lock_sleep [5]
95069.59 83510.00 187942/400403 nfsrc_trimcache [26]
1104.58 13395.38 993/8846 malloc [47]
117.22 10612.28 1359/3281 _mtx_unlock_sleep [154]
1570.18 1088.69 1016/2145 in_cksum_skip [298]
134.35 0.00 27/9682 bzero [105]
127.38 0.00 46/1196 in_cksumdata [347]
28.23 6.74 21/341743 spinlock_exit <cycle 1> [136]
1.87 28.70 18/87603 turnstile_trywait [45]
25.50 0.00 9/36 turnstile_chain_unlock [844]
11.88 9.24 7/30514 uma_zalloc_arg [70]
13.08 0.99 6/284 turnstile_lookup [582]
0.04 2.28 1/4728 lockstat_nsecs [229]
-----------------------------------------------
19.22 16.88 38/400403 nfssvc_program [4]
95069.59 83510.00 187942/400403 nfsrvd_getcache [14]
107453.19 94387.86 212423/400403 nfsrvd_updatecache [9]
[26] 8.5 202542.00 177914.74 400403 nfsrc_trimcache [26]
113077.19 35544.94 229860/1129009 _mtx_lock_sleep [5]
657.00 19128.82 5754/5754 nfsrv_checksockseqnum [183]
1011.44 8393.58 1235/3646 nfsrc_freecache [148]
14.25 31.02 4/24500 free <cycle 3> [102]
2.48 28.45 5/16831 m_freem [59]
0.93 14.35 9/87603 turnstile_trywait [45]
7.86 0.00 3/37 nfsrc_wanted [851]
1.57 0.86 1/324 turnstile_cancel [552]
-----------------------------------------------
The full gprof output for the kernel is at
<http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.txt>.
An amazing visualization of this (thanks to gprof2dot.py and dot) is at
<http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png>.
There are some functions missing from the visualization because the
gprof buckets overflowed.
Is there a documented way to merge the kernel and ZFS gprof data
output by pmcstat(1) so that those functions will be properly
accounted for?
-GAWollman
More information about the freebsd-hackers
mailing list