4.7 vs 5.2.1 SMP/UP bridging performance
Kenneth Culver
culverk at sweetdreamsracing.biz
Tue May 4 13:12:57 PDT 2004
Quoting Gerrit Nagelhout <gnagelhout at sandvine.com>:
> Hi,
>
> For one of our applications in our testlab, we are running bridge(4)
> with several user land applications. I have found that the bridging
> performance (64 byte packets, 2-port bridge) on 5.2.1 is
> significantly lower than that of RELENG_4, especially when running in
> SMP. The platform is a dual 2.8GHz xeon with a dual port em (100MHz
> PCI-X). Invariants are disabled, and polling (with idle_polling
> enabled) is used.
Quick stupid question, did you turn off all the debugging stuff in the kernel:
options DDB # Enable the kernel debugger
options INVARIANTS # Enable calls of extra sanity
checking
options INVARIANT_SUPPORT # Extra sanity checks of
internal struct
ures, required by INVARIANTS
options WITNESS # Enable checks to detect
deadlocks and
cycles
options WITNESS_SKIPSPIN
If you didn't turn all of that off, you may want to try it.
Ken
>
> Here are the various test results (packets per second, full duplex)
> [traffic generator] <=> [FreeBSD bridge] <=> [traffic generator]
>
> 4.7 UP: 1.2Mpps
> 4.7 SMP : 1.2Mpps
> 5.2.1 UP: 850Kpps
> 5.2.1 SMP: 500Kpps
>
> I believe that for RELENG_4, the hardware is the bottleneck, which
> explains why there is no difference between UP and SMP.
> In order to get these numbers for 5.2.1, I had to make a small change
> to bridge.c (change ETHER_ADDR_EQ to BDG_MATCH in bridge_in to avoid
> calling bcmp). This change boosted performance by about 20%
>
> I ran the kernel profiler for both UP and SMP (5.2.1), and included
> the results of the top functions below. In the past, I have run the
> profiler against RELENG_4 also, and the main difference with that
> (explaining reduced UP performance) is more overhead due to bus_dma &
> mbuf handling. When I compare the results of UP & SMP (5.2.1), all
> the functions using mutexes seem to get much more expensive, and
> critical_exit is taking more cycles. A quick count of mutexes in the
> bridge code path showed that there were 10-20 locks & unlocks for
> each packet. When as a quick test I added 10 more locks/unlocks to
> the code path, the SMP performance when down to 330Kpps. This
> indicates that mutexes are much more expensive in SMP than in UP.
>
> I would like to move to CURRENT for new hardware support, and the
> ability to properly use multi-threading in user-space, but can't do
> this until the performance bottlenecks are solved. I realize that
> 5.x is still a work in progress and hasn't been tuned as well as 4.7
> yet, but are there any plans for optimizations in this area? Does
> anyone have any suggestions on what else I can try?
>
> Thanks,
>
> Gerrit
>
> (wheel)# sysctl net.link.ether.bridge
> net.link.ether.bridge.version: $Revision: 1.72 $ $Date: 2003/10/31 18:32:08
> $
> net.link.ether.bridge.debug: 0
> net.link.ether.bridge.ipf: 0
> net.link.ether.bridge.ipfw: 0
> net.link.ether.bridge.copy: 0
> net.link.ether.bridge.ipfw_drop: 0
> net.link.ether.bridge.ipfw_collisions: 0
> net.link.ether.bridge.packets: 1299855421
> net.link.ether.bridge.dropped: 0
> net.link.ether.bridge.predict: 0
> net.link.ether.bridge.enable: 1
> net.link.ether.bridge.config: em0:1,em1:1
>
> (wheel)# sysctl kern.polling
> kern.polling.burst: 19
> kern.polling.each_burst: 80
> kern.polling.burst_max: 1000
> kern.polling.idle_poll: 1
> kern.polling.poll_in_trap: 0
> kern.polling.user_frac: 5
> kern.polling.reg_frac: 120
> kern.polling.short_ticks: 0
> kern.polling.lost_polls: 4297586
> kern.polling.pending_polls: 0
> kern.polling.residual_burst: 0
> kern.polling.handlers: 3
> kern.polling.enable: 1
> kern.polling.phase: 0
> kern.polling.suspect: 1030517
> kern.polling.stalled: 40
> kern.polling.idlepoll_sleeping: 0
>
>
> Here are some of the interesting parts of the config file:
> options HZ=2500
> options NMBCLUSTERS=32768
> #options GDB_REMOTE_CHAT
> #options INVARIANTS
> #options INVARIANT_SUPPORT
> #options DIAGNOSTIC
>
> options DEVICE_POLLING
>
>
>
> The following profiles show only the top functions (more than 0.2%):
>
> UP:
>
> granularity: each sample hit covers 16 byte(s) for 0.01% of 10.01 seconds
>
> % cumulative self self total
> time seconds seconds calls ms/call ms/call name
> 20.3 2.03 2.03 ether_input [1]
> 10.5 3.09 1.06 mb_free [2]
> 5.8 3.67 0.58
> _bus_dmamap_load_buffer [3]
> 5.6 4.23 0.56 m_getcl [4]
> 5.3 4.76 0.53 em_encap [5]
> 5.1 5.27 0.51 m_free [6]
> 5.1 5.78 0.51 mb_alloc [7]
> 4.9 6.27 0.49 bdg_forward [8]
> 4.9 6.76 0.49
> em_process_receive_interrupts [9]
> 4.1 7.17 0.41 bridge_in [10]
> 3.6 7.53 0.36 generic_bcopy [11]
> 3.6 7.89 0.36 m_freem [12]
> 2.6 8.14 0.26 em_get_buf [13]
> 2.2 8.37 0.22
> em_clean_transmit_interrupts [14]
> 2.2 8.59 0.22 em_start_locked [15]
> 2.0 8.79 0.20 bus_dmamap_load_mbuf
> [16]
> 1.9 8.99 0.19 bus_dmamap_load [17]
> 1.3 9.11 0.13 critical_exit [18]
> 1.1 9.23 0.11 em_start [19]
> 1.0 9.32 0.10 bus_dmamap_create [20]
> 0.8 9.40 0.08 em_receive_checksum
> [21]
> 0.6 9.46 0.06 em_tx_cb [22]
> 0.5 9.52 0.05 __mcount [23]
> 0.5 9.57 0.05
> em_transmit_checksum_setup [24]
> 0.5 9.62 0.05 m_tag_delete_chain
> [25]
> 0.5 9.66 0.05 m_adj [26]
> 0.3 9.69 0.03 mb_pop_cont [27]
> 0.2 9.71 0.02 bus_dmamap_destroy
> [28]
> 0.2 9.73 0.02 mb_reclaim [29]
> 0.2 9.75 0.02 ether_ipfw_chk [30]
> 0.2 9.77 0.02 em_dmamap_cb [31]
>
> SMP:
>
> granularity: each sample hit covers 16 byte(s) for 0.00% of 20.14 seconds
>
> % cumulative self self total
> time seconds seconds calls ms/call ms/call name
> 47.9 9.64 9.64 cpu_idle_default [1]
> 4.9 10.63 0.99 critical_exit [2]
> 4.6 11.56 0.93 mb_free [3]
> 4.3 12.41 0.86 bridge_in [4]
> 4.2 13.26 0.84 bdg_forward [5]
> 4.1 14.08 0.82 mb_alloc [6]
> 3.9 14.87 0.79
> em_process_receive_interrupts [7]
> 3.2 15.52 0.65 em_start [8]
> 3.1 16.15 0.63 m_free [9]
> 3.0 16.76 0.61
> _bus_dmamap_load_buffer [10]
> 2.5 17.27 0.51 m_getcl [11]
> 2.1 17.69 0.42 em_start_locked [12]
> 1.9 18.07 0.37 ether_input [13]
> 1.5 18.38 0.31 em_encap [14]
> 1.1 18.61 0.23 bus_dmamap_load [15]
> 1.0 18.82 0.21 generic_bcopy [16]
> 0.9 19.00 0.18 bus_dmamap_load_mbuf
> [17]
> 0.8 19.16 0.17 __mcount [18]
> 0.6 19.29 0.13 em_get_buf [19]
> 0.6 19.41 0.12
> em_clean_transmit_interrupts [20]
> 0.5 19.52 0.11 em_receive_checksum
> [21]
> 0.4 19.60 0.09 m_gethdr_clrd [22]
> 0.4 19.69 0.08 bus_dmamap_create [23]
> 0.3 19.75 0.06 em_tx_cb [24]
> 0.2 19.80 0.05 m_freem [25]
> 0.2 19.83 0.03 m_adj [26]
> 0.1 19.85 0.02 m_tag_delete_chain
> [27]
> 0.1 19.87 0.02 bus_dmamap_destroy
> [28]
> 0.1 19.89 0.02 mb_pop_cont [29]
> 0.1 19.91 0.02 em_dmamap_cb [30]
> 0.1 19.92 0.02
> em_transmit_checksum_setup [31]
> 0.1 19.94 0.01 mb_alloc_wait [32]
> 0.1 19.95 0.01 em_poll [33]
>
>
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe at freebsd.org"
More information about the freebsd-current
mailing list