Tor on FreeBSD Performance issues

Sun Feb 12 15:18:34 UTC 2012

Hi
> 
> On 11 Feb 2012, at 00:06, Steven Murdoch wrote:
> 
>> On 10 Feb 2012, at 22:22, Robert N. M. Watson wrote:
>>> I wonder if we're looking at some sort of different in socket buffer tuning between Linux and FreeBSD that is leading to better link utilisation under this workload. Both FreeBSD and Linux auto-tune socket buffer sizes, but I'm not sure if their policies for enabling/etc auto-tuning differ. Do we know if Tor fixes socket buffer sizes in such a way that it might lead to FreeBSD disabling auto-tuning?
>> 
>> If ConstrainedSockets is set to 1 (it defaults to 0), then Tor will "setsockopt(sock, SOL_SOCKET, SO_SNDBUF"  and "setsockopt(sock, SOL_SOCKET, SO_RCVBUF" to ConstrainedSockSize (defaults 8192). Otherwise I don't see any fiddling with buffer size. So I'd first confirm that ConstrainedSockets is set to zero, and perhaps try experimenting with it on for different values of ConstrainedSockSize.
> In FreeBSD, I believe the current policy is that any TCP socket that doesn't have a socket option specifically set will be auto-tuning. So it's likely that, as long as ConstrainedSockSize isn't set, auto-tuning is enabled.

This is set to zero in Tor.
> 
>>> I'm a bit surprised by the out-of-order packet count -- is that typical of a Tor workload, and can we compare similar statistics on other nodes there? This could also be a symptom of TCP reassembly queue issues. Lawrence: did we get the fixes in place there to do with the bounded reassembly queue length, and/or are there any workarounds for that issue? Is it easy to tell if we're hitting it in practice?
>> 
>> I can't think of any inherent reason for excessive out-of-order packets, as the host TCP stack is used by all Tor nodes currently. It could be some network connections from users are bad (we have plenty of dial-up users).
> 
> I guess what I'm wondering about is relative percentages. Out-of-order packets can also arise as a result of network stack bugs, and might explain a lower aggregate bandwidth. The netstat -Q options I saw in the forwarded e-mail suggest that the scenarios that could lead to this aren't present, but since it stands out, it would be worth trying to explain just to convince ourselves it's not a stack bug.
As we have two boxes with identical configuration in the same datacenter here I can give some Linux Output, too:
# netstat -s
Ip:
    1099780169 total packets received
    0 forwarded
    0 incoming packets discarded
    2062308427 incoming packets delivered
    2800933295 requests sent out
    694 outgoing packets dropped
    798042 fragments dropped after timeout
    143378847 reassemblies required
    45697700 packets reassembled ok
    18522117 packet reassembles failed
    1070 fragments received ok
    761 fragments failed
    28174 fragments created
Icmp:
    92792968 ICMP messages received
    18458681 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 73204262
        timeout in transit: 6996342
        source quenches: 813143
        redirects: 9100882
        echo requests: 1646656
        echo replies: 5
    2005869 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 359208
        echo request: 5
        echo replies: 1646656
IcmpMsg:
        InType0: 5
        InType3: 73204262
        InType4: 813143
        InType5: 9100882
        InType8: 1646656
        InType11: 6996342
        OutType0: 1646656
        OutType3: 359208
        OutType8: 5
Tcp:
    4134119965 active connections openings
    275823710 passive connection openings
    2002550589 failed connection attempts
    199749970 connection resets received
    31931 connections established
    1839369825 segments received
    3631158795 segments send out
    3353305069 segments retransmited
    2152248 bad segments received.
    237858281 resets sent
Udp:
    129942286 packets received
    203329 packets to unknown port received.
    0 packet receive errors
    109523321 packets sent
UdpLite:
TcpExt:
    7088 SYN cookies sent
    15275 SYN cookies received
    3196797 invalid SYN cookies received
    1093456 resets received for embryonic SYN_RECV sockets
    36073572 packets pruned from receive queue because of socket buffer overrun
    77060 packets pruned from receive queue
    232 packets dropped from out-of-order queue because of socket buffer overrun
    362884 ICMP packets dropped because they were out-of-window
    85 ICMP packets dropped because socket was locked
    673831896 TCP sockets finished time wait in fast timer
    48600 time wait sockets recycled by time stamp
    2013223394 delayed acks sent
    3477567 delayed acks further delayed because of locked socket
    Quick ack mode was activated 440274027 times
    35711291 times the listen queue of a socket overflowed
    35711291 SYNs to LISTEN sockets dropped
    457 packets directly queued to recvmsg prequeue.
    1460 bytes directly in process context from backlog
    48211 bytes directly received in process context from prequeue
    1494466591 packet headers predicted
    33 packets header predicted and directly queued to user
    4257229715 acknowledgments not containing data payload received
    740819251 predicted acknowledgments
    442309 times recovered from packet loss due to fast retransmit
    197193098 times recovered from packet loss by selective acknowledgements
    494378 bad SACK blocks received
    Detected reordering 221053 times using FACK
    Detected reordering 1053064 times using SACK
    Detected reordering 72059 times using reno fast retransmit
    Detected reordering 4265 times using time stamp
    336672 congestion windows fully recovered without slow start
    356482 congestion windows partially recovered using Hoe heuristic
    41059770 congestion windows recovered without slow start by DSACK
    54306977 congestion windows recovered without slow start after partial ack
    245685510 TCP data loss events
    TCPLostRetransmit: 7881258
    421631 timeouts after reno fast retransmit
    70726251 timeouts after SACK recovery
    26797894 timeouts in loss state
    349218987 fast retransmits
    19632788 forward retransmits
    224201891 retransmits in slow start
    2441482671 other TCP timeouts
    220051 classic Reno fast retransmits failed
    22663942 SACK retransmits failed
    160105897 packets collapsed in receive queue due to low socket buffer
    568326755 DSACKs sent for old packets
    12316261 DSACKs sent for out of order packets
    157800118 DSACKs received
    1008695 DSACKs for out of order packets received
    2043 connections reset due to unexpected SYN
    48512275 connections reset due to unexpected data
    15085625 connections reset due to early user close
    1702109944 connections aborted due to timeout
    TCPSACKDiscard: 231850
    TCPDSACKIgnoredOld: 99417376
    TCPDSACKIgnoredNoUndo: 33053947
    TCPSpuriousRTOs: 5163955
    TCPMD5Unexpected: 8
    TCPSackShifted: 290984575
    TCPSackMerged: 613203726
    TCPSackShiftFallback: 747049207
IpExt:
    InBcastPkts: 12617896
    OutBcastPkts: 1456356
    InOctets: -1096131435
    OutOctets: -1263483369
    InBcastOctets: -2144923256
    OutBcastOctets: 187483424
> 
>>> On the other hand, I think Steven had mentioned that Tor has changed how it does exit node load distribution to better take into account realised rather than advertised bandwidth. If that's the case, you might get larger systemic effects causing feedback: if you offer slightly less throughput then you get proportionally less traffic. This is something I can ask Steven about on Monday.
>> 
>> There is active probing of capacity, which then is used to adjust the weighting factors that clients use.
> 
> So there is a chance that the effect we're seeing has to do with clients not being directed to the host, perhaps due to larger systemic issues, or the FreeBSD box responding less well to probing and therefore being assigned less work by Tor as a whole. Are there any tools for diagnosing these sorts of interactions in Tor, or fixing elements of the algorithm to allow experiments with capacity to be done more easily? We can treat this as a FreeBSD stack problem in isolation, but in as much as we can control for effects like that, it would be useful.
> 
> There's a non-trivial possibility that we're simply missing a workaround for known-bad Broadcom hardware, as well, so it would be worth our taking a glance at the pciconf -lv output describing the card so we can compare Linux driver workarounds with FreeBSD driver workarounds, and make sure we have them all. If I recall correctly, that silicon is not known for its correctness, so failing to disable some hardware feature could have significant effect.

#pciconf -lv
bge0 at pci0:32:0:0:	class=0x020000 card=0x705d103c chip=0x165b14e4 rev=0x10 hdr=0x00
    vendor     = 'Broadcom Corporation'
    device     = 'NetXtreme BCM5723 Gigabit Ethernet PCIe'
    class      = network
    subclass   = ethernet
bge1 at pci0:34:0:0:	class=0x020000 card=0x705d103c chip=0x165b14e4 rev=0x10 hdr=0x00
    vendor     = 'Broadcom Corporation'
    device     = 'NetXtreme BCM5723 Gigabit Ethernet PCIe'
    class      = network
    subclass   = ethernet
> 
>>> Could someone remind me if Tor is multi-threaded these days, and if so, how socket I/O is distributed over threads?
>> 
>> I believe that Tor is single-threaded for the purposes of I/O. Some server operators with fat pipes have had good experiences of running several Tor instances in parallel on different ports to increase bandwidth utilisation.
> 
> It would be good to confirm the configuration in this particular case to make sure we understand it. It would also be good to know if the main I/O thread in Tor is saturating the core it's running on -- if so, we might be looking at some poor behaviour relating to, for example, frequent timestamp checking, which is currently more expensive on FreeBSD than Linux.
We have two Tor processes running. It still only uses multi-threading for crypto work, but not even for all of that (only Onionskins). On polling I actually got both Tor Processes to nearly saturate the cores they were on, but now that I disabled polling and went back to 1000HZ I don't get there. Currently one process is at 60% WCPU, the other one being at about 50%.

As It's been asked: Yes, it is a FreeBSD 9 Box and no, there is no net.inet.tcp.inflight.enable. 
Also libevent is using kqueue and I've tried patching both Tor and libevent to use CLOCK_MONOTONIC_FAST and CLOCK_REALTIME_FAST, as has been pointed out by Alexander.

If by flow cache you mean net.inet.flowtable, then I believe that the sysctl won't show up unless I activate IP Forwarding, which I have not (and I don't have the net.inet.flowtable available).

Also some sysctls as requested:
kern.ipc.somaxconn=16384
kern.ipc.maxsockets=204800
kern.maxfiles=204800
kern.maxfilesperproc=200000
kern.maxvnodes=200000
net.inet.tcp.recvbuf_max=10485760
net.inet.tcp.recvbuf_inc=65535
net.inet.tcp.sendbuf_max=10485760
net.inet.tcp.sendbuf_inc=65535
net.inet.tcp.sendspace=10485760
net.inet.tcp.recvspace=10485760
net.inet.tcp.delayed_ack=0 
net.inet.ip.portrange.first=1024
net.inet.ip.portrange.last=65535
net.inet.ip.rtexpire=2
net.inet.ip.rtminexpire=2
net.inet.ip.rtmaxcache=1024
net.inet.tcp.rfc1323=0
net.inet.tcp.maxtcptw=200000
net.inet.ip.intr_queue_maxlen=4096
net.inet.tcp.ecn.enable=1    (net.inet.ip.intr_queue_drops is zero)
net.inet.ip.portrange.reservedlow=0
net.inet.ip.portrange.reservedhigh=0
net.inet.ip.portrange.hifirst=1024
security.mac.portacl.enabled=1
security.mac.portacl.suser_exempt=1
security.mac.portacl.port_high=1023
security.mac.portacl.rules=uid:80:tcp:80
security.mac.portacl.rules=uid:256:tcp:443

Thanks for the replies and all of this information. 

Julian