From nobody Sat Feb 03 02:05:08 2024 X-Original-To: freebsd-net@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TRbZZ31jGz58mXY; Sat, 3 Feb 2024 02:05:26 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4TRbZZ16v3z4c8m; Sat, 3 Feb 2024 02:05:26 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-pf1-x42b.google.com with SMTP id d2e1a72fcca58-6ddb807e23bso1809618b3a.0; Fri, 02 Feb 2024 18:05:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706925925; x=1707530725; darn=freebsd.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=29dENUEreQygO9lXAShO8R1E0pc0lox2QNPAePhG7vI=; b=QC0apdtlKx/88dTL6Wmqsp33SFvHQ219YTZqdD6fay6/D/Q09Sgr1pz1Dl65xwkc/D boy5ARCpQI9mFhvvw/mv5QjUEPb2u7y4vy5T07LNOqL5hz9eQ3g6rcctLz0+iCziYujA ThAo0njct4oDxBnn3VZbUczQ63vFI+paukFAdCwpGDVTUXAefCvtOzIFVb61twTtEIjw nkQrVCsZZKhJpPNmhQhybCPSIA8UV7imLciPFrDAROK+lisb6EoK7u+h8m23XeuqqvQk 6gFPS9Vl7EIxPb6ODD6YcIPOPrVsi1NZ4iznVkROFPapIHsXGI12tj5fUIuFrHim/xF6 JPpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706925925; x=1707530725; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=29dENUEreQygO9lXAShO8R1E0pc0lox2QNPAePhG7vI=; b=hXjenw9/ybdSnZPM3qAf35prNDa3cJvubZYTeKET5GboWSkfLNCg7imIBIlpdRLYVa M8G27USK5wy9E27a4qRr5cs1mK9bm7XWeXfZ75rRMVOToJJwbV+KyIIKaRXBG3h4OoJ7 EolossLTE7fobQv+CUKBEln5In148nDapE4WBTQbOrEaacoovDP4Hkj+K6tDxyHQRNtT /OGatHEZtlMEQDtiCSnsMd6TgTNGdt7MDM3b23JDUDVa1b/toNxt0r7E3volr6Mqv0q8 NPyajbTYokRSw50dGscCWQwRSkFFVQAJp2w+1VOD2u3m9PXZoorirfK39sv9sCquigl0 5gKA== X-Gm-Message-State: AOJu0YyP+9P2gjR8l/4uMdLXXhgJl5PEOx4nxp261yrIC1ASP2dp/yoi 7ivgBjpKKKiyEbvFmf8aPSrfqIrnMAgek2CAuJE7N8tJNjjHyAoISWC9gz5VCuro6qygYDmUZE3 ivdVXtm8jmNY7M2094U/RnXt3S2s0tuY= X-Google-Smtp-Source: AGHT+IGUVkguJTvjMJNlQtfURLSi0+dKeVy0PlQSgJPitZuinHoxbAojv8G7F7jj+KNMPNG8CzXSj9G5msp3bP6nmPI= X-Received: by 2002:a05:6a00:90a2:b0:6e0:23e7:cec6 with SMTP id jo34-20020a056a0090a200b006e023e7cec6mr2720634pfb.26.1706925924770; Fri, 02 Feb 2024 18:05:24 -0800 (PST) List-Id: Networking and TCP/IP with FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-net List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-net@freebsd.org MIME-Version: 1.0 References: <2c31ac44-b34b-469c-a6de-fdd927ec2f9e@freebsd.org> In-Reply-To: From: Rick Macklem Date: Fri, 2 Feb 2024 18:05:08 -0800 Message-ID: Subject: Re: Increasing TCP TSO size support To: Drew Gallatin Cc: Richard Scheffenegger , "freebsd-net@FreeBSD.org" , FreeBSD Transport , rmacklem@freebsd.org, kp@freebsd.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 4TRbZZ16v3z4c8m X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; TAGGED_FROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US] On Fri, Feb 2, 2024 at 4:48=E2=80=AFPM Drew Gallatin = wrote: > > > > On Fri, Feb 2, 2024, at 6:13 PM, Rick Macklem wrote: > > A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte N= FS write request > or read reply will result in a 514 element mbuf chain. Each of these (mos= tly 2K mbuf clusters) > are non-contiguous data segments. (I suspect most NICs do not handle this= many segments well, > if at all.) > > > Excellent point > > > The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for th= e ktls), but I do not > know what it would take to make these work for non-KTLS TSO? > > > > Sendfile already uses M_EXTPG mbufs... When I was initially doing M_EXTPG= stuff for kTLS, I added support for using M_EXTPG mbufs in sendfile regard= less of whether or not kTLS was in use. That reduced CPU use marginally on= 64-bit platforms (due to reducing socket buffer lengths, and hence reducin= g pointer chasing), and quite a bit more on 32-bit platforms (due to also n= ot needing to map memory into the kernel map, and by reducing pointer chasi= ng even more, as more pages fit into an M_EXTPG mbuf when a paddr_t is 32-b= its. > > > I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs. > Does it assume each M_EXTPG mbuf is one contiguous data segment? > > > No, its fully aware of how to handle M_EXTPG mbufs. Look at tcp_m_copy()= . We added code in the segment counting part of that function to count the= hdr/trailer parts of an M_EXTPG mbuf, and to deal with the start/end page = being misaligned. > > I do see that ip_output() will call mb_unmapped_to_ext() when the NIC doe= s not have IFCAP_MEXTPG set. > (If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it ca= n become > a single contiguous data segment for TSO or ???) > > > No, it just means that a NIC driver has been verified to call not mtod() = an M_EXTPGS mbuf and deref the resulting data pointer. (which would make it= go "boom"). > > But the page size is only 4K on most platforms. So while an M_EXTPGS mbu= f can hold 5 pages (..from memory, too lazy to do the math right now) and r= educes socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k = per mbuf), the S/G list that a NIC will need to consume would likely decrea= se only by a factor of 2. And even then only if the busdma code to map mbu= fs for DMA is not coalescing adjacent mbufs. I know busdma does some coale= scing, but I can't recall if it coalesces physcally adjacent mbufs. I'm guessing the factor of 2 comes from the fact that each page is a contiguous segment? The NFS code could easily use 5 contiguous pages, so maybe it would be worthwhile to try and make some NIC drivers capable of handling contiguous pages as one segment for TSO output? (It means that tcp_outpout() would need to know this case was possible, Maybe a new if_hw_tsoXX that covers the max number of segments if pages are contig?) However, given your previous post, it might not matter much, since the larger TSO segment might not make much difference? > > If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being = called) were to > all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS (n= on-TLS case). > > > > It does. You should enable it for at least TCP. Good work!! I will try it someday relatively soon. Even if it only reduces the use of mbuf clusters, that sounds like it would be worthwhile. rick > > Drew