Siftr inflight byte question

Andre Oppermann andre at freebsd.org
Tue May 7 11:30:26 UTC 2013


On 06.05.2013 12:37, Lawrence Stewart wrote:
> [ccing freebsd-net@ so my problem description enters the collective
> subconscious in case I forget about this again]
>
> For everyone tuning in, Aris asked me the apt question of why siftr(4)'s
> "# inflight bytes" field doesn't take into account sacked bytes.
>
> On 05/06/13 18:45, Angelogiannopoulos, Aris wrote:
>> On 05/06/13 18:31, Lawrence Stewart wrote:
>>> On 05/06/13 18:18, Angelogiannopoulos, Aris wrote:
>>>> I noticed that the sent_inflight_bytes variable doesn’t take
>>>> into account the SACKd bytes in case SACK is used.
>>>>
>>>> Is there a specific reason for this?
>>>
>>> There is a very good reason - I was too lazy at the time ;)
>>>
>>> More specifically, calculating the number of bytes SACKed by the
>>> receiver is more difficult than pulling simple variables out of the
>>> TCP control block and I was new to the TCP stack when I wrote
>>> SIFTR. I've simply never bothered to go back and improve things in
>>> this regard.
>>
>> isn't it as simple as (tp->snd_nxt - tp->snd_fack) +
>> tp->sackhint.sack_bytes_rexmit ? This is the way the stack is doing
>> it in case of more than three duplicate acks.
>
> You would hope so given the comment accompanying that calculation in
> tcp_do_segment(), but I'm pretty sure that when I looked into this many
> moons ago, I found that calculation to be incorrect. I should obviously
> have gotten to the bottom of the problem back then, but from memory it
> was not crucial to what I was working on at the time and I subsequently
> forgot about it. It's good to come back to this now as it should be fixed.

It's not accurate but also not totally wrong.

>>>> Should we change it to better reflect the conditions on the
>>>> channel?
>>>
>>> Ideally, yes. Feel like having a go and submitting a patch? I can
>>> give you some hints on how to do it if you want a concrete starting
>>> point.
>>
>> Sure why not?
>
> My write up below is partially from memory and from a 2 min look at the
> code just now i.e. parts may be incorrect - please use it as a starting
> point only...
>
> So snd_fack effectively acts as a proxy for snd_una while SACK recovery
> is happening i.e. it tracks the highest sequence number we've been
> cumulatively SACKed for (left edge of the window).

snd_fack tracks the highest SEQ# that has been SACKed to us so we don't
retransmit beyond it during SACK recovery.

> sack_bytes_rexmit counts the number of bytes we've retransmitted from
> the known holes during the current recovery episode.

Yes, this is to avoid resending already resent data on the next incoming
(S)ACK and to find the next available hole to resend.

> snd_max is the highest sequence number we've sent (right edge of the
> window).

Yes.

> Another important bit of info is that the sender-side sack scoreboard
> stores holes (bytes missing at the receiver), whereas the sack blocks
> that come in on the wire refer to received byte ranges (bytes
> successfully received at the receiver).

Yes.  It's important to avoid confusion here.  We're talking about sender
side SACK (for retransmits we have to send).  There's also, but unrelated
to this discussion, the SACK blocks we send based on the data we received
and is waiting in the reassembly queue.

> Therefore from the senders perspective, I believe the "pseudo"
> calculation for bytes in flight during a SACK recovery episode should
> look something like:
>
> (snd_max - snd_fack) - total_hole_bytes + sack_bytes_rexmit

This isn't accurate either.  It likely over-estimates the amount of data
still in flight with multiple holes while "(snd_max - snd_fack) + sack_bytes_rexmit"
likely under-estimates it in the presence of re-ordering.

So instead of total_hole_bytes only those bytes from the holes that are assumed
lost should be counted per the hole retransmission rules.  It would require
some serious work to make our current SACK code track this contiguously. Also
for this reason SACK doesn't / shouldn't immediately retransmit any holes but
wait for more (S)ACKs to arrive (as specified in the applicable RFCs).
I haven't verified that our code does the right thing in all cases.

For reference the graphical representation:

           snd_una                                     snd_max
snd_seq .....|ooo#########ooo#######xxx#######***********|.....
                                     ^        ^
                                  snd_nxt  snd_fack
new_inflight                                  <--------->
hole_bytes    <->    +    <->   +   <->
sack_rexmit   <->    +    <->
total_inflt   <->    +    <->        +        <--------->

  x hole
  o hole retansmitted
  # SACKed
  * new inflight

To play devils advocate likely the best approximation compromise we can do at
the moment is:

inflight = (snd_max - snd_fack) + ((total_hole_bytes - sack_bytes_rexmit) / 2);

> 3 of those variables are already around and (hopefully) usable, but I
> don't believe we currently track a variable equivalent to total_hole_bytes.

Some of them only contain valid information when SACK is active.

> Hopefully someone will chime in and correct me if any of the above is
> misguided. Otherwise, your challenge, should you choose to accept it, is
> to patch SIFTR to perform the above calculation, and to verify that the
> reported result is correct - perhaps by running some controlled lab
> experiments where you can review the value reported by SIFTR against the
> known SACK blocks and sequence numbers on the wire during the recovery
> episode.

That would be an excellent verification.

> You could choose to make total_hole_bytes a sackhint like
> sack_bytes_rexmit, but I suggest you create a first version of the patch
> which manually walks the sack scoreboard each time through
> siftr_siftdata() to sum the # bytes in each hole to ensure you don't
> inadvertently introduce accounting bugs while developing the patch.
> Patch v2 can include total_hole_bytes as a sackhint.

Agreed.  There's a limit on the number of SACK holes so in certain situations
it may not be able to accurately reflect all available information.

-- 
Andre



More information about the freebsd-net mailing list