Re: ena(4) tx timeout messages in dmesg

From: Pete Wright <pete_at_nomadlogic.org>
Date: Tue, 13 May 2025 14:43:07 UTC

On 5/12/25 19:52, Kiyanovski, Arthur wrote:
>> ---------- Forwarded message ---------
>> From: Pete Wright <pete@nomadlogic.org>
>> Date: Mon, 12 May 2025 at 12:30
>> Subject: Re: ena(4) tx timeout messages in dmesg
>> To: Colin Percival <cperciva@tarsnap.com>, <freebsd-cloud@freebsd.org>
>> Cc: Arthur Kiyanovski <akiyano@freebsd.org>
>>
>>
>>
>>
>> On 5/12/25 11:56, Colin Percival wrote:
>>> On 5/12/25 11:25, Pete Wright wrote:
>>>> On 5/12/25 11:17, Colin Percival wrote:
>>>>> On 5/12/25 11:04, Pete Wright wrote:
>>>>>> hey there - i have an ec2 instance that i'm using as a nfs server
>>>>>> and have noticed the following messages in my dmesg buffer:
>>>>>> [...]
>>>>>> ena0: Found a Tx that wasn't completed on time, qid 3, index 998. 1
>>>>>> msecs have passed since last cleanup. Missing Tx timeout value 5000
>>>>>> msecs.
>>>>>>
>>>>> I've heard that this can be caused by a thread being starved for
>>>>> CPU, possibly due to FreeBSD kernel scheduler issues, but that was
>>>>> on a far more heavily loaded system.  What instance type are you
>>>>> running on?
>>>>
>>>> oh of course, forgot to provide useful info:
>>>>
>>>> # uname -ar
>>>> FreeBSD airflow-nfs.q0.ringdna.net 14.2-RELEASE-p1 FreeBSD 14.2-
>>>> RELEASE-p1 GENERIC amd64
>>>>
>>>> Instance type:
>>>> t3a.xlarge
>>>>
>>>> I also verified I have plenty of available "burstable credit"
>>>> available since this is a t class system (current balance is steady
>>>> at
>>>> 2,300 credits).
>>>
>>> Ah, this won't necessarily help you -- T family instances are on
>>> shared hardware so even if you have burstable credits it's possible
>>> that you'll be unlucky with "noisy neighbours" and the sibling
>>> instances will all want CPU at the same time as you.  But I think
>>> there's probably something else going on as well.
>>>
>>
>>
>> oh that's a good point, since this is a pre-prod system that is less of a concern
>> as we want to limit spend when possible.  i'll be spinning up production
>> systems in the following week or so that will be on a "c"
>> class system, i'll keep an eye out to see if see similar messages in that
>> environment.
>>
>> -pete
>>
>> --
>> Pete Wright
>> pete@nomadlogic.org
> 
> HI Colin, Pete,
> 
> Your analysis regarding CPU being occupied is the classic explanation for this kind
> prints.
> 
> The prints are consistent with cpu not being available to the interrupt
> handler to run.
> Although you say you have burstable credits available, the fact that you are using
> T instance types does make you more susceptible to such issues.
> 
> Also when you say you have 25% CPU usage, how did you check that?
> Are you using tools that give you an average over some time? so you may
> have 75% of the time 0 cpu usage and 25% of the time 100% cpu usage.
> 
> As you already suggested, the first thing we would like to eliminate is the T instance
> Type.
> If all works - great!
> 
> If not you may want to look into the spreading of interrupts over the different cpus
> using https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#io-irq-affinity
> And also make sure that the cpu heavy processes you have, are run on different cpus than
> ones you handle the interrupts on.
> 
> Hope this helps,
> Arthur
> 
> 
> 

thanks for the context Arthur, I'll take a look at that sysctl knob.  as 
i said the box is only serving a python virtual environment to a pool of 
ec2 compute nodes, and the dataset resides in memory.  so nothing too 
crazy.  the load does have spikes but they are pretty brief and rarely 
over %70.  i'm collecting metrics via telegraph, and also observe load 
via the usual suspects like top, systat etc.

it sounds like ena(4) seems to be particularly sensitive to cpu spikes 
though - at least with this vm configuration.  if i continue to see 
these messages in dmesg i'll test out distributing IRQ's, otherwise i 
think i can chalk this up to a noisy neighbor or something similar.

thanks!
-pete



-- 
Pete Wright
pete@nomadlogic.org