VIMAGE UDP memory leak fix

Fri Nov 21 11:22:30 UTC 2014

On 21 Nov 2014, at 11:02, Marko Zec <zec at fer.hr> wrote:

> Now that we've found ourselves in this discussion, I'm really
> becoming curious why exactly do we need UMA_ZONE_NOFREE for network
> stack zones at all?   Admittedly, I always thought that the primary
> purpose of UMA_ZONE_NOFREE was to prevent uma_reclaim() from paging out
> _used_ zone pages, but reviewing the uma code reveals that this might
> not be the case, i.e. that NOFREE only prevents _unused_ pages to be
> freed by uma_reclaim().
> 
> Moreover, all uma_zalloc() calls as far as I can see are flagged as
> M_NOWAIT and are followed by checks for allocation failures, so that
> part seems to be covered.
> 
> So, what's really the problem which UMA_ZONE_NOFREE flagging is supposed
> to solve these days? (you claim that we clearly need it for TCP - why)?

UMA_ZONE_NOFREE tells UMA that it can't reclaim unused slabs for the zone to be returned to the VM system for reuse elsewhere under memory pressure. UMA memory isn't pageable, so there's no link to paging policy: although soft-TLB systems might experience TLB miss exceptions on UMA-allocated kernel memory, you should never experience a page fault against it (in absence of a bug). Reclaim of unused slabs can happen, for example, if VM discovers it is low on free pages, in which case it notifies various kernel subsystems that it is feeling a bit cramped -- that same mechanism that, for example, triggers TCP to throw away reassembly buffers that haven't yet been ACK'd (although might have been SACK'd). You might expect this to happen in situations where first a large load spike happens for a particular UMA type (e.g., a DDoS opens lots of TCP connections), and then they are freed, leading to lots of socket/incpb slabs lying around unused, which eventually VM will ask be returned. It is highly desirable for UMA_ZONE_NOFREE to be removed from zones wherever possible so that memory can be returned under such circumstances, and it is not a good feature that the flag is present anywhere.

Subsystems pick up a dependence on UMA_ZONE_NOFREE if freed objects might be referenced after free. My understanding is that this is pretty old inherited behaviour from prior kernel memory allocators that didn't know how to return memory to VM. Given that property, it was safe to write code that might, for the purposes of efficiency, assume that it could walk data structures of the type with fewer synchronisation overheads -- or where synchronisation isn't possible (e.g., for direct access to kernel memory via /dev/kmem). We have been attempting to expunge those assumptions wherever possible -- these days, netstat uses sysctl()s that acquire references to all live inpcbs keeping them valid while they are copied out (you can't hold low-level locks during copyout() as sysctl might encounter a paging event writing to user memory). Convincing yourself that all such assumptions have been removed is a moderate amount of work, and if you get it wrong, you get use-after-free races that occur only in low-memory conditions, which are quite hard to figure out (read: almost impossible).

Bjoern can say more about what motivated his specific comment -- I had hoped that we'd quietly lost dependence on NOFREE over the last decade and could finally garbage collect it, but perhaps not!

Robert