ffmpeg & ULE

Wed Oct 19 01:20:17 UTC 2011

On Tuesday 18 October 2011 11:04:36 Urmas Lett wrote:
> Hello.
>
> Why is ffmpeg -threads massively slower with ULE than 4BSD?
>
> ffmpeg preset veryfast with sched_bsd:
> real    1m49.407s
> user    6m53.932s
> sys     0m1.700s
>
> ffmpeg preset veryfast with sched_ule:
> real    2m52.711s
> user    6m50.310s
> sys     0m1.582s
>
> #uname -a
> FreeBSD 9.0-RC1 FreeBSD 9.0-RC1 #0: Mon Oct 17 20:32:29 EEST

Since no-one has offered any insight about the cause (yet) I'll explain what I 
think is happening here.

SCHED_ULE tries to make threads "sticky" to a certain CPU.  This has benefits 
in the realm of cache utilization and memory locality for NUMA systems. This 
works great when all running threads do useful work all the time. Not all 
threads will receive equal amounts of CPU time.

SCHED_BSD on the other hand much more aggressively reschedules running threads 
on other CPUs with no regard for cache locality. This gives all threads an 
equal share of CPU time.

I was unable to (quickly) find out the exact implementation details of 
multithreaded ffmpeg, but I guess it simply splits up each frame in N equal 
parts and uses N threads to encode these parts. The master process then 
probably recombines them into the final frame. Because almost all encodings 
compress video using the 3rd dimension (time) it must wait for the current 
frame to finish before it can start encoding the next frame. Thus we end up 
with a workload like this:

split1 -> N x encode1 -> recombine1 -> split2 -> N x encode2 -> recombine2 
etc..

This bursty behavior really is no good match for ULE, because it (ffmpeg's 
master process) assumes equal runtime of all threads. ULE has no time to 
properly load balance all threads before they die (or stop doing work). You 
can see clearly in your timings that SCHED_ULE is actually just as fast when 
we look at the amount of CPU time spent. However, because it is not nearly as 
aggressive as SCHED_BSD in stealing threads from busy CPUs there is some idle 
time in there as well. This causes the difference. There are some tunable 
sysctls (kern.sched.*) that might help in this scenario.

I bet if you would run two ffmpeg processes in parallel you'd get about the 
same runtimes for both schedulers.

 (Disclaimer: I have collected most of this information from the mailing 
lists, not the actual code, so I could be completely wrong)

-- 
Pieter de Goeje