SCHED_ULE should not be the default

Adrian Chadd adrian at freebsd.org
Thu Dec 22 09:08:00 UTC 2011


Are you able to go through the emails here and grab out Attilio's
example for generating KTR scheduler traces?


Adrian

On 21 December 2011 16:52, Steve Kargl <sgk at troutmask.apl.washington.edu> wrote:
> On Fri, Dec 16, 2011 at 12:14:24PM +0100, Attilio Rao wrote:
>> 2011/12/15 Steve Kargl <sgk at troutmask.apl.washington.edu>:
>> > On Thu, Dec 15, 2011 at 05:25:51PM +0100, Attilio Rao wrote:
>> >>
>> >> I basically went through all the e-mail you just sent and identified 4
>> >> real report on which we could work on and summarizied in the attached
>> >> Excel file.
>> >> I'd like that George, Steve, Doug, Andrey and Mike possibly review the
>> >> few datas there and add more, if they want, or make more important
>> >> clarifications in particular about the Xorg presence (or rather not)
>> >> in their workload.
>> >
>> > Your summary of my observations appears correct.
>> >
>> > I have grabbed an up-to-date /usr/src, built and
>> > installed world, and built and installed a new
>> > kernel on one of the nodes in my cluster. ??It
>> > has
>> >
>>
>> It seems a perfect environment, just please make sure you made a
>> debug-free userland (setting MALLOC_PRODUCTION in jemalloc basically).
>>
>> The first thing is, can you try reproducing your case? As far as I got
>> it, for you it was enough to run N + small_amount of CPU-bound threads
>> to show performance penalty, so I'd ask you to start with using dnetc
>> or just your preferred cpu-bound workload and verify you can reproduce
>> the issue.
>> As it happens, please monitor the threads bouncing and CPU utilization
>> via 'top' (you don't need to be 100% precise, jut to get an idea, and
>> keep an eye on things like excessive threads migration, thread binding
>> obsessity, low throughput on CPU).
>> One note: if your workloads need to do I/O please use a tempfs or
>> memory storage to do so, in order to reduce I/O effects at all.
>> Also, verify this doesn't happen with 4BSD scheduler, just in case.
>>
>> Finally, if the problem is still in place, please recompile your
>> kernel by adding:
>> options KTR
>> options KTR_ENTRIES=262144
>> options KTR_COMPILE=(KTR_SCHED)
>> options KTR_MASK=(KTR_SCHED)
>>
>> And reproduce the issue.
>> When you are in the middle of the scheduling issue go with:
>> # ktrdump -ctf > ktr-ule-problem-YOURNAME.out
>>
>> and send to the mailing list along with your dmesg and the
>> informations on the CPU utilization you gathered by top(1).
>>
>> That should cover it all, but if you have further questions, please
>> just go ahead.
>
> Attilio,
>
> I have placed several files at
>
> http://troutmask.apl.washington.edu/~kargl/freebsd
>
> dmesg.txt      --> dmesg for ULE kernel
> summary        --> A summary that includes top(1) output of all runs.
> sysctl.ule.txt --> sysctl -a for the ULE kernel
> ktr-ule-problem-kargl.out.gz
>
> I performed a series of tests with both 4BSD and ULE kernels.
> The 4BSD and ULE kernels are identical except of course for the
> scheduler.  Both witness and invariants are disabled, and malloc
> has been compiled without debugging.
>
> Here's what I did.  On the master node in my cluster, I ran an
> OpenMPI code that sends N jobs off to the node with the kernel
> of interest.  There is communication between the master and
> slaves to generate 16 independent chunks of data.  Note, there
> is no disk IO.  So, for example, N=4 will start 4 essentially
> identical numerically intensity jobs.  At the start of a run,
> the master node instructs each slave job to create a chunk of
> data.  After the data is created, the slave sends it back to the
> master and the master sends instructions to create the next chunk
> of data.  This communication continues until the 16 chunks have
> been assigned, computed, and returned to the master.
>
> Here is a rough measurement of the problem with ULE and numerical
> intensity loads.  This command is executed on the master
>
> time mpiexec -machinefile mf3 -np N sasmp sas.in
>
> Since time is executed on the master, only the 'real' time is of
> interest (the summary file includes user and sys times).  This
> command is run at 5 times for each N value and up to 10 time for
> some N values with the ULE kernel.  The following table records
> the average 'real' time and the number in (...) is the mean
> absolute deviations.
>
> #  N         ULE             4BSD
> # -------------------------------------
> #  4    223.27 (0.502)   221.76 (0.551)
> #  5    404.35 (73.82)   270.68 (0.866)
> #  6    627.56 (173.0)   247.23 (1.442)
> #  7    475.53 (84.07)   285.78 (1.421)
> #  8    429.45 (134.9)   223.64 (1.316)
>
> These numbers to me demonstrate that ULE is not a good choice
> for a HPC workload.
>
> If you need more information, feel free to ask.  If you would
> like access to the node, I can probably arrange that.  But,
> we can discuss that off-line.
>
> --
> Steve
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"


More information about the freebsd-stable mailing list