Consistently "high" CPU load on 10.0-STABLE
Steven Hartland
killing at multiplay.co.uk
Sun Jul 20 19:09:58 UTC 2014
Hmm does vmstat -s indicate that 10 is swapping much more than 9?
I know there was some issues in early 10 which resulted in it
swapping more but I was under the impression this was fixed by:
http://svnweb.freebsd.org/base?view=revision&revision=265944
http://svnweb.freebsd.org/base?view=revision&revision=265945
May be there's still and issue there?
Regards
Steve
----- Original Message -----
From: "Jeremy Chadwick" <jdc at koitsu.org>
To: "Steven Hartland" <killing at multiplay.co.uk>
Cc: <freebsd-stable at freebsd.org>
Sent: Sunday, July 20, 2014 6:35 PM
Subject: Re: Consistently "high" CPU load on 10.0-STABLE
> Yes and no, heh... :-)
>
> Using top -a -H -S -z -s 1 and watching very very closely, what I end up
> seeing is that occasionally syncer reaches WCPU percentages of around
> 0.50% (or maybe higher) -- but when that happens, the actual load average
> **does not** suddenly increase.
>
> The load just seems to go from like 0.01 or 0.02 to 0.12 or 0.20
> sporadically with no real evidence of "why" in top. Possibly this is
> because I'm using -s 1 (one second update intervals) and whatever
> happens is so fast/quick that it lasts less than a second / top doesn't
> catch it, but still impacts the load?
>
> Anyway, the "top" processes per TIME in the above command are here:
>
> PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
> 17 root 16 - 0K 16K syncer 2 1:31 0.49% syncer
> 0 root -16 0 0K 4912K swapin 0 1:04 0.00% kernel{swapper}
> 0 root -92 0 0K 4912K - 2 0:17 0.00% kernel{em0 que}
> 1767 root 20 0 57312K 10104K select 1 0:15 0.00% smbd
> 12 root -60 - 0K 528K WAIT 0 0:13 0.00% intr{swi4: clock}
> 643 dhcpd 20 0 24524K 13044K select 1 0:07 0.00% dhcpd
> 12 root -88 - 0K 528K WAIT 1 0:04 0.00% intr{irq259: ahci0:ch}
> 14 root -16 - 0K 16K - 2 0:04 0.00% rand_harvestq
> 58515 jdc 20 0 37028K 6196K select 2 0:03 0.00% sshd
> 420 bind 20 0 70872K 36552K kqread 0 0:02 0.00% named{named}
> 19 root 20 - 0K 16K sdflus 2 0:02 0.00% softdepflush
> 420 bind 20 0 70872K 36552K uwait 3 0:02 0.00% named{named}
>
> This is with a system uptime of 14 hours.
>
> Comparatively, the RELENG_9 box I have (although it's on a VPS), which
> does a lot more in general and has an uptime of 39 days, shows something
> like this:
>
> PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
> 12 root -60 - 0K 224K WAIT 0 132:49 0.00% intr{swi4: clock}
> 0 root -92 0 0K 160K - 0 89:01 0.00% kernel{em0 taskq}
> 17 root 16 - 0K 16K syncer 0 45:46 0.00% syncer
> 12 root -88 - 0K 224K WAIT 1 16:31 0.00% intr{irq14: ata0}
> 12 root -60 - 0K 224K WAIT 1 12:52 0.00% intr{swi4: clock}
> 13 root -8 - 0K 48K - 1 8:03 0.00% geom{g_down}
> 15490 halbot 20 0 76172K 22532K select 0 6:34 0.00% perl
> 593 bind 20 0 92288K 23740K uwait 1 4:39 0.00% named{named}
> 593 bind 20 0 92288K 23740K uwait 0 4:36 0.00% named{named}
>
> So syncer looks like it might be about right for both systems, but
> "swapper" (still not sure what that is exactly) sure seems to be much
> more busy on the RELENG_10 box doing nothing vs. the RELENG_9 box.
> Here's the swapper line from the RELENG_9 box:
>
> $ top -a -H -S -b 9999 | grep swapper
> 0 root -16 0 0K 160K swapin 0 0:55 0.00% kernel{swapper}
>
> It's the only suspect I have at this point but it's not very good
> evidence of anything. :/
>
> Maybe I can use while : ; do top -a -S -H -z -b 99999 >> somelog.txt ;
> sleep 0.25 ; done and let that run for a while in hopes of catching
> something? But I also worry that such a test would actually impact the
> load itself.
>
> On Sun, Jul 20, 2014 at 03:26:02PM +0100, Steven Hartland wrote:
>> If you add -H -z to your top command does anything stand out?
>>
>> Regards
>> Steve
>> ----- Original Message -----
>> From: "Jeremy Chadwick" <jdc at koitsu.org>
>> To: <freebsd-stable at freebsd.org>
>> Sent: Sunday, July 20, 2014 7:24 AM
>> Subject: Consistently "high" CPU load on 10.0-STABLE
>>
>>
>> > (Please keep me CC'd as I'm not subscribed to freebsd-stable@)
>> >
>> > Today I took the liberty of upgrading my main home server from
>> > 9.3-STABLE (r268785) to 10.0-STABLE (r268894). The upgrade consisted of
>> > doing a fresh install of 10.0-STABLE on a brand new unused SSD. Most
>> > everything went as planned, barring a couple ports-related anomalies,
>> > and I seemed fairly impressed by the fact that buildworld times had
>> > dropped to 27 minutes and buildkernel to 4 minutes with clang (something
>> > I'd been avoiding like the plague for a long while). Kudos.
>> >
>> > But after an hour or so, I noticed a consistent (i.e. reproducible)
>> > trend: the system load average tends to hang around 0.10 to 0.15 all the
>> > time. There are times where the load drops to 0.03 or 0.04 but then
>> > something kicks it back up to 0.15 or 0.20 and then it slowly levels out
>> > again (over the course of a few minutes) then repeats.
>> >
>> > Obviously this is normal behaviour for a system when something is going
>> > on periodically. So I figured it might have been a userland process
>> > behaving differently under 10.x than 9.x. I let top -a -S -s 1 run and
>> > paid very very close attention to it for several minutes. Nothing. It
>> > doesn't appear to be something userland -- it appears to be something
>> > kernel-level, but nothing in top -S shows up as taking up any CPU time
>> > other than "[idle]" so I have no idea what might be doing it.
>> >
>> > The box isn't doing anything like routing network traffic/NAT, it's pure
>> > IPv4 (IPv6 disabled in world and kernel, and my home network does
>> > basically no IPv6) and sits idle most of the time fetching mail. It
>> > does use ZFS, but not for /, swap, /var, /tmp, or /usr.
>> >
>> > vmstat -i doesn't particularly show anything awful. All the cpuX:timer
>> > entries tend to fluctuate in rate, usually 120-200 or so; I'd expect an
>> > interrupt storm to be showing something in the 1000+ range.
>> >
>> > The only thing I can think of is the fact that the SSD being used has no
>> > 4K quirk entry in the kernel (and its ATA IDENTIFY responds with 512
>> > logical, 512 physical, even though we know it's 4K). The partitions are
>> > all 1MB-aligned regardless.
>> >
>> > This is all bare-metal, by the way -- no virtualisation involved.
>> >
>> > I do have DTrace enabled/built on this box but I have absolutely no clue
>> > how to go about profiling things. For example maybe output of this sort
>> > would be helpful (but I've no idea how to get it):
>> >
>> > http://lists.freebsd.org/pipermail/freebsd-stable/2014-July/079276.html
>> >
>> > I'm certain I didn't see this behaviour in 9.x so I'd be happy to try
>> > and track it down if I had a little bit of hand-holding.
>> >
>> > I've put all the things I can think of that might be relevant to "system
>> > config/tuning bits" up here:
>> >
>> > http://jdc.koitsu.org/freebsd/releng10_perf_issue/
>> >
>> > I should note my kernel config is slightly inaccurate (I've removed some
>> > stuff from the file in attempt to rebuild, but building world prior to
>> > kernel failed due to r268896 breaking world, but anyone subscribed here
>> > has already seen the Jenkins job of that ;-) ).
>> >
>> > Thanks.
>> >
>> > --
>> > | Jeremy Chadwick jdc at koitsu.org |
>> > | UNIX Systems Administrator http://jdc.koitsu.org/ |
>> > | Making life hard for others since 1977. PGP 4BD6C0CB |
>> >
>> > _______________________________________________
>> > freebsd-stable at freebsd.org mailing list
>> > http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
>> >
>
> --
> | Jeremy Chadwick jdc at koitsu.org |
> | UNIX Systems Administrator http://jdc.koitsu.org/ |
> | Making life hard for others since 1977. PGP 4BD6C0CB |
>
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
>
More information about the freebsd-stable
mailing list