[colug-432] conflicting CPU usage information

Jeff Frontz jeff.frontz at gmail.com
Thu Apr 5 16:43:33 EDT 2012


In the olden days of Unix, processes blocked on I/O would definitely
count in the load average.  At one point, so would processes that were
being "traced".

Assuming you've already compared "ps" outputs and no extra stuff is
running on the "busier" system, I would suspect some process handling
I/O that is constantly getting queued to run but that doesn't have
much work to do once it runs.  See if iostat sheds any light (e.g.,
one machine shows lots more I/O than the other).  Also, check if there
is a flakey USB device (keyboard, mouse, etc.) that might be causing
interrupts. Compare mpstat output.  Compare netstat output
(individually with -s, -L, -m) to see if there is any difference
(e.g., some wayward network device is mistaking the "busier" system
for a syslog server).  If you're using NFS, compare nfsstat output.
procinfo might give some more coarse statistics on interrupts, context
switches, etc. (but will probably tell you what you already know).
Lastly, try a very simple shell script looping on running "ps r -A" to
see if you can catch something on the "busier" system's run queue.


Jeff


On Thu, Apr 5, 2012 at 1:17 PM, Rick Hornsby <richardjhornsby at gmail.com> wrote:
> Something is keeping the load high on two systems which are pretty
> much identically configured, but I haven't been able to find any
> reason why there is any load.  Uptime on both prod systems is showing
> with high load averages for not really doing much:
>
> 01: load average: 1.04, 1.02, 1.00
> 02: load average: 2.09, 2.19, 2.13
>
> I've watched the CPU% for several minutes at a time, and it stays idle
> (top says idle, not sleep/wait/etc), but the load average doesn't
> drop.  I recognize that CPU% and load average are not interchangeable
> values for the purposes of a technical discussion.  They're related,
> however.  From Linux Journal
> (http://www.linuxjournal.com/article/9001):
>
> ===
> ...load averages do not include any processes or threads waiting on
> I/O, networking, databases or anything else not demanding the CPU.  It
> narrowly focuses on what is actively demanding CPU time.  This differs
> greatly from the CPU percentage. The CPU percentage is the amount of a
> time interval (that is, the sampling interval) that the system's
> processes were found to be active on the CPU.
> ...
> The load averages differ from CPU percentage in two significant ways:
> 1) load averages measure the trend in CPU utilization not only an
> instantaneous snapshot, as does percentage, and 2) load averages
> include all demand for the CPU not only how much was active at the
> time of measurement.
> ===
>
> If I watch top on 01, I can see the CPU% jumping between 0.1% and 1.0%
> - the rest is idle time.  (top and other normal processing could
> easily account for this) The load average, however, remains.  The same
> issue is on 02, but from appearances is even worse:
>
> top - 14:55:58 up 481 days,  2:35,  1 user,  load average: 2.04, 2.13, 2.12
> Tasks: 359 total,  1 running, 357 sleeping,  0 stopped,  1 zombie
> Cpu(s):  0.0% us,  0.1% sy,  0.1% ni, 99.7% id,  0.2% wa,  0.0% hi,  0.0% si
>
> When I sort top's process list by CPU time, I would expect to see
> something that has been hogging it - or at least racked up some
> serious time.  However, on 01, PatrolAgent (a monitoring tool) is the
> highest in terms of CPU time (577:22), but is showing an active CPU%
> of 0.  The same thing is showing up on 02 (619:36).
>
> Here's what my basic question is: how can there be such a high load
> average, but nothing is obviously consuming nearly enough to justify
> these numbers?  Is there anything else I can do to track down the
> offending process(es)?  Could the load average simply be wrong in the
> case of these servers?  Am I missing something really obvious?
>
> I posed this question on an internal discussion group at work and
> basically got "Linux Journal is wrong, Wikipedia says that NFS I/O
> will cause high load."  I'm willing to be wrong, and maybe NFS really
> is causing the load average.  However, I'm having a really hard time
> buying Wikipedia as an authoritative source on much of anything.
>
> thanks!
> -rick
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432



More information about the colug-432 mailing list