[colug-432] conflicting CPU usage information

Rick Hornsby richardjhornsby at gmail.com
Thu Apr 5 13:17:44 EDT 2012


Something is keeping the load high on two systems which are pretty
much identically configured, but I haven't been able to find any
reason why there is any load.  Uptime on both prod systems is showing
with high load averages for not really doing much:

01: load average: 1.04, 1.02, 1.00
02: load average: 2.09, 2.19, 2.13

I've watched the CPU% for several minutes at a time, and it stays idle
(top says idle, not sleep/wait/etc), but the load average doesn't
drop.  I recognize that CPU% and load average are not interchangeable
values for the purposes of a technical discussion.  They're related,
however.  From Linux Journal
(http://www.linuxjournal.com/article/9001):

===
...load averages do not include any processes or threads waiting on
I/O, networking, databases or anything else not demanding the CPU.  It
narrowly focuses on what is actively demanding CPU time.  This differs
greatly from the CPU percentage. The CPU percentage is the amount of a
time interval (that is, the sampling interval) that the system's
processes were found to be active on the CPU.
...
The load averages differ from CPU percentage in two significant ways:
1) load averages measure the trend in CPU utilization not only an
instantaneous snapshot, as does percentage, and 2) load averages
include all demand for the CPU not only how much was active at the
time of measurement.
===

If I watch top on 01, I can see the CPU% jumping between 0.1% and 1.0%
- the rest is idle time.  (top and other normal processing could
easily account for this) The load average, however, remains.  The same
issue is on 02, but from appearances is even worse:

top - 14:55:58 up 481 days,  2:35,  1 user,  load average: 2.04, 2.13, 2.12
Tasks: 359 total,  1 running, 357 sleeping,  0 stopped,  1 zombie
Cpu(s):  0.0% us,  0.1% sy,  0.1% ni, 99.7% id,  0.2% wa,  0.0% hi,  0.0% si

When I sort top's process list by CPU time, I would expect to see
something that has been hogging it - or at least racked up some
serious time.  However, on 01, PatrolAgent (a monitoring tool) is the
highest in terms of CPU time (577:22), but is showing an active CPU%
of 0.  The same thing is showing up on 02 (619:36).

Here's what my basic question is: how can there be such a high load
average, but nothing is obviously consuming nearly enough to justify
these numbers?  Is there anything else I can do to track down the
offending process(es)?  Could the load average simply be wrong in the
case of these servers?  Am I missing something really obvious?

I posed this question on an internal discussion group at work and
basically got "Linux Journal is wrong, Wikipedia says that NFS I/O
will cause high load."  I'm willing to be wrong, and maybe NFS really
is causing the load average.  However, I'm having a really hard time
buying Wikipedia as an authoritative source on much of anything.

thanks!
-rick


More information about the colug-432 mailing list