[colug-432] conflicting CPU usage information

Thu Apr 5 14:09:49 EDT 2012

I would actually agree with the Linux Journal excerpts you've
provided. The load average is the number of processes ready to run on
the CPU. If they were being blocked, waiting on an IO request, they
wouldn't be in that queue. I personally like Robert Love's book "Linux
Kernel Development". In there you'll find a similar diagram which I
quickly Googled a variation of:
http://wiki.kldp.org/pds/ProcessManagement/state_diagram.jpg (Here,
we're talking about the middle-left circle)

Aside from load averages (again, just the average count of processes
ready to execute but having to wait their turn), the CPU distribution
is just a ratio report of how the CPU is spending it's time. For each
process, the kernel will keep track of how many cycles were spent
executing code in user space, in kernel space, idling without any work
to do, etc. The accounting is pure integer accumulation math and the
reports are merely a percentage breakdown. You can see these various
accounting data for each process down /proc/XXXX/ and/or use "ps" for
reporting on some of those values. (E.g. Try: "ps -eo
pid,user,pcpu,cputime,etime,args"). The trick with programs like "ps"
is that they are just looking at a snapshot in time, so if you're
using "ps" you'll want to grab several snapshots and see which
processes are actually accumulating more cputime. See below for a
sample shell function I've created in the past to help myself answer
these very questions.

I typically like to run "vmstat 10" on a system for my initial
analysis. The first "r" column (run-queue) will be analogous to the
load average values you are seeing. Aside from seeing the run-queue,
you can also watch the breakdown of CPU time between user(us),
system(sy), idle(id) and io wait(wa). I'd also mention the value of
the memory & swap related columns but it sounds like you're purely
focused on CPU loads.

Finally, I wouldn't say a  run-queue of 2.XX is high at all. I don't
usually worry about single digit load averages. It would seem you have
plenty of CPU overhead available to you.

Here is the shell function I was referring to. Copy & paste into your
shell, then just run "topprocesses" and see what you get. If you get
nothing, that means there is nothing that is standing out as a top
running process (%125 busier than the average process on the machine).
Also requires root to be able to view all of /proc/*/stat entries.

topprocesses() {
    # /proc/XXXX/stat 's fields 8,   9,     10,  11   represent the process'
    #                           user, nice, sys, idle times.
    SAMPLING_PERIOD=5 # seconds
    #------------------------------
    awk '{ print $1 " " $8+$10+$14+$15 }' /proc/*/stat >
${TMP}statlist1 2>/dev/null # was using 8+10 then 14+15
    sleep $SAMPLING_PERIOD
    awk '{ print $1 " " $8+$10+$14+$15 }' /proc/*/stat >
${TMP}statlist2 2>/dev/null
    # Now tell me which process are _currently_ eating the CPU?
    awk '{
    if ($1 in times) {
        diff=times[$1] - $2
        if (diff > 0)
            print diff " " $1
        else
            print diff*-1 " " $1
    } else
        times[$1]=$2
    }' ${TMP}statlist1 ${TMP}statlist2 > ${TMP}statdiff
    #sort -n ${TMP}statdiff
    average=$(awk '{ if ($1 != 0) { N++; total += $1 } } END { print
total / N }' ${TMP}statdiff)
    awk -v avg=$average '{ if ($1 > avg*1.25) print $2 }'
${TMP}statdiff | xargs | sed 's/ /,/g'
}

-- Jon Miller

On Thu, Apr 5, 2012 at 1:17 PM, Rick Hornsby <richardjhornsby at gmail.com> wrote:
>
> Something is keeping the load high on two systems which are pretty
> much identically configured, but I haven't been able to find any
> reason why there is any load.  Uptime on both prod systems is showing
> with high load averages for not really doing much:
>
> 01: load average: 1.04, 1.02, 1.00
> 02: load average: 2.09, 2.19, 2.13
>
> I've watched the CPU% for several minutes at a time, and it stays idle
> (top says idle, not sleep/wait/etc), but the load average doesn't
> drop.  I recognize that CPU% and load average are not interchangeable
> values for the purposes of a technical discussion.  They're related,
> however.  From Linux Journal
> (http://www.linuxjournal.com/article/9001):
>
> ===
> ...load averages do not include any processes or threads waiting on
> I/O, networking, databases or anything else not demanding the CPU.  It
> narrowly focuses on what is actively demanding CPU time.  This differs
> greatly from the CPU percentage. The CPU percentage is the amount of a
> time interval (that is, the sampling interval) that the system's
> processes were found to be active on the CPU.
> ...
> The load averages differ from CPU percentage in two significant ways:
> 1) load averages measure the trend in CPU utilization not only an
> instantaneous snapshot, as does percentage, and 2) load averages
> include all demand for the CPU not only how much was active at the
> time of measurement.
> ===
>
> If I watch top on 01, I can see the CPU% jumping between 0.1% and 1.0%
> - the rest is idle time.  (top and other normal processing could
> easily account for this) The load average, however, remains.  The same
> issue is on 02, but from appearances is even worse:
>
> top - 14:55:58 up 481 days,  2:35,  1 user,  load average: 2.04, 2.13, 2.12
> Tasks: 359 total,  1 running, 357 sleeping,  0 stopped,  1 zombie
> Cpu(s):  0.0% us,  0.1% sy,  0.1% ni, 99.7% id,  0.2% wa,  0.0% hi,  0.0% si
>
> When I sort top's process list by CPU time, I would expect to see
> something that has been hogging it - or at least racked up some
> serious time.  However, on 01, PatrolAgent (a monitoring tool) is the
> highest in terms of CPU time (577:22), but is showing an active CPU%
> of 0.  The same thing is showing up on 02 (619:36).
>
> Here's what my basic question is: how can there be such a high load
> average, but nothing is obviously consuming nearly enough to justify
> these numbers?  Is there anything else I can do to track down the
> offending process(es)?  Could the load average simply be wrong in the
> case of these servers?  Am I missing something really obvious?
>
> I posed this question on an internal discussion group at work and
> basically got "Linux Journal is wrong, Wikipedia says that NFS I/O
> will cause high load."  I'm willing to be wrong, and maybe NFS really
> is causing the load average.  However, I'm having a really hard time
> buying Wikipedia as an authoritative source on much of anything.
>
> thanks!
> -rick
> _______________________________________________
> colug-432 mailing list
> colug-432 at colug.net
> http://lists.colug.net/mailman/listinfo/colug-432