Newsgroups: comp.parallel.pvm
From: Clark Dorman <clark@s3i.com>
Subject: Re: Host load on PVM
Organization: PSINet
Date: 30 Jul 1998 12:05:07 -0400
Message-ID: <dogu78guk.fsf@elmo.s3i.com>


shgs@my-dejanews.com writes:
> 
>   I'm a PVM 3.3.11 user on a network with the following architectures: SUN4,
> SUN4SOL2, RS6K, SGI, and SGI64. I would like to develop a task scheduling
> algorithm based not only on the distribution of my own PVM tasks but also on
> the real machine load on the hosts. I am thinking of using the output of the
> "uptime" command on the hosts since I didn't find any PVM function for
> finding this kind of data. Am I mistaken or I have to develop special slaves
> to get the load info from my hosts?

Sort of.  Below are my thoughts, although I'm no expert.

There are a couple of ways to get the information.  The way
that I do it is as follows:

	o  Main program adds all the machines to pvm
	o  main program starts a slave on each machine
	o  each slave figures out what the load is on their own machine
	o  each slave sends the information to the main program (parent)
           every 30 seconds or so. 
	o  main program receives information, which tells it two
	   things:  
		1)  that the machine is there and part of pvm
		2)  what the load is.

Another option is to have the main program check the loads itself, but
then you still have to check that the machine is up and part of pvm.

How to get the load?  That's where it gets a little tricky.  'uptime'
is one of a variety of programs that gets system information.  I
usually use 'rup' which checks multiple machines on the network, or
you can just do a 'rup machinename'.  

However, you don't really want to have to do a system call and then
parse the output, especially since it might be different on different
machines.  What you (probably) want is a subroutine call, and thats
why you probably want rstat, which is part of the remote procedure
call library.  Take a while and read 'man rstat' and 'man rpc'.  If
you go with a system like the above, then actually each slave (on
remote machines) is doing an rpc on their own machine, but that's ok.
As it turns out, the calls are a little different on different
machines as well, but it seems to work.  Below is a little program
that I wrote that uses rpcsvc to determine the load.  Compile it with
'gcc test_rstat.c -lrpcsvc'

An additional problem that you are going to have is that 'load' does
not measure how busy the cpu is right now; uptime and rup give
averages over 1, 5, and 10 minutes.  So, a problem that I have had is
that my slave sends over the information, I start a process, and the
next time that the slave sends over information (5 seconds later) it
does not show the cpu as being very busy.  I have build in delays
between starting processes on the same computer.  


QUESTION FOR EXPERTS: On SOLARIS, there is a thing call a 'perf meter'
that you can pop up showing how busy computers are (along with lots of
other information, such as disk, errors, etc.).  Load is an option,
and it uses the 1 minute average.  However, there is also something
called 'cpu' that shows how busy the computer is "right now".  How do
I get that information???  I cannot find anything that gives it to me,
and rstat does not seem to have any fields that correspond to that
number.

Note that if you _really_ want to get into this sort of thing, gnu
make has a routine called 'getloadavg.c' that does lots of spiffy load
stuff, and has the ability to parse just about any machine.  It is
very complicated though and seems to go through and parse kernal
files.  And it still doesn't seem to give you the 'right now'
information.

Please let me know if you can find something better than the sort of
thing I have below.

Clark


/****** begin test_rstat.c ******************************/
/*
  Do a rup in C using rstat for compare with interactive rup.

  The computer's name is either elmo (Sun Solaris 5.5.1) or zook (Irix 5.3)
*/ 
#include <rpc/rpc.h>            /* for clnt_stat enum    */
#include <rpcsvc/rstat.h>       /* for typedef'ed struct statstime  */

#include <stdio.h>

main()
{
        struct statstime        ex_stat;
        int                     rstat_status;
        int                     states;
        int                     drives;
        int                     i;
        double                  ave_stat[3];

        rstat_status = rstat( "elmo", &ex_stat);

        /**********************************************************************/
        /* Comment out one of these, depending on which system you are        */
        /* on.                                                                */
        /**********************************************************************/
        /* Solaris */
        states = RSTAT_CPUSTATES;
        drives = RSTAT_DK_NDRIVE;

        /* IRIX    */
        /* states = CPUSTATES;    */
        /* drives = DK_NDRIVE;    */

        for ( i=0; i<states ; i++ ){
                printf(" Cp time %d:   %d\n", i, ex_stat.cp_time[ i ]);
        }

        for ( i=0; i< drives ; i++ ){
                printf(" Dk xfer %d:  %d\n", i, ex_stat.dk_xfer[i]);
        }
        
        printf( "ex_stat.v_pgpgin %d\n",       ex_stat.v_pgpgin     );
        printf( "ex_stat.v_pgpgout %d\n",      ex_stat.v_pgpgout    );
        printf( "ex_stat.v_pswpin %d\n",       ex_stat.v_pswpin     );
        printf( "ex_stat.v_pswpout %d\n",      ex_stat.v_pswpout    );
        printf( "ex_stat.v_intr %d\n",         ex_stat.v_intr       );
        printf( "ex_stat.if_ipackets %d\n",    ex_stat.if_ipackets  );
        printf( "ex_stat.if_ierrors %d\n",     ex_stat.if_ierrors   );
        printf( "ex_stat.if_oerrors %d\n",     ex_stat.if_oerrors   );
        printf( "ex_stat.if_collisions %d\n",  ex_stat.if_collisions );
        printf( "ex_stat.v_swtch %d\n",        ex_stat.v_swtch      );
        printf( "ex_stat.if_opackets %d\n",    ex_stat.if_opackets  );
        for ( i=0; i<3; i++) {
                printf( "ex_stat.avenrun %d:  %d \n", i, ex_stat.avenrun[ i ] );
        }

        /* The 256 below (at least on SOLARIS) comes from FSCALE in
           param.h.  It seems to be the same on Irix 5.3 */
        printf( "\n Compare this with rup output " );
        for ( i=0; i<3; i++) {
                ave_stat[i] = (double)(ex_stat.avenrun[ i ])/ 256.;
                printf( " %lf ", ave_stat[i]);
        }
        printf("\n");

}
/****** end test_rstat.c ******************************/

