From: Vincent Nikkelen <vincentn@stack.nl>
Newsgroups: comp.parallel.pvm
Subject: Re: system problems with pvm 3.4.0
Date: 29 Jun 1999 01:19:34 +0200
Organization: M.C.G.V. Stack - Eindhoven, the Netherlands
Sender: vincentn@turtle.stack.nl
Message-Id: <7l8vu6$mk6$1@turtle.stack.nl>
References: <3777BA46.ADAEC704@ornl.gov>
User-Agent: tin/pre-1.4-980730 (UNIX) (FreeBSD/2.2.8-STABLE (i386))
Xref: ukc comp.parallel.pvm:8522


hi,

maybe you already did but check your message-cue. I once had the same
problem and this was the cause:

PVM is a-synchrone, so a message-buffer can get full. especially when 1
task is a bit slower [can be any reason, that's why it's difficult to
catch] it's message-cue can grow. PVM mostly takes much time to
allocate space for it's message-cue [or allocating is slow] which slows
down your task even more etc etc.

Because PVM is not freeing the memory used by a message-cue, a task with
a non-growing message-cue is much more efficient than a task with a growing
message-cue.

[I'm using 'task' all the time, but one could also read 'PVMD' etc etc]

Vincent

John Galambos <jjdg@ornl.gov> wrote:
: Hi,

: We recently started doing parallel calculations on our linux cluster
: using pvm 3.4.0. Most of the time things work fine, but sometimes one
: CPU will suddenly start "running" slower, and bring the entire
: calculation to a grinding crawl. After stopping the parallel calculation
: the slow behavior of the offending CPU persists, even for serial
: calculations. For example, running a dead simple 4 line "loop" program
: under the time utility shows that the offending CPU is using excessive
: "system" time. But I can't see anything unusual running on it with
: either ps or top. Rebooting the offender clears it up. It's happened on
: several different CPUs.  Sometimes the cluster goes days without this
: happening.  I never noticed this problem before we started doing
: parallel calcs on the cluster, and it only seems to happen when we're
: running parallel.  Anyone seen anything like this ?

: Thanks - John Galambos

: System:

: alpha 533 MHz 21164 chips (Microwa computers)
: Linux, RH5.1, kernel = 2.0.36
: Netgear FS516 100 mbs switch (also tried 3Com 3300 Superstack II 100
: mbs)

