Newsgroups: comp.parallel.pvm,fj.comp.parallel,fj.lang.c,fj.lang.fortran
Path: galaxy.trc.rwcp.or.jp!coconuts.jaist!wnoc-tyo-news!etl.go.jp!e8710
From: e8710@etlrips.etl.go.jp (Asai Yoshihiro)
Subject: Re: Error with PVM 3.3.1
Content-Type: text/plain; charset=US-ASCII
Message-ID: <k7Eeo.e8710@etlrips.etl.go.jp>
Sender: news@etl.go.jp (News System)
Nntp-Posting-Host: etlcom3
Reply-To: e8710@etlrips.etl.go.jp (Asai Yoshihiro)
Organization: Electrotechnical Laboratory, Tsukuba Science City
References: <2vem3e$lhg@excepcion.cc.gatech.edu> 
	<esanchez.773598083@campus>
Mime-Version: 1.0 (generated by vin2.0)
Date: Fri, 8 Jul 1994 06:34:58 GMT
Lines: 85
Xref: galaxy.trc.rwcp.or.jp fj.comp.parallel:981 fj.lang.c:1459 fj.lang.fortran:149
X-originally-archived-at: http://galaxy.rwcp.or.jp/text/cgi-bin/newsarticle2?ng=fj.lang.c&nb=1459&hd=a
X-reformat-date: Mon, 18 Oct 2004 15:18:22 +0900
X-reformat-comment: Tabs were expanded into 4 column tabstops by the Galaxy's archiver. See http://katsu.watanabe.name/ancientfj/galaxy-format.html for more info.

On 07/08/94(01:21) esanchez@academ01.mty.itesm.mx (Enrique Sanchez Vela) wrote
in <esanchez.773598083@campus> (comp.parallel.pvm:2084/etlss2):
 |  Prince,
 |
 |   I don't think this is a PVM problem, I can see two diferent causes.
 |
 |         a) Network overloaded.
 |         b) Slave computer overloaded
 |
 |    To see if is any of above, you can monitor the network trafic (there
 |are several tools) or try it when traffic is lower or monitor the slave
 |computer, you can see how loaded is it using "rup hostname" or "uptime" 
 |(you have to run it local). 
 |
 | Hope this helps.
 |
 |Enrique Sanchez Vela.             e-mail: esanchez@mtecv2.mty.itesm.mx
 |ITESM - Campus Monterrey          phone (52-8) 328-4089
 |Tecnologia Computacional          fax   (52-8) 369-2004
 |AIX/ESA System Administrator (for living)
 |
 |
 |pkohli@cc.gatech.edu (Prince Kohli) writes:
 |
 |>I had posted this before as a problem that I had with 3.2.2. However,
 |>the same error recurs with pvm 3.3.1. Would any one know why this
 |>happens? Any hint at all would be appreciated.
 |
 |>Also, the following problem now occurs much later as compared to
 |>pvm 3.2.2, but it does occur.
 |
 |>----
 |
 |>I have an application that runs on top of pvm 3.2.2 (3.3.1 now). The problem
 |>is that at random times, i.e., sometimes very soon after the program
 |>starts and sometimes much later, the host console will give this error:
 |
 |>netoutput() timed out sending to <machine_name> after 23, 194.24140
 |>hd_dump() ref 1 t100000 n <machine_name> ar "SUN4" lo ""
 |>sa 130.207.114.58:3211 mtu 4096 f 0x0 e 0 txq 2
 |>tx 65537 rx 0 rtt 0.003648
 |
 |>The 130.207.114.58 is the address of <machine_name>.
 |
 |>After this, though the pvm daemon is still running there, the master host
 |>thinks it is dead and removes it from the config, and all later packets
 |>from it are marked bogus packets. And all this of course screws up my
 |>application.
 |
 |>----
 |
 |>Thanks in advance for any help,
 |
 |>cheers,
 |
 |>-Prince
 |
 |--
 |Enrique Sanchez Vela.             e-mail: esanchez@mtecv2.mty.itesm.mx
 |ITESM - Campus Monterrey          phone (52-8) 328-4089
 |Tecnologia Computacional          fax   (52-8) 369-2004
 |AIX/ESA System Administrator

The trouble may be related with a failure in keeping load balancing
in multiuser and multitasking environment. We find recommendation
of pool of tasks paradigm for the dynamic load balancing and/or
use of pvm_addhosts() for the host failure.

I still wonder if there are some key parameters to controll
the tolerance of the host failure which could be changed in 
installing the pvm3.3?

I also have similar trouble and any information related with
this symptom is very much appreciated.

Yoshihiro Asai
Fundamental Physics Section
Electrotechnical Laboratory
Umezono 1-1-4, Tsukuba, Ibaraki 305
Japan

Email: e8710@etlrips.etl.go.jp



