我们有两个几乎一样的集群,在集群A上ORCA一直运行良好,但是集群B基本都不正常,交并行作业就死在“ORCA finished by error termination in GTOInt Calling Command: mpirun -np 24 /dir1/orca_5_0_2_linux_x86-64_shared_openmpi411/orca_gtoint_mpi opt.int.tmp opt ” ,同时有MPI的错误提示“ORTE was unable to reliably start one or more daemons.This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
[file orca_tools/qcmsg.cpp, line 465]:
.... aborting the run
”
这个问题困惑了我好几年了,一直查不出来原因,表现也很奇怪:
(1)两个配置一样的集群,一个能用一个不能用;
(2)同样的作业,同样的节点(指定),如果运行多次,也有可能偶尔有一次越过GTOInt的地方继续往自洽去算,就是也不是次次失败。