|
本帖最后由 FUcreature 于 2021-7-24 14:17 编辑
前情提要
FU本来是Ubuntu+Slurm用户,但是因为各种原因要接触CentOS+Torque的环境,所以就上云开了个环境研究看看,因为许可的问题没敢把Gaussian传上去,所以安装了ORCA5.0.0和OpenMPI,一路测试到并行计算的时候遇到了排不掉的问题,故前来求助
问题环境
虚拟网络为192.168.1.0/24,放置了3台2核4G的服务器,系统均为CentOS 7 Minimal;其中.1是路由器,.2是管理节点(下称master),两台.3x的计算节点(下称worker01、worker02)
在root@master上生成了SSH密钥对,并将公钥分发到root@worker01、worker02,测试SSH无密码登陆正常
在master安装ypserv包、worker安装ypbind包并配置NIS服务;安装nfs-utils包并配置NFS服务,将master的/home /opt挂出,客户端通过写/etc/fstab挂载到对应位置
测试用户、家目录互通正常,参考 Torque6.1.1官方文档 进行编译安装(由于目前CentOS 7的hwloc版本已达到1.11,大于文档要求的1.9.1,故直接用yum安装)
在master安装pbs_server,然后将生成的package分发到worker进行pbs_mom的安装(因为没有互通/usr/local所以客户端也需要安装hwloc之类的包,service才能跑起来)
启动各种服务后,再按照 orca入门 和 OpenMPI FAQ 编译安装了两者到/opt下,同时将PATH、LD_LIBRARY_PATH相关的脚本分发到所有节点的/etc/profile.d里
问题场景
首先用orca文档中提供的水分子输入文件进行了单核的运算,qsub脚本如下:
- #!/bin/bash
- #PBS -N water
- #PBS -l nodes=1:ppn=1
- #PBS -o water.o
- #PBE -e water.e
- echo Running in $HOSTNAME
- /opt/orca-5.0.0/orca /home/poi/water/water.inp > /home/poi/water/water.out
复制代码 到这里一切正常,可以在water.out里得到“****ORCA TERMINATED NORMALLY****”,water.o里看到运行作业的主机名;可一旦将输入文件换成并行运算的testjob.inp(抄自 http://sobereva.com/451 )
- ! BLYP def2-SVP noautostart miniprint pal4
- * xyz 0 1
- C 0.00000000 0.00000000 -0.56221066
- H 0.00000000 -0.92444767 -1.10110537
- H -0.00000000 0.92444767 -1.10110537
- O 0.00000000 0.00000000 0.69618930
- *
复制代码 对应的qsub脚本:
- #!/bin/bash
- #PBS -N testjob
- #PBS -l nodes=2:ppn=2
- #PBS -o testjob.o
- #PBE -e testjob.e
- echo Running in $HOSTNAME
- /opt/orca-5.0.0/orca /home/poi/testjob/testjob.inp > /home/poi/testjob/testjob.out
复制代码 然后就爆炸了,根据.e文件的输出,排掉一个worker上缺少librdmacm包的问题之后,尚有一个问题解决不掉:
- [file orca_tools/qcmat1.cpp, line 1677, Process 2]: Unable to open file /home/poi/testjob/testjob.VAUXJ.tmp in TMatrix<T>::Store(const char *fname)! <<<这里
- [file orca_tools/qcmat1.cpp, line 1677, Process 3]: Unable to open file /home/poi/testjob/testjob.VAUXJ.tmp in TMatrix<T>::Store(const char *fname)! <<<这里
- --------------------------------------------------------------------------
- Primary job terminated normally, but 1 process returned
- a non-zero exit code. Per user-direction, the job has been aborted.
- --------------------------------------------------------------------------
- --------------------------------------------------------------------------
- An MPI communication peer process has unexpectedly disconnected. This
- usually indicates a failure in the peer process (e.g., a crash or
- otherwise exiting without calling MPI_FINALIZE first).
- Although this local MPI process will likely now behave unpredictably
- (it may even hang or crash), the root cause of this problem is the
- failure of the peer -- that is what you need to investigate. For
- example, there may be a core file that you can examine. More
- generally: such peer hangups are frequently caused by application bugs
- or other external events.
- Local host: torque-centos-workers-4wmc
- Local PID: 20709
- Peer host: torque-centos-workers-mg7p
- --------------------------------------------------------------------------
- [torque-centos-workers-4wmc:20705] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
- [torque-centos-workers-4wmc:20705] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
- --------------------------------------------------------------------------
- mpirun detected that one or more processes exited with non-zero status, thus causing
- the job to be terminated. The first process to do so was:
- Process name: [[40750,1],2]
- Exit code: 64
- --------------------------------------------------------------------------
- [torque-centos-workers-4wmc:20705] 1 more process has sent help message help-mpi-btl-tcp.txt / peer hung up
- [torque-centos-workers-4wmc:20705] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
- [file orca_tools/qcmsg.cpp, line 458]:
复制代码 它说“Unable to open file testjob.VAUXJ.tmp”,但诡异的是这个文件确实是存在的,文件的属主也是当前用户,不存在权限问题:
- [poi@torque-centos-masters-pl0s testjob]$ ls
- testjob.dummy.tmp testjob.gbw testjob.int.tmp testjob.out testjob.sh testjob.S.tmp testjob.V.tmp
- testjob.e33 testjob.H.tmp testjob.K.tmp testjob.PDAT.tmp testjob.SHARKINP.tmp testjob.T.tmp
- testjob.EIJ.tmp testjob.inp testjob.o testjob_property.txt testjob.SHARK.K.tmp testjob.VAUXJ.tmp <<< 这里
复制代码 目前FU在这里尬住了,没找到出路,诸君如有想法还请赐教
|
|