|
|
各位大佬好~
我在使用命令行提交作业时,针对无磁性金属表面弛豫计算中发现:1节点56核(没节点56核190G)的 运行1天多报错,运行内存使用6%左右,castep文件中尝试迭代求解时,前几次计算都结果不对,然后重复算到第5次求解时会报错并终止计算(castep文件放在附件中)。
caste文件最后显示如下:
Writing analysis data to Cu_001_9.castep_bin
Writing model to Cu_001_9.check
Last known process information:
===============================
Name: castepexe.exe
Umask: 0022
State: R (running)
Tgid: 122086
Ngid: 122086
Pid: 122086
PPid: 122082
TracerPid: 0
Uid: 200158 200158 200158 200158
Gid: 200008 200008 200008 200008
FDSize: 64
Groups: 200006 200008
VmPeak: 3953240 kB
VmSize: 2950428 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 3025656 kB
VmRSS: 1940256 kB
RssAnon: 1917920 kB
RssFile: 1560 kB
RssShmem: 20776 kB
VmData: 2066560 kB
VmStk: 1796 kB
VmExe: 150104 kB
VmLib: 28368 kB
VmPTE: 5080 kB
VmSwap: 100024 kB
Threads: 1
SigQ: 0/767479
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000100000
SigCgt: 00000001800046ae
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000001fffffffff
CapAmb: 0000000000000000
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000001
Cpus_allowed_list: 0
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 1229435
nonvoluntary_ctxt_switches: 2365629
castepext.log文件显示如下:
not keeping all the check files
Run Castep: with seedname = Cu_001_9
error, Incomplete CASTEP output file
std_out.txt文件报错信息显示如下:
Checked out license feature: MS_castep_site <v2018.09> [for msi] (2 copies)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 122137 RUNNING AT n07310
= EXIT CODE: 7
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
同时此错误也出现在: 同作业相同核数不同节点(2节点110核、1节点56核分别在运行两天核一天多后出现上述错误;5节点110核计算时在第一次进行35此左右的SCF LOOP计算后便出现上述EXIT CODE: 7错误)。
上述2节点共110核,scf_cycles为150;charge为0.1;smearing为0.01,且每次迭代计算能量E与E-TS间的差距小于0.1。计算两天后出现上述错误
实在为之苦恼,考虑到是否是多节点导致的并行问题,使用1节点56核即最开始所讲的,scf_cycles为150;charge为0.1;smearing为0.01,且每次迭代计算能量E与E-TS间的差距小于0.1。
上述作业均满足运存需求,我在网上查阅后发现EXIT CODE: 9为运存不足,但EXIT CODE: 7不知为何?在和超算中心工作人员联系后,表示我作业本身参数设置不对。但在次参数设置下的较小的表面弛豫均无问题(即小层数计算下无问题,但即便目前作业仍然只有81个原子),同时为了降低层数上升导致的计算量上升问题,计算过程中我修改:scf_cycles为150;charge为0.15、0.1;smearing为0.1、0.15、0.2,但均出现上述错误。
请各位大佬不吝赐教
|
-
-
castep.txt
146.91 KB, 下载次数 Times of downloads: 7
1节点56核castep计算
|