|
|
本帖最后由 jimulation 于 2020-12-31 20:41 编辑
在虚拟机里用gmx2019.6生成了tpr文件,放到广州超算平台上用gpu版的gmx2019.6跑,单结点配置是4块V100加28核心CPU。尝试了多种提交指令,遇到了几种错误如下:
1.yhrun -N 1 -n 4 ...... -ntomp 6 -gpu_id 0123 -nb gpu -pme gpu -npme 1 -bonded gpu
报错:Inconsistency in user input:Bonded interactions on the GPU were required, but not supported for these simulation settings. Change your settings, or do not require using GPUs.
2.yhrun -N 1 -n 4 ...... -ntomp 6 -gpu_id 0123 -nb gpu -pme gpu -npme 1 -bonded cpu
报错:Step 100: The total potential energy is nan, which is not finite. The LJ and electrostatic contributions to the energy are 0 and 0, respectively. A non-finite potential energy can be caused by overlapping interactions in bonded interactions or very large or Nan coordinate values. Usually this is caused by a badly- or non-equilibrated initial configuration, incorrect interactions or parameters in the topology.
3.yhrun -N 1 -n 4 ...... -ntomp 6 -gpu_id 0123
报错:80 particles communicated to PME rank 1 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x. This usually means that your system is not well equilibrated.
4.yhrun -N 1 -n 2 ...... -ntomp 6 -gpu_id 0123
报错:step 1: One or more water molecules can not be settled. Check for bad contacts and/or reduce the timestep if appropriate.
以上,问题1:为什么采用nb/npme/bonded选项,会出现不同的错误?其中有什么内部机制么?
————————————————
将-n选项改为1,仅使用一个rank,则可以正常运行,如5~7;但指定了-bonded gpu又会出错,如8:
5.yhrun -N 1 -n 1 ...... -ntomp 16 -gpu_id 0123(63ns/day)log文件提示:1 GPU selected for this run. Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node: PP:0,PME:0. PP tasks will do (non-perturbed) short-ranged interactions on the GPU, PME tasks will do all aspects on the GPU)
6.yhrun -N 1 -n 1 ...... -ntomp 24 -gpu_id 0123(66ns/day,log文件提示同5)
7.yhrun -N 1 -n 1 ...... -ntomp 28 -gpu_id 0123(61ns/day,log文件提示同5)
8.yhrun -N 1 -n 1 ...... -ntomp 24 -gpu_id 0123 -nb gpu -pme gpu -npme 1 -bonded gpu(报错,同1)
以上,问题2:为何改用一个rank就能正常运行,但-bonded还是不能手动指定?
————————————————
还有一个问题3:采用纯CPU计算时,-n或-ntmpi设置rank数,-ntomp设置每个rank的线程数(http://bbs.keinsci.com/thread-13861-1-1.html),二者乘积是总核数。那么采用GPU/CPU混合计算时,gpu是如何被划分的?比如只有一块GPU,-ntmpi设为4,-ntomp设为6,若1个pme rank在GPU上计算,其线程数指什么?剩余3个rank在CPU上计算,则CPU一共是24核还是18核参与运算?
问题比较多,求解答,非常感谢!
|
|