超算上GPU计算报错，求助可能的原因

jimulation · 发表于 Post on 2020-12-31 20:41:20

本帖最后由 jimulation 于 2020-12-31 20:41 编辑

在虚拟机里用gmx2019.6生成了tpr文件，放到广州超算平台上用gpu版的gmx2019.6跑，单结点配置是4块V100加28核心CPU。尝试了多种提交指令，遇到了几种错误如下：
1.yhrun -N 1 -n 4 ...... -ntomp 6 -gpu_id 0123 -nb gpu -pme gpu -npme 1 -bonded gpu
报错：Inconsistency in user input:Bonded interactions on the GPU were required, but not supported for these simulation settings. Change your settings, or do not require using GPUs.

2.yhrun -N 1 -n 4 ...... -ntomp 6 -gpu_id 0123 -nb gpu -pme gpu -npme 1 -bonded cpu
报错：Step 100: The total potential energy is nan, which is not finite. The LJ and electrostatic contributions to the energy are 0 and 0, respectively. A non-finite potential energy can be caused by overlapping interactions in bonded interactions or very large or Nan coordinate values. Usually this is caused by a badly- or non-equilibrated initial configuration, incorrect interactions or parameters in the topology.

3.yhrun -N 1 -n 4 ...... -ntomp 6 -gpu_id 0123
报错：80 particles communicated to PME rank 1 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x. This usually means that your system is not well equilibrated.

4.yhrun -N 1 -n 2 ...... -ntomp 6 -gpu_id 0123
报错：step 1: One or more water molecules can not be settled. Check for bad contacts and/or reduce the timestep if appropriate.

以上，问题1：为什么采用nb/npme/bonded选项，会出现不同的错误？其中有什么内部机制么？
————————————————

将-n选项改为1，仅使用一个rank，则可以正常运行，如5~7；但指定了-bonded gpu又会出错，如8：
5.yhrun -N 1 -n 1 ...... -ntomp 16 -gpu_id 0123（63ns/day)log文件提示：1 GPU selected for this run. Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node: PP:0,PME:0. PP tasks will do (non-perturbed) short-ranged interactions on the GPU, PME tasks will do all aspects on the GPU）

6.yhrun -N 1 -n 1 ...... -ntomp 24 -gpu_id 0123（66ns/day，log文件提示同5）

7.yhrun -N 1 -n 1 ...... -ntomp 28 -gpu_id 0123（61ns/day，log文件提示同5）

8.yhrun -N 1 -n 1 ...... -ntomp 24 -gpu_id 0123 -nb gpu -pme gpu -npme 1 -bonded gpu（报错，同1）

以上，问题2：为何改用一个rank就能正常运行，但-bonded还是不能手动指定？
————————————————

还有一个问题3：采用纯CPU计算时，-n或-ntmpi设置rank数，-ntomp设置每个rank的线程数(http://bbs.keinsci.com/thread-13861-1-1.html)，二者乘积是总核数。那么采用GPU/CPU混合计算时，gpu是如何被划分的？比如只有一块GPU，-ntmpi设为4，-ntomp设为6，若1个pme rank在GPU上计算，其线程数指什么？剩余3个rank在CPU上计算，则CPU一共是24核还是18核参与运算？

问题比较多，求解答，非常感谢！

对抗路达摩 · 发表于 Post on 2023-2-15 19:11:18

喵星大佬发表于 2023-2-15 16:48
并不，可以指定PME在哪个GPU上跑

各人实践下来对GROMACS大部分体系用多GPU，和单GPU比很难获得有效的提升

喵星大佬 · 发表于 Post on 2023-2-15 16:48:14

wuzhiyi 发表于 2021-1-2 05:54
我的意思是一块gpu一天100ns
四块加在一起估计一天120ns 因为一旦发现有多GPU gmx就把PME放在CPU上跑了 ...

并不，可以指定PME在哪个GPU上跑

季伯醇 · 发表于 Post on 2023-2-14 20:23:04

https://gitlab.com/gromacs/gromacs/-/issues/3412
这篇帖子讨论了这个问题，是gromacs2019/2020的版本更新带来的问题
解决方法有
1. 修改mdp文件：
constraints=no
gen_vel=yes
2. 如果一个rank不报错的话，先用一个rank慢速跑例如100步，导出gro文件再用多个rank跑

jimulation · 发表于 Post on 2021-1-3 22:27:40

wuzhiyi 发表于 2021-1-2 05:54
我的意思是一块gpu一天100ns
四块加在一起估计一天120ns 因为一旦发现有多GPU gmx就把PME放在CPU上跑了 ...

我懂你意思，你说的是计算效率问题，但是我遇到的问题是计算出错，而不是计算效率低

wuzhiyi · 发表于 Post on 2021-1-2 05:54:05

jimulation 发表于 2021-1-1 21:12
问题是以前用4块GPU成功算过现在不行了

我的意思是一块gpu一天100ns
四块加在一起估计一天120ns 因为一旦发现有多GPU gmx就把PME放在CPU上跑了

跑四个计算那就是一天400ns

jimulation · 发表于 Post on 2021-1-1 21:12:58

wuzhiyi 发表于 2021-1-1 19:26
gmx 不能多GPU运行，用一块GPU就行了。
用GPU的情况下，最优永远是用一个MPI rank剩下都搞openMP，四块GPU ...

问题是以前用4块GPU成功算过

现在不行了

wuzhiyi · 发表于 Post on 2021-1-1 19:26:10

gmx 不能多GPU运行，用一块GPU就行了。
用GPU的情况下，最优永远是用一个MPI rank剩下都搞openMP，四块GPU最好跑四个计算，没个用7个openMP thread

abin · 发表于 Post on 2020-12-31 23:38:56

我没有猜错的话，
yhrun就是srun的变种。

如果这个成立的话，
你懂得。
去看slurm手册。

或者看天河的手册呀。

手册里都有的。

		自动登录 Automatic login	找回密码 Forget password
密码 Password			注册 Register

[GROMACS] 超算上GPU计算报错，求助可能的原因

浏览过的版块