超算GPU加速利用率为0

不辣的麻辣烫 · 发表于 Post on 2023-3-1 17:41:55

最近利用超算中心进行gromacs protein-ligand的动力学模拟，显卡是[backcolor=rgba(0, 0, 0, 0.04)]NVIDIA® Tesla® V100，核数96，卡数8，736GB/节点。大概计算14万个原子的体系，预计用时24h，显卡利用率0%，只有内存2.8%利用率。请问各位前辈出现了啥情况：

.sh 文件

#!/bin/bash
#SBATCH --job-name = md_out
#SBATCH --error = %j.err

module add GROMACS/2023-dev #加载软件

export GMX_GPU_DD_COMMS=true
export GMX_GPU_PME_PP_COMMS=true
export GMX_FORCE_UPDATE_DEFAULT_GPU=true

gmx_mpi mdrun -v -pin on -nb gpu -bonded gpu -pme gpu -update gpu -deffnm md

提交作业命令

sbatch -N 1 -p g-v100-1 -n 10 -c 1 md.sh

部分log日志

Running on 1 node with total 48 cores, 96 logical cores, 8 compatible GPUs
Hardware detected on host g-v100-8-worker0001 (the node of MPI rank 0):
  CPU info:
Vendor: Intel
Brand:  Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Family: 6 Model: 85 Stepping: 4
Features: aes apic avx avx2 avx512f avx512cd avx512bw avx512vl avx512secondFMA clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 x2apic
Number of AVX-512 FMA units: 2

...
...
When checking whether update groups are usable:
  Domain decomposition is not active, so there is no need for update groups
  At least one moleculetype does not conform to the requirements for using update groups
1 GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node:
  PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU
PP task will update and constrain coordinates on the GPU
PME tasks will do all aspects on the GPU
Using 1 MPI process
Using 48 OpenMP threads

Pinning threads with an auto-selected logical core stride of 2
System total charge: 0.000
Will do PME sum in reciprocal space for electrostatic interactions.

There are: 145256 Atoms

md文件

title                = Protein-ligand complex MD simulation
; Run parameters
integrator             = md       ; leap-frog integrator
nsteps                = 50000000 ; 2 * 25000000 = 50000 ps (50 ns)
dt                   = 0.002    ; 2 fs
; Output control
nstenergy             = 5000    ; save energies every 10.0 ps
nstlog                = 5000    ; update log file every 10.0 ps
nstxout-compressed    = 5000    ; save coordinates every 10.0 ps
; Bond parameters
continuation          = yes    ; continuing from NPT
constraint_algorithm = lincs    ; holonomic constraints
constraints          = h-bonds ; bonds to H are constrained
lincs_iter             = 1       ; accuracy of LINCS
lincs_order          = 4       ; also related to accuracy
; Neighbor searching and vdW
cutoff-scheme          = Verlet
ns_type                = grid    ; search neighboring grid cells
nstlist                = 20       ; largely irrelevant with Verlet
rlist                = 1.2
vdwtype                = cutoff
vdw-modifier          = force-switch
rvdw-switch          = 1.0
rvdw                   = 1.2    ; short-range van der Waals cutoff (in nm)
; Electrostatics
coulombtype          = PME    ; Particle Mesh Ewald for long-range electrostatics
rcoulomb             = 1.2
pme_order             = 4       ; cubic interpolation
fourierspacing       = 0.16    ; grid spacing for FFT
; Temperature coupling
tcoupl                = V-rescale                   ; modified Berendsen thermostat
tc-grps                = Protein_2xt Water_NA       ; two coupling groups - more accurate
tau_t                = 0.1 0.1                   ; time constant, in ps
ref_t                = 300 300                   ; reference temperature, one for each group, in K
; Pressure coupling
pcoupl                = Parrinello-Rahman          ; pressure coupling is on for NPT
pcoupltype             = isotropic                   ; uniform scaling of box vectors
tau_p                = 2.0                         ; time constant, in ps
ref_p                = 1.0                         ; reference pressure, in bar
compressibility       = 4.5e-5                      ; isothermal compressibility of water, bar^-1
; Periodic boundary conditions
pbc                   = xyz    ; 3-D PBC
; Dispersion correction is not used for proteins with the C36 additive FF
DispCorr             = no
; Velocity generation
gen_vel                = no       ; continuing from NPT equilibration

还望各位前辈帮忙指点指点。（PS：同样的体系，在自己电脑上采用1660 super几乎满gpu利用率，用时27h左右）

sobereva · 发表于 Post on 2023-3-1 17:50:42

别自己手写[GROMACS]这种标签，给你去了，以后注意

Entropy.S.I · 发表于 Post on 2023-3-1 18:04:29

没有NVLink或者有NVLink但没有在编译时加入NVLink支持，就别想着用多卡跑一个任务。老老实实单卡单任务。

不辣的麻辣烫 · 发表于 Post on 2023-3-1 18:12:10

sobereva 发表于 2023-3-1 17:50
别自己手写[GROMACS]这种标签，给你去了，以后注意

好的，社长

不辣的麻辣烫 · 发表于 Post on 2023-3-1 18:15:18

Entropy.S.I 发表于 2023-3-1 18:04
没有NVLink或者有NVLink但没有在编译时加入NVLink支持，就别想着用多卡跑一个任务。老老实实单卡单任务。

看了您之前的帖子提到了NVLink的问题，后面用了一张12核的单卡跑了一次，也是近24h的完成时间，gpu利用率也是0%。想问一下这一般是哪边出了问题呢，同样的.sh文件

Entropy.S.I · 发表于 Post on 2023-3-1 18:31:08

不辣的麻辣烫发表于 2023-3-1 18:15
看了您之前的帖子提到了NVLink的问题，后面用了一张12核的单卡跑了一次，也是近24h的完成时间，gpu利用率 ...

仔细看了下你给的信息，在这样的模拟参数下，一块V100的速度在100ns/d左右是正常的，2080Ti/3070的水平。

不知道你怎么看的GPU占用率，如果在登录节点直接nvidia-smi，输出的根本不是计算节点的GPU信息，倒是可以在队列脚本里面加nvidia-smi

abin · 发表于 Post on 2023-3-1 19:51:04

NVLink 是硬件。

如果硬件不具备，
多卡跑一个任务，效果很差劲的。

不辣的麻辣烫 · 发表于 Post on 2023-3-2 16:04:41

Entropy.S.I 发表于 2023-3-1 18:31
仔细看了下你给的信息，在这样的模拟参数下，一块V100的速度在100ns/d左右是正常的，2080Ti/3070的水平。 ...

好的，非常感谢

		自动登录 Automatic login	找回密码 Forget password
密码 Password			注册 Register

[GROMACS] 超算GPU加速利用率为0

浏览过的版块