求助：集群A100显卡MD模拟速度缓慢（gromacs 2019.1）

xingbb · 发表于 Post on 2022-4-29 17:23:20

GROMACS 2019.1安装在集群（slurm）上，使用A100节点进行MD模拟运算：
该节点信息如下：A100 GPU--8张，CPU--96核(Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz)。

gromacs安装信息：
GROMACS version: 2019.1
Precision:       single
Memory model:    64 bit
MPI library:       thread_mpi
OpenMP support:    enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:       CUDA
GROMACS version: 2019.1
Precision:       single
Memory model:    64 bit
MPI library:       thread_mpi
OpenMP support:    enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:       CUDA
SIMD instructions:  AVX2_256
FFT library:       fftw-3.3.3-sse2
RDTSCP usage:    enabled
TNG support:       enabled
Hwloc support:    disabled
Tracing support: disabled
C compiler:       /bin/cc GNU 4.8.5
C compiler flags: -mavx2 -mfma    -O2 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler:    /bin/c++ GNU 4.8.5
C++ compiler flags:  -mavx2 -mfma -std=c++11 -O2 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler:    /usr/local/cuda-10.2/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Buil
t on Wed_Oct_23_19:24:38_PDT_2019;Cuda compilation tools, release 10.2, V10.2.89
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencod
e;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_6
1;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O2;-DNDEBUG;
-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:       11.60
CUDA runtime:    10.20

命令提交脚本：使用npt平衡作为测试，体系是含有7167个原子的蛋白-小分子体系
#!/bin/bash
#SBATCH --job-name=dutp_ligand
#SBATCH --partition=A100
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1
#SBATCH --output=test.out.%j
#SBATCH --error=test.err.%j
gmx mdrun -deffnm npt -nb gpu    ###  运算速度 27.773 （ns/day）

gmx mdrun -deffnm npt -ntmpi 1 -ntomp 12 -pme gpu  ###运算速度 39.526 (ns/day)

还尝试了多卡并行，结果速度更慢，请各位老师指教指教。

Frozen-Penguin · 发表于 Post on 2022-4-29 19:39:21

GPU的使用率应该很低，如果能看到GPU使用率的话建议检查一下。
多给几个CPU线程应该还能变快，-ntmpi 1 -ntomp 40试试能不能变快。
关于怎么优化gromacs运行效率可以看这个：
http://bbs.keinsci.com/forum.php ... 1&fromuid=36081 (出处: 计算化学公社)[/url]
内容非常详细，不过是几年前的，里面的有些结论可能不适用于A100，具体怎么设置更快还需要自己探索。

skywalk · 发表于 Post on 2022-4-29 19:55:40

xingbb · 发表于 Post on 2022-4-29 20:25:04

skywalk 发表于 2022-4-29 19:55
用nvidia-smi命令看一下，看一下显存占用率，估计没有用上gpu

老师，您好。跑GMX任务时，GPU Memory Usage显示为 539MiB

xingbb · 发表于 Post on 2022-4-29 20:35:26

Frozen-Penguin 发表于 2022-4-29 19:39
GPU的使用率应该很低，如果能看到GPU使用率的话建议检查一下。
多给几个CPU线程应该还能变快，-ntmpi 1 -n ...

gmx mdrun -deffnm npt -nb gpu ### 运算速度 27.773 （ns/day），这一条命令没有指定-ntomp，我看了日志文件上是1 MPI rank，48 OpenMP threads，结果速度比-ntmop 12还慢

牧生 · 发表于 Post on 2022-4-29 20:46:20

本帖最后由牧生于 2022-4-29 21:25 编辑

我觉得你的错误在于没有指定让单个GPU跑单个任务，从而导致所有GPU跑同一个任务，结果就是变慢了
试试我常用的这个命令
gmx mdrun -deffnm md -v -ntomp 16 -ntmpi 1 -gpu_id 0 （视情况可加上 -update gpu）见此贴 http://bbs.keinsci.com/thread-27044-2-1.html#pid183389

相关测试见第四部分的结论
http://bbs.keinsci.com/thread-13861-1-1.html
我没有用过集群，我只用我自己的机子，不一定完全正确

xingbb · 发表于 Post on 2022-4-29 21:14:20

牧生发表于 2022-4-29 20:46
看到你这条命令，gmx mdrun -deffnm npt -nb gpu ，我觉得你的错误在于没有指定让单个GPU跑单 ...

那篇文章我试过里面的参数了，使用您给的参数，运行结果是39.066(ns/day)

abin · 发表于 Post on 2022-4-29 21:20:04

A100 Ampere架构.

1.3. Verifying Ampere Compatibility for Existing Applications
The first step towards making a CUDA application compatible with the NVIDIA Ampere GPU architecture is to check if the application binary already contains compatible GPU code (at least the PTX). The following sections explain how to accomplish this for an already built CUDA application.

1.3.1. Applications Built Using CUDA Toolkit 10.2 or Earlier
CUDA applications built using CUDA Toolkit versions 2.1 through 10.2 are compatible with NVIDIA Ampere architecture based GPUs as long as they are built to include PTX versions of their kernels. This can be tested by forcing the PTX to JIT-compile at application load time with following the steps:
Download and install the latest driver from https://www.nvidia.com/drivers.
Set the environment variable CUDA_FORCE_PTX_JIT=1.
Launch the application.
With CUDA_FORCE_PTX_JIT=1, GPU binary code embedded in an application binary is ignored. Instead PTX code for each kernel is JIT-compiled to produce GPU binary code. An application fails to execute if it does not include PTX. This means the application is not compatible with the NVIDIA Ampere GPU architecture and needs to be rebuilt for compatibility. On the other hand, if the application works properly with this environment variable set, then the application is compatible with the NVIDIA Ampere GPU architecture.

Note: Be sure to unset the CUDA_FORCE_PTX_JIT environment variable after testing is done.

Ref:
https://docs.nvidia.com/cuda/amp ... mpere-compatibility

问题,
你用哪一个cuda toolkit编译的?
cuda toolkit和Driver是两个东西.
当前显示的是, runtime 是 cuda 10.2
试试Nvidia说的东西呀.

如果你的集群可以运行singularity, 我给你个版本, 你试试.

要不然, 你自己编译试试哦.

xingbb · 发表于 Post on 2022-4-29 21:40:35

abin 发表于 2022-4-29 21:20
A100 Ampere架构.

1.3. Verifying Ampere Compatibility for Existing Applications

老师，您好！您的问题我去向管理员求证，您说的给我一个版本是指的什么呀？

abin · 发表于 Post on 2022-4-29 21:45:12

xingbb 发表于 2022-4-29 21:40
老师，您好！您的问题我去向管理员求证，您说的给我一个版本是指的什么呀？

请仔细阅读 “如果你的集群可以运行singularity, 我给你个版本, 你试试. ”
当然是我编译调试的GMX-GPU版本了. . .

plus · 发表于 Post on 2022-4-30 15:18:09

a100做md太浪费了

Entropy.S.I · 发表于 Post on 2022-4-30 17:51:14

本帖最后由 Entropy.S.I 于 2022-4-30 17:53 编辑

使用2021.5版本GMX、510版本GPU驱动和CUDA Toolkit 11.6。GMX 2021.5已对Ampere架构进行优化。单GPU运行时指定-update gpu，多GPU运行时先手动测试不同的ntmpi和ntomp组合。

如果只有这1个版本GMX，说明此集群以前很少用来跑GMX，或集群管理员不够专业，应该沟通一下，多装几个版本GMX，例如2019.6、2020.7、2021.5乃至2022.1

		自动登录 Automatic login	找回密码 Forget password
密码 Password			注册 Register

[GROMACS] 求助：集群A100显卡MD模拟速度缓慢（gromacs 2019.1）