性能翻倍？RTX4090科学计算之经典MD模拟全面测试

Entropy.S.I · 发表于 Post on 2024-1-19 17:40:48

gauss98 发表于 2024-1-19 09:32
感谢博主测试，
问问有没有多卡性能测试？
6卡8卡4090 （D?) 的配置和测试

搭平台成本过高，没测过，也没有必要测。多卡机器每块卡的CPU和通信资源都是独立的。

wangyueda · 发表于 Post on 2024-3-11 17:02:42

请问下楼主，我用您的测试集中的体系B在课题组服务器上跑，最高也只能跑91.220 ns/day，与您的最高性能300多ns/day差距较大是为啥？而且我的“-update gpu -bonded gpu”要比“-update gpu”速度快些（前者91.220 ns/day，后者76.52791.220 ns/day）。而且-ntomp=1要比-ntomp>1(8,12,16等)都快，请问下我的问题出在哪？谢谢楼主

我的机器参数：
CPU：AMD EPYC 7402 24-Core Processor
GPU: A100*8

gmx版本信息：

gmx -version
GROMACS version: 2022.6
Precision: mixed
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.8-sse2-avx
GPU FFT library: cuFFT
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 9.4.0
C compiler flags: -mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -O3 -DNDEBUG
C++ compiler: /usr/bin/c++ GNU 9.4.0
C++ compiler flags: -mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA compiler: /data/soft/cuda-sdk/12.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2023 NVIDIA Corporation;Built on Tue_Feb__7_19:32:13_PST_2023;Cuda compilation tools, release 12.1, V12.1.66;Build cuda_12.1.r12.1/compiler.32415258_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;--generate-code=arch=compute_89,code=sm_89;--generate-code=arch=compute_90,code=sm_90;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-D_FORCE_INLINES;-mavx2 -mfma -pthread -Wno-missing-field-initializers -fexcess-precision=fast -funroll-all-loops -fopenmp -O3 -DNDEBUG
CUDA driver: 12.20
CUDA runtime: 12.10

复制代码

提交脚本：

#!/bin/bash
#An example.
#SBATCH -J wyd-test
#SBATCH -p normal # 使用指定的队列
#SBATCH --qos=normalqos # 使用normal队列对应的QoS
#SBATCH --gres=gpu:1 # 使用的GPU卡数
gmx mdrun -pin on -ntmpi 1 -ntomp 1 -notunepme -bonded gpu -update gpu -v -deffnm B

复制代码

Entropy.S.I · 发表于 Post on 2024-3-12 10:11:36

本帖最后由 Entropy.S.I 于 2024-3-12 10:34 编辑

wangyueda 发表于 2024-3-11 17:02
请问下楼主，我用您的测试集中的体系B在课题组服务器上跑，最高也只能跑91.220 ns/day，与您的最高性能300 ...

CPU烂得一蹋糊涂，这种情况下-bonded gpu更快不意外

http://bbs.keinsci.com/thread-39266-1-1.html
看最后一张图，价值不到3000的4060都比你们价值10万多的的A100快

wangyueda · 发表于 Post on 2024-3-12 10:32:58

Entropy.S.I 发表于 2024-3-12 10:11
CPU烂得一蹋糊涂，这种情况下-bonded gpu更快不意外，A100能打2080Ti都不错了

好的谢谢楼主，所以我这速度（91.220 ns/day）差不多也是当前配置下的极限了对吧

Entropy.S.I · 发表于 Post on 2024-3-12 10:39:48

wangyueda 发表于 2024-3-12 10:32
好的谢谢楼主，所以我这速度（91.220 ns/day）差不多也是当前配置下的极限了对吧

不一定，需要根据CPU架构仔细调优，尽可能减少核间延迟的影响。EPYC 7402每个CCX只有3核，只用3核6线程，把所有omp线程绑定到同一个CCX中可能比用更多核还快。对于这种核间延迟很烂的CPU，1个MPI Rank不建议用很多核

wangyueda · 发表于 Post on 2024-3-12 11:01:58

本帖最后由 wangyueda 于 2024-4-10 20:44 编辑

Entropy.S.I 发表于 2024-3-12 10:39
不一定，需要根据CPU架构仔细调优，尽可能减少核间延迟的影响。EPYC 7402每个CCX只有3核，只用3核6线程， ...

好的谢谢楼主，我再试试。

Entropy.S.I · 发表于 Post on 2024-3-12 12:05:27

wangyueda 发表于 2024-3-12 11:01
好的谢谢楼主，我再试试。（ps: 刚来课题组有点疑惑采购这机器的人咋想的。。）

外行可能会认为“有GPU就行”，殊不知在科学计算这种小领域，没什么程序能做到让CPU性能不拖累GPU，CPU往往会成为整个计算的瓶颈。底层的优化只有AI行业有充足的人力去做。

tienan0412 · 发表于 Post on 2024-4-3 20:48:19

请问，4090D和4080如何选择？4090D能不能把贵出的价格值回来呢。

zdworld · 发表于 Post on 2024-10-6 15:02:37

13和14代intel缩缸之后性能对齐amd有优势吗

		自动登录 Automatic login	找回密码 Forget password
密码 Password			注册 Register

[硬件评测] 性能翻倍？RTX4090科学计算之经典MD模拟全面测试