gromacs2019.3gpu加速效果对比

mol · 发表于 Post on 2019-7-17 11:02:49

各位老师好，

小弟组里最近新添了一台Platium 8173M+RTX2080Ti服务器，我做了下简单的速度对比，供大家参考：
17000原子体系下：
E5 2686 v4+GTX1080Ti机器156ns/day

GROMACS version: 2019.3
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/bin/cc GNU 4.8.5
C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/bin/c++ GNU 4.8.5
C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/local/cuda-9.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 9.10
CUDA runtime: 9.10

复制代码

Platium 8173M+RTX2080Ti机器采用gcc5.0 和avx512指令集编译 154ns/day

GROMACS version: 2019.3
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX_512
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/local/bin/gcc GNU 5.5.0
C compiler flags: -mavx512f -mfma -O2 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/local/bin/g++ GNU 5.5.0
C++ compiler flags: -mavx512f -mfma -std=c++11 -O2 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/local/cuda-10.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on Wed_Apr_24_19:10:27_PDT_2019;Cuda compilation tools, release 10.1, V10.1.168
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;; ;-mavx512f;-mfma;-std=c++11;-O2;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.10
CUDA runtime: 10.10

复制代码

同样机器采用gcc5.0 和avx2_256指令集编译152ns/day

GROMACS version: 2019.3
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support: CUDA
SIMD instructions: AVX2_256
FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /usr/local/bin/gcc GNU 5.5.0
C compiler flags: -mavx2 -mfma -O2 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
C++ compiler: /usr/local/bin/g++ GNU 5.5.0
C++ compiler flags: -mavx2 -mfma -std=c++11 -O2 -DNDEBUG -funroll-all-loops -fexcess-precision=fast
CUDA compiler: /usr/local/cuda-10.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on Wed_Apr_24_19:10:27_PDT_2019;Cuda compilation tools, release 10.1, V10.1.168
CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O2;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver: 10.10
CUDA runtime: 10.10

复制代码

感觉新机器还不如老的机器呢。。。

puzhongji · 发表于 Post on 2019-7-19 08:40:46

这个结果好意外

bobosiji · 发表于 Post on 2019-8-8 08:30:02

puzhongji 发表于 2019-7-19 08:40
这个结果好意外

可能体系太小，新机器优势体现不了？

308866814 · 发表于 Post on 2019-8-10 10:41:18

您好，能否把tpr文件附上，方便大家测试不同平台配置下的ns/day？

StormSpirts · 发表于 Post on 2019-12-19 18:32:40

根据https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.26011上的测试结果，2080TI明显优于1080TI，楼主要么使用文中的条件测试一下？

		自动登录 Automatic login	找回密码 Forget password
密码 Password			注册 Register

[GROMACS] gromacs2019.3gpu加速效果对比