|
本帖最后由 Ionizing 于 2021-1-6 17:08 编辑
本人使用某超算中心集群,编译了 VASP 6.1.0 ,使用 GPU 加速后效率反而没有单纯 CPU 并行效率高:
测试体系:
- TiO2(110) rutile+O2, with LDAU, ISPIN=2, 36 k-points, symmetry off(见附件)
复制代码
测试环境:
测试结果(每个电子步的时间):- 1 node with no GPU: 20s, (i.e. 36 mpi tasks)
- 2 nodes with no GPU: 11s, (i.e. 72 mpi tasks)
- 1 node with V100x2: 83s, (1 mpi task)
- 1 node with V100x2: 88s, (36 mpi tasks)
- 1 node with V100x2: 44s, (2 mpi tasks)
- 1 node with V100x2: 43s, (4 mpi tasks)
复制代码
注:
- 在使用 GPU 加速时要求 NCORE=1 ,但在单纯使用 CPU 计算时 NCORE=6 ,其它参数相同;
- 在使用 GPU 加速时我曾 ssh 到计算节点上查看 GPU 的占用情况 (nvidia-smi),上面四种情况下 GPU 的两张卡均占用 95%+, 显存占用 20% 。
[size=14.399999618530273px]提交脚本:
- #!/bin/bash
- #SBATCH --account some_account
- #SBATCH -p analysis
- #SBATCH -N 1
- #SBATCH -n 6
- #SBATCH -c 6
- ##SBATCH --ntasks-per-node=32
- #SBATCH -t 02:00:00
- #SBATCH --gres=gpu:volta:2
- #SBATCH --gres-flags=disable-binding
- source /etc/profile.d/modules.bash
- module purge
- ulimit -s unlimited
- export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
- module purge
- module load intel/ips_18
- module load impi/ips_18
- module load cuda/11.1.0-intel
- echo "============================================================"
- module list
- env | grep "MKLROOT="
- echo "============================================================"
- echo "Job ID: $SLURM_JOB_NAME"
- echo "Job name: $SLURM_JOB_NAME"
- echo "Number of nodes: $SLURM_JOB_NUM_NODES"
- echo "Number of processors: $SLURM_NTASKS"
- echo "Task is running on the following nodes:"
- echo $SLURM_JOB_NODELIST
- echo "============================================================"
- echo
- srun ../bin/vasp_gpu
复制代码
编译参数(makefile.include):
- # Precompiler options
- CPP_OPTIONS= -DHOST="LinuxIFC"\
- -DMPI -DMPI_BLOCK=8000 -Duse_collective \
- -DscaLAPACK \
- -DCACHE_SIZE=4000 \
- -Davoidalloc \
- -Dvasp6 \
- -Duse_bse_te \
- -Dtbdyn \
- -Dfock_dblbuf
- CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
- FC = mpiifort
- FCL = mpiifort -mkl=sequential
- FREE = -free -names lowercase
- FFLAGS = -assume byterecl -w -xHOST
- OFLAG = -O2
- OFLAG_IN = $(OFLAG)
- DEBUG = -O0
- MKL_PATH = $(MKLROOT)/lib/intel64
- BLAS =
- LAPACK =
- BLACS = -lmkl_blacs_intelmpi_lp64
- SCALAPACK = $(MKL_PATH)/libmkl_scalapack_lp64.a $(BLACS)
- OBJECTS = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d.o
- INCS =-I$(MKLROOT)/include/fftw
- LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS)
- OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
- OBJECTS_O2 += fft3dlib.o
- # For what used to be vasp.5.lib
- CPP_LIB = $(CPP)
- FC_LIB = $(FC)
- CC_LIB = icc
- CFLAGS_LIB = -O
- FFLAGS_LIB = -O1
- FREE_LIB = $(FREE)
- OBJECTS_LIB= linpack_double.o getshmem.o
- # For the parser library
- CXX_PARS = icpc
- LLIBS += -lstdc++
- # Normally no need to change this
- SRCDIR = ../../src
- BINDIR = ../../bin
- #================================================
- # GPU Stuff
- CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28 -UscaLAPACK -Ufock_dblbuf
- OBJECTS_GPU= fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o
- CC = mpiicc
- CXX = mpiicpc
- CFLAGS = -fPIC -DADD_ -Wall -qopenmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS
- CUDA_ROOT ?= /usr/local/cuda/
- NVCC := $(CUDA_ROOT)/bin/nvcc -ccbin=mpiicc
- CUDA_LIB := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas
- #GENCODE_ARCH := -gencode=arch=compute_30,code="sm_30,compute_30" \
- # -gencode=arch=compute_35,code="sm_35,compute_35" \
- # -gencode=arch=compute_60,code="sm_60,compute_60" \
- # -gencode=arch=compute_70,code="sm_70,compute_70" \
- # -gencode=arch=compute_72,code="sm_72,compute_72"
- GENCODE_ARCH := -gencode=arch=compute_35,code="sm_35,compute_35" \
- -gencode=arch=compute_60,code="sm_60,compute_60" \
- -gencode=arch=compute_70,code="sm_70,compute_70" \
- -gencode=arch=compute_72,code="sm_72,compute_72"
- MPI_INC = $(I_MPI_ROOT)/include64/
复制代码
现在我的疑问是:
- 我使用的编译参数是否有问题;
- 我使用的提交脚本是否有问题;
- 我使用的测试体系和测试参数是否有问题;
- 我仍然怀疑我打开方式不对,请问有何指教。
|
|