|
本帖最后由 乐平 于 2023-4-23 22:24 编辑
很是诡异,用相同的编译环境 (gcc-9.3, 调用 intel oneAPI 2022.02 的 MKL)
相同的编译命令
- ./install_cp2k_toolchain.sh --math-mode=mkl --with-gcc=install --with-openmpi=install --with-ptscotch=install --with-superlu=install --with-pexsi=install --with-quip=install --with-plumed=install
复制代码
只不过一个集群是 PBS 作业调度系统(我们称它为集群 1),另一个集群是 Slurm 作业调度系统(我们称它为集群 2)。当然,在 open mpi 编译的时候稍稍不同,跳过了 slurm,见 在有 slurm 的集群中编译 CP2K 报错 - 第一性原理 (First Principle) - 计算化学公社 (keinsci.com) 里的讨论,但是按理说应该是一样的编译环境,只是注释掉--enable-mpi1-compatibility 额外的兼容性。
最终得到的只编译 cp2k.psmp 版,即 make -j xx ARCH=local VERSION=psmp (xx 是 CPU 核心数,两个集群的核心数不同,这里用 xx 代替)。
有趣的事情来了!
集群 1 的编译结果:- total 1.2G
- -rwxrwxr-x 1 huan huan 2.1M Apr 23 12:32 graph.psmp
- -rwxrwxr-x 1 huan huan 112K Apr 23 12:32 memory_utilities_unittest.psmp
- -rwxrwxr-x 1 huan huan 3.5M Apr 23 12:32 parallel_rng_types_unittest.psmp
- -rwxrwxr-x 1 huan huan 1.1M Apr 23 12:32 dbm_miniapp.psmp
- -rwxrwxr-x 1 huan huan 8.9M Apr 23 12:32 dbt_tas_unittest.psmp
- -rwxrwxr-x 1 huan huan 11M Apr 23 12:33 dbt_unittest.psmp
- -rwxrwxr-x 1 huan huan 2.4M Apr 23 12:33 grid_miniapp.psmp
- -rwxrwxr-x 1 huan huan 2.4M Apr 23 12:33 grid_unittest.psmp
- -rwxrwxr-x 1 huan huan 19K Apr 23 12:38 nequip_unittest.psmp
- -rwxrwxr-x 1 huan huan 979K Apr 23 12:38 dumpdcd.psmp
- -rwxrwxr-x 1 huan huan 926K Apr 23 12:38 xyz2dcd.psmp
- -rwxrwxr-x 1 huan huan 548M Apr 23 12:39 libcp2k_unittest.psmp
- -rwxrwxr-x 1 huan huan 548M Apr 23 12:39 cp2k.psmp
- lrwxrwxrwx 1 huan huan 9 Apr 23 12:39 cp2k.popt -> cp2k.psmp
- lrwxrwxrwx 1 huan huan 9 Apr 23 12:39 cp2k_shell.psmp -> cp2k.psmp
复制代码
集群 2 的编译结果:
- total 991M
- -rwxr-xr-x 1 huan users 3.4M Apr 18 20:35 parallel_rng_types_unittest.psmp
- -rwxr-xr-x 1 huan users 106K Apr 18 20:35 memory_utilities_unittest.psmp
- -rwxr-xr-x 1 huan users 2.0M Apr 18 20:35 graph.psmp
- -rwxr-xr-x 1 huan users 1.1M Apr 18 20:35 dbm_miniapp.psmp
- -rwxr-xr-x 1 huan users 8.5M Apr 18 20:35 dbt_tas_unittest.psmp
- -rwxr-xr-x 1 huan users 11M Apr 18 20:36 dbt_unittest.psmp
- -rwxr-xr-x 1 huan users 2.1M Apr 18 20:36 grid_unittest.psmp
- -rwxr-xr-x 1 huan users 2.1M Apr 18 20:36 grid_miniapp.psmp
- -rwxr-xr-x 1 huan users 15K Apr 18 20:39 nequip_unittest.psmp
- -rwxr-xr-x 1 huan users 921K Apr 18 20:39 xyz2dcd.psmp
- -rwxr-xr-x 1 huan users 975K Apr 18 20:39 dumpdcd.psmp
- -rwxr-xr-x 1 huan users 480M Apr 18 20:40 libcp2k_unittest.psmp
- -rwxr-xr-x 1 huan users 480M Apr 18 20:40 cp2k.psmp
- lrwxr-xr-x 1 huan users 9 Apr 18 20:40 cp2k_shell.psmp -> cp2k.psmp
- lrwxr-xr-x 1 huan users 9 Apr 18 20:40 cp2k.popt -> cp2k.psmp
复制代码
可以看出,集群 1 得到的可执行程序 cp2k.psmp 是 548 MB,而集群 2 里编译的结果却是 480 MB。
除了大小有区别之外,运行测试的结果也有区别(测试的命令也相同)
- make ARCH=local VERSION=psmp TESTOPTS+="--mpiranks 4 --ompthreads 4 --timeout 2000" test
复制代码
集群 1 测试的结果
- ------------------------------- Summary --------------------------------
- Number of FAILED tests 3
- Number of WRONG tests 2
- Number of CORRECT tests 3928
- Total number of tests 3933
- Summary: correct: 3928 / 3933; wrong: 2; failed: 3; 298min
- Status: FAILED
- *************************** Testing ended ******************************
- make[3]: *** [test] Error 5
- make[2]: *** [test] Error 2
- make[1]: *** [psmp] Error 2
- make: *** [test] Error 2
复制代码
集群 2 测试的结果
- ------------------------------- Timings --------------------------------
- Plot: name="timings", title="Timing Distribution", ylabel="time [s]"
- PlotPoint: name="100th_percentile", plot="timings", label="100th %ile", y=28.93, yerr=0.0
- PlotPoint: name="99th_percentile", plot="timings", label="99th %ile", y=9.96, yerr=0.0
- PlotPoint: name="98th_percentile", plot="timings", label="98th %ile", y=7.63, yerr=0.0
- PlotPoint: name="95th_percentile", plot="timings", label="95th %ile", y=5.27, yerr=0.0
- PlotPoint: name="90th_percentile", plot="timings", label="90th %ile", y=4.20, yerr=0.0
- PlotPoint: name="80th_percentile", plot="timings", label="80th %ile", y=2.81, yerr=0.0
- ------------------------------- Summary --------------------------------
- Number of FAILED tests 5
- Number of WRONG tests 2
- Number of CORRECT tests 3926
- Total number of tests 3933
- Summary: correct: 3926 / 3933; wrong: 2; failed: 5; 47min
- Status: FAILED
- *************************** Testing ended ******************************
- make[3]: *** [test] Error 7
- make[2]: *** [test] Error 2
- make[1]: *** [psmp] Error 2
- make: *** [test] Error 2
复制代码
可以看出,集群 1 (cp2k.psmp 较大 548 MB)的错误相对少,3 (FAILED tests) + 2 (WRONG tests) = 5 个;
而集群 2(cp2k.psmp 较小 480 MB)的错误相对多,5 (FAILED tests) + 2 (WRONG tests) = 7 个
CP2K 2023-01 虽然安装上非常顺利(相比之前的 8.1,8.2,9.1,2022版本顺利太多),但是上述问题却让人很迷惑……
|
|