计算化学公社

标题: cp2k能量计算时SCF不收敛/2 2 1 K-point报错 [打印本页]

作者
Author:
liheng    时间: 2022-5-4 15:40
标题: cp2k能量计算时SCF不收敛/2 2 1 K-point报错
本帖最后由 liheng 于 2022-5-4 18:19 编辑

各位老师好:
       最近刚上手CP2K-9.1,再计算能量时发现不SCF收敛(300 steps,EPS_SCF 1E-5),算法是&DIAGONALIZATION。看了大家的帖子说是可能因为使用了gamma点的原因(模型为表面吸附模型,尺寸:21.9*21.9*40.9,单位:埃),所以就尝试使用2*2*1的K点计算,并且root和非root账户都尝试过提交任务,都会报错:
非root账户报错:
SCF WAVEFUNCTION OPTIMIZATION

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2b87dcc14dfd in ???
#1  0x2b87dcc14013 in ???
#2  0x2b87ddd283ff in ???
#3  0x2b87ffd02280 in ???
#4  0x2b87d5b9d0c4 in ???
#5  0x2b87d5b9d33c in ???
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node node3 exited on signal 11 (Segmentation fault).
root账户报错:
Operating system error: Cannot allocate memory
Allocation would exceed memory limit


Error termination. Backtrace:
#0  0x2abac60f0dfd in ???
#1  0x2abac60f1995 in ???
#2  0x2abac60f1ba5 in ???
#3  0x2b3ff60 in __cp_cfm_types_MOD_cp_cfm_create
        at /home/lh_cp2k/cp2k-9.1/src/fm/cp_cfm_types.F:140
#4  0x1ebca6d in __qs_scf_diagonalization_MOD_do_general_diag_kp
        at /home/lh_cp2k/cp2k-9.1/src/qs_scf_diagonalization.F:403
#5  0x12c0068 in __qs_scf_loop_utils_MOD_qs_scf_new_mos_kp
        at /home/lh_cp2k/cp2k-9.1/src/qs_scf_loop_utils.F:339
#6  0x12a2d20 in __qs_scf_MOD_scf_env_do_scf
        at /home/lh_cp2k/cp2k-9.1/src/qs_scf.F:491
#7  0x12af6af in __qs_scf_MOD_scf
        at /home/lh_cp2k/cp2k-9.1/src/qs_scf.F:243
#8  0x10a2794 in __qs_energy_MOD_qs_energies
        at /home/lh_cp2k/cp2k-9.1/src/qs_energy.F:93
#9  0x10c6c5f in __qs_force_MOD_qs_calc_energy_force
        at /home/lh_cp2k/cp2k-9.1/src/qs_force.F:120
#10  0xd97914 in __force_env_methods_MOD_force_env_calc_energy_force
        at /home/lh_cp2k/cp2k-9.1/src/force_env_methods.F:271
#11  0x7fd87e in cp2k_run
        at /home/lh_cp2k/cp2k-9.1/src/start/cp2k_runs.F:355
#12  0x80028f in __cp2k_runs_MOD_run_input
        at /home/lh_cp2k/cp2k-9.1/src/start/cp2k_runs.F:991
#13  0x7fa8db in cp2k
        at /home/lh_cp2k/cp2k-9.1/src/start/cp2k.F:387
#14  0x458e0c in main
        at /home/lh_cp2k/cp2k-9.1/src/start/cp2k.F:44



计算时用了36核,服务器可用内存有512G,是我的内存不够吗?请问这个问题大家又遇到吗呢?
或者说我这个体系有必要增加K点吗?会不会和其他参数有关呢?谢谢




作者
Author:
sobereva    时间: 2022-5-5 08:40
k点当前足矣
既然你用了二维周期性,就没必要再用SURFACE_DIPOLE_CORRECTION T
当前问题大抵是运行环境的原因,512GB内存36进程并行跑这个体系不至于内存不够。ulimit -a检查内存可分配量是否被限制了
作者
Author:
liheng    时间: 2022-5-5 13:52
本帖最后由 liheng 于 2022-5-5 13:59 编辑
sobereva 发表于 2022-5-5 08:40
k点当前足矣
既然你用了二维周期性,就没必要再用SURFACE_DIPOLE_CORRECTION T
当前问题大抵是运行环境的 ...

您说的K点足矣是指2*2*1吗?还是GAMMA就行了?
ulimit -a 如下所示:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2059828
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 999999
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2059828
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

当我提交任务后(36核,K点2*2*1),就看见内存想疯了一样猛涨,直至500多G占满
然后报错:
Operating system error: Cannot allocate memory
Allocation would exceed memory limit

Error termination. Backtrace:
#0  0x2ba96e094dfd in ???
#1  0x2ba96e095995 in ???
#2  0x2ba96e095ba5 in ???
#3  0x2d61891 in __dbcsr_ptr_util_MOD_mem_alloc_d
        at /home/lh_cp2k/cp2k-9.1/exts/dbcsr/src/data/dbcsr_ptr_util.F:343
#4  0x2d61891 in __dbcsr_ptr_util_MOD_mem_alloc_d
        at /home/lh_cp2k/cp2k-9.1/exts/dbcsr/src/data/dbcsr_ptr_util.F:322
#5  0x2d61d5b in __dbcsr_ptr_util_MOD_ensure_array_size_d
        at /home/lh_cp2k/cp2k-9.1/exts/dbcsr/src/data/dbcsr_ptr_util.F:264
#6  0x2d53940 in __dbcsr_data_methods_MOD_dbcsr_data_ensure_size
        at /home/lh_cp2k/cp2k-9.1/exts/dbcsr/src/data/dbcsr_data_methods.F:351
#7  0x2f1aaa2 in __dbcsr_block_access_MOD_dbcsr_reserve_blocks
        at /home/lh_cp2k/cp2k-9.1/exts/dbcsr/src/block/dbcsr_block_access.F:621
#8  0x2e5b869 in __dbcsr_api_MOD_dbcsr_reserve_blocks
        at /home/lh_cp2k/cp2k-9.1/exts/dbcsr/src/dbcsr_api.F:484
#9  0xd45c18 in __cp_dbcsr_cp2k_link_MOD_cp_dbcsr_alloc_block_from_nbl
        at /home/lh_cp2k/cp2k-9.1/src/cp_dbcsr_cp2k_link.F:565
#10  0x1cf37fe in __qs_core_hamiltonian_MOD_build_core_hamiltonian_matrix
        at /home/lh_cp2k/cp2k-9.1/src/qs_core_hamiltonian.F:350
#11  0x10a39f3 in qs_energies_init_hamiltonians
        at /home/lh_cp2k/cp2k-9.1/src/qs_energy_init.F:310
#12  0x10a39f3 in __qs_energy_init_MOD_qs_energies_init
        at /home/lh_cp2k/cp2k-9.1/src/qs_energy_init.F:108
#13  0x10a2603 in __qs_energy_MOD_qs_energies
        at /home/lh_cp2k/cp2k-9.1/src/qs_energy.F:84
#14  0x10c6c5f in __qs_force_MOD_qs_calc_energy_force
        at /home/lh_cp2k/cp2k-9.1/src/qs_force.F:120
#15  0xd97914 in __force_env_methods_MOD_force_env_calc_energy_force
        at /home/lh_cp2k/cp2k-9.1/src/force_env_methods.F:271
#16  0x7fd87e in cp2k_run
        at /home/lh_cp2k/cp2k-9.1/src/start/cp2k_runs.F:355
#17  0x80028f in __cp2k_runs_MOD_run_input
        at /home/lh_cp2k/cp2k-9.1/src/start/cp2k_runs.F:991
#18  0x7fa8db in cp2k
        at /home/lh_cp2k/cp2k-9.1/src/start/cp2k.F:387
#19  0x458e0c in main
        at /home/lh_cp2k/cp2k-9.1/src/start/cp2k.F:44








作者
Author:
sobereva    时间: 2022-5-5 19:54
liheng 发表于 2022-5-5 13:52
您说的K点足矣是指2*2*1吗?还是GAMMA就行了?
ulimit -a 如下所示:
core file size          (blocks ...

2*2*1够了

你先单核方式运行看看什么情况,占多少内存
也可能是编译或者并行问题,看看其它任务运行情况判断
作者
Author:
liheng    时间: 2022-5-6 16:43
本帖最后由 liheng 于 2022-5-9 16:21 编辑
sobereva 发表于 2022-5-5 19:54
2*2*1够了

你先单核方式运行看看什么情况,占多少内存

我用sopt运行了下,发现内存占用为70多G。又用ssmp和psmp分别测试了一下,发现这两个和之前一直用的popt一样,500多G的内存全部吃满了。不知道我这个是不是问题很大啊?因为在您之前的一篇帖子(http://sobereva.com/586)里有说过ssmp是openMP,psmp是MPI+openMP,理论上openMP是可以节约内存的,而我这个还是内存不够
而且我还发现如果我用psmp跑并且使用gamma点的情况下:
1,我用mpirun -np 36 cp2k.psmp **.inp,(当时还有其它vasp任务在跑,总核数包括超线程196,去除vasp任务还剩46%的线程数),虽然指定了36个核,但是cpu的占用不止36个核,核数直接占到99%,也就是说psmp倾向于把剩余的核数全部跑满一样。
2,内存也是直接占满了,这和之前的popt不一样。popt我指定多少核就是多少核,在gamma点时内存也不会占满


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
之前的问题似乎已经解决了,因为我第一次装的时候是装的openmpi,但后面因为root账户使用openmpi很麻烦,所以又重新编译了一次,而这一次采用了系统中已有的intelmpi,但是之前的openmpi没有删除,所以就一直留在~/.bashrc中,以至于提交一次任务会有很任务在重复计算,这可能是导致之前内存爆炸的原因。参考:http://bbs.keinsci.com/thread-22032-1-1.html

新的问题又来了:使用openmpi(之前用的是系统已经装好的intelmpi),test全部通过,cp2k.sopt能跑,在两台机器上分别测试mpirun -np 36 cp2k.psmp报错:
Error termination. Backtrace:
Operating system error: Cannot allocate memory
Allocation would exceed memory limit

primary job terminated normally, but 1 process returned a non-zero exit code. per user-direction, the job has been aborted.







欢迎光临 计算化学公社 (http://bbs.keinsci.com/) Powered by Discuz! X3.3