|
LetsQu1t 发表于 2025-8-16 09:41 These are good points to make. Indeed, such considerations can be dependent on one's specific computational workflows. Your experience with SMT and intensive workflows that use most of the CPU's computing power sounds typical, and in line with my findings. The dual-socket EPYC 9654 system also saw a modest performance increase with SMT off in this scenario. However, I saw the opposite (and to a greater extent) with a wider range of workflows and loads, so generally I'd advise people to leave SMT on, unless you verify that your specific workflow benefits from it being switched off (and I always encourage people to spend the time benchmarking performance for their specific workflows, and, indeed, sharing their experiences on this forum!) |
David_R 发表于 2025-8-9 01:24 Many thanks for the follow-up. I absolutely agree that, when you have large computing nodes, it would be fruitful to split the resource and carry out a number of quantum chemistry jobs simultaneously. Interestingly, I believe this is not quite the case for many other scenarios, such as computational fluid dynamics (CFD) or finite element electromagnetic simulations. A loss of performance could be apparent when holding multiple calculations. For SMT: our HPC clusters have HT fully disabled (adhering to the standard of the industry). I've also quickly looked at this on my home server (AMD EPYC Rome), and, basically, the CPU ran at the same max. turbo clock-speed (in Cinebench R23, CPU-Z, and Aida 64) regardless of the SMT setting. However, I performed a computationally intensive (RI-)DLPNO-CCSD(T1)/ def2-QZVP job across all physical cores using the same input file, and observed a 3% time reduction when SMT was off. The results were reproducible. |
LetsQu1t 发表于 2025-8-2 17:52 Thank you for your comment! These are good points to discuss. When running the comparisons with and without SMT enabled, the utilisation of all logical cores (twice the number of physical cores) was achieved by running multiple simultaneous ORCA calculations, parallelised with 4 nprocs each, so the Linux scheduler handles the workload without ORCA throwing any errors. My investigations took me down a path of extracting more performance from hardware by committing more simultaneous calculations, each parallelised with fewer CPU cores each, especially, as you noted, individual calculations don't tend to parallelise efficiently beyond 16-32 cores (regardless of whether SMT is enabled). My two-fold hypothesis on the relevance of this strategy is as follows: 1. Continued advancements in the efficiency of scheduling with SMT and Linux have ironed out a lot issues with commiting tasks according to logical cores rather than physical cores. Still, in line with what Tian Lu notes in his blogpost, the performance increase (if there is one) is fairly marginal. 2. CPU core clocks have increased considerably in these high-core-count chips, and they boost reliably under all-core workloads relevant to computational chemistry. My suspicion is that because the CPUs operate at higher frequencies, this exposes bottlenecks in parallelisation overhead, meaning that optimal parallisation efficiency occurs at lower CPU cores specified per calculation. This is also underwritten by some of my more qualitative observations using older-generation AMD EPYC 7002 and 7003 hardware, as well as the newer EPYC Turin 9005 chips. Lastly, other bottlenecks associated with committing the large number of tasks according to the number of logical cores are mitigated here as these systems have adequate RAM, and powerful cooling/power delivery. Nonetheless, these might be valid factors for others to take into consideration. |
|
本帖最后由 LetsQu1t 于 2025-8-2 09:58 编辑 Hi David, Many thanks for sharing this work. I just want to have a few quick comments on hyperthreads and parallelisation efficiency. I'm curious that, when SMT was disabled, calling more CPUs than the number of physical cores still resulted in a normal termination of your calculation. Based on my knowledge, this should lead to an instant abortion of the ORCA job as the system would never be able to locate the extra cores (those exceeding the # of physical cores which equals to # of logical threads). Meanwhile, your conclusion that the maximum efficiency was achieved by enabling SMT and specifying all threads in the input file. This, however, was widely regarded as a poor practice. Tian Lu has posted a full thread about this (see http://sobereva.com/392). When the # of cores requested > # of physical cores (w/ HT/SMT), the performance plunged without exception, as you can tell from his numbers. In addition, disabling SMT could result in performance improvement if it boosts all-core turbo clock-speed (common for the old Intel chips, and also in the case that your motherboard's power delivery is not robust). The fact that you started to observe pronounced efficiency loss when requesting > 24 cores was also mind-blowing. A 58-atom close-shell system with RI-DFT and 2-zeta basis set consist a reasonably sized, not particularly small job, and I was expecting a turning point at a number > 24. Sincerely |
含光君 发表于 2025-5-8 16:21 It is my pleasure! To make charts, I use various combinations of Python, Origin Pro, MS Excel, and Adobe Illustrator. Most of the 'custom' modifications are done in Adobe Illustrator as it is powerful vector graphics software. I would expect that any open source alternative for vector graphics software should also work well. |
| 参与人数Participants 1 | eV +5 | 收起 理由Reason |
|---|---|---|
|
| + 5 | Thanks |
|
Thanks for sharing! Very inspiring and helpful work! The charts in this report look quite nice, could you please share how did you plot them? Using python or other applications? |
手机版 Mobile version|北京科音自然科学研究中心 Beijing Kein Research Center for Natural Sciences|京公网安备 11010502035419号|计算化学公社 — 北京科音旗下高水平计算化学交流论坛 ( 京ICP备14038949号-1 )|网站地图
GMT+8, 2026-1-24 13:25 , Processed in 0.188121 second(s), 27 queries , Gzip On.