计算化学公社»论坛首页 › 计算机软硬件 (Software and Hardware) › 硬件配置与采购 (Hardware Configuration and Procurement) › ORCA 6 Benchmarking: 7980X vs. 2 × EPYC 9654QS

ORCA 6 Benchmarking: 7980X vs. 2 × EPYC 9654QS

查看数: 2640 | 评论数: 6 | 收藏 Add to favorites 1

关灯 | 提示：支持键盘翻页<-左右->

帖子模式

本主题由 sobereva 于 2025-5-9 02:07 加入精华

David_R

发布时间: 2025-5-8 15:26

正文摘要:

本帖最后由 David_R 于 2025-5-9 11:50 编辑 Hello all, In this post I’d like to share my ORCA 6 benchmarks on the two computer systems I use to pursue my various scientific endeavours. This report ...

回复 Reply

David_R 发表于 Post on 2025-8-24 08:24:59

LetsQu1t 发表于 2025-8-16 09:41
Many thanks for the follow-up. I absolutely agree that, when you have large computing nodes, it wo ...

These are good points to make. Indeed, such considerations can be dependent on one's specific computational workflows.

Your experience with SMT and intensive workflows that use most of the CPU's computing power sounds typical, and in line with my findings. The dual-socket EPYC 9654 system also saw a modest performance increase with SMT off in this scenario. However, I saw the opposite (and to a greater extent) with a wider range of workflows and loads, so generally I'd advise people to leave SMT on, unless you verify that your specific workflow benefits from it being switched off (and I always encourage people to spend the time benchmarking performance for their specific workflows, and, indeed, sharing their experiences on this forum!)

LetsQu1t 发表于 Post on 2025-8-16 09:41:35

David_R 发表于 2025-8-9 01:24
Thank you for your comment! These are good points to discuss.

When running the comparisons with ...

Many thanks for the follow-up. I absolutely agree that, when you have large computing nodes, it would be fruitful to split the resource and carry out a number of quantum chemistry jobs simultaneously. Interestingly, I believe this is not quite the case for many other scenarios, such as computational fluid dynamics (CFD) or finite element electromagnetic simulations. A loss of performance could be apparent when holding multiple calculations.

For SMT: our HPC clusters have HT fully disabled (adhering to the standard of the industry). I've also quickly looked at this on my home server (AMD EPYC Rome), and, basically, the CPU ran at the same max. turbo clock-speed (in Cinebench R23, CPU-Z, and Aida 64) regardless of the SMT setting. However, I performed a computationally intensive (RI-)DLPNO-CCSD(T1)/ def2-QZVP job across all physical cores using the same input file, and observed a 3% time reduction when SMT was off. The results were reproducible.

David_R 发表于 Post on 2025-8-9 09:24:49

LetsQu1t 发表于 2025-8-2 17:52
Hi David,

Many thanks for sharing this work. I just want to have a few quick comments on hypert ...

Thank you for your comment! These are good points to discuss.

When running the comparisons with and without SMT enabled, the utilisation of all logical cores (twice the number of physical cores) was achieved by running multiple simultaneous ORCA calculations, parallelised with 4 nprocs each, so the Linux scheduler handles the workload without ORCA throwing any errors.

My investigations took me down a path of extracting more performance from hardware by committing more simultaneous calculations, each parallelised with fewer CPU cores each, especially, as you noted, individual calculations don't tend to parallelise efficiently beyond 16-32 cores (regardless of whether SMT is enabled). My two-fold hypothesis on the relevance of this strategy is as follows:

1. Continued advancements in the efficiency of scheduling with SMT and Linux have ironed out a lot issues with commiting tasks according to logical cores rather than physical cores. Still, in line with what Tian Lu notes in his blogpost, the performance increase (if there is one) is fairly marginal.

2. CPU core clocks have increased considerably in these high-core-count chips, and they boost reliably under all-core workloads relevant to computational chemistry. My suspicion is that because the CPUs operate at higher frequencies, this exposes bottlenecks in parallelisation overhead, meaning that optimal parallisation efficiency occurs at lower CPU cores specified per calculation. This is also underwritten by some of my more qualitative observations using older-generation AMD EPYC 7002 and 7003 hardware, as well as the newer EPYC Turin 9005 chips.

Lastly, other bottlenecks associated with committing the large number of tasks according to the number of logical cores are mitigated here as these systems have adequate RAM, and powerful cooling/power delivery. Nonetheless, these might be valid factors for others to take into consideration.

LetsQu1t 发表于 Post on 2025-8-2 17:52:55

本帖最后由 LetsQu1t 于 2025-8-2 09:58 编辑

Hi David,

  Many thanks for sharing this work. I just want to have a few quick comments on hyperthreads and parallelisation efficiency.
  I'm curious that, when SMT was disabled, calling more CPUs than the number of physical cores still resulted in a normal termination of your calculation. Based on my knowledge, this should lead to an instant abortion of the ORCA job as the system would never be able to locate the extra cores (those exceeding the # of physical cores which equals to # of logical threads).
  Meanwhile, your conclusion that the maximum efficiency was achieved by enabling SMT and specifying all threads in the input file. This, however, was widely regarded as a poor practice. Tian Lu has posted a full thread about this (see http://sobereva.com/392). When the # of cores requested > # of physical cores (w/ HT/SMT), the performance plunged without exception, as you can tell from his numbers. In addition, disabling SMT could result in performance improvement if it boosts all-core turbo clock-speed (common for the old Intel chips, and also in the case that your motherboard's power delivery is not robust).
  The fact that you started to observe pronounced efficiency loss when requesting > 24 cores was also mind-blowing. A 58-atom close-shell system with RI-DFT and 2-zeta basis set consist a reasonably sized, not particularly small job, and I was expecting a turning point at a number > 24.

Sincerely

David_R 发表于 Post on 2025-5-8 16:58:30

含光君发表于 2025-5-8 16:21
Thanks for sharing! Very inspiring and helpful work!

The charts in this report look quite nice, c ...

It is my pleasure!

To make charts, I use various combinations of Python, Origin Pro, MS Excel, and Adobe Illustrator. Most of the 'custom' modifications are done in Adobe Illustrator as it is powerful vector graphics software. I would expect that any open source alternative for vector graphics software should also work well.

评分 Rate

参与人数 Participants 1	eV +5	收起理由 Reason
含光君	+ 5	Thanks

查看全部评分 View all ratings

含光君 发表于 Post on 2025-5-8 16:21:01

Thanks for sharing! Very inspiring and helpful work!

The charts in this report look quite nice, could you please share how did you plot them? Using python or other applications?

		自动登录 Automatic login	找回密码 Forget password
密码 Password			注册 Register