计算化学公社

标题: ORCA 6 Benchmarking: 7980X vs. 2 × EPYC 9654QS [打印本页]

作者
Author:
David_R    时间: 2025-5-8 15:26
标题: ORCA 6 Benchmarking: 7980X vs. 2 × EPYC 9654QS
本帖最后由 David_R 于 2025-5-9 11:50 编辑

Hello all,

In this post I’d like to share my ORCA 6 benchmarks on the two computer systems I use to pursue my various scientific endeavours. This report was originally written up in English. I strongly considered presenting it in Chinese with the help of translation software (admittedly, my language skills are somewhat limited), but I was worried that some of the intended meaning would be confused, without me being able to verify it. I hope the written English will be admissible here; if not, I am happy to edit the post with a Chinese translation.

在这篇文章中,我想分享我在我用于进行各项科学研究的两台计算机系统上运行 ORCA 6 的基准测试结果。这份报告最初是用英文撰写的。我曾认真考虑过借助翻译软件(坦白地说,我的中文水平有限)将其翻译成中文,但我担心在无法核实的情况下,一些原意会被混淆。我希望这篇英文文章在这里是可以接受的;如果不行,我很乐意编辑帖子,附上中文翻译。

1. Introduction
These benchmarks were carried out for my own purposes, so are somewhat specific to my particular use-case, but I hope that others will find them more generally enlightening, in addition to other benchmarks provided by others on this forum. The aim was to find optimal settings among a number of selected hardware and software-based parameters, and to compare the server-grade EPYC and HEDT Threadripper platforms. It is certainly not a comprehensive exploration of all available parameters. Moreover, some are selected for a more systematic and rigorous investigation, while others that I tested in a more cursory or qualitative manner are mentioned where appropriate.

I’ve tried my best to format and structure my report in a manner that can be easily digested. Nonetheless, if there are any ambiguities or mistakes I’ve overlooked, I shall be happy to attend to them. In some areas, I found that a salient interpretation of the results probably extends beyond my knowledge of computer hardware; in which case, according to the phrase ‘砖引玉’, I hope that others might offer their expertise.

2. Hardware and Software Specifications
(, 下载次数 Times of downloads: 53)    (, 下载次数 Times of downloads: 49)

2.1. Threadripper 7980X Workstation
CPU: AMD Ryzen Threadripper 7980X: 64 cores/128 threads (running overclocked @ 4.9 GHz all-core)
RAM: 256 GB (4 × 64 GB) V-color DDR5 ECC RDIMM (running overclocked at 6000 MT/s)
GPU: NVIDIA RTX 4090
Motherboard: Asus Pro WS TRX50-SAGE
Storage: 3 × 2 TB Samsung 990 Pro
PSU: Seasonic Prime TX-1600
CPU cooler: Silverstone XE360-TR5
Case: Silverstone RM52

2.2. 2 × EPYC 9654 Server
CPU: 2 × EPYC 9654QS: each 96 cores/192 threads @ 3.5 GHz all-core
RAM: 1.15 TB (24 × 48 GB) SK Hynix DDR5 ECC RDIMM @ 4800 MT/s
Motherboard: Gigabyte MZ73-LM1 rev 3.2
Storage: 4 TB Samsung 990 Pro
PSU: Seasonic Prime TX-1600
CPU cooler: Silverstone XE360PDD
Case: Silverstone RM52

2.3. Software (same for both systems):
Operating System: Ubuntu Desktop 24.04.2 LTS
Hardware monitoring: htop, psensor
Additional details: ORCA 6.0.1, OpenMPI version 5.0.7. Bash scripts were used for automating the running of calculations, and Python scripts were used for parsing the output files for timings, and evaluating the statistics thereof.

3. Notes on Hardware
I suspect that most here will be familiar with the popular EPYC 9654 system; there are numerous detailed forum posts on the topic of this CPU, and I reported on my experience with it in a previous forum post. This system offers excellent price-to-performance for quantum chemistry, more so with the considerably discounted QS variant.

AMD’s HEDT Threadripper platform is rather more niche. The idea behind this platform is to offer higher core-counts, more PCI-e connectivity, and memory bandwidth over consumer Ryzen CPUs, while enjoying overclocking capabilities, higher core frequencies, and faster memory speeds over the equivalent Zen4 server-grade EPYC chips.

I spent some time overclocking the 7980X system, tweaking the settings for performance and stability. The system thus runs entirely stable at 4.8 – 4.9 GHz on all 64 CPU cores for several months. Moreover, because the programmed Fmax for one CCD is higher than the others, CPU frequencies boost to 5.5 GHz when fewer than 8 cores are active.

Here is a summary of the various overclocking settings applied:


The impressive overclocked performance of the 7980X system is achieved by supplying the CPU with an enormous amount of power, often exceeding 800 W. Thus, a powerful water cooler with a large radiator is essential.

Under full load, the EPYC 9654 system has no trouble keeping temperatures under control. The CPU temperature tends to stay around 65 – 70 °C. Meanwhile, the 7980X system runs considerably hotter, with CCD temperatures between 80 and 95 °C.

In the presented benchmarking, simultaneous multithreading (SMT) is enabled, unless otherwise specified (the effect of SMT is explored in the latter part of the post).

4. Overview of the Benchmark
The primary motivation for the chosen methodology is the need for running very large batches (10^4 – 10^6) of DFT calculations for training models, such as neural network potentials for MD, or predictive/generative models of empirical utility. My view is that the use of predictive models and data-driven approaches to development and discovery in chemical synthesis holds great promise for the field, and large scale quantum chemical calculations will doubtlessly play an important role. Thus, the results are presented as ‘throughput’ in h–1; that is, the number of calculations one could run per hour if successive calculations were run back-to-back continuously. Having used both of these computer systems for research projects already, I can attest that these benchmarks faithfuly represent system performance under ‘real-world’ use.

A key question I wanted to answer was how to best utilise all available CPU cores, by balancing the number of CPU cores each calculation is parallelised with (using the nprocs function in ORCA), against running multiple calculations simultaneously as concurrent ‘jobs’.

One might also be interested in the most efficient approach to carrying out smaller-scale studies or even just individual calculations in the shortest amount of time, so the present systems are studied in this capacity too.

The benchmarking presented here comprises single-point DFT calculations on an optimized geometry of a simple organotransition metal compound: (dppbz)CuH (pictured below). Two functionals were tested; namely, the range-separated hybrid ωB97X, and the pure functional PBE. Furthermore, comparisons are made across def2-SVP and def2-TZVP basis sets.

(, 下载次数 Times of downloads: 45)

I spent some time investigating the reproducibility of the results, and found them to be highly consistent. Standard deviations were less than one second (with total calculation timings typically between 1 and 45 minutes). All the results presented are average values obtained from all the collected individual timings. Simultaneous jobs were initialised with a 0.1 second delay between them; increasing this delay to 30 seconds did not significantly change the timings.

Here is an example input script:
! RIJCOSX wB97X MINIPRINT

%pal nprocs 4 end

%basis
Basis "def2-TZVP"
Aux "def2/J"
end

* xyzfile 0 1 GEO.xyz

...and here is the geometry in GEO.xyz:
58
Coordinates from ORCA-job
  Cu          2.46973002972452     -0.18601739044056      0.04232875648838
  C          -0.64363913555159     -0.97384561595684     -0.88211909464252
  C          -1.62045677492391     -1.57356613363719     -1.69479681390466
  C          -2.62861064458259     -2.36525177869295     -1.13939160073426
  C          -2.66742741767769     -2.57603031053721      0.24197794608297
  C          -1.69865533688148     -1.99334312422015      1.06276593858960
  C          -0.68411432605112     -1.18621532788858      0.52077293903823
  H          -1.58405488995610     -1.43143070510902     -2.77734581040532
  H          -3.37760021230677     -2.82694533364111     -1.78850655384471
  H          -3.44685247080927     -3.20413117176586      0.68185565008376
  H          -1.72340199901762     -2.18003195526747      2.13888713946339
  P           0.76331812454317      0.02511474857812     -1.57656893092477
  P           0.66808905554658     -0.44033697872267      1.55737850489893
  C           0.07027521207012      1.71758452155321     -1.82468507906642
  C          -1.29059788908241      2.04029583164644     -1.69532264239153
  C          -1.72741200360080      3.35678686265136     -1.88035380204103
  C          -0.81295444353836      4.36434909072420     -2.19891778216789
  C           0.54562680566646      4.05342157908296     -2.32559401896909
  C           0.98597747702343      2.74234787712552     -2.13261515084648
  H          -2.01689964984798      1.26284342751882     -1.44819727093619
  H          -2.78977562759455      3.59356913149549     -1.77514931583574
  H          -1.15657941853224      5.39225195613550     -2.34288559258041
  H           1.26840737214768      4.83740206958434     -2.56701562498220
  H           2.05162779696117      2.51026008695776     -2.21735508687712
  C           0.95914382935440     -0.62563790246834     -3.29429774766077
  C           1.86124565969842     -1.68859439323831     -3.47961989366908
  C           2.05998798780848     -2.23577711880120     -4.74985597418974
  C           1.37446780925027     -1.71626516451124     -5.85310150449239
  C           0.48978312792667     -0.64688294601580     -5.68126863172323
  C           0.28161973138268     -0.10314694683018     -4.40949201679897
  H           2.42349826175845     -2.07511997661121     -2.62448463361414
  H           2.76448555104436     -3.06165039647633     -4.87992812408805
  H           1.53770714659285     -2.13791858149570     -6.84865123830882
  H          -0.04052345493023     -0.23017556744675     -6.54208536696866
  H          -0.40592565735296      0.73751361732364     -4.28815381296754
  C          -0.06339768157169      1.10862534045831      2.24555004431524
  C          -1.42958448350428      1.43073751041558      2.18854526394961
  C          -1.89626130851997      2.63637046048158      2.72376030371323
  C          -1.00649286246590      3.53222548243676      3.32324484835761
  C           0.35675949219332      3.22219993254359      3.38083730417821
  C           0.82683750789704      2.02339485477321      2.83999469377563
  H          -2.13648757845930      0.73910254484218      1.72475680631803
  H          -2.96221256573466      2.87473897175397      2.67137636386630
  H          -1.37338453790757      4.47404072402123      3.74017453949608
  H           1.06032906914228      3.92120762995246      3.84087251542926
  H           1.89605417051905      1.79471673005399      2.87415056416288
  C           0.77494092775765     -1.56446325647194      3.01944140939291
  C           0.02333947380930     -1.39237832209352      4.19472766797217
  C           0.16778310954606     -2.28384900025092      5.26281948932466
  C           1.06228134615066     -3.35482173344512      5.16943203371828
  C           1.82137664628147     -3.52739334281143      4.00710725975150
  C           1.68596864095113     -2.63321654629609      2.94195260870045
  H          -0.67266188046333     -0.55447160762770      4.28166626618063
  H          -0.41993204979187     -2.13803927782552      6.17339490220428
  H           1.17568338680578     -4.04870790383006      6.00682102759451
  H           2.53376576021073     -4.35347390248122      3.93375406977450
  H           2.30502115678396     -2.75144925368267      2.04777616706077
  H           4.01076163410810     -0.33052301551929      0.07963809174976

5. Results
5.1. The effect of increasing number of CPU cores for a single job
My initial investigation focused on the number of CPU cores called (using the ORCA nprocs command) for a single calculation. Four different levels of theory were tested: ωB97X/def2-TZVP, ωB97X/def2-SVP, PBE/def2-TZVP, and PBE/def2-SVP. The results are presented in the four graphs below: throughout this post, those for the 7980X system are coloured red, and the 2 × 9654 system data are coloured blue. In order to visualise the parallelisation efficiency, dashed lines of the corresponding colours indicate the theoretical linear increase in performance pertaining to an ‘embarrassingly parallel’ relationship.

(, 下载次数 Times of downloads: 49)

Some key observations from the obtained data:

The difference in timings from an individual calculation on the two systems is largely attributable to the difference in CPU clock frequencies (4.9 vs. 3.5 GHz for the 7980X and 9654QS chips, respectively, which corresponds to a ratio of 1.4:1). However, the 7980X system attains a greater speedup when switching to less computationally expensive methods, indicating that they are more strongly influenced by CPU frequencies. I did do a cursory test on the effect of RAM frequency and NVMe drive speeds, which was negligible in all cases, suggesting that internal CPU design may play a role in the observed trends.

5.2. The effect of increasing the number of simultaneous jobs
So, instead of running individual calculations and increasing number of CPU cores, for which the performance increase deviates from linearity rather quickly, what if we run multiple calculations concurrently, each with nprocs = 4? For an initial investigation, the ωB97X/def2-TZVP level of theory was used—the results are presented in the graph below:

(, 下载次数 Times of downloads: 47)

Some key observations from this study:


5.3. Balancing number of CPU cores vs. number of simultaneous jobs
As demonstrated above, an approximately linear increase in throughput is attained when increasing the number of jobs (each with nprocs = 4) but what about using the same strategy with different numbers of CPU cores called per job?

I encountered a strange problem when attempting this with nprocs = 2: the calculations would not run simultaneously. I made a half-hearted attempt to troubleshoot this but gave up and decided not to pursue it further. Thus, I present results with nprocs = 4, 8, 16 and 32, for both ωB97X/def2-TZVP and PBE/def2-SVP levels of theory in the graphs below:

(, 下载次数 Times of downloads: 49)

Some prevailing observations:

5.4. The effect of simultaneous multithreading (SMT)
ORCA does not natively make use of SMT technology. I have heard some people advocate switching it off in the BIOS, but this was advice from quite some time ago, and is likely hardware-dependent, so I went about testing it myself.

The effect of turning SMT on or off on timings for individual calculations (with nprocs = 4 and 32, at the ωB97X/def2-TZVP and PBE/def2-SVP levels of theory) is shown in the bar chart below. Note that, for clarity, I’ve plotted the data here as calculation time, so lower is better.

(, 下载次数 Times of downloads: 49)

In almost all cases, whether SMT is enabled or not, there is almost no discernable difference. A notable exception is a PBE/def2-SVP calculation using 32 CPU cores, for which turning off SMT results in a slight performance decrease for the 7980X system, and a slight increase for the 2 × 9654 system.

Next, we turn our attention to the effect of SMT on running multiple jobs. The obtained throughput data is plotted in the bar charts below (higher is better).

(, 下载次数 Times of downloads: 46)

A few notable insights:

6. Conclusions
For these two equivalent AMD Zen4 CPUs, the best general predictors for performance are CPU core count and frequency, with other potential performance bottlenecks apparently less influential. This results in a performance ratio of approximately 2:1 for the 2 × 9654 system over the 7980X. Given that the two systems cost approximately the same (if one ignores the price of the GPU, which is not used here), then the dual socket EPYC configuration offers significantly better price-to-performance for ORCA calculations. However, in my case, I use the 7980X system for a wide range of other workflows, especially those which utilise GPUs, or are lightly-threaded, and thus benefit from the higher clock speeds.

A summary of the main findings:

Thank you all for reading!


EDIT: Fixed various formatting and typographical errors.
作者
Author:
含光君    时间: 2025-5-8 16:21
Thanks for sharing! Very inspiring and helpful work!

The charts in this report look quite nice, could you please share how did you plot them? Using python or other applications?
作者
Author:
David_R    时间: 2025-5-8 16:58
含光君 发表于 2025-5-8 16:21
Thanks for sharing! Very inspiring and helpful work!

The charts in this report look quite nice, c ...

It is my pleasure!

To make charts, I use various combinations of Python, Origin Pro, MS Excel, and Adobe Illustrator. Most of the 'custom' modifications are done in Adobe Illustrator as it is powerful vector graphics software. I would expect that any open source alternative for vector graphics software should also work well.
作者
Author:
LetsQu1t    时间: 2025-8-2 17:52
本帖最后由 LetsQu1t 于 2025-8-2 09:58 编辑

Hi David,

  Many thanks for sharing this work. I just want to have a few quick comments on hyperthreads and parallelisation efficiency.
  I'm curious that, when SMT was disabled, calling more CPUs than the number of physical cores still resulted in a normal termination of your calculation. Based on my knowledge, this should lead to an instant abortion of the ORCA job as the system would never be able to locate the extra cores (those exceeding the # of physical cores which equals to # of logical threads).
  Meanwhile, your conclusion that the maximum efficiency was achieved by enabling SMT and specifying all threads in the input file. This, however, was widely regarded as a poor practice. Tian Lu has posted a full thread about this (see http://sobereva.com/392). When the # of cores requested > # of physical cores (w/ HT/SMT), the performance plunged without exception, as you can tell from his numbers. In addition, disabling SMT could result in performance improvement if it boosts all-core turbo clock-speed (common for the old Intel chips, and also in the case that your motherboard's power delivery is not robust).
  The fact that you started to observe pronounced efficiency loss when requesting > 24 cores was also mind-blowing. A 58-atom close-shell system with RI-DFT and 2-zeta basis set consist a reasonably sized, not particularly small job, and I was expecting a turning point at a number > 24.

Sincerely
作者
Author:
David_R    时间: 3 day ago
LetsQu1t 发表于 2025-8-2 17:52
Hi David,

  Many thanks for sharing this work. I just want to have a few quick comments on hypert ...

Thank you for your comment! These are good points to discuss.

When running the comparisons with and without SMT enabled, the utilisation of all logical cores (twice the number of physical cores) was achieved by running multiple simultaneous ORCA calculations, parallelised with 4 nprocs each, so the Linux scheduler handles the workload without ORCA throwing any errors.

My investigations took me down a path of extracting more performance from hardware by committing more simultaneous calculations, each parallelised with fewer CPU cores each, especially, as you noted, individual calculations don't tend to parallelise efficiently beyond 16-32 cores (regardless of whether SMT is enabled). My two-fold hypothesis on the relevance of this strategy is as follows:

1. Continued advancements in the efficiency of scheduling with SMT and Linux have ironed out a lot issues with commiting tasks according to logical cores rather than physical cores. Still, in line with what Tian Lu notes in his blogpost, the performance increase (if there is one) is fairly marginal.

2. CPU core clocks have increased considerably in these high-core-count chips, and they boost reliably under all-core workloads relevant to computational chemistry. My suspicion is that because the CPUs operate at higher frequencies, this exposes bottlenecks in parallelisation overhead, meaning that optimal parallisation efficiency occurs at lower CPU cores specified per calculation. This is also underwritten by some of my more qualitative observations using older-generation AMD EPYC 7002 and 7003 hardware, as well as the newer EPYC Turin 9005 chips.

Lastly, other bottlenecks associated with committing the large number of tasks according to the number of logical cores are mitigated here as these systems have adequate RAM, and powerful cooling/power delivery. Nonetheless, these might be valid factors for others to take into consideration.




欢迎光临 计算化学公社 (http://bbs.keinsci.com/) Powered by Discuz! X3.3