计算化学公社

标题: ORCA 6 Benchmarking: 7980X vs. 2 × EPYC 9654QS [打印本页]

作者
Author: David_R 时间: 2025-5-8 15:26
标题: ORCA 6 Benchmarking: 7980X vs. 2 × EPYC 9654QS
本帖最后由 David_R 于 2025-5-9 11:50 编辑

Hello all,

In this post I’d like to share my ORCA 6 benchmarks on the two computer systems I use to pursue my various scientific endeavours. This report was originally written up in English. I strongly considered presenting it in Chinese with the help of translation software (admittedly, my language skills are somewhat limited), but I was worried that some of the intended meaning would be confused, without me being able to verify it. I hope the written English will be admissible here; if not, I am happy to edit the post with a Chinese translation.

在这篇文章中，我想分享我在我用于进行各项科学研究的两台计算机系统上运行 ORCA 6 的基准测试结果。这份报告最初是用英文撰写的。我曾认真考虑过借助翻译软件（坦白地说，我的中文水平有限）将其翻译成中文，但我担心在无法核实的情况下，一些原意会被混淆。我希望这篇英文文章在这里是可以接受的；如果不行，我很乐意编辑帖子，附上中文翻译。

1. Introduction

These benchmarks were carried out for my own purposes, so are somewhat specific to my particular use-case, but I hope that others will find them more generally enlightening, in addition to other benchmarks provided by others on this forum. The aim was to find optimal settings among a number of selected hardware and software-based parameters, and to compare the server-grade EPYC and HEDT Threadripper platforms. It is certainly not a comprehensive exploration of all available parameters. Moreover, some are selected for a more systematic and rigorous investigation, while others that I tested in a more cursory or qualitative manner are mentioned where appropriate.

I’ve tried my best to format and structure my report in a manner that can be easily digested. Nonetheless, if there are any ambiguities or mistakes I’ve overlooked, I shall be happy to attend to them. In some areas, I found that a salient interpretation of the results probably extends beyond my knowledge of computer hardware; in which case, according to the phrase ‘抛砖引玉’, I hope that others might offer their expertise.

2. Hardware and Software Specifications
(, 下载次数 Times of downloads: 90) (, 下载次数 Times of downloads: 89)

2.1. Threadripper 7980X Workstation

CPU: AMD Ryzen Threadripper 7980X: 64 cores/128 threads (running overclocked @ 4.9 GHz all-core)

RAM: 256 GB (4 × 64 GB) V-color DDR5 ECC RDIMM (running overclocked at 6000 MT/s)

GPU: NVIDIA RTX 4090

Motherboard: Asus Pro WS TRX50-SAGE

Storage: 3 × 2 TB Samsung 990 Pro

PSU: Seasonic Prime TX-1600

CPU cooler: Silverstone XE360-TR5

Case: Silverstone RM52

2.2. 2 × EPYC 9654 Server

CPU: 2 × EPYC 9654QS: each 96 cores/192 threads @ 3.5 GHz all-core

RAM: 1.15 TB (24 × 48 GB) SK Hynix DDR5 ECC RDIMM @ 4800 MT/s

Motherboard: Gigabyte MZ73-LM1 rev 3.2

Storage: 4 TB Samsung 990 Pro

PSU: Seasonic Prime TX-1600

CPU cooler: Silverstone XE360PDD

Case: Silverstone RM52

2.3. Software (same for both systems):

Operating System: Ubuntu Desktop 24.04.2 LTS

Hardware monitoring: htop, psensor

Additional details: ORCA 6.0.1, OpenMPI version 5.0.7. Bash scripts were used for automating the running of calculations, and Python scripts were used for parsing the output files for timings, and evaluating the statistics thereof.

3. Notes on Hardware

I suspect that most here will be familiar with the popular EPYC 9654 system; there are numerous detailed forum posts on the topic of this CPU, and I reported on my experience with it in a previous forum post. This system offers excellent price-to-performance for quantum chemistry, more so with the considerably discounted QS variant.

AMD’s HEDT Threadripper platform is rather more niche. The idea behind this platform is to offer higher core-counts, more PCI-e connectivity, and memory bandwidth over consumer Ryzen CPUs, while enjoying overclocking capabilities, higher core frequencies, and faster memory speeds over the equivalent Zen4 server-grade EPYC chips.

I spent some time overclocking the 7980X system, tweaking the settings for performance and stability. The system thus runs entirely stable at 4.8 – 4.9 GHz on all 64 CPU cores for several months. Moreover, because the programmed Fmax for one CCD is higher than the others, CPU frequencies boost to 5.5 GHz when fewer than 8 cores are active.

Here is a summary of the various overclocking settings applied:

EXPO II profile enabled so the RAM runs at 6000 MT/s.
PBO enabled with +100 MHz boost clock override, and -10 curve optimizer on all cores.
VDDSOC voltage adjusted to 1.15 V.

The impressive overclocked performance of the 7980X system is achieved by supplying the CPU with an enormous amount of power, often exceeding 800 W. Thus, a powerful water cooler with a large radiator is essential.

Under full load, the EPYC 9654 system has no trouble keeping temperatures under control. The CPU temperature tends to stay around 65 – 70 °C. Meanwhile, the 7980X system runs considerably hotter, with CCD temperatures between 80 and 95 °C.

In the presented benchmarking, simultaneous multithreading (SMT) is enabled, unless otherwise specified (the effect of SMT is explored in the latter part of the post).

4. Overview of the Benchmark

The primary motivation for the chosen methodology is the need for running very large batches (10^4 – 10^6) of DFT calculations for training models, such as neural network potentials for MD, or predictive/generative models of empirical utility. My view is that the use of predictive models and data-driven approaches to development and discovery in chemical synthesis holds great promise for the field, and large scale quantum chemical calculations will doubtlessly play an important role. Thus, the results are presented as ‘throughput’ in h–1; that is, the number of calculations one could run per hour if successive calculations were run back-to-back continuously. Having used both of these computer systems for research projects already, I can attest that these benchmarks faithfuly represent system performance under ‘real-world’ use.

A key question I wanted to answer was how to best utilise all available CPU cores, by balancing the number of CPU cores each calculation is parallelised with (using the nprocs function in ORCA), against running multiple calculations simultaneously as concurrent ‘jobs’.

One might also be interested in the most efficient approach to carrying out smaller-scale studies or even just individual calculations in the shortest amount of time, so the present systems are studied in this capacity too.

The benchmarking presented here comprises single-point DFT calculations on an optimized geometry of a simple organotransition metal compound: (dppbz)CuH (pictured below). Two functionals were tested; namely, the range-separated hybrid ωB97X, and the pure functional PBE. Furthermore, comparisons are made across def2-SVP and def2-TZVP basis sets.

(, 下载次数 Times of downloads: 81)

I spent some time investigating the reproducibility of the results, and found them to be highly consistent. Standard deviations were less than one second (with total calculation timings typically between 1 and 45 minutes). All the results presented are average values obtained from all the collected individual timings. Simultaneous jobs were initialised with a 0.1 second delay between them; increasing this delay to 30 seconds did not significantly change the timings.

Here is an example input script:

! RIJCOSX wB97X MINIPRINT

%pal nprocs 4 end

%basis

Basis "def2-TZVP"

Aux "def2/J"

end

* xyzfile 0 1 GEO.xyz

...and here is the geometry in GEO.xyz:

Coordinates from ORCA-job

Cu 2.46973002972452 -0.18601739044056 0.04232875648838

C -0.64363913555159 -0.97384561595684 -0.88211909464252

C -1.62045677492391 -1.57356613363719 -1.69479681390466

C -2.62861064458259 -2.36525177869295 -1.13939160073426

C -2.66742741767769 -2.57603031053721 0.24197794608297

C -1.69865533688148 -1.99334312422015 1.06276593858960

C -0.68411432605112 -1.18621532788858 0.52077293903823

H -1.58405488995610 -1.43143070510902 -2.77734581040532

H -3.37760021230677 -2.82694533364111 -1.78850655384471

H -3.44685247080927 -3.20413117176586 0.68185565008376

H -1.72340199901762 -2.18003195526747 2.13888713946339

P 0.76331812454317 0.02511474857812 -1.57656893092477

P 0.66808905554658 -0.44033697872267 1.55737850489893

C 0.07027521207012 1.71758452155321 -1.82468507906642

C -1.29059788908241 2.04029583164644 -1.69532264239153

C -1.72741200360080 3.35678686265136 -1.88035380204103

C -0.81295444353836 4.36434909072420 -2.19891778216789

C 0.54562680566646 4.05342157908296 -2.32559401896909

C 0.98597747702343 2.74234787712552 -2.13261515084648

H -2.01689964984798 1.26284342751882 -1.44819727093619

H -2.78977562759455 3.59356913149549 -1.77514931583574

H -1.15657941853224 5.39225195613550 -2.34288559258041

H 1.26840737214768 4.83740206958434 -2.56701562498220

H 2.05162779696117 2.51026008695776 -2.21735508687712

C 0.95914382935440 -0.62563790246834 -3.29429774766077

C 1.86124565969842 -1.68859439323831 -3.47961989366908

C 2.05998798780848 -2.23577711880120 -4.74985597418974

C 1.37446780925027 -1.71626516451124 -5.85310150449239

C 0.48978312792667 -0.64688294601580 -5.68126863172323

C 0.28161973138268 -0.10314694683018 -4.40949201679897

H 2.42349826175845 -2.07511997661121 -2.62448463361414

H 2.76448555104436 -3.06165039647633 -4.87992812408805

H 1.53770714659285 -2.13791858149570 -6.84865123830882

H -0.04052345493023 -0.23017556744675 -6.54208536696866

H -0.40592565735296 0.73751361732364 -4.28815381296754

C -0.06339768157169 1.10862534045831 2.24555004431524

C -1.42958448350428 1.43073751041558 2.18854526394961

C -1.89626130851997 2.63637046048158 2.72376030371323

C -1.00649286246590 3.53222548243676 3.32324484835761

C 0.35675949219332 3.22219993254359 3.38083730417821

C 0.82683750789704 2.02339485477321 2.83999469377563

H -2.13648757845930 0.73910254484218 1.72475680631803

H -2.96221256573466 2.87473897175397 2.67137636386630

H -1.37338453790757 4.47404072402123 3.74017453949608

H 1.06032906914228 3.92120762995246 3.84087251542926

H 1.89605417051905 1.79471673005399 2.87415056416288

C 0.77494092775765 -1.56446325647194 3.01944140939291

C 0.02333947380930 -1.39237832209352 4.19472766797217

C 0.16778310954606 -2.28384900025092 5.26281948932466

C 1.06228134615066 -3.35482173344512 5.16943203371828

C 1.82137664628147 -3.52739334281143 4.00710725975150

C 1.68596864095113 -2.63321654629609 2.94195260870045

H -0.67266188046333 -0.55447160762770 4.28166626618063

H -0.41993204979187 -2.13803927782552 6.17339490220428

H 1.17568338680578 -4.04870790383006 6.00682102759451

H 2.53376576021073 -4.35347390248122 3.93375406977450

H 2.30502115678396 -2.75144925368267 2.04777616706077

H 4.01076163410810 -0.33052301551929 0.07963809174976

5. Results
5.1. The effect of increasing number of CPU cores for a single job

My initial investigation focused on the number of CPU cores called (using the ORCA nprocs command) for a single calculation. Four different levels of theory were tested: ωB97X/def2-TZVP, ωB97X/def2-SVP, PBE/def2-TZVP, and PBE/def2-SVP. The results are presented in the four graphs below: throughout this post, those for the 7980X system are coloured red, and the 2 × 9654 system data are coloured blue. In order to visualise the parallelisation efficiency, dashed lines of the corresponding colours indicate the theoretical linear increase in performance pertaining to an ‘embarrassingly parallel’ relationship.

(, 下载次数 Times of downloads: 85)

Some key observations from the obtained data:

The throughput (in hr–1; calculated by dividing 3600 by the average calculation timings in seconds) increases with the number of CPU cores called up to an optimal number, after which point it decays significantly.
Among all examples studied, the parallelisation efficiency deviates markedly from linearity beyond 8 CPU cores.
The number of CPU cores that attains the shortest calculation time depends on the system used and the level of theory selected: Generally, the 2 × 9654 system demonstrates better parallelisation efficiency than the 7980X, and the more computationally intensive ωB97X functional parallelises more effectively than does PBE.
In a similar trend, optimal performance for the two systems is more closely matched for the most computationally intensive ωB97X/def2-TZVP level of theory (with a performance ratio of 1.1:1 for the 7980X over the 2 × 9654 system), while the difference is much larger for the less computationally expensive PBE/def2-SVP methodology (with the corresponding performance ratio increasing to 1.6:1). This is only partly attributable to the difference in optimal number of CPU cores for the ωB97X/def2-TZVP level of theory, with a respective 'core-for-core' performance of 1.3:1.

The difference in timings from an individual calculation on the two systems is largely attributable to the difference in CPU clock frequencies (4.9 vs. 3.5 GHz for the 7980X and 9654QS chips, respectively, which corresponds to a ratio of 1.4:1). However, the 7980X system attains a greater speedup when switching to less computationally expensive methods, indicating that they are more strongly influenced by CPU frequencies. I did do a cursory test on the effect of RAM frequency and NVMe drive speeds, which was negligible in all cases, suggesting that internal CPU design may play a role in the observed trends.

5.2. The effect of increasing the number of simultaneous jobs

So, instead of running individual calculations and increasing number of CPU cores, for which the performance increase deviates from linearity rather quickly, what if we run multiple calculations concurrently, each with nprocs = 4? For an initial investigation, the ωB97X/def2-TZVP level of theory was used—the results are presented in the graph below:

(, 下载次数 Times of downloads: 78)

Some key observations from this study:

Throughput increases as the total CPU utilisation (number of jobs × 4) increases with an approximately linear correlation, up to the point at which all physical CPU cores are active (expressed as '100% CPU utilisation').
Interestingly, the highest throughput is obtained when utilising all the available logical processors (that is, twice the number of physical cores; 384 and 128 for the 2 × 9654 and 7980X systems, respectively). Although ORCA does not support SMT natively, it seems that running concurrent jobs can make use of SMT to alleviate internal CPU inefficiencies, resulting in a not-insignificant performance increase (note that all of these results are with SMT enabled in the BIOS, unless otherwise specified; vide infra).
When fully utilising each system in this way, the performance ratio between the 2 × 9654 and 7980X systems is 2.2:1. This is well accounted for by the differences in core count and frequencies between the two systems.

5.3. Balancing number of CPU cores vs. number of simultaneous jobs

As demonstrated above, an approximately linear increase in throughput is attained when increasing the number of jobs (each with nprocs = 4) but what about using the same strategy with different numbers of CPU cores called per job?

I encountered a strange problem when attempting this with nprocs = 2: the calculations would not run simultaneously. I made a half-hearted attempt to troubleshoot this but gave up and decided not to pursue it further. Thus, I present results with nprocs = 4, 8, 16 and 32, for both ωB97X/def2-TZVP and PBE/def2-SVP levels of theory in the graphs below:

(, 下载次数 Times of downloads: 88)

Some prevailing observations:

In all cases, the highest throughput is attained when using nprocs = 4, and all logical cores (double the number of physical cores available). Using more CPU cores per calculation invariably results in lower performance, in part due to the loss of parallelisation efficiency demonstrated in section 5.1.
Performance increases close to linearity up to utilisation of all physical cores in each case, except PBE/def2-SVP on the 2 × 9654 system, where it deviates significantly beyond 96 cores. In analogy to the results in section 5.1, it seems that the performance of the 7980X system really shines with the less computationally expensive method.
As mentioned in the previous section, the optimum performance ratio between the 2 × 9654 and 7980X systems is 2.2:1 for the ωB97X/def2-TZVP level of theory. However, this decreases to 1.8:1 when PBE/def2-SVP is used.

5.4. The effect of simultaneous multithreading (SMT)

ORCA does not natively make use of SMT technology. I have heard some people advocate switching it off in the BIOS, but this was advice from quite some time ago, and is likely hardware-dependent, so I went about testing it myself.

The effect of turning SMT on or off on timings for individual calculations (with nprocs = 4 and 32, at the ωB97X/def2-TZVP and PBE/def2-SVP levels of theory) is shown in the bar chart below. Note that, for clarity, I’ve plotted the data here as calculation time, so lower is better.

(, 下载次数 Times of downloads: 84)

In almost all cases, whether SMT is enabled or not, there is almost no discernable difference. A notable exception is a PBE/def2-SVP calculation using 32 CPU cores, for which turning off SMT results in a slight performance decrease for the 7980X system, and a slight increase for the 2 × 9654 system.

Next, we turn our attention to the effect of SMT on running multiple jobs. The obtained throughput data is plotted in the bar charts below (higher is better).

(, 下载次数 Times of downloads: 81)

A few notable insights:

As anticipated, calling more total CPU cores than available physical cores invariably results in a pronounced performance decrease when SMT is turned off.
When using all physical CPU cores (expressed as '100% CPU utilisation'), there is a significant performance increase for the 2 × 9654 system, especially in the case of PBE/def2-SVP calculations, while there is no difference for the 7980X system.
Nonetheless, for both systems, it is optimal to leave SMT enabled and utilise all logical CPU cores (expressed as '200% utilisation').

6. Conclusions

For these two equivalent AMD Zen4 CPUs, the best general predictors for performance are CPU core count and frequency, with other potential performance bottlenecks apparently less influential. This results in a performance ratio of approximately 2:1 for the 2 × 9654 system over the 7980X. Given that the two systems cost approximately the same (if one ignores the price of the GPU, which is not used here), then the dual socket EPYC configuration offers significantly better price-to-performance for ORCA calculations. However, in my case, I use the 7980X system for a wide range of other workflows, especially those which utilise GPUs, or are lightly-threaded, and thus benefit from the higher clock speeds.

A summary of the main findings:

For running individual calculations in the shortest possible time, the higher CPU frequencies of the 7980X system offer better performance, owing to limited parallelisation efficiency. Generally, using more than 16 CPU cores for a single calculation offers only marginal improvements in timings, and can ultimately lead to a significant performance decrease.
If one has a lot of calculations to run, the optimal use of system resources involves running concurrent jobs with 4 CPU cores called each, and using all available logical cores.
Optimal throughput as delineated above is achieved with SMT enabled, and even in various edge cases one might encounter, switching SMT off is unlikely to result in a significant performance increase.

Thank you all for reading!

EDIT: Fixed various formatting and typographical errors.

作者
Author: 含光君 时间: 2025-5-8 16:21
Thanks for sharing! Very inspiring and helpful work!

The charts in this report look quite nice, could you please share how did you plot them? Using python or other applications?

作者
Author: David_R 时间: 2025-5-8 16:58

含光君发表于 2025-5-8 16:21
Thanks for sharing! Very inspiring and helpful work!

The charts in this report look quite nice, c ...

It is my pleasure!

To make charts, I use various combinations of Python, Origin Pro, MS Excel, and Adobe Illustrator. Most of the 'custom' modifications are done in Adobe Illustrator as it is powerful vector graphics software. I would expect that any open source alternative for vector graphics software should also work well.

作者
Author: LetsQu1t 时间: 2025-8-2 17:52
本帖最后由 LetsQu1t 于 2025-8-2 09:58 编辑

Hi David,

  Many thanks for sharing this work. I just want to have a few quick comments on hyperthreads and parallelisation efficiency.
  I'm curious that, when SMT was disabled, calling more CPUs than the number of physical cores still resulted in a normal termination of your calculation. Based on my knowledge, this should lead to an instant abortion of the ORCA job as the system would never be able to locate the extra cores (those exceeding the # of physical cores which equals to # of logical threads).
  Meanwhile, your conclusion that the maximum efficiency was achieved by enabling SMT and specifying all threads in the input file. This, however, was widely regarded as a poor practice. Tian Lu has posted a full thread about this (see http://sobereva.com/392). When the # of cores requested > # of physical cores (w/ HT/SMT), the performance plunged without exception, as you can tell from his numbers. In addition, disabling SMT could result in performance improvement if it boosts all-core turbo clock-speed (common for the old Intel chips, and also in the case that your motherboard's power delivery is not robust).
  The fact that you started to observe pronounced efficiency loss when requesting > 24 cores was also mind-blowing. A 58-atom close-shell system with RI-DFT and 2-zeta basis set consist a reasonably sized, not particularly small job, and I was expecting a turning point at a number > 24.

Sincerely

作者
Author: David_R 时间: 2025-8-9 09:24

LetsQu1t 发表于 2025-8-2 17:52
Hi David,

Many thanks for sharing this work. I just want to have a few quick comments on hypert ...

Thank you for your comment! These are good points to discuss.

When running the comparisons with and without SMT enabled, the utilisation of all logical cores (twice the number of physical cores) was achieved by running multiple simultaneous ORCA calculations, parallelised with 4 nprocs each, so the Linux scheduler handles the workload without ORCA throwing any errors.

My investigations took me down a path of extracting more performance from hardware by committing more simultaneous calculations, each parallelised with fewer CPU cores each, especially, as you noted, individual calculations don't tend to parallelise efficiently beyond 16-32 cores (regardless of whether SMT is enabled). My two-fold hypothesis on the relevance of this strategy is as follows:

1. Continued advancements in the efficiency of scheduling with SMT and Linux have ironed out a lot issues with commiting tasks according to logical cores rather than physical cores. Still, in line with what Tian Lu notes in his blogpost, the performance increase (if there is one) is fairly marginal.

2. CPU core clocks have increased considerably in these high-core-count chips, and they boost reliably under all-core workloads relevant to computational chemistry. My suspicion is that because the CPUs operate at higher frequencies, this exposes bottlenecks in parallelisation overhead, meaning that optimal parallisation efficiency occurs at lower CPU cores specified per calculation. This is also underwritten by some of my more qualitative observations using older-generation AMD EPYC 7002 and 7003 hardware, as well as the newer EPYC Turin 9005 chips.

Lastly, other bottlenecks associated with committing the large number of tasks according to the number of logical cores are mitigated here as these systems have adequate RAM, and powerful cooling/power delivery. Nonetheless, these might be valid factors for others to take into consideration.

作者
Author: LetsQu1t 时间: 2025-8-16 09:41

David_R 发表于 2025-8-9 01:24
Thank you for your comment! These are good points to discuss.

When running the comparisons with ...

Many thanks for the follow-up. I absolutely agree that, when you have large computing nodes, it would be fruitful to split the resource and carry out a number of quantum chemistry jobs simultaneously. Interestingly, I believe this is not quite the case for many other scenarios, such as computational fluid dynamics (CFD) or finite element electromagnetic simulations. A loss of performance could be apparent when holding multiple calculations.

For SMT: our HPC clusters have HT fully disabled (adhering to the standard of the industry). I've also quickly looked at this on my home server (AMD EPYC Rome), and, basically, the CPU ran at the same max. turbo clock-speed (in Cinebench R23, CPU-Z, and Aida 64) regardless of the SMT setting. However, I performed a computationally intensive (RI-)DLPNO-CCSD(T1)/ def2-QZVP job across all physical cores using the same input file, and observed a 3% time reduction when SMT was off. The results were reproducible.

作者
Author: David_R 时间: 2025-8-24 08:24

LetsQu1t 发表于 2025-8-16 09:41
Many thanks for the follow-up. I absolutely agree that, when you have large computing nodes, it wo ...

These are good points to make. Indeed, such considerations can be dependent on one's specific computational workflows.

Your experience with SMT and intensive workflows that use most of the CPU's computing power sounds typical, and in line with my findings. The dual-socket EPYC 9654 system also saw a modest performance increase with SMT off in this scenario. However, I saw the opposite (and to a greater extent) with a wider range of workflows and loads, so generally I'd advise people to leave SMT on, unless you verify that your specific workflow benefits from it being switched off (and I always encourage people to spend the time benchmarking performance for their specific workflows, and, indeed, sharing their experiences on this forum!)

欢迎光临计算化学公社 (http://bbs.keinsci.com/)