本帖最后由 hlmkh 于 2017-1-18 16:50 编辑 |
Gaussian 16 Rev. A.03 Release Notes
New Modeling Capabilities
TD-DFT analytic second derivatives for predicting vibrational frequencies/IR and Raman spectra and performing transition state optimizations and IRC calculations for excited states.
EOMCC analytic gradients for performing geometry optimizations.
Anharmonic vibrational analysis for VCD and ROA spectra: see Freq=Anharmonic.
Vibronic spectra and intensities: see Freq=FCHT and related options.
Resonance Raman spectra: see Freq=ReadFCHT.
New DFT functionals: M08 family, MN15, MN15L.
New double-hybrid methods: DSDPBEP86, PBE0DH and PBEQIDH.
PM7 semi-empirical method.
Adamo excited state charge transfer diagnostic: see Pop=DCT.
The EOMCC solvation interaction models of Caricato: see SCRF=PTED.
Generalized internal coordinates, a facility which allows arbitrary redundant internal coordinates to be defined and used for optimization constraints and other purposes. See Geom=GIC and GIC Info.
NVIDIA K40 and K80 GPUs are supported under Linux for Hartree-Fock and DFT calculations. See the Using GPUs tab for details.
Parallel performance on larger numbers of processors has been improved. See the Parallel Performance tab for information about how to get optimal performance on multiple CPUs and clusters.
Gaussian 16 uses an optimized memory algorithm to avoid I/O during CCSD iterations.
There are several enhancements to the GEDIIS optimization algorithm.
CASSCF improvements for active spaces ≥ (10,10) increase performance and make active spaces of up to 16 orbitals feasible (depending on the molecular system).
Significant speedup of the core correlation energies for W1 compound model.
Gaussian 16 incorporates algorithmic improvements for significant speedup of the diagonal, second-order self-energy approximation (D2) component of composite electron propagator (CEP) methods as described in [DiazTinoco16]. See EPT.
Tools for interfacing Gaussian with other programs, both in compiled languages such as Fortran and C and with interpreted languages such as Python and Perl. Refer to the Interfacing to Gaussian 16 page for details.
Parameters specified in Link 0 (%) input lines and/or in a Default.Route file can now also be specified via either command-line
arguments or environment variables. See the Link 0 Equivalences tab for details.
Compute the force constants are every nth step of a geometry optimization: see Opt=Recalc.
Gaussian 16 can use NVIDIA K40 and K80 GPUs under Linux. Earlier GPUs do not have the computational capabilities or memory size to run the algorithms in Gaussian 16. Gaussian 16 does not yet support the Tesla-Pascal series.
Allocating sufficient amounts of memory to jobs is even more important when using GPUs than for CPUs, since larger batches of work must be done at the same time in order to use the GPUs efficiently. The K40 and K80 units can have up to 16 GB of memory. Typically, most of this should be made available to Gaussian. Giving Gaussian 8-9 GB works well when there is 12 GB total on each GPU; similarly, allocating Gaussian 11-12 GB is appropriate for a 16 GB GPU. In addition, at least an equal amount of memory must be available for each CPU thread which is controlling a GPU.
When using GPUs, it is essential to have the GPU controlled by a specific CPU. The controlling CPU should be as physically close as possible to the GPU it is controlling. The hardware arrangement on a system with GPUs can be checked using the nvidia-dmi utility. For example, this output is for a machine with two 16-core Haswell CPU chips and three K80 boards, each of which has two GPUs:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity
GPU0 X PIX SOC SOC SOC SOC SOC SOC 0-15 cores on first chip
GPU1 PIX X SOC SOC SOC SOC SOC SOC 0-15
GPU2 SOC SOC X PIX PHB PHB PHB PHB 16-31 cores on second chip
GPU3 SOC SOC PIX X PHB PHB PHB PHB 16-31
GPU4 SOC SOC PHB PHB X PIX PXB PXB 16-31
GPU5 SOC SOC PHB PHB PIX X PXB PXB 16-31
GPU6 SOC SOC PHB PHB PXB PXB X PIX 16-31
GPU7 SOC SOC PHB PHB PXB PXB PIX X 16-31
The important part of this output is the CPU affinity. This example shows that GPUs 0 and 1 (on the first K80 card) are connected to the CPUs on chip 0 while GPUs 2-5 (on the other two K80 cards) are connected to the CPUs on chip 1.
The GPUs to use for a calculation and their controlling CPUs are specified with the %GPUCPU Link 0 command. This command takes one parameter:
where gpu-list is a comma-separated list of GPU numbers, possibly including numerical ranges (e.g., 0-4,6), and controlling-cpus is a similarly-formatted list of controlling CPU numbers. To continue with the same example, a job which uses all the CPUs—20 CPUs doing parts of the computation and 6 controlling GPUs—would use the following Link 0 commands:
This pins threads 0-31 to CPUs 0-31 and then uses GPU0 controlled by CPU 0, GPU1 controlled by CPU 1, GPU2 controlled by CPU 16, and so on. Note that the controlling CPUs are included in %CPU. The GPU and CPU lists could be expressed more tersely as:
Normally one uses consecutive numbering in the obvious way, but things can be associated differently in special cases. For example, suppose on the same machine one already had one job using 6 CPUs running with %CPU=16-21. Then if one wanted to use the other 26 CPUs with 6 controlling GPUs one would specify:
This would create 26 threads, with GPUs controlled by the threads on CPUs 0, 1, 22, 23, 24 and 25.
GPUs are not helpful for small jobs but are effective for larger molecules when doing DFT energies, gradients and frequencies (for both ground and excited states). They are not used effectively by post-SCF calculations such as MP2 or CCSD. Each GPU is several times faster than a CPU but since on modern machines there are typically many more CPUs than GPUs, it is important to use all the CPUs as well as the GPUs and the speedup from GPUs is reduced because many CPUs are also used effectively. For example, if the GPU is 5x faster than a CPU, then the speedup from going to 1 CPU to 1 CPU + 1 GPU would be 5x, but the speedup going from 32 CPUs to 32 CPUs + 8 GPUs (i.e., 24 CPUs + 8 GPUs) would be equivalent to 24 + 5×8 = 44 CPUs, for a speedup of 44/32 or about 1.4x.
GPUs on nodes in a cluster can be used. Since the %CPU and %GPUCPU specifications are applied to each node in the cluster, the nodes must have identical configurations (number of GPUs and their affinity to CPUs); since most clusters are collections of identical nodes, this is not usually a problem.