This section summarises the options DFTK offers to monitor and influence performance of the code.
By default DFTK uses TimerOutputs.jl to record timings, memory allocations and the number of calls for selected routines inside the code. These numbers are accessible in the object
DFTK.timer. Since the timings are automatically accumulated inside this datastructure, any timing measurement should first reset this timer before running the calculation of interest.
For example to measure the timing of an SCF:
import TimerOutputs TimerOutputs.reset_timer!(DFTK.timer) scfres = self_consistent_field(basis, tol=1e-8) DFTK.timer
────────────────────────────────────────────────────────────────────────────── Time Allocations ────────────────────── ─────────────────────── Tot / % measured: 756ms / 25.4% 127MiB / 38.6% Section ncalls time %tot avg alloc %tot avg ────────────────────────────────────────────────────────────────────────────── self_consistent_field 1 192ms 100% 192ms 48.9MiB 100% 48.9MiB LOBPCG 14 122ms 63.5% 8.72ms 13.0MiB 26.5% 951KiB Hamiltonian mu... 42 85.4ms 44.5% 2.03ms 1.93MiB 3.93% 47.0KiB ortho 99 6.59ms 3.43% 66.6μs 2.40MiB 4.89% 24.8KiB rayleigh_ritz 28 6.42ms 3.34% 229μs 956KiB 1.90% 34.1KiB block multipli... 111 3.31ms 1.72% 29.8μs 1.65MiB 3.35% 15.2KiB compute_density 7 33.3ms 17.3% 4.76ms 13.6MiB 27.6% 1.94MiB energy_hamiltonian 15 27.9ms 14.5% 1.86ms 14.6MiB 29.7% 0.97MiB QR orthonormaliz... 14 340μs 0.18% 24.3μs 159KiB 0.32% 11.4KiB SimpleMixing 7 174μs 0.09% 24.9μs 959KiB 1.91% 137KiB guess_density 1 689μs 0.36% 689μs 231KiB 0.46% 231KiB guess_spin_density 1 355ns 0.00% 355ns 0.00B 0.00% 0.00B ──────────────────────────────────────────────────────────────────────────────
The output produced when printing or displaying the
DFTK.timer now shows a nice table summarising total time and allocations as well as a breakdown over individual routines.
Timing measurements have the unfortunate disadvantage that they alter the way stack traces look making it sometimes harder to find errors when debugging. For this reason timing measurements can be disabled completely (i.e. not even compiled into the code) by setting the environment variable
"false". For this to take effect recompiling all DFTK (including the precompile cache) is needed.
Unfortunately measuring timings in
TimerOutputs is not yet thread-safe. Therefore taking timings of threaded parts of the code will be disabled unless you set
"all". In this case you must not use Julia threading (see section below) or otherwise undefined behaviour results.
At the moment DFTK employs shared-memory parallelism using multiple levels of threading which distribute the workload over different $k$-Points, bands or within an FFT or BLAS call between processors. At its current stage our approach to threading is quite simple and pragmatic, such that we do not yet achieve a great scaling. This should be improved in future versions of the code.
Finding a good sweet spot between the number of threads to use and the extra performance gained by each additional working core is not always easy, since starting, terminating and synchronising threads takes time as well. Most importantly the best settings wrt. threading depend on both hardware and the problem (e.g. number of bands, $k$-Points, FFT grid size).
For the moment DFTK does not offer an automated selection mechanism of thread and parallelisation threading and just uses the Julia defaults. Since these are rarely good, users are advised to use the timing capabilities of DFTK to experiment with threading for their particular use case before running larger calculations.
For typical small to medium-size calculations in DFTK the largest part of time is spent doing discrete Fourier transforms (about 80 to 90%). For this reason parallelising FFTs can have a large effect on the runtime of larger calculation in DFTK. Unfortunately for scaling of FFT threading for smaller problem sizes and large numbers of threads is not great, such that by default threading in FFTW is even disabled.
The recommended setting for FFT threading with DFTK is therefore to only use moderate number of FFT threads, something like $2$ or $4$ and for smaller calculations disable FFT threading completely. To enable parallelisation of FFTs (which is by default disabled), use
using FFTW FFTW.set_num_threads(N)
N is the number of threads you desire.
All BLAS calls in Julia go through a parallelised OpenBlas or MKL (with MKL.jl. Generally threading in BLAS calls is far from optimal and the default settings can be pretty bad. For example for CPUs with hyper threading enabled, the default number of threads seems to equal the number of virtual cores. Still, BLAS calls typically take second place in terms of the share of runtime they make up (between 10% and 20%). Of note many of these do not take place on matrices of the size of the full FFT grid, but rather only in a subspace (e.g. orthogonalisation, Rayleigh-Ritz, ...) such that parallelisation is either anyway disabled by the BLAS library or not very effective.
The recommendation is therefore to use the same number of threads as for the FFT threads. You can set the number of BLAS threads by
using LinearAlgebra BLAS.set_num_threads(N)
N is the number of threads you desire. To check the number of BLAS threads currently used, you can use
Int(ccall((BLAS.@blasfunc(openblas_get_num_threads), BLAS.libblas), Cint, ()))
or (from Julia 1.6) simply
On top of FFT and BLAS threading DFTK uses Julia threads (
Thread.@threads) in a couple of places to parallelise over
k-Points (density computation) or bands (Hamiltonian application). The number of threads used for these aspects is controlled by the environment variable
JULIA_NUM_THREADS. To influence the number of Julia threads used, set this variable before starting the Julia process.
Notice, that Julia threading is applied on top of FFTW and BLAS threading in the sense that the regions parallelised by Julia threads again use parallelised FFT and BLAS calls, such that the effects are not orthogonal. Compared to FFT and BLAS threading the parallelisation implied by using Julia threads tends to scale better, but its effectiveness is limited by the number of bands and the number of irreducible
k-Points used in the calculation. Therefore this good scaling quickly diminishes for small to medium systems.
The recommended setting is to stick to
2 Julia threads and to use
4 or more Julia threads only for large systems and/or many
k-Points. To check the number of Julia threads use
Note, that this picture is likely to change in future versions of DFTK and Julia as improvements to threading and parallelisation are made in the language or the code.