Timings and parallelization
This section summarizes the options DFTK offers to monitor and influence performance of the code.
Timing measurements
By default DFTK uses TimerOutputs.jl to record timings, memory allocations and the number of calls for selected routines inside the code. These numbers are accessible in the object DFTK.timer
. Since the timings are automatically accumulated inside this datastructure, any timing measurement should first reset this timer before running the calculation of interest.
For example to measure the timing of an SCF:
DFTK.reset_timer!(DFTK.timer)
scfres = self_consistent_field(basis, tol=1e-8)
DFTK.timer
────────────────────────────────────────────────────────────────────────────── Time Allocations ────────────────────── ─────────────────────── Tot / % measured: 776ms / 23.8% 116MiB / 33.5% Section ncalls time %tot avg alloc %tot avg ────────────────────────────────────────────────────────────────────────────── self_consistent_field 1 184ms 99.5% 184ms 38.6MiB 99.4% 38.6MiB LOBPCG 14 112ms 60.4% 7.99ms 12.2MiB 31.3% 889KiB Hamiltonian mu... 44 75.6ms 40.9% 1.72ms 3.02MiB 7.79% 70.4KiB kinetic+local 44 70.8ms 38.3% 1.61ms 200KiB 0.50% 4.55KiB nonlocal 44 2.27ms 1.23% 51.6μs 852KiB 2.14% 19.4KiB rayleigh_ritz 30 6.59ms 3.56% 220μs 0.96MiB 2.46% 32.6KiB ortho 112 5.15ms 2.78% 46.0μs 945KiB 2.38% 8.44KiB block multipli... 123 3.56ms 1.93% 29.0μs 1.73MiB 4.46% 14.4KiB compute_density 7 32.4ms 17.5% 4.63ms 5.44MiB 14.0% 796KiB energy_hamiltonian 15 30.9ms 16.7% 2.06ms 13.3MiB 34.2% 905KiB ene_ops 15 25.5ms 13.8% 1.70ms 9.13MiB 23.5% 624KiB ene_ops: xc 15 13.6ms 7.35% 906μs 2.85MiB 7.35% 195KiB ene_ops: har... 15 4.97ms 2.69% 331μs 3.97MiB 10.2% 271KiB ene_ops: local 15 2.48ms 1.34% 166μs 1.61MiB 4.15% 110KiB ene_ops: non... 15 2.04ms 1.10% 136μs 159KiB 0.40% 10.6KiB ene_ops: kin... 15 1.46ms 0.79% 97.3μs 528KiB 1.33% 35.2KiB QR orthonormaliz... 14 394μs 0.21% 28.1μs 159KiB 0.40% 11.4KiB SimpleMixing 7 244μs 0.13% 34.9μs 959KiB 2.41% 137KiB guess_density 1 933μs 0.50% 933μs 231KiB 0.58% 231KiB guess_spin_density 1 292ns 0.00% 292ns 0.00B 0.00% 0.00B ──────────────────────────────────────────────────────────────────────────────
The output produced when printing or displaying the DFTK.timer
now shows a nice table summarising total time and allocations as well as a breakdown over individual routines.
Timing measurements have the unfortunate disadvantage that they alter the way stack traces look making it sometimes harder to find errors when debugging. For this reason timing measurements can be disabled completely (i.e. not even compiled into the code) by setting the environment variable DFTK_TIMING
to "0"
or "false"
. For this to take effect recompiling all DFTK (including the precompile cache) is needed.
Unfortunately measuring timings in TimerOutputs
is not yet thread-safe. Therefore taking timings of threaded parts of the code will be disabled unless you set DFTK_TIMING
to "all"
. In this case you must not use Julia threading (see section below) or otherwise undefined behaviour results.
Threading
At the moment DFTK employs shared-memory parallelism using multiple levels of threading which distribute the workload over different $k$-Points, bands or within an FFT or BLAS call between processors. At its current stage our approach to threading is quite simple and pragmatic, such that we do not yet achieve a great scaling. This should be improved in future versions of the code.
Finding a good sweet spot between the number of threads to use and the extra performance gained by each additional working core is not always easy, since starting, terminating and synchronising threads takes time as well. Most importantly the best settings wrt. threading depend on both hardware and the problem (e.g. number of bands, $k$-Points, FFT grid size).
For the moment DFTK does not offer an automated selection mechanism of thread and parallelization threading and just uses the Julia defaults. Since these are rarely good, users are advised to use the timing capabilities of DFTK to experiment with threading for their particular use case before running larger calculations.
FFTW threads
For typical small to medium-size calculations in DFTK the largest part of time is spent doing discrete Fourier transforms (about 80 to 90%). For this reason parallelising FFTs can have a large effect on the runtime of larger calculation in DFTK. Unfortunately for scaling of FFT threading for smaller problem sizes and large numbers of threads is not great, such that by default threading in FFTW is even disabled.
The recommended setting for FFT threading with DFTK is therefore to only use moderate number of FFT threads, something like $2$ or $4$ and for smaller calculations disable FFT threading completely. To enable parallelization of FFTs (which is by default disabled), use
using FFTW
FFTW.set_num_threads(N)
where N
is the number of threads you desire.
BLAS threads
All BLAS calls in Julia go through a parallelized OpenBlas or MKL (with MKL.jl. Generally threading in BLAS calls is far from optimal and the default settings can be pretty bad. For example for CPUs with hyper threading enabled, the default number of threads seems to equal the number of virtual cores. Still, BLAS calls typically take second place in terms of the share of runtime they make up (between 10% and 20%). Of note many of these do not take place on matrices of the size of the full FFT grid, but rather only in a subspace (e.g. orthogonalization, Rayleigh-Ritz, ...) such that parallelization is either anyway disabled by the BLAS library or not very effective.
The recommendation is therefore to use the same number of threads as for the FFT threads. You can set the number of BLAS threads by
using LinearAlgebra
BLAS.set_num_threads(N)
where N
is the number of threads you desire. To check the number of BLAS threads currently used, you can use
Int(ccall((BLAS.@blasfunc(openblas_get_num_threads), BLAS.libblas), Cint, ()))
or (from Julia 1.6) simply BLAS.get_num_threads()
.
Julia threads
On top of FFT and BLAS threading DFTK uses Julia threads (Thread.@threads
) in a couple of places to parallelize over k
-Points (density computation) or bands (Hamiltonian application). The number of threads used for these aspects is controlled by the environment variable JULIA_NUM_THREADS
. To influence the number of Julia threads used, set this variable before starting the Julia process.
Notice, that Julia threading is applied on top of FFTW and BLAS threading in the sense that the regions parallelized by Julia threads again use parallelized FFT and BLAS calls, such that the effects are not orthogonal. Compared to FFT and BLAS threading the parallelization implied by using Julia threads tends to scale better, but its effectiveness is limited by the number of bands and the number of irreducible k
-Points used in the calculation. Therefore this good scaling quickly diminishes for small to medium systems.
The recommended setting is to stick to 2
Julia threads and to use 4
or more Julia threads only for large systems and/or many k
-Points. To check the number of Julia threads use Threads.nthreads()
.
Summary of recommended settings
Calculation size | JULIA_NUM_THREADS | FFTW.set_num_threads(N) | BLAS.set_num_threads(N) |
---|---|---|---|
tiny | 1 | 1 | 1 |
small | 2 | 1 | 1 |
medium | 2 | 2 | 2 |
large | 2 | 4 | 4 |
Note, that this picture is likely to change in future versions of DFTK and Julia as improvements to threading and parallelization are made in the language or the code.
MPI
DFTK uses MPI to distribute on kpoints only at the moment. This should be the most performant method of parallelization: if you have kpoints, start by disabling all threading and use this. Simply follow the instructions on MPI.jl and run DFTK under MPI:
mpiexecjl -np 16 julia myscript.jl
Notice that we use mpiexecjl (see MPI.jl docs) to automatically select the mpiexec
compatible with the MPI version used by Julia.
Issues and workarounds:
- Printing is garbled, as usual with MPI. You can use
at the top of your script to disable printing on all processes but one.DFTK.mpi_master() || (redirect_stdout(); redirect_stderr())
- This feature is still experimental and some routines (e.g. band structure and direct minimization) are not compatible with this yet.