This section summarizes the options DFTK offers to monitor and influence performance of the code.
By default DFTK uses TimerOutputs.jl to record timings, memory allocations and the number of calls for selected routines inside the code. These numbers are accessible in the object
DFTK.timer. Since the timings are automatically accumulated inside this datastructure, any timing measurement should first reset this timer before running the calculation of interest.
For example to measure the timing of an SCF:
DFTK.reset_timer!(DFTK.timer) scfres = self_consistent_field(basis, tol=1e-8) DFTK.timer
────────────────────────────────────────────────────────────────────────────── Time Allocations ────────────────────── ─────────────────────── Tot / % measured: 1.08s / 24.0% 94.2MiB / 47.2% Section ncalls time %tot avg alloc %tot avg ────────────────────────────────────────────────────────────────────────────── self_consistent_field 1 258ms 100% 258ms 44.2MiB 99.5% 44.2MiB compute_density 8 132ms 50.8% 16.4ms 7.47MiB 16.8% 956KiB LOBPCG 16 86.8ms 33.5% 5.42ms 12.6MiB 28.3% 805KiB Hamiltonian mu... 44 66.8ms 25.8% 1.52ms 3.67MiB 8.25% 85.4KiB kinetic+local 44 64.0ms 24.7% 1.45ms 797KiB 1.75% 18.1KiB nonlocal 44 1.62ms 0.62% 36.8μs 876KiB 1.92% 19.9KiB rayleigh_ritz 28 4.08ms 1.58% 146μs 897KiB 1.97% 32.0KiB ortho 99 4.01ms 1.55% 40.5μs 809KiB 1.78% 8.17KiB block multipli... 103 1.66ms 0.64% 16.1μs 1.47MiB 3.31% 14.6KiB energy_hamiltonian 17 32.9ms 12.7% 1.94ms 15.5MiB 34.9% 934KiB ene_ops 17 29.3ms 11.3% 1.72ms 10.7MiB 24.1% 645KiB ene_ops: xc 17 20.3ms 7.84% 1.19ms 3.31MiB 7.45% 200KiB ene_ops: har... 17 4.13ms 1.60% 243μs 4.55MiB 10.2% 274KiB ene_ops: local 17 1.89ms 0.73% 111μs 1.91MiB 4.29% 115KiB ene_ops: non... 17 1.24ms 0.48% 72.8μs 203KiB 0.44% 11.9KiB ene_ops: kin... 17 967μs 0.37% 56.9μs 705KiB 1.55% 41.5KiB QR orthonormaliz... 16 293μs 0.11% 18.3μs 160KiB 0.35% 10.0KiB SimpleMixing 8 144μs 0.06% 18.1μs 1.07MiB 2.41% 137KiB guess_density 1 446μs 0.17% 446μs 233KiB 0.51% 233KiB guess_spin_density 1 200ns 0.00% 200ns 0.00B 0.00% 0.00B ──────────────────────────────────────────────────────────────────────────────
The output produced when printing or displaying the
DFTK.timer now shows a nice table summarising total time and allocations as well as a breakdown over individual routines.
Timing measurements have the unfortunate disadvantage that they alter the way stack traces look making it sometimes harder to find errors when debugging. For this reason timing measurements can be disabled completely (i.e. not even compiled into the code) by setting the environment variable
"false". For this to take effect recompiling all DFTK (including the precompile cache) is needed.
Unfortunately measuring timings in
TimerOutputs is not yet thread-safe. Therefore taking timings of threaded parts of the code will be disabled unless you set
"all". In this case you must not use Julia threading (see section below) or otherwise undefined behaviour results.
At the moment DFTK offers two ways to parallelize a calculation, firstly shared-memory parallelism using threading and secondly multiprocessing using MPI (via the MPI.jl Julia interface). MPI-based parallelism is currently only over
k-Points, such that it cannot be used for calculations with only a single
k-Point. Otherwise combining both forms of parallelism is possible as well.
The scaling of both forms of parallelism for a number of test cases is demonstrated in the following figure. These values were obtained using DFTK version 0.1.17 and Julia 1.6 and the precise scalings will likely be different depending on architecture, DFTK or Julia version. The rough trends should, however, be similar.
The MPI-based parallelization strategy clearly shows a superior scaling and should be preferred if available.
Currently DFTK uses MPI to distribute on
k-Points only. This implies that calculations with only a single
k-Point cannot use make use of this. For details on setting up and configuring MPI with Julia see the MPI.jl documentation.
First disable all threading inside DFTK, by adding the following to your script running the DFTK calculation:
using DFTK disable_threading()
Run Julia in parallel using the
mpiexecjlwrapper script from MPI.jl:
mpiexecjl -np 16 julia myscript.jl
-np 16tells MPI to use 16 processes and
-t 1tells Julia to use one thread only. Notice that we use
mpiexecjlto automatically select the
mpiexeccompatible with the MPI version used by MPI.jl.
As usual with MPI printing will be garbled. You can use
DFTK.mpi_master() || (redirect_stdout(); redirect_stderr())
at the top of your script to disable printing on all processes but one.
Even though MPI-based parallelism shows the better scaling it is still experimental and some routines (e.g. band structure and direct minimization) are not compatible with it yet.
Threading in DFTK currently happens on multiple layers distributing the workload over different $k$-Points, bands or within an FFT or BLAS call between threads. At its current stage our scaling for thread-based parallelism is worse compared MPI-based and therefore the parallelism described here should only be used if no other option exists. To use thread-based parallelism proceed as follows:
Ensure that threading is properly setup inside DFTK by adding to the script running the DFTK calculation:
using DFTK setup_threading()
This disables FFT threading and sets the number of BLAS threads to the number of Julia threads.
Run Julia passing the desired number of threads using the flag
julia -t 8 myscript.jl
For some cases (e.g. a single
k-Point, fewish bands and a large FFT grid) it can be advantageous to add threading inside the FFTs as well. One example is the Caffeine calculation in the above scaling plot. In order to do so just call
setup_threading(n_fft=2), which will select two FFT threads. More than two FFT threads is rarely useful.
The default threading setup done by
setup_threading is to select one FFT thread and the same number of BLAS and Julia threads. This section provides some info in case you want to change these defaults.
All BLAS calls in Julia go through a parallelized OpenBlas or MKL (with MKL.jl. Generally threading in BLAS calls is far from optimal and the default settings can be pretty bad. For example for CPUs with hyper threading enabled, the default number of threads seems to equal the number of virtual cores. Still, BLAS calls typically take second place in terms of the share of runtime they make up (between 10% and 20%). Of note many of these do not take place on matrices of the size of the full FFT grid, but rather only in a subspace (e.g. orthogonalization, Rayleigh-Ritz, ...) such that parallelization is either anyway disabled by the BLAS library or not very effective. To set the number of BLAS threads use
using LinearAlgebra BLAS.set_num_threads(N)
N is the number of threads you desire. To check the number of BLAS threads currently used, you can use
Int(ccall((BLAS.@blasfunc(openblas_get_num_threads), BLAS.libblas), Cint, ()))
or (from Julia 1.6) simply
On top of BLAS threading DFTK uses Julia threads (
Thread.@threads) in a couple of places to parallelize over
k-Points (density computation) or bands (Hamiltonian application). The number of threads used for these aspects is controlled by the flag
-t passed to Julia or the environment variable
JULIA_NUM_THREADS. To check the number of Julia threads use
Since FFT threading is only used in DFTK inside the regions already parallelized by Julia threads, setting FFT threads to something larger than
1 is rarely useful if a sensible number of Julia threads has been chosen. Still, to explicitly set the FFT threads use
using FFTW FFTW.set_num_threads(N)
N is the number of threads you desire. By default no FFT threads are used, which is almost always the best choice.