.. _quantum_espresso_card: QuantumESPRESSO =============== The following guide describes how to load, configure and use QuantumESPRESSO @ CINECA's cluster. QuantumESPRESSO is available on :ref:`hpc/leonardo:Leonardo` and :ref:`hpc/galileo:Galileo100` clusters. Relevant links ^^^^^^^^^^^^^^ - QE repository: https://gitlab.com/QEF/q-e.git - MaX benchmarks: https://gitlab.com/max-centre/benchmarks-max3.git - JUBE xmls: https://gitlab.com/max-centre/JUBE4MaX.git - spack recipe: https://gitlab.com/spack/spack/-/blob/develop/var/spack/repos/builtin.mock/packages/quantum-espresso/package.py Modules ^^^^^^^ CPU-based and GPU-based machines deploy QuantumESPRESSO with different software stacks, to fully exploit the underlying hardware. In particular: - **Intel/Oneapi** compiler and MPI implementation on G100 and Leonardo DCGP, plus **MKL** for FFT, BLAS/LAPACK and SCALAPACK - **NVHPC** compiler and **OpenMPI/HPCX-MPI** on Leonardo Booster, plus **OpenBLAS** and **FFTW** libraries. Installations based on gcc compiler do not provide performance, and are provided for postprocessing executables. Alternative Installations ^^^^^^^^^^^^^^^^^^^^^^^^^ If you wish installing your own version of QuantumESPRESSO, we suggesting using CMake and the options provided in the `Wiki of the official repository `_ for the CINECA cluster in use. Parallelization strategies ^^^^^^^^^^^^^^^^^^^^^^^^^^ QuantumESPRESSO supports different parallelization strategies. - R&G (`-npw` or no options) processes to distribute real/reciprocal spaces - pools (`-nk`) to distribute k-points - images (`-ni`) to distribute irreducible representations or q-points in a dispersion - band processes (`-nbnd`) to distribute the Kohn-Sham states - linear algebra processes (auto) to distribute diagonalization, via scalapack or custom algorithm. For GPU installations, the diagonalization is done on a single GPU (scalapack are not used We suggest the following for optimal performance on Leonardo Booster: - prioritize pools over R&G , in particular for workloads with hundreds of planes or less in the z-direction, also for intra-node distribution. - The minimum number of k-points per pool (kunit) in PWSCF is the number of k-points (`kunit=1`), while in phonon is usually `kunit=2`, except the following cases: (i) lgamma and not noncolin or domag: `kunit=1` (ii) not lgamma but noncolin and domag: `kunit=4`. - Images implements independent calculations but they might be affected by imbalanced workload, so a mixture of images and poolsusually provides best performances. More detailed information about parallelization strategies can be found on this `link `_ GPU performance considerations and troubleshooting ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Mapping and binding """"""""""""""""""" If you are using the node in exclusive mode, distribute the resources among MPI tasks (usually 1 MPI task per GPU) as follows 1. Set SLURM options to ask for 8 cpus per task .. code-block:: #SLURM --nodes= #SLURM --ntasks-per-node=4 #SLURM --cpus-per-task=8 2. Launch the MPI application by mapping the tasks over the full node with 8 cpus per task. The following code snippets show the command line for mpirun and srun: .. code-block:: mpirun --map-by node:PE=$SLURM_CPUS_PER_TASK --rank by core pw.x .. code-block:: srun --cpus-per-task=$SLURM_CPUS_PER_TASK --cpu-bind=cores pw.x 3. MPI-GPU binding is done by source code, so an external binding with `CUDA_VISIBLE_DEVICES` is not needed. Multi-node multi-GPU runs """"""""""""""""""""""""" If your workload requires **multi-node distribution** due to memory constrains on GPUs, we suggest testing the following environment variables to improve performnaces. **Memory issues during SCF loop (OOM)** Your code crashes after some iterations steps in the SCF loop. Instead of increasing the number of nodes, try adding the folliwing environment variable in your jobscript. .. code-block:: matlabsession export UCX_TLS=^cuda_ipc This error is due to handles automatically created by the MPI library when calling Isend+Irecv*Waitall. .. note:: Do not export this environment variable if using a number of R&G processes less or equal to 4. **Increase multi-node BW** If your code distributes FFTs across multiple nodes, the MPI installations might not use all the NICs available for inter-node communications. Try interposing this script between the mpi launcher and the executable .. code-block:: #!/bin/bash # Replace with OMPI_COMM_WORLD_LOCAL_RANK if using mpirun case $(( ${SLURM_LOCALID} )) in 0) export UCX_NET_DEVICES=mlx5_0:1 CUDA_VISIBLE_DEVICES=0 ;; 1) export UCX_NET_DEVICES=mlx5_1:1 CUDA_VISIBLE_DEVICES=1 ;; 2) export UCX_NET_DEVICES=mlx5_2:1 CUDA_VISIBLE_DEVICES=2 ;; 3) export UCX_NET_DEVICES=mlx5_3:1 CUDA_VISIBLE_DEVICES=3 ;; esac echo Launching on $UCX_NET_DEVICES .. note:: The environment variable depends on the mpi launcher, srun (`SLURM_LOCALID`) or mpirun (`OMPI_COMM_WORLD_LOCAL_RANK`) **Improve multi-node scaling with FFT distribution** If you need to distribute FFTs over multiple-nodes and achieve the so called 'eager' regime, with small messages exchanged among processes, try reducing the threshold for the rendez-vous algorithm, which can be more efficient on GPUs .. code-block:: export UCX_RNDV_THRESH=8192