Pitagora
Pitagora is the new EUROfusion supercomputer hosted by CINECA and currently built in the CINECA’s headquarter in Casalecchio di Reno, Bologna, Italy. The cluster is supplied by Lenovo corp. and is composed of two partitions: a general purpose partition cpu-based named DCPG and an accelerated partition based on NVIDIA H100 accelerators named Booster.
The specific guide for the Pitagora cluster contains unique information that deviates from the general behavior described in the HPC Clusters sections.
Access to the System
The machine is reachable via ssh (secure Shell) protocol at hostname point: login.pitagora.cineca.it.
The connection is established, automatically, to one of the available login nodes. It is possible to connect to Pitagora using one the specific login hostname points:
login01-ext.pitagora.cineca.it
login02-ext.pitagora.cineca.it
login03-ext.pitagora.cineca.it
login04-ext.pitagora.cineca.it
login05-ext.pitagora.cineca.it
login06-ext.pitagora.cineca.it
Warning
The mandatory access to Pitagora is the two-factor authetication (2FA). Get more information at section Access to the Systems.
Note
Even-numbered login nodes have the same architecture of Booster parition’s compute nodes while odd-numbered have the same architecture of DCGP parition’s compute nodes
login-boost.pitagora.cineca.it will allow users to log on one of the even-numbered login nodes in a round robin fashion.
login-dcgp.pitagora.cineca.it will allow users to log on one of the odd-numbered login nodes in a round robin fashion.
System Architecture
The system, supplied by Lenovo, is based on two new specifically-designed compute blades, which are available throught two distinct SLURM partitios on the Cluster:
GPU blade based on NVIDIA NVIDIA H100 accelerators - Booster partition.
CPU-only blade based on AMD Turin 128c processors - Data Centric General Purpose (DCGP) partition.
The overall system architecture uses NVIDIA Mellanox InfiniBand High Data Rate (HDR) connectivity, with smart in-network computing acceleration engines that enable extremely low latency and high data throughput to provide the highest AI and HPC application performance and scalability.
Hardware Details
Type |
Specific |
|---|---|
Models |
Lenovo SD650-N V3 |
Racks |
7 |
Nodes |
168 |
Processors/node |
2x Intel Emerald Rapids Xeon Gold 6548Y+ 32c 2.5 GHz |
CPU/node |
64 |
Accelerators/node |
4x NVIDIA H100 SXM 94GB HBM2e |
Local Storage/node (tmfs) |
|
RAM/node |
512 GiB DDR5 5600 Mhz |
Rmax |
27.27 PFlop/s (top500) |
Internal Network |
Nvidia ConnectX-7 NDR200 |
Storage (raw capacity) |
2 x 7.68 GiB SSDs (HW RAID 1) |
Type |
Specific |
|---|---|
Models |
Lenovo SD665 V3 |
Racks |
14 |
Nodes |
1008 |
Processors/node |
2x AMD Turin EPYC 9745 128c 2.4 GHz - Zen5 microarch |
CPU/node |
256 |
Accelerators/node |
(none) |
Local Storage/node (tmfs) |
|
RAM/node |
768 GiB DDR5 6400 Mhz |
Rmax |
17 Pflop/s (top500) |
Internal Network |
Nvidia ConnectX-7 NDR SharedIO 200Gbit/s |
Storage (raw capacity) |
Diskless nodes |
File Systems and Data Management
The storage organization conforms to CINECA infrastructure. General information are reported in File Systems and Data Management section. In the following, only differences with respect to general behavior are listed and explained.
Warning
The backup service on the $HOME area is currently not active, due to the ongoing installation of our new automatic backup system, which will be completed in the forecoming weeks.
Job Managing and SLURM Partitions
In the following table you can find informations about the SLURM partitions for Booster and DCGP partitions of the production environment. Please note that the slurm email service is not active yet.
See also
Further information about job submission are reported in the general section Scheduler and Job Submission.
Partition |
QOS |
TRES Limits per job |
Walltime |
MaxTRES per User |
Priority |
Notes |
|---|---|---|---|---|---|---|
ptgr_all_serial (default) |
normal |
Max = 4 cores |
04:00:00 |
4 cores |
40 |
No GPUs, Budget Free Max 4 running jobs and 10 pending jobs per User |
boost_fua_prod |
normal |
Max = 16 nodes |
24:00:00 |
32 nodes |
40 |
|
boost_qos_fuabprod |
Min = 17 full nodes Max = 32 nodes |
24:00:00 |
32 nodes |
60 |
runs on 96 nodes (GrpTRES) |
|
boost_fua_dbg |
normal |
Max = 2 nodes |
00:30:00 |
40 |
runs on 2 nodes (GrpTRES) |
Partition |
QOS |
TRES Limits per job |
Walltime |
MaxTRES per User |
Priority |
Notes |
|---|---|---|---|---|---|---|
ptgr_all_serial (default) |
normal |
Max = 4 cores |
04:00:00 |
4 cores |
40 |
No GPUs, Budget Free Max 4 running jobs and 10 pending jobs per User |
dcgp_fua_prod |
normal |
Max = 64 nodes |
24:00:00 |
64 nodes |
40 |
|
dcgp_qos_fuabprod |
Min = 65 full nodes Max = 128 nodes |
24:00:00 |
128 nodes |
60 |
runs on 640 nodes (GrpTRES) |
|
dcgp_qos_fualprod |
Max = 3 nodes |
4-00:00:00 |
3 nodes |
40 |
||
dcgp_fua_dbg |
normal |
Max = 2 nodes |
00:30:00 |
2 nodes |
40 |
runs on 8 nodes (GrpTRES) |
Processes/Threads Binding/Affinity
Processes Binding
By default, srun (SLURM launcher) performs an automatic binding. For multi-threaded application request the proper –cpus-per-task and bind the processes to cores (srun –cpu-bind=cores).
By default, OpenMPI libraries (mpirun launcher) bind processes to core. For multi-threaded applications this causes the cpu overallocation. Ensure that you are either not bound at all (by specifying –bind-to none) or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process (–map-by socket:PE=$SLURM_CPUS_PER_NODE).
By default, IntelMPI libraries (mpirun launcher with hydra process manager) performs a correct binding. If you opt for IntelMPI mpirun as launcher, unset the I_MPI_PMI_LIBRARY (meant for using Intelmpi with srun) defined when loading the module to avoid the verbose warnings.
Threads Affinity
All present compilers (gcc, nvhpc, aocc, intel) by default don’t bind threads to cores. You can act on the threads affinity with the standard OMP_PLACES/OMP_PROC_BIND variables.
MPI Advanced Configuration
Enable UCC for Collective Communications
UCC (Unified Collective Communication) is a library designed to efficiently handle collective communication operations (e.g., allreduce, broadcast) across multiple nodes and GPUs.
Developed to support deep learning workloads (such as PyTorch-based recommender models) running on systems with multiple network paths (“multi-rail”) and GPUs, it has also been designed and implemented for high-performance PGAS (Partitioned Global Address Space) applications and runtimes - HPC programs where memory is logically shared but physically distributed, making efficient communication critical.
As stated by NVIDIA, UCC serves as a drop-in replacement for HCOLL and is expected to gradually assume the role of the default collective library as it continues to implement the full range of HCOLL hierarchical algorithms. This means that applications or MPI stacks currently using HCOLL can switch to UCC without requiring major code changes. As UCC matures and reaches full feature parity, it is expected to become the default solution for collective communication.
Important
In general, we recommend enabling UCC for collective communications, especially for GPU-based workloads, as it can provide significant performance improvements compared to the native OpenMPI collective component [Venkata MG et al., IEEE 2024.].
To enable UCC for collective communications, load one of the following OpenMPI modules (all built with UCC support, version 1.6.0):
hpcx-mpi/2.25.1 # OpenMPI 4.1.6 with UCC 1.6.0 and UCX 1.20, shipped with HPC-X 2.25.1 (compiled by NVIDIA with GCC 8.5.0)
openmpi/4.1.6--gcc--12.3.0-ucx1.20 # OpenMPI 4.1.6 with UCC 1.6.0 and UCX 1.20, compiled by Cineca with GCC 12.3.0
openmpi/5.0.9--gcc--12.3.0-ucx1.20 # OpenMPI 5.0.9 with UCC 1.6.0 and UCX 1.20, compiled by Cineca with GCC 12.3.0
Then export the following environment variables to enable UCC as the collective communication library for MPI:
export OMPI_MCA_coll_ucc_enable=1
export OMPI_MCA_coll_ucc_priority=100
We recommend setting a high priority for UCC to ensure it is selected as the default collective component when available.
Warning
Asymmetric source/destination memory types for sender and receivers, for example, when the sender uses host memory and the receiver uses device memory, are not yet supported in UCC. If your application relies on this feature, you may encounter errors similar to the following:
ucc_schedule_pipelined.c:211 UCC ERROR failed to initialize fragment for pipeline
allreduce_sra_knomial.c:234 TL_UCP ERROR failed to init pipelined schedule
allreduce.c:38 TL_UCP ERROR asymmetric src/dst memory types are not supported yet
mpool.c:55 UCX WARN object 0x3077c00 was not returned to mpool tl_ucp_req_mp
IMPORTANT: This is not a blocking issue, as OpenMPI will automatically fall back to the next available collective component (the native one in these modules). Therefore, please note that the error messages may be misleading: they indicate a failure in initializing the collective operation in UCC, but in practice this simply triggers a fallback to the native implementation. As a result, comparing the performance of collective operations with and without UCC enabled may not show any difference in such cases, since both configurations may end up using the native OpenMPI collective component.
However, it is relevant to note that not enabling UCC may negatively impact the performance of all symmetric communications.
- Known applications affected by this issue include:
GENE
For more information about UCC, please refer to the official wiki from OpenUCX and the NVIDIA HPC-X documentation.
Enable NVIDIA SHARP for Network Offloading of Collective Operations
NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) improves the performance of MPI and machine learning collective operations by offloading them from CPUs and GPUs to the network, reducing redundant data transfers between endpoints.
This approach decreases the amount of data traversing the network, leading to faster collective operations. Additionally, offloading these operations frees CPU and GPU resources for computation instead of communication overhead.
Important
As a general rule of thumb, we suggest to enable SHARP for collective operations, especially for GPU-based workloads with a large number of processes, as it can provide significant performance improvements [Ramesh B et al. , IEEE 2020.]. However, the actual performance gain may vary depending on the application, data size, and communication patterns. Therefore, we strongly recommend testing with and without SHARP enabled to determine the best configuration for your use case.
To enable SHARP for collective operations, load one of the following OpenMPI modules (all built with SHARP support):
hpcx-mpi/2.25.1 # OpenMPI 4.1.6 with UCC 1.6.0 and UCX 1.20, shipped with HPC-X 2.25.1 (compiled by NVIDIA with GCC 8.5.0)
openmpi/4.1.6--gcc--12.3.0-ucx1.20 # OpenMPI 4.1.6 with UCC 1.6.0 and UCX 1.20, compiled by Cineca with GCC 12.3.0
openmpi/5.0.9--gcc--12.3.0-ucx1.20 # OpenMPI 5.0.9 with UCC 1.6.0 and UCX 1.20, compiled by Cineca with GCC 12.3.0
Then export the following environment variables to enable SHARP for collective communications:
# Enable UCC (required)
export OMPI_MCA_coll_ucc_enable=1
export OMPI_MCA_coll_ucc_priority=100
# Enable SHARP (through UCC)
export OMPI_UCC_CL_BASIC_TLS=ucp,sharp
export SHARP_COLL_ENABLE_SAT=1
Note: SHARP support is provided through UCC; therefore, enabling UCC is a prerequisite for using SHARP.
For more information, refer to the official NVIDIA documentation.
Known Issues
This section collects currently known issues affecting PITAGORA.
The list below is intended as a quick reference for users who may experience problems on the system. We strongly encourage all users to report any issues they encounter - whether listed here or not - to the user support team.