Pitagora


Pitagora is the new EUROfusion supercomputer hosted by CINECA and currently built in the CINECA’s headquarter in Casalecchio di Reno, Bologna, Italy. The cluster is supplied by Lenovo corp. and is composed of two partitions: a general purpose partition cpu-based named DCPG and an accelerated partition based on NVIDIA H100 accelerators named Booster.
The specific guide for the Pitagora cluster contains unique information that deviates from the general behavior described in the HPC Clusters sections.
Access to the System
The machine is reachable via ssh
(secure Shell) protocol at hostname point: login.pitagora.cineca.it.
The connection is established, automatically, to one of the available login nodes. It is possible to connect to Pitagora using one the specific login hostname points:
login01-ext.pitagora.cineca.it
login02-ext.pitagora.cineca.it
login03-ext.pitagora.cineca.it
login04-ext.pitagora.cineca.it
Warning
The mandatory access to Pitagora is the two-factor authetication (2FA). Get more information at section Access to the Systems.
System Architecture
The system, supplied by Lenovo, is based on two new specifically-designed compute blades, which are available throught two distinct SLURM partitios on the Cluster:
GPU blade based on NVIDIA NVIDIA H100 accelerators - Booster partition.
CPU-only blade based on AMD Turin 128c processors - Data Centric General Purpose (DCGP) partition.
The overall system architecture uses NVIDIA Mellanox InfiniBand High Data Rate (HDR) connectivity, with smart in-network computing acceleration engines that enable extremely low latency and high data throughput to provide the highest AI and HPC application performance and scalability.
Hardware Details
Type |
Specific |
---|---|
Models |
Lenovo SD650-N V3 |
Racks |
7 |
Nodes |
168 |
Processors/node |
2x Intel Emerald Rapids 6548Y 32c 2.4 GHz |
CPU/node |
64 |
Accelerators/node |
4x NVIDIA H100 SXM 80GB HBM2e |
Local Storage/node (tmfs) |
|
RAM/node |
512 GiB DDR5 5600 Mhz |
Rmax |
27.27 PFlop/s (top500) |
Internal Network |
Nvidia ConnectX-7 NDR200 |
Storage (raw capacity) |
2 x 7.68 GiB SSDs (HW RAID 1) |
Type |
Specific |
---|---|
Models |
Lenovo SD665 V3 |
Racks |
14 |
Nodes |
1008 |
Processors/node |
2x AMD Turin 128c - Zen5 microarch 2.4 GHz |
CPU/node |
256 |
Accelerators/node |
(none) |
Local Storage/node (tmfs) |
|
RAM/node |
768 GiB DDR5 6400 Mhz |
Rmax |
17 Pflop/s (top500) |
Internal Network |
Nvidia ConnectX-7 NDR SharedIO 200Gbit/s |
Storage (raw capacity) |
Diskless nodes |
Job Managing and SLURM Partitions
In the following table you can find informations about the SLURM partitions for Booster and DCGP partitions of the production environment. Please note that the slurm email service is not active yet.
See also
Further information about job submission are reported in the general section Scheduler and Job Submission.
Partition |
QOS |
#Nodes/#per job |
Walltime |
#Max Nodes/#per user |
Priority |
Notes |
---|---|---|---|---|---|---|
boost_fua_prod |
normal |
max = 16 |
24:00:00 |
32 |
40 |
|
boost_qos_fuabprod |
min = 17 (full nodes) max = 32 |
24:00:00 |
32 |
60 |
runs on 96 nodes (GrpTRES) |
|
boost_fua_dbg |
normal |
max = 2 |
00:30:00 |
2 |
40 |
runs on 8 nodes (GrpTRES) |
Partition |
QOS |
#Nodes/#per job |
Walltime |
#Max Nodes/#per user |
Priority |
Notes |
---|---|---|---|---|---|---|
dcgp_fua_prod |
normal |
max = 64 |
24:00:00 |
64 |
40 |
|
dcgp_qos_fuabprod |
min = 65 (full nodes) max = 128 |
24:00:00 |
128 |
60 |
runs on 640 nodes (GrpTRES) |
|
dcgp_qos_fualprod |
max = 3 |
4-00:00:00 |
3 |
40 |
||
dcgp_fua_dbg |
normal |
max = 2 |
00:30:00 |
2 |
40 |
runs on 8 nodes (GrpTRES) |
Processes/Threads Binding/Affinity
Processes Binding
By default, srun (SLURM launcher) performs an automatic binding. For multi-threaded application request the proper –cpus-per-task and bind the processes to cores (srun –cpu-bind=cores).
By default, OpenMPI libraries (mpirun launcher) bind processes to core. For multi-threaded applications this causes the cpu overallocation. Ensure that you are either not bound at all (by specifying –bind-to none) or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process (–map-by socket:PE=$SLURM_CPUS_PER_NODE).
By default, IntelMPI libraries (mpirun launcher with hydra process manager) performs a correct binding. If you opt for IntelMPI mpirun as launcher, unset the I_MPI_PMI_LIBRARY (meant for using Intelmpi with srun) defined when loading the module to avoid the verbose warnings.
Threads Affinity
All present compilers (gcc, nvhpc, aocc, intel) by default don’t bind threads to cores. You can act on the threads affinity with the standard OMP_PLACES/OMP_PROC_BIND variables.
Known Issues
This section collects currently known issues affecting PITAGORA.
The list below is intended as a quick reference for users who may experience problems on the system. We strongly encourage all users to report any issues they encounter - whether listed here or not - to the user support team.
Internode GPUDirect Communication: UCX GPUDirect RDMA Error
Status: Open
Last Update: 2025-07-31
Impacted partition: Booster
DESCRIPTION:
The UCX GPUDirect RDMA feature is currently not functioning for point-to-point communications. This is likely due to an incompatibility between Intel UPI (intersocket connection) and UCX, as referenced in NVIDIA issue 2235234.
Example of the error output:
[r310c04s01:2918358:0:2918358] ib_mlx5_log.c:179 Local protection error on mlx5_0:1/IB (synd 0x4 vend 0x51 hw_synd 0/2)
[r310c04s01:2918358:0:2918358] ib_mlx5_log.c:179 RC QP 0xb232 wqe[20]: SEND s-e [inl len 10] [va 0x14986e800000 len 1 lkey 0x63a2] [rqpn 0xc526 dlid=2992 sl=0 port=1 src_path_bits=0]
[r310c03s04:1011253:0:1011253] ib_mlx5_log.c:179 Local protection error on mlx5_0:1/IB (synd 0x4 vend 0x51 hw_synd 0/2)
[r310c03s04:1011253:0:1011253] ib_mlx5_log.c:179 RC QP 0xc526 wqe[25]: SEND s-e [inl len 10] [va 0x14fa6a800000 len 1 lkey 0x688b] [rqpn 0xb232 dlid=5592 sl=0 port=1 src_path_bits=0]
==== backtrace (tid:2918358) ====
0 0x00000000000129b0 uct_ib_mlx5_completion_with_err() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/ib_mlx5_log.c:179
1 0x00000000000279ec uct_rc_mlx5_iface_handle_failure() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5_iface.c:235
2 0x00000000000279ec uct_rc_iface_arbiter_dispatch() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/rc/base/rc_iface.h:455
3 0x00000000000279ec uct_rc_mlx5_iface_handle_failure() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5_iface.c:238
4 0x0000000000013a25 uct_ib_mlx5_check_completion() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/ib_mlx5.c:477
5 0x0000000000028d97 uct_ib_mlx5_poll_cq() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/ib_mlx5.inl:148
6 0x0000000000028d97 uct_rc_mlx5_iface_poll_tx() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5.inl:1891
7 0x0000000000028d97 uct_rc_mlx5_iface_progress() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5_iface.c:127
8 0x0000000000028d97 uct_rc_mlx5_iface_progress_cyclic() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5_iface.c:132
9 0x000000000004dc4a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/ucs/datastruct/callbackq.h:215
10 0x000000000004dc4a uct_worker_progress() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/api/uct.h:2813
11 0x000000000004dc4a ucp_worker_progress() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/ucp/core/ucp_worker.c:3033
12 0x0000000000031f73 opal_progress() ???:0
13 0x0000000000054955 ompi_request_default_wait_all() ???:0
14 0x000000000009ce07 MPI_Waitall() ???:0
15 0x0000000000403e33 main() ???:0
16 0x0000000000029590 __libc_start_call_main() ???:0
17 0x0000000000029640 __libc_start_main_alias_2() :0
18 0x0000000000404cc5 _start() ???:0
=================================
TEMPORARY SOLUTION:
UCX GPUDirect RDMA has been disabled by default through the following environment variable, set automatically in all OpenMPI modules:
export UCX_IB_GPU_DIRECT_RDMA=no
This can be verified by inspecting OpenMPI modules. In the example below, unrelated lines have been omitted for clarity.
[<username>@<login_pitagora> ~]$ module show openmpi/4.1.6--gcc--12.3.0
-------------------------------------------------------------------
/pitagora/prod/opt/modulefiles/base/libraries/openmpi/4.1.6--gcc--12.3.0:
module-whatis {An open source Message Passing Interface implementation.}
[...]
setenv UCX_IB_GPU_DIRECT_RDMA no
[...]
append-path MANPATH {}
-------------------------------------------------------------------
Important
No action is required by the user to apply this workaround.
If users wish to test UCX GPUDirect RDMA manually, they can unset the variable after loading the module to re-enable the feature.
No significant performance degradation has been observed in synthetic benchmarks (e.g., OSU) or selected real-world applications.
IntelMPI Provider/Fabric Compatibility on AMD Processors
Status: Open
Last Update: 2025-07-29
Impacted partition: DCGP
DESCRIPTION:
The Mellanox (MLX) provider for IntelMPI does not work correctly with several applications and libraries on the AMD-based DCGP partition, resulting in job crashes or significantly reduced performance.
- Known affected codes:
STARWALL
ASCOT5
PETSc-based software
In addition, MLX shows lower performance compared to the VERBS provider in most communication patterns, as observed in OSU benchmark tests.
TEMPORARY SOLUTION:
The VERBS provider has been set as the default for IntelMPI. This is done automatically when loading the IntelMPI module by exporting the following environment variables:
export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=verbs
This can be confirmed by inspecting the IntelMPI module. In the example below, unrelated lines have been omitted for clarity.
[<username>@<login_pitagora> ~]$ module show intel-oneapi-mpi/2021.12.1
-------------------------------------------------------------------
/pitagora/prod/opt/modulefiles/base/libraries/intel-oneapi-mpi/2021.12.1:
module-whatis {Intel MPI Library is a multifabric message-passing library that implements the open-source MPICH specification. Use the library to create, maintain, and test advanced, complex applications that perform better on high-performance computing (HPC) clusters based on Intel processors.}
conflict intel-oneapi-mpi
[...]
setenv I_MPI_FABRICS shm:ofi
setenv FI_PROVIDER verbs
[...]
setenv MPIFC mpiifx
-------------------------------------------------------------------
Important
No action is required from the user to enable VERBS as it is now the default provider.
The VERBS provider solves job crash issues for ASCOT5.
For some applications (e.g., STARWALL and PETSc), VERBS also fails. In these cases, we recommend switching to the TCP provider, which resolves the crashes and shows no significant performance loss so far. To use TCP:
export FI_PROVIDER=tcp
If switching providers manually, ensure that you export the variables after loading the IntelMPI module. Loading the module will override any previously set values for
FI_PROVIDER
.
- NOTES:
The VERBS provider uses the same fabric as MLX and is fully capable of utilizing the InfiniBand (IB) network. Switching from MLX to VERBS simply means that IntelMPI will use the API provided by the VERBS library instead of the one from the MLX library, while the underlying communication fabric (i.e. the low-level communication layer that enables data transfer between processes across nodes) remains the same - namely, OFI (OpenFabrics Interfaces).
The TCP provider does not support the InfiniBand network; it relies exclusively on the Ethernet network.
Preloaded Modules Prevent Job Submission
Status: Open
Last Update: 2025-07-29
Impacted partitions: All
DESCRIPTION:
Submitting jobs with modules preloaded in the shell environment can cause the following error:
sbatch: error: Batch job submission failed: Unexpected message received
TEMPORARY SOLUTION:
Before submitting your job, run:
module purge
Then load all necessary modules inside your job script.