Pitagora

Pitagora is the new EUROfusion supercomputer hosted by CINECA and currently built in the CINECA’s headquarter in Casalecchio di Reno, Bologna, Italy. The cluster is supplied by Lenovo corp. and is composed of two partitions: a general purpose partition cpu-based named DCPG and an accelerated partition based on NVIDIA H100 accelerators named Booster.

The specific guide for the Pitagora cluster contains unique information that deviates from the general behavior described in the HPC Clusters sections.

Access to the System

The machine is reachable via ssh (secure Shell) protocol at hostname point: login.pitagora.cineca.it.

The connection is established, automatically, to one of the available login nodes. It is possible to connect to Pitagora using one the specific login hostname points:

login01-ext.pitagora.cineca.it

login02-ext.pitagora.cineca.it

login03-ext.pitagora.cineca.it

login04-ext.pitagora.cineca.it

login05-ext.pitagora.cineca.it

login06-ext.pitagora.cineca.it

Warning

The mandatory access to Pitagora is the two-factor authetication (2FA). Get more information at section Access to the Systems.

Note

Even-numbered login nodes have the same architecture of Booster parition’s compute nodes while odd-numbered have the same architecture of DCGP parition’s compute nodes

login-boost.pitagora.cineca.it will allow users to log on one of the even-numbered login nodes in a round robin fashion.
login-dcgp.pitagora.cineca.it will allow users to log on one of the odd-numbered login nodes in a round robin fashion.

System Architecture

The system, supplied by Lenovo, is based on two new specifically-designed compute blades, which are available throught two distinct SLURM partitios on the Cluster:

GPU blade based on NVIDIA NVIDIA H100 accelerators - Booster partition.
CPU-only blade based on AMD Turin 128c processors - Data Centric General Purpose (DCGP) partition.

The overall system architecture uses NVIDIA Mellanox InfiniBand High Data Rate (HDR) connectivity, with smart in-network computing acceleration engines that enable extremely low latency and high data throughput to provide the highest AI and HPC application performance and scalability.

Hardware Details

Booster

Type	Specific
Models	Lenovo SD650-N V3
Racks	7
Nodes	168
Processors/node	2x Intel Emerald Rapids 6548Y 32c 2.4 GHz
CPU/node	64
Accelerators/node	4x NVIDIA H100 SXM 80GB HBM2e
Local Storage/node (tmfs)
RAM/node	512 GiB DDR5 5600 Mhz
Rmax	27.27 PFlop/s (top500)
Internal Network	Nvidia ConnectX-7 NDR200
Storage (raw capacity)	2 x 7.68 GiB SSDs (HW RAID 1)

DCGP

Type	Specific
Models	Lenovo SD665 V3
Racks	14
Nodes	1008
Processors/node	2x AMD Turin 128c - Zen5 microarch 2.4 GHz
CPU/node	256
Accelerators/node	(none)
Local Storage/node (tmfs)
RAM/node	768 GiB DDR5 6400 Mhz
Rmax	17 Pflop/s (top500)
Internal Network	Nvidia ConnectX-7 NDR SharedIO 200Gbit/s
Storage (raw capacity)	Diskless nodes

Job Managing and SLURM Partitions

In the following table you can find informations about the SLURM partitions for Booster and DCGP partitions of the production environment. Please note that the slurm email service is not active yet.

Processes/Threads Binding/Affinity

Processes Binding

By default, srun (SLURM launcher) performs an automatic binding. For multi-threaded application request the proper –cpus-per-task and bind the processes to cores (srun –cpu-bind=cores).
By default, OpenMPI libraries (mpirun launcher) bind processes to core. For multi-threaded applications this causes the cpu overallocation. Ensure that you are either not bound at all (by specifying –bind-to none) or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process (–map-by socket:PE=$SLURM_CPUS_PER_NODE).
By default, IntelMPI libraries (mpirun launcher with hydra process manager) performs a correct binding. If you opt for IntelMPI mpirun as launcher, unset the I_MPI_PMI_LIBRARY (meant for using Intelmpi with srun) defined when loading the module to avoid the verbose warnings.

Threads Affinity

All present compilers (gcc, nvhpc, aocc, intel) by default don’t bind threads to cores. You can act on the threads affinity with the standard OMP_PLACES/OMP_PROC_BIND variables.

Known Issues

This section collects currently known issues affecting PITAGORA.

The list below is intended as a quick reference for users who may experience problems on the system. We strongly encourage all users to report any issues they encounter - whether listed here or not - to the user support team.

Internode GPUDirect Communication: UCX GPUDirect RDMA Error

Status: Open | Last Update: 2025-07-31 | Partition: Booster

Description

The UCX GPUDirect RDMA feature is currently not functioning for point-to-point communications. This is likely due to an incompatibility between Intel UPI (intersocket connection) and UCX, as referenced in NVIDIA issue 2235234.

Case 1: The mpi job fails with errors like the following one:

[r310c04s01:2918358:0:2918358] ib_mlx5_log.c:179  Local protection error on mlx5_0:1/IB (synd 0x4 vend 0x51 hw_synd 0/2)
[r310c04s01:2918358:0:2918358] ib_mlx5_log.c:179  RC QP 0xb232 wqe[20]: SEND s-e [inl len 10] [va 0x14986e800000 len 1 lkey 0x63a2] [rqpn 0xc526 dlid=2992 sl=0 port=1 src_path_bits=0]
[r310c03s04:1011253:0:1011253] ib_mlx5_log.c:179  Local protection error on mlx5_0:1/IB (synd 0x4 vend 0x51 hw_synd 0/2)
[r310c03s04:1011253:0:1011253] ib_mlx5_log.c:179  RC QP 0xc526 wqe[25]: SEND s-e [inl len 10] [va 0x14fa6a800000 len 1 lkey 0x688b] [rqpn 0xb232 dlid=5592 sl=0 port=1 src_path_bits=0]
==== backtrace (tid:2918358) ====
0x00000000000129b0 uct_ib_mlx5_completion_with_err()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/ib_mlx5_log.c:179
0x00000000000279ec uct_rc_mlx5_iface_handle_failure()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5_iface.c:235
0x00000000000279ec uct_rc_iface_arbiter_dispatch()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/rc/base/rc_iface.h:455
0x00000000000279ec uct_rc_mlx5_iface_handle_failure()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5_iface.c:238
0x0000000000013a25 uct_ib_mlx5_check_completion()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/ib_mlx5.c:477
0x0000000000028d97 uct_ib_mlx5_poll_cq()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/ib_mlx5.inl:148
0x0000000000028d97 uct_rc_mlx5_iface_poll_tx()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5.inl:1891
0x0000000000028d97 uct_rc_mlx5_iface_progress()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5_iface.c:127
0x0000000000028d97 uct_rc_mlx5_iface_progress_cyclic()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/ib/mlx5/rc/rc_mlx5_iface.c:132
0x000000000004dc4a ucs_callbackq_dispatch()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/ucs/datastruct/callbackq.h:215
0x000000000004dc4a uct_worker_progress()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/uct/api/uct.h:2813
0x000000000004dc4a ucp_worker_progress()  /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-redhat9-cuda12-x86_64/ucx-b00d4c4b05219b73a7d764f03644b5296ba9dfdb/src/ucp/core/ucp_worker.c:3033
0x0000000000031f73 opal_progress()  ???:0
0x0000000000054955 ompi_request_default_wait_all()  ???:0
0x000000000009ce07 MPI_Waitall()  ???:0
0x0000000000403e33 main()  ???:0
0x0000000000029590 __libc_start_call_main()  ???:0
0x0000000000029640 __libc_start_main_alias_2()  :0
0x0000000000404cc5 _start()  ???:0
=================================

TEMPORARY SOLUTION:

UCX GPUDirect RDMA has been disabled by default through the following environment variable, set automatically in all OpenMPI modules:

export UCX_IB_GPU_DIRECT_RDMA=no

This can be verified by inspecting OpenMPI modules. In the example below, unrelated lines have been omitted for clarity.

[<username>@<login_pitagora> ~]$ module show openmpi/4.1.6--gcc--12.3.0
-------------------------------------------------------------------
/pitagora/prod/opt/modulefiles/base/libraries/openmpi/4.1.6--gcc--12.3.0:

module-whatis   {An open source Message Passing Interface implementation.}
[...]
setenv          UCX_IB_GPU_DIRECT_RDMA no
[...]
append-path     MANPATH {}
-------------------------------------------------------------------

Important

No action is required by the user to apply this workaround.
If users wish to test UCX GPUDirect RDMA manually, they can unset the variable after loading the module to re-enable the feature.
No significant performance degradation has been observed in synthetic benchmarks (e.g., OSU) or selected real-world applications.

Case 2: Pytorch with NCCL backend jobs hang.

TEMPORARY SOLUTION:

Disable NCCL GPU Direct RDMA as follows:
export NCCL_NET_GDR_LEVEL=LOC

IntelMPI Provider/Fabric Compatibility on AMD Processors

Status: Open | Last Update: 2025-07-29 | Partition: DCGP

Description
DESCRIPTION:

The Mellanox (MLX) provider for IntelMPI does not work correctly with several applications and libraries on the AMD-based DCGP partition, resulting in job crashes or significantly reduced performance.

Known affected codes: - STARWALL - ASCOT5 - PETSc-based software

In addition, MLX shows lower performance compared to the VERBS provider in most communication patterns, as observed in OSU benchmark tests.

TEMPORARY SOLUTION:

The VERBS provider has been set as the default for IntelMPI. This is done automatically when loading the IntelMPI module by exporting the following environment variables:
export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=verbs
This can be confirmed by inspecting the IntelMPI module. In the example below, unrelated lines have been omitted for clarity.
[<username>@<login_pitagora> ~]$ module show intel-oneapi-mpi/2021.12.1
-------------------------------------------------------------------
/pitagora/prod/opt/modulefiles/base/libraries/intel-oneapi-mpi/2021.12.1:

module-whatis   {Intel MPI Library is a multifabric message-passing library that implements the open-source MPICH specification. Use the library to create, maintain, and test advanced, complex applications that perform better on high-performance computing (HPC) clusters based on Intel processors.}
conflict        intel-oneapi-mpi
[...]
setenv          I_MPI_FABRICS shm:ofi
setenv          FI_PROVIDER verbs
[...]
setenv          MPIFC mpiifx
-------------------------------------------------------------------
Important

No action is required from the user to enable VERBS as it is now the default provider.

The VERBS provider solves job crash issues for ASCOT5.

For some applications (e.g., STARWALL and PETSc), VERBS also fails. In these cases, we recommend switching to the TCP provider, which resolves the crashes. However, a performance decrease is observed for subroutines involving heavy inter-node communication. To use TCP:
export FI_PROVIDER=tcp
If switching providers manually, ensure that you export the variables after loading the IntelMPI module. Loading the module will override any previously set values for FI_PROVIDER.
NOTES: - The VERBS provider uses the same fabric as MLX and is fully capable of utilizing the InfiniBand (IB) network. Switching from MLX to VERBS simply means that IntelMPI will use the API provided by the VERBS library instead of the one from the MLX library, while the underlying communication fabric (i.e. the low-level communication layer that enables data transfer between processes across nodes) remains the same - namely, OFI (OpenFabrics Interfaces). - The TCP provider does not support the InfiniBand network; it relies exclusively on the Ethernet network.

Partition	QOS	#Nodes/#per job	Walltime	#Max Nodes/#per user	Priority	Notes
boost_fua_prod	normal	max = 16	24:00:00	32	40
boost_fua_prod	boost_qos_fuabprod	min = 17 (full nodes) max = 32	24:00:00	32	60	runs on 96 nodes (GrpTRES)
boost_fua_dbg	normal	max = 2	00:30:00	2	40	runs on 8 nodes (GrpTRES)

Partition	QOS	#Nodes/#per job	Walltime	#Max Nodes/#per user	Priority	Notes
dcgp_fua_prod	normal	max = 64	24:00:00	64	40
	dcgp_qos_fuabprod	min = 65 (full nodes) max = 128	24:00:00	128	60	runs on 640 nodes (GrpTRES)
	dcgp_qos_fualprod	max = 3	4-00:00:00	3	40
dcgp_fua_dbg	normal	max = 2	00:30:00	2	40	runs on 8 nodes (GrpTRES)