Pitagora

Pitagora is the new EUROfusion supercomputer hosted by CINECA and currently built in the CINECA’s headquarter in Casalecchio di Reno, Bologna, Italy. The cluster is supplied by Lenovo corp. and is composed of two partitions: a general purpose partition cpu-based named DCPG and an accelerated partition based on NVIDIA H100 accelerators named Booster.

The specific guide for the Pitagora cluster contains unique information that deviates from the general behavior described in the HPC Clusters sections.

Access to the System

The machine is reachable via ssh (secure Shell) protocol at hostname point: login.pitagora.cineca.it.

The connection is established, automatically, to one of the available login nodes. It is possible to connect to Pitagora using one the specific login hostname points:

  • login01-ext.pitagora.cineca.it

  • login02-ext.pitagora.cineca.it

  • login03-ext.pitagora.cineca.it

  • login04-ext.pitagora.cineca.it

  • login05-ext.pitagora.cineca.it

  • login06-ext.pitagora.cineca.it

Warning

The mandatory access to Pitagora is the two-factor authetication (2FA). Get more information at section Access to the Systems.

Note

Even-numbered login nodes have the same architecture of Booster parition’s compute nodes while odd-numbered have the same architecture of DCGP parition’s compute nodes

  • login-boost.pitagora.cineca.it will allow users to log on one of the even-numbered login nodes in a round robin fashion.

  • login-dcgp.pitagora.cineca.it will allow users to log on one of the odd-numbered login nodes in a round robin fashion.

System Architecture

The system, supplied by Lenovo, is based on two new specifically-designed compute blades, which are available throught two distinct SLURM partitios on the Cluster:

  • GPU blade based on NVIDIA NVIDIA H100 accelerators - Booster partition.

  • CPU-only blade based on AMD Turin 128c processors - Data Centric General Purpose (DCGP) partition.

The overall system architecture uses NVIDIA Mellanox InfiniBand High Data Rate (HDR) connectivity, with smart in-network computing acceleration engines that enable extremely low latency and high data throughput to provide the highest AI and HPC application performance and scalability.

Hardware Details

Type

Specific

Models

Lenovo SD650-N V3

Racks

7

Nodes

168

Processors/node

2x Intel Emerald Rapids Xeon Gold 6548Y+ 32c 2.5 GHz

CPU/node

64

Accelerators/node

4x NVIDIA H100 SXM 94GB HBM2e

Local Storage/node (tmfs)

RAM/node

512 GiB DDR5 5600 Mhz

Rmax

27.27 PFlop/s (top500)

Internal Network

Nvidia ConnectX-7 NDR200

Storage (raw capacity)

2 x 7.68 GiB SSDs (HW RAID 1)

Type

Specific

Models

Lenovo SD665 V3

Racks

14

Nodes

1008

Processors/node

2x AMD Turin EPYC 9745 128c 2.4 GHz - Zen5 microarch

CPU/node

256

Accelerators/node

(none)

Local Storage/node (tmfs)

RAM/node

768 GiB DDR5 6400 Mhz

Rmax

17 Pflop/s (top500)

Internal Network

Nvidia ConnectX-7 NDR SharedIO 200Gbit/s

Storage (raw capacity)

Diskless nodes

File Systems and Data Management

The storage organization conforms to CINECA infrastructure. General information are reported in File Systems and Data Management section. In the following, only differences with respect to general behavior are listed and explained.

Warning

The backup service on the $HOME area is currently not active, due to the ongoing installation of our new automatic backup system, which will be completed in the forecoming weeks.

Job Managing and SLURM Partitions

In the following table you can find informations about the SLURM partitions for Booster and DCGP partitions of the production environment. Please note that the slurm email service is not active yet.

See also

Further information about job submission are reported in the general section Scheduler and Job Submission.

Partition

QOS

TRES Limits per job

Walltime

MaxTRES per User

Priority

Notes

ptgr_all_serial

(default)

normal

Max = 4 cores

04:00:00

4 cores

40

No GPUs, Budget Free

Max 4 running jobs and

10 pending jobs per User

boost_fua_prod

normal

Max = 16 nodes

24:00:00

32 nodes

40

boost_qos_fuabprod

Min = 17 full nodes

Max = 32 nodes

24:00:00

32 nodes

60

runs on 96 nodes (GrpTRES)

boost_fua_dbg

normal

Max = 2 nodes

00:30:00

40

runs on 2 nodes (GrpTRES)

Partition

QOS

TRES Limits per job

Walltime

MaxTRES per User

Priority

Notes

ptgr_all_serial

(default)

normal

Max = 4 cores

04:00:00

4 cores

40

No GPUs, Budget Free

Max 4 running jobs and

10 pending jobs per User

dcgp_fua_prod

normal

Max = 64 nodes

24:00:00

64 nodes

40

dcgp_qos_fuabprod

Min = 65 full nodes

Max = 128 nodes

24:00:00

128 nodes

60

runs on 640 nodes (GrpTRES)

dcgp_qos_fualprod

Max = 3 nodes

4-00:00:00

3 nodes

40

dcgp_fua_dbg

normal

Max = 2 nodes

00:30:00

2 nodes

40

runs on 8 nodes (GrpTRES)

Processes/Threads Binding/Affinity

Processes Binding

  • By default, srun (SLURM launcher) performs an automatic binding. For multi-threaded application request the proper –cpus-per-task and bind the processes to cores (srun –cpu-bind=cores).

  • By default, OpenMPI libraries (mpirun launcher) bind processes to core. For multi-threaded applications this causes the cpu overallocation. Ensure that you are either not bound at all (by specifying –bind-to none) or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process (–map-by socket:PE=$SLURM_CPUS_PER_NODE).

  • By default, IntelMPI libraries (mpirun launcher with hydra process manager) performs a correct binding. If you opt for IntelMPI mpirun as launcher, unset the I_MPI_PMI_LIBRARY (meant for using Intelmpi with srun) defined when loading the module to avoid the verbose warnings.

Threads Affinity

All present compilers (gcc, nvhpc, aocc, intel) by default don’t bind threads to cores. You can act on the threads affinity with the standard OMP_PLACES/OMP_PROC_BIND variables.

MPI Advanced Configuration

Enable UCC for Collective Communications

UCC (Unified Collective Communication) is a library designed to efficiently handle collective communication operations (e.g., allreduce, broadcast) across multiple nodes and GPUs.

Developed to support deep learning workloads (such as PyTorch-based recommender models) running on systems with multiple network paths (“multi-rail”) and GPUs, it has also been designed and implemented for high-performance PGAS (Partitioned Global Address Space) applications and runtimes - HPC programs where memory is logically shared but physically distributed, making efficient communication critical.

As stated by NVIDIA, UCC serves as a drop-in replacement for HCOLL and is expected to gradually assume the role of the default collective library as it continues to implement the full range of HCOLL hierarchical algorithms. This means that applications or MPI stacks currently using HCOLL can switch to UCC without requiring major code changes. As UCC matures and reaches full feature parity, it is expected to become the default solution for collective communication.

Important

In general, we recommend enabling UCC for collective communications, especially for GPU-based workloads, as it can provide significant performance improvements compared to the native OpenMPI collective component [Venkata MG et al., IEEE 2024.].

To enable UCC for collective communications, load one of the following OpenMPI modules (all built with UCC support, version 1.6.0):

hpcx-mpi/2.25.1                       # OpenMPI 4.1.6 with UCC 1.6.0 and UCX 1.20, shipped with HPC-X 2.25.1 (compiled by NVIDIA with GCC 8.5.0)
openmpi/4.1.6--gcc--12.3.0-ucx1.20    # OpenMPI 4.1.6 with UCC 1.6.0 and UCX 1.20, compiled by Cineca with GCC 12.3.0
openmpi/5.0.9--gcc--12.3.0-ucx1.20    # OpenMPI 5.0.9 with UCC 1.6.0 and UCX 1.20, compiled by Cineca with GCC 12.3.0

Then export the following environment variables to enable UCC as the collective communication library for MPI:

export OMPI_MCA_coll_ucc_enable=1
export OMPI_MCA_coll_ucc_priority=100

We recommend setting a high priority for UCC to ensure it is selected as the default collective component when available.

Warning

Asymmetric source/destination memory types for sender and receivers, for example, when the sender uses host memory and the receiver uses device memory, are not yet supported in UCC. If your application relies on this feature, you may encounter errors similar to the following:

ucc_schedule_pipelined.c:211  UCC  ERROR failed to initialize fragment for pipeline
allreduce_sra_knomial.c:234  TL_UCP ERROR failed to init pipelined schedule
      allreduce.c:38   TL_UCP ERROR asymmetric src/dst memory types are not supported yet
          mpool.c:55   UCX  WARN  object 0x3077c00 was not returned to mpool tl_ucp_req_mp

IMPORTANT: This is not a blocking issue, as OpenMPI will automatically fall back to the next available collective component (the native one in these modules). Therefore, please note that the error messages may be misleading: they indicate a failure in initializing the collective operation in UCC, but in practice this simply triggers a fallback to the native implementation. As a result, comparing the performance of collective operations with and without UCC enabled may not show any difference in such cases, since both configurations may end up using the native OpenMPI collective component.

However, it is relevant to note that not enabling UCC may negatively impact the performance of all symmetric communications.

Known applications affected by this issue include:
  • GENE

For more information about UCC, please refer to the official wiki from OpenUCX and the NVIDIA HPC-X documentation.

Enable NVIDIA SHARP for Network Offloading of Collective Operations

NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) improves the performance of MPI and machine learning collective operations by offloading them from CPUs and GPUs to the network, reducing redundant data transfers between endpoints.

This approach decreases the amount of data traversing the network, leading to faster collective operations. Additionally, offloading these operations frees CPU and GPU resources for computation instead of communication overhead.

Important

As a general rule of thumb, we suggest to enable SHARP for collective operations, especially for GPU-based workloads with a large number of processes, as it can provide significant performance improvements [Ramesh B et al. , IEEE 2020.]. However, the actual performance gain may vary depending on the application, data size, and communication patterns. Therefore, we strongly recommend testing with and without SHARP enabled to determine the best configuration for your use case.

To enable SHARP for collective operations, load one of the following OpenMPI modules (all built with SHARP support):

hpcx-mpi/2.25.1                       # OpenMPI 4.1.6 with UCC 1.6.0 and UCX 1.20, shipped with HPC-X 2.25.1 (compiled by NVIDIA with GCC 8.5.0)
openmpi/4.1.6--gcc--12.3.0-ucx1.20    # OpenMPI 4.1.6 with UCC 1.6.0 and UCX 1.20, compiled by Cineca with GCC 12.3.0
openmpi/5.0.9--gcc--12.3.0-ucx1.20    # OpenMPI 5.0.9 with UCC 1.6.0 and UCX 1.20, compiled by Cineca with GCC 12.3.0

Then export the following environment variables to enable SHARP for collective communications:

# Enable UCC (required)
export OMPI_MCA_coll_ucc_enable=1
export OMPI_MCA_coll_ucc_priority=100

# Enable SHARP (through UCC)
export OMPI_UCC_CL_BASIC_TLS=ucp,sharp
export SHARP_COLL_ENABLE_SAT=1

Note: SHARP support is provided through UCC; therefore, enabling UCC is a prerequisite for using SHARP.

For more information, refer to the official NVIDIA documentation.

Known Issues

This section collects currently known issues affecting PITAGORA.

The list below is intended as a quick reference for users who may experience problems on the system. We strongly encourage all users to report any issues they encounter - whether listed here or not - to the user support team.

Intel-Oneapi-MPI Provider/Fabric Compatibility on AMD Processors
Internode GPUDirect Communication: UCX GPUDirect RDMA Error