Pitagora

Pitagora is the new EUROfusion supercomputer hosted by CINECA and currently built in the CINECA’s headquarter in Casalecchio di Reno, Bologna, Italy. The cluster is supplied by Lenovo corp. and is composed of two partitions: a general purpose partition cpu-based named DCPG and an accelerated partition based on NVIDIA H100 accelerators named Booster.

The specific guide for the Pitagora cluster contains unique information that deviates from the general behavior described in the HPC Clusters sections.

Access to the System

The machine is reachable via ssh (secure Shell) protocol at hostname point: login.pitagora.cineca.it.

The connection is established, automatically, to one of the available login nodes. It is possible to connect to Pitagora using one the specific login hostname points:

  • login01-ext.pitagora.cineca.it

  • login02-ext.pitagora.cineca.it

  • login03-ext.pitagora.cineca.it

  • login04-ext.pitagora.cineca.it

  • login05-ext.pitagora.cineca.it

  • login06-ext.pitagora.cineca.it

Warning

The mandatory access to Pitagora is the two-factor authetication (2FA). Get more information at section Access to the Systems.

Note

Even-numbered login nodes have the same architecture of Booster parition’s compute nodes while odd-numbered have the same architecture of DCGP parition’s compute nodes

  • login-boost.pitagora.cineca.it will allow users to log on one of the even-numbered login nodes in a round robin fashion.

  • login-dcgp.pitagora.cineca.it will allow users to log on one of the odd-numbered login nodes in a round robin fashion.

System Architecture

The system, supplied by Lenovo, is based on two new specifically-designed compute blades, which are available throught two distinct SLURM partitios on the Cluster:

  • GPU blade based on NVIDIA NVIDIA H100 accelerators - Booster partition.

  • CPU-only blade based on AMD Turin 128c processors - Data Centric General Purpose (DCGP) partition.

The overall system architecture uses NVIDIA Mellanox InfiniBand High Data Rate (HDR) connectivity, with smart in-network computing acceleration engines that enable extremely low latency and high data throughput to provide the highest AI and HPC application performance and scalability.

Hardware Details

Type

Specific

Models

Lenovo SD650-N V3

Racks

7

Nodes

168

Processors/node

2x Intel Emerald Rapids 6548Y 32c 2.4 GHz

CPU/node

64

Accelerators/node

4x NVIDIA H100 SXM 80GB HBM2e

Local Storage/node (tmfs)

RAM/node

512 GiB DDR5 5600 Mhz

Rmax

27.27 PFlop/s (top500)

Internal Network

Nvidia ConnectX-7 NDR200

Storage (raw capacity)

2 x 7.68 GiB SSDs (HW RAID 1)

Type

Specific

Models

Lenovo SD665 V3

Racks

14

Nodes

1008

Processors/node

2x AMD Turin 128c - Zen5 microarch 2.4 GHz

CPU/node

256

Accelerators/node

(none)

Local Storage/node (tmfs)

RAM/node

768 GiB DDR5 6400 Mhz

Rmax

17 Pflop/s (top500)

Internal Network

Nvidia ConnectX-7 NDR SharedIO 200Gbit/s

Storage (raw capacity)

Diskless nodes

Job Managing and SLURM Partitions

In the following table you can find informations about the SLURM partitions for Booster and DCGP partitions of the production environment. Please note that the slurm email service is not active yet.

See also

Further information about job submission are reported in the general section Scheduler and Job Submission.

Partition

QOS

#Nodes/#per job

Walltime

#Max Nodes/#per user

Priority

Notes

boost_fua_prod

normal

max = 16

24:00:00

32

40

boost_qos_fuabprod

min = 17 (full nodes) max = 32

24:00:00

32

60

runs on 96 nodes (GrpTRES)

boost_fua_dbg

normal

max = 2

00:30:00

2

40

runs on 8 nodes (GrpTRES)

Partition

QOS

#Nodes/#per job

Walltime

#Max Nodes/#per user

Priority

Notes

dcgp_fua_prod

normal

max = 64

24:00:00

64

40

dcgp_qos_fuabprod

min = 65 (full nodes) max = 128

24:00:00

128

60

runs on 640 nodes (GrpTRES)

dcgp_qos_fualprod

max = 3

4-00:00:00

3

40

dcgp_fua_dbg

normal

max = 2

00:30:00

2

40

runs on 8 nodes (GrpTRES)

Processes/Threads Binding/Affinity

Processes Binding

  • By default, srun (SLURM launcher) performs an automatic binding. For multi-threaded application request the proper –cpus-per-task and bind the processes to cores (srun –cpu-bind=cores).

  • By default, OpenMPI libraries (mpirun launcher) bind processes to core. For multi-threaded applications this causes the cpu overallocation. Ensure that you are either not bound at all (by specifying –bind-to none) or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process (–map-by socket:PE=$SLURM_CPUS_PER_NODE).

  • By default, IntelMPI libraries (mpirun launcher with hydra process manager) performs a correct binding. If you opt for IntelMPI mpirun as launcher, unset the I_MPI_PMI_LIBRARY (meant for using Intelmpi with srun) defined when loading the module to avoid the verbose warnings.

Threads Affinity

All present compilers (gcc, nvhpc, aocc, intel) by default don’t bind threads to cores. You can act on the threads affinity with the standard OMP_PLACES/OMP_PROC_BIND variables.

Known Issues

This section collects currently known issues affecting PITAGORA.

The list below is intended as a quick reference for users who may experience problems on the system. We strongly encourage all users to report any issues they encounter - whether listed here or not - to the user support team.

Internode GPUDirect Communication: UCX GPUDirect RDMA Error
IntelMPI Provider/Fabric Compatibility on AMD Processors