Leonardo

Leonardo is the pre-exascale Tier-0 supercomputer of the EuroHPC Joint Undertaking (JU), hosted by CINECA and currently located at the Bologna DAMA-Technopole in Italy. This guide provides specific information about the Leonardo cluster, including details that differ from the general behavior described in the broader HPC Clusters section.

Access to the System

The machine is reachable via ssh (secure Shell) protocol at hostname point: login.leonardo.cineca.it.

The connection is established, automatically, to one of the available login nodes. It is possible to connect to Leonardo using one the specific login hostname points:

login01-ext.leonardo.cineca.it

login02-ext.leonardo.cineca.it

login05-ext.leonardo.cineca.it

login07-ext.leonardo.cineca.it

Warning

The mandatory access to Leonardo si the two-factor authetication (2FA). Get more information at section Access to the Systems.

System Architecture

The cluster, supplied by EVIDEN ATOS, is based on two new specifically-designed compute blades, which are available throught two distinc Slurm partitios on the Cluster:

X2135 GPU blade based on NVIDIA Ampere A100-64 accelerators - Booster partition.
X2140 CPU-only blade based on Intel Sapphire Rapids processors - Data Centric General Purpose (DCGP) partition.

The overall system architecture uses NVIDIA Mellanox InfiniBand High Data Rate (HDR) connectivity, with smart in-network computing acceleration engines that enable extremely low latency and high data throughput to provide the highest AI and HPC application performance and scalability.

The Booster partition entered pre-production in May 2023 and moved to full production in July 2023. The DCGP partition followed, starting pre-production in January 2024 and reaching full production in February 2024.

Hardware Details

Booster

Type	Specific
Models	Atos BullSequana X2135, Da Vinci single-node GPU
Racks	116
Nodes	3456
Processors/node	1x Intel Ice Lake Intel Xeon Platinum 8358
CPU/node	32
Accelerators/node	4x NVIDIA Ampere100 custom, 64GiB HBM2e NVLink 3.0 (200 GB/s)
Local Storage/node (tmfs)	(none)
RAM/node	512 GiB DDR4 3200 MHz
Rmax	241.2 PFlop/s (top500)
Internal Network	200 Gbps NVIDIA Mellanox HDR InfiniBand - Dragonfly+ Topology
Storage (raw capacity)	106 PiB based on DDN ES7990X and Hard Drive Disks (Capacity Tier) 5.7 PiB based on DDN ES400NVX2 and Solid State Drives (Fast Tier)

DCGP

Type	Specific
Models	Atos BullSequana X2140 three-node CPU blade
Racks	22
Nodes	1536
Processors/node	2x Intel Sapphire Rapids Intel Xeon Platinum 8480+
CPU/node	112 cores/node
Accelerators	(none)
Local Storage/node (tmfs)	3 TiB
RAM/node	512(8x64) GiB DDR5 4800 MHz
Rmax	7.84 PFlop/s (top500)
Internal Network	200 Gbps NVIDIA Mellanox HDR InfiniBand - Dragonfly+ Topology
Storage (raw capacity)	106 PiB based on DDN ES7990X and Hard Drive Disks (Capacity Tier) 5.7 PiB based on DDN ES400NVX2 and Solid State Drives (Fast Tier)

File Systems and Data Managment

The storage organization conforms to CINECA infrastructure. General information are reported in File Systems and Data Management section. In the following, only differences with respect to general behavior are listed and explained.

Job Managing and Slurm Partitions

In the following table you can find informations about the Slurm partitions for Booster and DCGP partitions.

See also

Further information about job submission are reported in the general section Scheduler and Job Submission.

Booster

Partition	QOS	#Cores/#GPU per job	Walltime	Max Nodes/cores/GPUs/user	Priority	Notes
lrd_all_serial (default)	normal	4 cores (8 logical cores)	04:00:00	1 node / 4 cores (30800 MB RAM)	40	No GPUs, Hyperthreading x 2 Budget Free
boost_usr_prod	normal	64 nodes	24:00:00		40
	boost_qos_dbg	2 nodes	00:30:00	2 nodes / 64 cores / 8 GPUs	80
	boost_qos_bprod	min = 65 nodes max = 256 nodes	24:00:00	256 nodes	60
	boost_qos_lprod	3 nodes	4-00:00:00	3 nodes / 12 GPUs	40
boost_fua_dbg	normal	2 nodes	00:10:00	2 nodes / 64 cores / 8 GPUs	40	Runs on 2 nodes
boost_fua_prod	normal	16 nodes	24:00:00	4 running jobs per user account 32 nodes / 3584 cores	40
	boost_qos_fuabprod	min = 17 nodes max = 32 nodes	24:00:00	32 nodes / 3584 cores	60	Runs on 49 nodes Min is 17 FULL nodes
	qos_fualowprio	16 nodes	08:00:00		0

Note

The partitions: boost_fua_dbg, boost_fua_prod can be exclusively used by Eurofusion users. For more information see the dedicated EUROfusion section.

DCGP

Partition	QOS	#Cores/#GPU per job	Walltime	Max Nodes/cores/GPUs/user	Priority	Notes
lrd_all_serial (default)	normal	max = 4 cores (8 logical cores)	04:00:00	1 node / 4 cores (30800 MB RAM)	40	Hyperthreading x 2 Budget Free
dcgp_usr_prod	normal	16 nodes	24:00:00	512 nodes per prj. account	40
	dcgp_qos_dbg	2 nodes	00:30:00	2 nodes / 224 cores per user account 512 nodes per prj. account	80
	dcgp_qos_bprod	min = 17 nodes max = 128 nodes	24:00:00	128 nodes per user account 512 nodes per prj. account	60	GrpTRES = 1536 nodes Min is 17 FULL nodes
	dcgp_qos_lprod	3 nodes	4-00:00:00	3 nodes / 336 cores per user account 512 nodes per prj. account	40
dcgp_fua_dbg	normal	2 nodes	00:10:00	2 nodes / 224 cores	40	Runs on 2 nodes
dcgp_fua_prod	normal	16 nodes	24:00:00		40
	dcgp_qos_fuabprod	min = 17 nodes max = 64 nodes	24:00:00	64 nodes / 7168 cores	60	Runs on 130 nodes Min is 17 FULL nodes
	qos_fualowprio	16 nodes	08:00:00		0

Note

The partitions: dcgp_fua_dbg, dcgp_fua_prod can be exclusively used by Eurofusion users. For more information see the dedicated EUROfusion section.

Network Architecture

Leonardo features a state-of-the-art interconnect system tailored for high-performance computing (HPC). It delivers low latency and high bandwidth by leveraging NVIDIA Mellanox InfiniBand HDR (High Data Rate) technology, powered by NVIDIA QUANTUM QM8700 Smart Switches, and a Dragonfly+ topology. Below is an overview of its architecture and key features:

Hierarchical Cell Structure: The system is structured into multiple cells, each comprising a group of interconnected compute nodes.
Inter-cell Connectivity: As illustrated in the figure below, cells are connected via an all-to-all topology. Each pair of distinct cells is linked by 18 independent connections, each passing through a dedicated Layer 2 (L2) switch. This design ensures high availability and reduces congestion.
Intra-cell Topology: Inside each cell, a non-blocking two-layer fat-tree topology is used, allowing scalable and efficient intra-cell communication.
System Composition:
- 19 cells dedicated to the Booster partition.
- 2 cells for the DCGP (Data-Centric General Purpose) partition.
- 1 hybrid cell with both accelerated (36 Booster nodes) and conventional (288 DCGP nodes) compute resources.
- 1 cell allocated for management, storage, and login services.
Adaptive Routing: The network employs adaptive routing, dynamically optimizing data paths to alleviate congestion and maintain performance under load.

Advanced Information

Documents

Article on Leonardo architecture and the technologies adopted for its GPU-accelerated partition: CINECA Supercomputing Centre, SuperComputing Applications and Innovation Department. (2024). “LEONARDO: A Pan-European Pre-Exascale Supercomputer for HPC and AI applications.”, Journal of large-scale research facilities, 8, A186. https://doi.org/10.17815/jlsrf-8-186
Details about new technologies included in the Witley platform with Intel Xeon Icelake contained in the Leonardo pre-exascale system (link)
Additional documents (link)

Some tuning guides for dedicated enviroments (ML/DL or HPC Clusters):