EFGW Gateway ============ EFGW is the new EUROfusion Gateway system hosted by CINECA in the headquarter Casalecchio di Reno, Bologna, Italy. The cluster is supplied by Lenovo Corp. and is equipped with 15 AMD nodes, including 4 nodes with fat memory (1.5 TB), and 1 node with 4 H100 GPUs and local SSD storage. How to get a User Account ------------------------- Users of the old EUROfusion Gateway were migrated on the new system with the same username. For new users: to get access to EFGW the following steps are required: * Register on `UserDB portal `_ * complete the registration filling your affiliation in the **Institution** page and uploading a valid Identity Document in **Documents for HPC** page. * Download the :download:`Gateway User Agreement <../files/eurofusion_gateway_user_agreement_26_10_2022.pdf>` (GUA) * Fill and sign the GUA, send it via email to EUROfusion Coordination Officer Denis Kalupin (Denis.Kalupin-at-euro-fusion.org) Access to the System -------------------- The machine is reachable via ``ssh`` (secure Shell) protocol at hostname point: **login.eufus.eu**. The connection is established, automatically, to one of the available login nodes. It is also possible to connect to **EFGW** using one the specific login hostname points: * **viz05-ext.efgw.cineca.it** * **viz06-ext.efgw.cineca.it** * **viz07-ext.efgw.cineca.it** * **viz08-ext.efgw.cineca.it** Each login node is equipped with two AMD EPIC 9254 24-Core Processors. An alias hostname pointing to all the login nodes in a round-robin fashion, will be set-up in the next weeks. .. warning:: **The mandatory access to EFGW is the two-factor authentication (2FA) via the dedicated provisioner efgw**. Get more information at section :ref:`general/access:Access to the Systems`. **Please note**: EFGW users have to obtain the ssh certificate from the **efgw** provisioner. * In the section :ref:`general/access:How to activate the **2FA** and the **OTP** generator` use the `step-CA EFGW client `_ in the place of the step-CA CINECA-HPC client * In the section :ref:`general/access:How to configure *smallstep* client` **Step 3**, obtain the ssh certificate from the efgw provisioner .. code-block:: bash step ssh login 'username' --provisioner efgw * In the section :ref:`general/access:How to manage authentication *certificates*`, use the efgw provisioner in all the key *step* commands (certificate re-generation, certificate creation in file format) How to access EFGW with NX -------------------------- 1. Get a ssh key with *step* and put it in your $HOME/.ssh folder .. code-block:: bash step ssh certificate 'username' --provisioner efgw ~/.ssh/gw_key 2. Configure a NX session as follows: .. image:: ../img/nx1.png :width: 600px :align: center .. image:: ../img/nx2.png :width: 600px :align: center .. image:: ../img/nx3.png :width: 600px :align: center .. image:: ../img/nx4.png :width: 600px :align: center You can freely install software on your NX desktop using the "flatpak" package. Please refer to the `official documentation `_ for instructions on how to use it. **Please keep in mind that you need to use it with the --user flag.** System Architecture ------------------- The cluster, supplied by Lenovo, is based on AMD processors: * 10 nodes with two AMD EPYC 9745 128-Core Processors and 738 GB DDR5 RAM per node * 4 nodes with two AMD EPYC 9745 128-Core Processors and 1511 GB DDR5 RAM per node * 1 node with two AMD EPIC 9354 32-Core Processors, 4 H100 GPUs, and 738 GB DDR5 RAM per node File Systems and Data Management -------------------------------- The storage organization conforms to **CINECA** infrastructure. General information are reported in :ref:`hpc/hpc_data_storage:File Systems and Data Management` section. In the following, only differences with respect to general behavior are listed and explained. The storage is organized as a replica of the previous Gateway cluster with the data of **/afs** and **/pfs** areas copied on the new lustre storage system (no afs available, only the data were copied). Please notice that the path **/gss_efgw_work**, linked to the /pfs areas on the old Gateway, does not exist on the new Gateway. The TMPDIR is defined: * on the local SSD disks on login nodes (2.5 TB of capacity), mounted as ``/scratch_local`` (``TMPDIR=/scratch_local``). This is a shared area with no quota, remove all the files once they are not requested anymore. A cleaning procedure will be enforced in case of improper use of the area. * on the local SSD disk on the GPU node (850 GB of capacity, default size 10 GB ) * on RAM on all the 14 cpu-only, diskless compute nodes (with a fixed size of 10 GB) On the GPU node, a larger local TMPDIR area can be requested, if needed, with the slurm directive: .. code-block:: bash $ SBATCH --gres=tmpfs:XXG up to a maximum of 212.5 GB. Environment and Customization ----------------------------- The main tools and compilers are available through the module command when logging into the cluster: .. code-block:: bash $ module av To have all the modules of aocc, gcc, and OneAPI stacks installed from CINECA staff, you need to load the "cineca-modules" module and execute "module av" command: .. code-block:: bash $ module load cineca-modules $ module av For getting information about "module" usage, compilers, and mpi libraries you can consult the :ref:`hpc/hpc_enviroment:The module command` and :ref:`hpc/hpc_enviroment:Compilers` You can install any additional software you may need with `flatpack `_ or :ref:`hpc/hpc_enviroment:SPACK` . How to make your $HOME/public open to all users ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order to configure on the new EFGW your $HOME/public as it was on the old EFGW afs filesystem, please add the proper ACL to your $HOME directory as follows: .. code-block:: bash $ setfacl -m g:g2:x $HOME Job Managing and Slurm Partitions --------------------------------- In the following table you can find informations about the Slurm partitions on the EFGW cluster. +----------------+--------------------+--------------------------------+--------------+------------------------------+-------------------------+--------------+--------------------------+ | **Partition** | **QOS** | **#Cores per job** | **Walltime** | **Max jobs/res. per user** | **Max memory per node** | **Priority** | **Notes** | +================+====================+================================+==============+==============================+=========================+==============+==========================+ | gw | noQOS | max=768 cores | 48:00:00 | 2000 submitted jobs | 735 GB / 1511 GB | 40 | Four fat memory nodes | + +--------------------+--------------------------------+--------------+------------------------------+-------------------------+--------------+--------------------------+ | | qos_dbg | max=128 cores | 00:30:00 | Max 128 cores | 735 GB / 1511 GB | 80 | Can run on max 128 cores | + +--------------------+--------------------------------+--------------+------------------------------+-------------------------+--------------+--------------------------+ | | qos_gwlong | max=256 cores | 144:00:00 | Max 128 cores, 2 running jobs| 735 GB / 1511 GB | 40 | Four fat memory nodes | +----------------+--------------------+--------------------------------+--------------+------------------------------+-------------------------+--------------+--------------------------+ | gwgpu | noQOS | max=16 cores/1 gpu / 188250 MB | 08:00:00 | 1 running job | 735 GB | 40 | | +----------------+--------------------+--------------------------------+--------------+------------------------------+-------------------------+--------------+--------------------------+ .. note:: In the new Gateway the debug partition has been replaces by a QoS. How to request support ---------------------- Please write a mail to superc@cineca.it specifying EFGW in the Subject.