Unimore Aries Cluster User Guide

Scope

This guide explains how to start using the Unimore Aries cluster for this repository from a clean workstation and a clean cluster shell. It covers SSH login, GitHub SSH setup, repository cloning, Conda setup, basic Slurm inspection, one interactive srun check, and one first sbatch submission.

The Aries-specific values below come from the local call notes in .temp/aries_call.txt. Re-check them with sinfo, showqos, and the current cluster welcome message before running large jobs.

Do not paste passwords into scripts, Markdown files, shell history notes, or Git-tracked files. When a password is required, type it only into the SSH prompt.

Useful References

Aries Values To Know

Item

Value

SSH host

aries.hpc.unimo.it

Account

xilab

Main GPU partition for this work

ice4hpc

GPU QoS

gpus

GPU node mentioned in the call

gnode01

Recommended first GPU request

--gpus=1g.20gb:1

Recommended first mode

single GPU

Work directories

/scratch1/<username> or /scratch2/<username>

Persistent home

/unimore_home/<username> or ~

The call notes indicate that gnode01 exposes H100 GPU resources through MIG 20 GB slices. Treat nvidia-smi inside an allocated GPU job as the real runtime check.

Mental Model

Aries has a login node and compute nodes.

  • Use the login node for light setup: SSH, Git clone, editing files, Conda environment creation, and submitting jobs.

  • Use compute nodes for Python tests, CUDA checks, training, and any expensive execution.

  • Use srun for an interactive allocation. If the SSH session dies, the interactive work can die with it.

  • Use sbatch for non-interactive jobs. The scheduler keeps running the job after you disconnect.

Connect With SSH

From Windows PowerShell:

ssh dferrari@aries.hpc.unimo.it

From Linux or macOS:

ssh dferrari@aries.hpc.unimo.it

Replace dferrari with your Aries username. After a successful login you should land on the login node, for example with a prompt similar to:

[dferrari@fe01 ~]$

Choose A Working Directory

Use scratch for active repository work and job output:

cd /scratch1/$USER
pwd

If /scratch1/$USER is not available or is busy, try:

cd /scratch2/$USER
pwd

Use home for persistent shell configuration and small configuration files:

cd ~
pwd

Load Cluster Modules

List available modules:

module av

Load Slurm if it is not already loaded:

module load slurm

Load CUDA and Anaconda:

module load cuda
module load Anaconda3

If Anaconda3 is not the exact module name on the current module tree, inspect the available names:

module av Anaconda
module av python

Run Conda initialization only if conda activate is not available:

conda init bash

Then disconnect and reconnect, or reload the shell:

exec bash -l

Add A GitHub SSH Key On Aries

First check whether an SSH key already exists:

ls -al ~/.ssh

Generate a new key if needed. Use the email associated with your GitHub account:

ssh-keygen -t ed25519 -C "your_email@example.com"

Accept the default path unless you intentionally manage multiple keys. Use a passphrase if you want the private key protected on the cluster.

Start the SSH agent and add the key:

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

Print the public key:

cat ~/.ssh/id_ed25519.pub

Open GitHub in the browser on your local workstation and add the printed public key under Settings -> SSH and GPG keys -> New SSH key. The official GitHub guide is:

https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account

Test the GitHub SSH connection from Aries:

ssh -T git@github.com

A successful test usually identifies your GitHub account and says GitHub does not provide shell access. That is expected.

Clone The Repository

Move to scratch:

cd /scratch1/$USER

Clone with SSH:

git clone git@github.com:XiLab-Robotics/Physics-Informed-Neural-Networks.git
cd Physics-Informed-Neural-Networks

Check the remote:

git remote -v

Optional but useful Git identity setup on Aries:

git config --global user.name "Your Name"
git config --global user.email "your_email@example.com"

Create The Conda Environment

Load modules first:

module load cuda
module load Anaconda3

Create the project environment:

conda create -y -n pinns_env python=3.12
conda activate pinns_env
python -m pip install --upgrade pip

Install PyTorch using the official PyTorch selector for the CUDA build that matches the loaded CUDA module and cluster driver:

https://pytorch.org/get-started/locally/

For a CUDA 12.4 module, start by checking the current PyTorch selector and then use the generated command. A typical pip pattern is:

python -m pip install torch --index-url https://download.pytorch.org/whl/cu124

If the selector recommends a newer CUDA wheel, use the selector output instead of this example. Then install the repository dependencies:

python -m pip install -r requirements.txt

For the recovered original RCIM workflow only, there is also a nested historical requirements file:

python -m pip install -r scripts/paper_reimplementation/rcim_ml_compensation/recovered_original_workflow/requirements.txt

Use that nested file only when you are explicitly working in the recovered original workflow.

Verify Python And CUDA

On the login node, verify imports only:

python -c "import torch, lightning, numpy, pandas, sklearn; print(torch.__version__); print(lightning.__version__)"

Do not assume CUDA is available on the login node. Check CUDA inside an allocated GPU job.

Inspect Slurm State

Show partitions and node states:

sinfo

Show queue state:

squeue

Show only your jobs:

squeue -u "$USER"

On Aries, squeue --me is also available and is the shortest reliable way to show only your jobs:

squeue --me

The shell may define sq shorthand.

Show QoS limits from the login node:

showqos

If showqos is missing, first confirm that you are on the login node and that Slurm is loaded:

hostname
module load slurm
showqos

showqos can be unavailable inside an allocated compute-node shell. Exit back to the login node before using it for quota checks.

Common Slurm states:

State

Meaning

R

Running

PD

Pending

CD

Cancelled or completed, depending on the command context

idle

Node has available resources

mix

Node is partially allocated

alloc

Node is fully allocated

Cancel one of your jobs:

scancel <job_id>

Cancel all of your jobs only when you really intend to stop them:

scancel -u "$USER"

Show details for one job:

scontrol show job <job_id>

Replace <job_id> with the numeric job id printed by srun, sbatch, or squeue.

Interactive Compute-Node Sessions With srun

Use interactive srun --pty /bin/bash only when you want to land on a compute node and run commands manually from the terminal. This is useful for short checks, debugging, and confirming that modules, Conda, CUDA, and paths work. If the SSH session dies, the interactive work can die with it.

The interactive shell can start with a different Conda prompt, for example (base) even if the login-node shell was already in pinns_env. Always load modules and activate the intended Conda environment again inside the allocated node shell.

The prompt tells you where you are:

[dferrari@fe03 Physics-Informed-Neural-Networks]$   # login node
[dferrari@gnode01 Physics-Informed-Neural-Networks]$ # compute node

Interactive GPU Node

Start with a modest request. This matches the Aries call notes and avoids asking for more CPU or memory than the first check needs:

srun \
  --ntasks=4 \
  --nodes=1 \
  --mem=40g \
  --partition=ice4hpc \
  --account=xilab \
  --qos=gpus \
  --gpus=1g.20gb:1 \
  --pty \
  --mpi=pmix \
  /bin/bash

Once the shell opens on the allocated node, load modules and activate the environment:

module load cuda
module load Anaconda3
conda activate pinns_env

Check the GPU:

nvidia-smi

Check PyTorch CUDA visibility:

python -c "import torch; print('cuda_available=', torch.cuda.is_available()); print('device_count=', torch.cuda.device_count()); print('device_name=', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"

Exit the interactive allocation when finished:

exit

Interactive CPU Node

For work without GPU acceleration, use a CPU partition and a CPU QoS. The call notes say that CPU work should use QoS normal, high, or low; if no QoS is specified, Aries can select the default QoS. The live sinfo output shown in the validated session exposed CPU-oriented partitions such as ulow, low, high, and user-debug.

Use user-debug only for short checks because its time limit is 30 minutes:

srun \
  --nodes=1 \
  --ntasks=1 \
  --cpus-per-task=4 \
  --time=00:30:00 \
  --mem=20g \
  --partition=user-debug \
  --account=xilab \
  --qos=normal \
  --pty \
  --mpi=pmix \
  /bin/bash

For a longer CPU session, use a longer CPU partition such as low, then keep the requested time within the partition time limit:

srun \
  --nodes=1 \
  --ntasks=1 \
  --cpus-per-task=8 \
  --time=02:00:00 \
  --mem=20g \
  --partition=low \
  --account=xilab \
  --qos=normal \
  --pty \
  --mpi=pmix \
  /bin/bash

Once the shell opens on the allocated CPU node, bind common numerical-library thread counts to the CPU request before running Python:

module load Anaconda3
conda activate pinns_env

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export MKL_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export OPENBLAS_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export NUMEXPR_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

hostname
python -c "import os; print('cpu_count=', os.cpu_count()); print('threads=', os.environ.get('OMP_NUM_THREADS'))"

CPU Work Inside A GPU Allocation

The ice4hpc partition is the GPU partition. Use it when you need the GPU node or when you are preparing GPU training. The call notes say the gpus QoS has a per-job limit of cpu=8, gres/gpu=4, and mem=100G, with a per-user limit of cpu=16, gres/gpu=4, mem=202G, and node=1.

For a CPU-heavy check that still allocates the GPU node, keep the MIG GPU request:

srun \
  --ntasks=8 \
  --nodes=1 \
  --time=02:00:00 \
  --mem=20g \
  --partition=ice4hpc \
  --account=xilab \
  --qos=gpus \
  --gpus=1g.20gb:1 \
  --pty \
  --mpi=pmix \
  /bin/bash

For this GPU-partition case, do not remove --gpus=1g.20gb:1: the notes say that omitting the GPU request can place the interactive shell on a non-GPU node. Use a CPU partition instead when you want a true CPU-only node.

Run A Script With srun

Use srun without --pty when you want Slurm to allocate resources and then execute one command or one shell script. This is useful for short tests and debug jobs. The terminal remains attached to the job; if the SSH session dies, the job can die too. Use sbatch for long unattended runs.

Important distinction:

  • srun resource requests go on the srun command line.

  • #SBATCH directives inside a script are for sbatch; srun bash script.sh treats them as shell comments.

CPU Hello-World Script With srun

Create a small shell script:

nano aries_hello_srun.sh

Use this content:

#!/bin/bash -l
set -euo pipefail

echo "[INFO] Host: $(hostname)"
echo "[INFO] Workdir: $(pwd)"
echo "[INFO] SLURM job: ${SLURM_JOB_ID:-none}"
echo "[INFO] SLURM CPUs per task: ${SLURM_CPUS_PER_TASK:-unset}"

module load Anaconda3
conda activate pinns_env

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export MKL_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export OPENBLAS_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export NUMEXPR_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

python -c "import os, platform; print('hello_from=', platform.node()); print('threads=', os.environ.get('OMP_NUM_THREADS'))"

Make it executable:

chmod +x aries_hello_srun.sh

Run it on a short CPU allocation:

srun \
  --nodes=1 \
  --ntasks=1 \
  --cpus-per-task=2 \
  --time=00:05:00 \
  --mem=4g \
  --partition=user-debug \
  --account=xilab \
  --qos=normal \
  ./aries_hello_srun.sh

If user-debug is busy or not appropriate, inspect sinfo and use another CPU partition such as low, high, or ulow with a matching time limit and QoS.

GPU Check Script With srun

Create a GPU check script:

nano aries_gpu_check_srun.sh

Use this content:

#!/bin/bash -l
set -euo pipefail

module load cuda
module load Anaconda3
conda activate pinns_env

echo "[INFO] Host: $(hostname)"
echo "[INFO] SLURM job: ${SLURM_JOB_ID:-none}"

nvidia-smi
python -c "import torch; print('cuda_available=', torch.cuda.is_available()); print('device_count=', torch.cuda.device_count()); print('device_name=', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"

Run it on the GPU partition:

chmod +x aries_gpu_check_srun.sh
srun \
  --nodes=1 \
  --ntasks=4 \
  --time=00:10:00 \
  --mem=20g \
  --partition=ice4hpc \
  --account=xilab \
  --qos=gpus \
  --gpus=1g.20gb:1 \
  --mpi=pmix \
  ./aries_gpu_check_srun.sh

Repository Test With srun

Use the repository’s lightweight setup validation before launching training:

cd /scratch1/$USER/Physics-Informed-Neural-Networks

Run the validation script through Slurm:

srun \
  --nodes=1 \
  --ntasks=1 \
  --cpus-per-task=2 \
  --time=00:10:00 \
  --mem=8g \
  --partition=user-debug \
  --account=xilab \
  --qos=normal \
  bash -lc "module load Anaconda3; conda activate pinns_env; python -B scripts/training/validate_training_setup.py --config-path config/training/feedforward/presets/baseline.yaml --platform linux"

Then run the minimal smoke test through Slurm:

srun \
  --nodes=1 \
  --ntasks=1 \
  --cpus-per-task=2 \
  --time=00:10:00 \
  --mem=8g \
  --partition=user-debug \
  --account=xilab \
  --qos=normal \
  bash -lc "module load Anaconda3; conda activate pinns_env; python -B scripts/training/run_training_smoke_test.py --config-path config/training/feedforward/presets/baseline.yaml --output-suffix aries_first_smoke_test --fast-dev-run-batches 1 --platform linux"

These commands are still tests, not full campaigns. They create validation or smoke-test artifacts under the repository output/report structure.

First sbatch Script

Create a batch script in the repository root:

nano aries_first_smoke_test.sbatch

Use this template:

#!/bin/bash -l
#SBATCH -A xilab
#SBATCH -p ice4hpc
#SBATCH --qos=gpus
#SBATCH --time=01:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=40g
#SBATCH --gpus=1g.20gb:1
#SBATCH --job-name=aries_smoke
#SBATCH --output=aries_smoke_%j.out
#SBATCH --error=aries_smoke_%j.err

set -euo pipefail

module load cuda
module load Anaconda3

conda activate pinns_env

cd /scratch1/${USER}/Physics-Informed-Neural-Networks

echo "[INFO] Host: $(hostname)"
echo "[INFO] Workdir: $(pwd)"
echo "[INFO] Python: $(which python)"

nvidia-smi

python -c "import torch; print('cuda_available=', torch.cuda.is_available()); print('device_count=', torch.cuda.device_count()); print('device_name=', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"

python -B scripts/training/validate_training_setup.py \
  --config-path config/training/feedforward/presets/baseline.yaml \
  --platform linux

python -B scripts/training/run_training_smoke_test.py \
  --config-path config/training/feedforward/presets/baseline.yaml \
  --output-suffix aries_first_sbatch_smoke_test \
  --fast-dev-run-batches 1 \
  --platform linux

Submit it:

sbatch aries_first_smoke_test.sbatch

Watch the queue:

squeue -u "$USER"

Inspect output after the job starts or finishes:

tail -n 100 aries_smoke_<job_id>.out
tail -n 100 aries_smoke_<job_id>.err

Replace <job_id> with the numeric job id printed by sbatch.

CPU-Only sbatch Script

Use this template when the job does not need GPU acceleration. It uses a CPU partition, a CPU QoS, and no --gpus directive:

#!/bin/bash -l
#SBATCH -A xilab
#SBATCH -p low
#SBATCH --qos=normal
#SBATCH --time=24:00:00
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=20g
#SBATCH --job-name=test_cpu
#SBATCH --output=cpu_%j.out
#SBATCH --error=cpu_%j.err

set -euo pipefail

module load Anaconda3

conda activate pinns_env

cd /scratch1/${USER}/Physics-Informed-Neural-Networks

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export MKL_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export OPENBLAS_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export NUMEXPR_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"

echo "[INFO] Host: $(hostname)"
echo "[INFO] Workdir: $(pwd)"
echo "[INFO] Python: $(which python)"
echo "[INFO] SLURM job: ${SLURM_JOB_ID}"
echo "[INFO] SLURM CPUs per task: ${SLURM_CPUS_PER_TASK}"

python -c "import os; print('cpu_count=', os.cpu_count()); print('threads=', os.environ.get('OMP_NUM_THREADS'))"

python ./job01/miojob.py

Submit it with:

sbatch test_cpu.sbatch

For short tests, replace #SBATCH -p low, #SBATCH --qos=normal, and #SBATCH --time=24:00:00 with #SBATCH -p user-debug, #SBATCH --qos=normal, and #SBATCH --time=00:30:00.

Resource Request Rules Of Thumb

Start smaller than the maximum and scale only after a successful smoke test.

Use Case

Starting Request

Shell and import checks

login node, no Slurm job

Interactive GPU check

--ntasks=4 --mem=40g --gpus=1g.20gb:1

Interactive CPU-only short check

--partition=user-debug --qos=normal --cpus-per-task=4

Interactive CPU-only longer check

--partition=low --qos=normal --cpus-per-task=8

Interactive GPU-partition CPU-heavy check

--partition=ice4hpc --qos=gpus --ntasks=8 --gpus=1g.20gb:1

First smoke test

--ntasks-per-node=4 --mem=40g --gpus=1g.20gb:1

First CPU-only batch job

#SBATCH -p low, #SBATCH --qos=normal, #SBATCH --cpus-per-task=8

First GPU batch job

#SBATCH -p ice4hpc, #SBATCH --qos=gpus, #SBATCH --gpus=1g.20gb:1

Heavier GPU training

Increase time first, then memory or CPU only if needed

The call notes say the GPU QoS allows up to cpu=8, gres/gpu=4, and mem=100G per job, with a per-user GPU limit visible through showqos. Request less than the maximum unless the job actually needs it. Smaller jobs usually queue faster.

Troubleshooting

nvidia-smi Is Missing

On the login node this can be normal. Inside a GPU allocation, load CUDA:

module load cuda
nvidia-smi

If it still fails, confirm that the job requested a GPU:

squeue -j <job_id> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"

Job Stays Pending

Check the pending reason:

squeue -u "$USER"

Common causes are resource pressure, priority, QoS limits, or asking for too much CPU, memory, walltime, or GPU.

Interactive Session Dies

Use sbatch for any job that should survive a dropped SSH connection. Keep srun --pty /bin/bash for short debugging only.

GitHub Clone Fails With Permission denied (publickey)

Check that the key exists and is loaded:

ls -al ~/.ssh
ssh-add -l
ssh -T git@github.com

Then verify that the public key in ~/.ssh/id_ed25519.pub is registered in GitHub.

Conda Command Not Found

Reload the Anaconda module:

module load Anaconda3
which conda

If needed, initialize Bash and reconnect:

conda init bash
exit

CUDA Is Not Available In PyTorch

Check all three layers:

module load cuda
nvidia-smi
python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available())"

If nvidia-smi works but torch.cuda.is_available() is false, reinstall PyTorch using the current command from the official PyTorch install selector.

First Full Training Or Campaign

Do not turn the first sbatch smoke test into a full training campaign by editing only the Python command. For a real training campaign, prepare the campaign plan, YAML queue, launcher, active campaign state, and approval gate required by the repository workflow.

For existing approved Linux launchers, prefer the repository-owned scripts under scripts/campaigns/ instead of hand-written cluster commands.