Unimore Aries Cluster User Guide
Scope
This guide explains how to start using the Unimore Aries cluster for this
repository from a clean workstation and a clean cluster shell. It covers SSH
login, GitHub SSH setup, repository cloning, Conda setup, basic Slurm
inspection, one interactive srun check, and one first sbatch submission.
The Aries-specific values below come from the local call notes in
.temp/aries_call.txt. Re-check them with sinfo, showqos, and the current
cluster welcome message before running large jobs.
Do not paste passwords into scripts, Markdown files, shell history notes, or Git-tracked files. When a password is required, type it only into the SSH prompt.
Useful References
Aries public page: https://www.labcsai.unimore.it/aries/
GitHub SSH key generation for Linux: https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent?platform=linux
GitHub SSH key registration: https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account
Slurm documentation: https://slurm.schedmd.com/documentation.html
Slurm
srun: https://slurm.schedmd.com/srun.htmlSlurm
sbatch: https://slurm.schedmd.com/sbatch.htmlHarvard convenient Slurm commands: https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/
PyTorch install selector: https://pytorch.org/get-started/locally/
Aries Values To Know
Item |
Value |
|---|---|
SSH host |
|
Account |
|
Main GPU partition for this work |
|
GPU QoS |
|
GPU node mentioned in the call |
|
Recommended first GPU request |
|
Recommended first mode |
single GPU |
Work directories |
|
Persistent home |
|
The call notes indicate that gnode01 exposes H100 GPU resources through MIG
20 GB slices. Treat nvidia-smi inside an allocated GPU job as the real
runtime check.
Mental Model
Aries has a login node and compute nodes.
Use the login node for light setup: SSH, Git clone, editing files, Conda environment creation, and submitting jobs.
Use compute nodes for Python tests, CUDA checks, training, and any expensive execution.
Use
srunfor an interactive allocation. If the SSH session dies, the interactive work can die with it.Use
sbatchfor non-interactive jobs. The scheduler keeps running the job after you disconnect.
Connect With SSH
From Windows PowerShell:
ssh dferrari@aries.hpc.unimo.it
From Linux or macOS:
ssh dferrari@aries.hpc.unimo.it
Replace dferrari with your Aries username. After a successful login you
should land on the login node, for example with a prompt similar to:
[dferrari@fe01 ~]$
Choose A Working Directory
Use scratch for active repository work and job output:
cd /scratch1/$USER
pwd
If /scratch1/$USER is not available or is busy, try:
cd /scratch2/$USER
pwd
Use home for persistent shell configuration and small configuration files:
cd ~
pwd
Load Cluster Modules
List available modules:
module av
Load Slurm if it is not already loaded:
module load slurm
Load CUDA and Anaconda:
module load cuda
module load Anaconda3
If Anaconda3 is not the exact module name on the current module tree, inspect
the available names:
module av Anaconda
module av python
Run Conda initialization only if conda activate is not available:
conda init bash
Then disconnect and reconnect, or reload the shell:
exec bash -l
Add A GitHub SSH Key On Aries
First check whether an SSH key already exists:
ls -al ~/.ssh
Generate a new key if needed. Use the email associated with your GitHub account:
ssh-keygen -t ed25519 -C "your_email@example.com"
Accept the default path unless you intentionally manage multiple keys. Use a passphrase if you want the private key protected on the cluster.
Start the SSH agent and add the key:
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
Print the public key:
cat ~/.ssh/id_ed25519.pub
Open GitHub in the browser on your local workstation and add the printed public
key under Settings -> SSH and GPG keys -> New SSH key. The official GitHub
guide is:
https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account
Test the GitHub SSH connection from Aries:
ssh -T git@github.com
A successful test usually identifies your GitHub account and says GitHub does not provide shell access. That is expected.
Clone The Repository
Move to scratch:
cd /scratch1/$USER
Clone with SSH:
git clone git@github.com:XiLab-Robotics/Physics-Informed-Neural-Networks.git
cd Physics-Informed-Neural-Networks
Check the remote:
git remote -v
Optional but useful Git identity setup on Aries:
git config --global user.name "Your Name"
git config --global user.email "your_email@example.com"
Create The Conda Environment
Load modules first:
module load cuda
module load Anaconda3
Create the project environment:
conda create -y -n pinns_env python=3.12
conda activate pinns_env
python -m pip install --upgrade pip
Install PyTorch using the official PyTorch selector for the CUDA build that matches the loaded CUDA module and cluster driver:
https://pytorch.org/get-started/locally/
For a CUDA 12.4 module, start by checking the current PyTorch selector and
then use the generated command. A typical pip pattern is:
python -m pip install torch --index-url https://download.pytorch.org/whl/cu124
If the selector recommends a newer CUDA wheel, use the selector output instead of this example. Then install the repository dependencies:
python -m pip install -r requirements.txt
For the recovered original RCIM workflow only, there is also a nested historical requirements file:
python -m pip install -r scripts/paper_reimplementation/rcim_ml_compensation/recovered_original_workflow/requirements.txt
Use that nested file only when you are explicitly working in the recovered original workflow.
Verify Python And CUDA
On the login node, verify imports only:
python -c "import torch, lightning, numpy, pandas, sklearn; print(torch.__version__); print(lightning.__version__)"
Do not assume CUDA is available on the login node. Check CUDA inside an allocated GPU job.
Inspect Slurm State
Show partitions and node states:
sinfo
Show queue state:
squeue
Show only your jobs:
squeue -u "$USER"
On Aries, squeue --me is also available and is the shortest reliable way to
show only your jobs:
squeue --me
The shell may define sq shorthand.
Show QoS limits from the login node:
showqos
If showqos is missing, first confirm that you are on the login node and that
Slurm is loaded:
hostname
module load slurm
showqos
showqos can be unavailable inside an allocated compute-node shell. Exit back
to the login node before using it for quota checks.
Common Slurm states:
State |
Meaning |
|---|---|
|
Running |
|
Pending |
|
Cancelled or completed, depending on the command context |
|
Node has available resources |
|
Node is partially allocated |
|
Node is fully allocated |
Cancel one of your jobs:
scancel <job_id>
Cancel all of your jobs only when you really intend to stop them:
scancel -u "$USER"
Show details for one job:
scontrol show job <job_id>
Replace <job_id> with the numeric job id printed by srun, sbatch, or
squeue.
Interactive Compute-Node Sessions With srun
Use interactive srun --pty /bin/bash only when you want to land on a compute
node and run commands manually from the terminal. This is useful for short
checks, debugging, and confirming that modules, Conda, CUDA, and paths work.
If the SSH session dies, the interactive work can die with it.
The interactive shell can start with a different Conda prompt, for example
(base) even if the login-node shell was already in pinns_env. Always load
modules and activate the intended Conda environment again inside the allocated
node shell.
The prompt tells you where you are:
[dferrari@fe03 Physics-Informed-Neural-Networks]$ # login node
[dferrari@gnode01 Physics-Informed-Neural-Networks]$ # compute node
Interactive GPU Node
Start with a modest request. This matches the Aries call notes and avoids asking for more CPU or memory than the first check needs:
srun \
--ntasks=4 \
--nodes=1 \
--mem=40g \
--partition=ice4hpc \
--account=xilab \
--qos=gpus \
--gpus=1g.20gb:1 \
--pty \
--mpi=pmix \
/bin/bash
Once the shell opens on the allocated node, load modules and activate the environment:
module load cuda
module load Anaconda3
conda activate pinns_env
Check the GPU:
nvidia-smi
Check PyTorch CUDA visibility:
python -c "import torch; print('cuda_available=', torch.cuda.is_available()); print('device_count=', torch.cuda.device_count()); print('device_name=', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"
Exit the interactive allocation when finished:
exit
Interactive CPU Node
For work without GPU acceleration, use a CPU partition and a CPU QoS. The call
notes say that CPU work should use QoS normal, high, or low; if no QoS is
specified, Aries can select the default QoS. The live sinfo output shown in
the validated session exposed CPU-oriented partitions such as ulow, low,
high, and user-debug.
Use user-debug only for short checks because its time limit is 30 minutes:
srun \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=4 \
--time=00:30:00 \
--mem=20g \
--partition=user-debug \
--account=xilab \
--qos=normal \
--pty \
--mpi=pmix \
/bin/bash
For a longer CPU session, use a longer CPU partition such as low, then keep
the requested time within the partition time limit:
srun \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=8 \
--time=02:00:00 \
--mem=20g \
--partition=low \
--account=xilab \
--qos=normal \
--pty \
--mpi=pmix \
/bin/bash
Once the shell opens on the allocated CPU node, bind common numerical-library thread counts to the CPU request before running Python:
module load Anaconda3
conda activate pinns_env
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export MKL_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export OPENBLAS_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export NUMEXPR_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
hostname
python -c "import os; print('cpu_count=', os.cpu_count()); print('threads=', os.environ.get('OMP_NUM_THREADS'))"
CPU Work Inside A GPU Allocation
The ice4hpc partition is the GPU partition. Use it when you need the GPU node
or when you are preparing GPU training. The call notes say the gpus QoS has a
per-job limit of cpu=8, gres/gpu=4, and mem=100G, with a per-user limit
of cpu=16, gres/gpu=4, mem=202G, and node=1.
For a CPU-heavy check that still allocates the GPU node, keep the MIG GPU request:
srun \
--ntasks=8 \
--nodes=1 \
--time=02:00:00 \
--mem=20g \
--partition=ice4hpc \
--account=xilab \
--qos=gpus \
--gpus=1g.20gb:1 \
--pty \
--mpi=pmix \
/bin/bash
For this GPU-partition case, do not remove --gpus=1g.20gb:1: the notes say
that omitting the GPU request can place the interactive shell on a non-GPU
node. Use a CPU partition instead when you want a true CPU-only node.
Run A Script With srun
Use srun without --pty when you want Slurm to allocate resources and then
execute one command or one shell script. This is useful for short tests and
debug jobs. The terminal remains attached to the job; if the SSH session dies,
the job can die too. Use sbatch for long unattended runs.
Important distinction:
srunresource requests go on thesruncommand line.#SBATCHdirectives inside a script are forsbatch;srun bash script.shtreats them as shell comments.
CPU Hello-World Script With srun
Create a small shell script:
nano aries_hello_srun.sh
Use this content:
#!/bin/bash -l
set -euo pipefail
echo "[INFO] Host: $(hostname)"
echo "[INFO] Workdir: $(pwd)"
echo "[INFO] SLURM job: ${SLURM_JOB_ID:-none}"
echo "[INFO] SLURM CPUs per task: ${SLURM_CPUS_PER_TASK:-unset}"
module load Anaconda3
conda activate pinns_env
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export MKL_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export OPENBLAS_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export NUMEXPR_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
python -c "import os, platform; print('hello_from=', platform.node()); print('threads=', os.environ.get('OMP_NUM_THREADS'))"
Make it executable:
chmod +x aries_hello_srun.sh
Run it on a short CPU allocation:
srun \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=2 \
--time=00:05:00 \
--mem=4g \
--partition=user-debug \
--account=xilab \
--qos=normal \
./aries_hello_srun.sh
If user-debug is busy or not appropriate, inspect sinfo and use another CPU
partition such as low, high, or ulow with a matching time limit and QoS.
GPU Check Script With srun
Create a GPU check script:
nano aries_gpu_check_srun.sh
Use this content:
#!/bin/bash -l
set -euo pipefail
module load cuda
module load Anaconda3
conda activate pinns_env
echo "[INFO] Host: $(hostname)"
echo "[INFO] SLURM job: ${SLURM_JOB_ID:-none}"
nvidia-smi
python -c "import torch; print('cuda_available=', torch.cuda.is_available()); print('device_count=', torch.cuda.device_count()); print('device_name=', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"
Run it on the GPU partition:
chmod +x aries_gpu_check_srun.sh
srun \
--nodes=1 \
--ntasks=4 \
--time=00:10:00 \
--mem=20g \
--partition=ice4hpc \
--account=xilab \
--qos=gpus \
--gpus=1g.20gb:1 \
--mpi=pmix \
./aries_gpu_check_srun.sh
Repository Test With srun
Use the repository’s lightweight setup validation before launching training:
cd /scratch1/$USER/Physics-Informed-Neural-Networks
Run the validation script through Slurm:
srun \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=2 \
--time=00:10:00 \
--mem=8g \
--partition=user-debug \
--account=xilab \
--qos=normal \
bash -lc "module load Anaconda3; conda activate pinns_env; python -B scripts/training/validate_training_setup.py --config-path config/training/feedforward/presets/baseline.yaml --platform linux"
Then run the minimal smoke test through Slurm:
srun \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=2 \
--time=00:10:00 \
--mem=8g \
--partition=user-debug \
--account=xilab \
--qos=normal \
bash -lc "module load Anaconda3; conda activate pinns_env; python -B scripts/training/run_training_smoke_test.py --config-path config/training/feedforward/presets/baseline.yaml --output-suffix aries_first_smoke_test --fast-dev-run-batches 1 --platform linux"
These commands are still tests, not full campaigns. They create validation or smoke-test artifacts under the repository output/report structure.
First sbatch Script
Create a batch script in the repository root:
nano aries_first_smoke_test.sbatch
Use this template:
#!/bin/bash -l
#SBATCH -A xilab
#SBATCH -p ice4hpc
#SBATCH --qos=gpus
#SBATCH --time=01:00:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=40g
#SBATCH --gpus=1g.20gb:1
#SBATCH --job-name=aries_smoke
#SBATCH --output=aries_smoke_%j.out
#SBATCH --error=aries_smoke_%j.err
set -euo pipefail
module load cuda
module load Anaconda3
conda activate pinns_env
cd /scratch1/${USER}/Physics-Informed-Neural-Networks
echo "[INFO] Host: $(hostname)"
echo "[INFO] Workdir: $(pwd)"
echo "[INFO] Python: $(which python)"
nvidia-smi
python -c "import torch; print('cuda_available=', torch.cuda.is_available()); print('device_count=', torch.cuda.device_count()); print('device_name=', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"
python -B scripts/training/validate_training_setup.py \
--config-path config/training/feedforward/presets/baseline.yaml \
--platform linux
python -B scripts/training/run_training_smoke_test.py \
--config-path config/training/feedforward/presets/baseline.yaml \
--output-suffix aries_first_sbatch_smoke_test \
--fast-dev-run-batches 1 \
--platform linux
Submit it:
sbatch aries_first_smoke_test.sbatch
Watch the queue:
squeue -u "$USER"
Inspect output after the job starts or finishes:
tail -n 100 aries_smoke_<job_id>.out
tail -n 100 aries_smoke_<job_id>.err
Replace <job_id> with the numeric job id printed by sbatch.
CPU-Only sbatch Script
Use this template when the job does not need GPU acceleration. It uses a CPU
partition, a CPU QoS, and no --gpus directive:
#!/bin/bash -l
#SBATCH -A xilab
#SBATCH -p low
#SBATCH --qos=normal
#SBATCH --time=24:00:00
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=20g
#SBATCH --job-name=test_cpu
#SBATCH --output=cpu_%j.out
#SBATCH --error=cpu_%j.err
set -euo pipefail
module load Anaconda3
conda activate pinns_env
cd /scratch1/${USER}/Physics-Informed-Neural-Networks
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export MKL_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export OPENBLAS_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export NUMEXPR_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
echo "[INFO] Host: $(hostname)"
echo "[INFO] Workdir: $(pwd)"
echo "[INFO] Python: $(which python)"
echo "[INFO] SLURM job: ${SLURM_JOB_ID}"
echo "[INFO] SLURM CPUs per task: ${SLURM_CPUS_PER_TASK}"
python -c "import os; print('cpu_count=', os.cpu_count()); print('threads=', os.environ.get('OMP_NUM_THREADS'))"
python ./job01/miojob.py
Submit it with:
sbatch test_cpu.sbatch
For short tests, replace #SBATCH -p low, #SBATCH --qos=normal, and
#SBATCH --time=24:00:00 with #SBATCH -p user-debug,
#SBATCH --qos=normal, and #SBATCH --time=00:30:00.
Resource Request Rules Of Thumb
Start smaller than the maximum and scale only after a successful smoke test.
Use Case |
Starting Request |
|---|---|
Shell and import checks |
login node, no Slurm job |
Interactive GPU check |
|
Interactive CPU-only short check |
|
Interactive CPU-only longer check |
|
Interactive GPU-partition CPU-heavy check |
|
First smoke test |
|
First CPU-only batch job |
|
First GPU batch job |
|
Heavier GPU training |
Increase time first, then memory or CPU only if needed |
The call notes say the GPU QoS allows up to cpu=8, gres/gpu=4, and
mem=100G per job, with a per-user GPU limit visible through showqos.
Request less than the maximum unless the job actually needs it. Smaller jobs
usually queue faster.
Troubleshooting
nvidia-smi Is Missing
On the login node this can be normal. Inside a GPU allocation, load CUDA:
module load cuda
nvidia-smi
If it still fails, confirm that the job requested a GPU:
squeue -j <job_id> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
Job Stays Pending
Check the pending reason:
squeue -u "$USER"
Common causes are resource pressure, priority, QoS limits, or asking for too much CPU, memory, walltime, or GPU.
Interactive Session Dies
Use sbatch for any job that should survive a dropped SSH connection. Keep
srun --pty /bin/bash for short debugging only.
GitHub Clone Fails With Permission denied (publickey)
Check that the key exists and is loaded:
ls -al ~/.ssh
ssh-add -l
ssh -T git@github.com
Then verify that the public key in ~/.ssh/id_ed25519.pub is registered in
GitHub.
Conda Command Not Found
Reload the Anaconda module:
module load Anaconda3
which conda
If needed, initialize Bash and reconnect:
conda init bash
exit
CUDA Is Not Available In PyTorch
Check all three layers:
module load cuda
nvidia-smi
python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available())"
If nvidia-smi works but torch.cuda.is_available() is false, reinstall
PyTorch using the current command from the official PyTorch install selector.
First Full Training Or Campaign
Do not turn the first sbatch smoke test into a full training campaign by
editing only the Python command. For a real training campaign, prepare the
campaign plan, YAML queue, launcher, active campaign state, and approval gate
required by the repository workflow.
For existing approved Linux launchers, prefer the repository-owned scripts
under scripts/campaigns/ instead of hand-written cluster commands.