Troubleshooting
💡 The following Troubleshooting can be completed by consulting the IFB Community Forum
[SLURM] Invalid account or account/partition combination specified
Complete message:
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
Explanation 1
Your current default SLURM account should be the demo
one. You may have seen a red notice at login? You can check that using:
$ sacctmgr list user $USER
User Def Acct Admin
---------- ---------- ---------
cnorris demo None
Solution
If you don't already have a project, you have to request one from the platform: https://my.cluster.france-bioinformatique.fr/manager2/project
Otherwise, you already have a project/account, you can either:
- Specify at each job your SLURM account:
srun -A my_account command
#!/bin/bash
#SBATCH -A my_account
command
- Change your default account
sacctmgr update user $USER set defaultaccount=my_account
⚠️ status_bar is updated hourly. So it may still display demo as your default account by don't worry, it should have work.
[RStudio] Timeout or do not start
Try to clean session files and cache:
# Remove (rm) or move (mv) RStudio files
# mv ~/.rstudio ~/.rstudio.backup-2022-27-02
rm -rf ~/.rstudio
rm -rf ~/.local/share/rstudio
rm .RData
Retry.
If it doesn't work, try to remove your configuration (settings will be lost)
rm -rf ~/.config/rstudio
Retry.
If it doesn't work, contact the support (IFB Community Forum)
[JupyterHUB] Timeout or do not start
Kill your job/session using the web interface (Menu "File" --> "Hub Control Panel" --> "Stop server") or in command line:
# Remove running jupyter job
scancel -u $USER -n jupyter
Clean session files, cache:
# Remove (rm) or move (mv) JupyterHUB directories
# mv ~/.jupyter ~/.jupyter.backup-2022-27-02
rm -rf ~/.jupyter
rm -rf ~/.local/share/jupyter
[GPU] How to know the availability of GPU nodes
We can use sinfo
command with "Generic resources (gres)" information.
For example:
sinfo -N -O nodelist,partition:15,Gres:30,GresUsed:50 -p gpu
NODELIST PARTITION GRES GRES_USED
gpu-node-01 gpu gpu:1g.5gb:14 gpu:1g.5gb:0(IDX:N/A)
gpu-node-02 gpu gpu:3g.20gb:2,gpu:7g.40gb:1 gpu:3g.20gb:1(IDX:0),gpu:7g.40gb:0(IDX:N/A)
gpu-node-03 gpu gpu:7g.40gb:2 gpu:7g.40gb:2(IDX:0-1)
In other words: * gpu-node-01: 14 profiles 1g.5gb, 0 used * gpu-node-02: 2 profiles 3g.20gb, 1 used * gpu-node-02: 1 profile 7g.40gb, 0 used * gpu-node-03: 1 profile 7g.40gb, 2 used
So we can see which GPU/profiles are immediately available.
More information about this "profile" ("Multi-Instance GPU"): * https://ifb-elixirfr.gitlab.io/cluster/doc/slurm/slurm_at/#gpu-nodes * https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
[SLURM] How to use resources wisely
Be vigilant about the proper use of resources.
Do tests on small datasets before launching your whole analysis.
And check the resources usage:
CPU / Memory
You can use:
* htop
: on the node, during the job
* seff
: once your job is finished.
* sacct
, ...
For example with the seff
command, you can check the CPU and memory usage (once your job is finished):
# for the jobid `2435594`
$ seff 2435594
Job ID: 2435594
Cluster: core
User/Group: myuser/mygroup
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 50
CPU Utilized: 182-04:57:51
CPU Efficiency: 52.31% of 348-07:04:10 core-walltime
Job Wall-clock time: 6-23:10:53
Memory Utilized: 45.86 GB
Memory Efficiency: 18.34% of 250.00 GB
Here we have requested 50 CPU and 250GB of memory, during several days:
Only 52.31% of CPU is being used (100% of 50 CPU on 52.31% of total time, 52.31% of 50 CPU on 100% of total time, or a mix). It's not really efficient. It could be explained sometimes by I/O operations like read, write or get data over Internet (so CPU are just waiting for data), but it deserves further investigations.
Memory used is only 45.86 GB of 250.00 GB allocated (18.34%). So, next time, ask for less (something like 60 GB should be sufficient).
GPU
Check your job is currently using the GPU, for example, you can use nvidia-smi
command during processing.
We can misused some libraries, parameters and finally not used the GPU.
For example, if your job runs on gpu-node-03
:
ssh gpu-node-03 nvidia-smi
So we can check your software (process) are currently using all of GPU or part of GPU (MIG).