💡 The following Troubleshooting can be completed by consulting the IFB Community Forum
Invalid account or account/partition combination specified
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
Your current default SLURM account should be the
demo one. You may have seen a red notice at login? You can check that using:
$ sacctmgr list user $USER User Def Acct Admin ---------- ---------- --------- cnorris demo None
If you don't already have a project, you have to request one from the platform: https://my.cluster.france-bioinformatique.fr/manager2/project
Otherwise, you already have a project/account, you can either:
- Specify at each job your SLURM account:
srun -A my_account command
#!/bin/bash #SBATCH -A my_account command
- Change your default account
sacctmgr update user $USER set defaultaccount=my_account
⚠️ status_bar is updated hourly. So it may still display demo as your default account by don't worry, it should have work.
[RStudio] Timeout or do not start
Try to clean session files and cache:
# Remove (rm) or move (mv) RStudio files # mv ~/.rstudio ~/.rstudio.backup-2022-27-02 rm -rf ~/.rstudio rm -rf ~/.local/share/rstudio rm .RData
If it doesn't work, try to remove your configuration (settings will be lost)
rm -rf ~/.config/rstudio
If it doesn't work, contact the support (IFB Community Forum)
[JupyterHUB] Timeout or do not start
Kill your job/session using the web interface (Menu "File" --> "Hub Control Panel" --> "Stop server") or in command line:
# Remove running jupyter job scancel -u $USER -n jupyter
Clean session files, cache:
# Remove (rm) or move (mv) JupyterHUB directories # mv ~/.jupyter ~/.jupyter.backup-2022-27-02 rm -rf ~/.jupyter rm -rf ~/.local/share/jupyter
[GPU] How to know the availability of GPU nodes
We can use
sinfo command with "Generic resources (gres)" information.
sinfo -N -O nodelist,partition:15,Gres:30,GresUsed:50 -p gpu NODELIST PARTITION GRES GRES_USED gpu-node-01 gpu gpu:1g.5gb:14 gpu:1g.5gb:0(IDX:N/A) gpu-node-02 gpu gpu:3g.20gb:2,gpu:7g.40gb:1 gpu:3g.20gb:1(IDX:0),gpu:7g.40gb:0(IDX:N/A) gpu-node-03 gpu gpu:7g.40gb:2 gpu:7g.40gb:2(IDX:0-1)
In other words: * gpu-node-01: 14 profiles 1g.5gb, 0 used * gpu-node-02: 2 profiles 3g.20gb, 1 used * gpu-node-02: 1 profile 7g.40gb, 0 used * gpu-node-03: 1 profile 7g.40gb, 2 used
So we can see which GPU/profiles are immediately available.
More information about this "profile" ("Multi-Instance GPU"): * https://ifb-elixirfr.gitlab.io/cluster/doc/slurm/slurm_at/#gpu-nodes * https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
[SLURM] How to use resources wisely
Be vigilant about the proper use of resources.
Do tests on small datasets before launching your whole analysis.
And check the resources usage:
CPU / Memory
You can use:
htop: on the node, during the job
seff: once your job is finished.
For example with the
seff command, you can check the CPU and memory usage (once your job is finished):
# for the jobid `2435594` $ seff 2435594 Job ID: 2435594 Cluster: core User/Group: myuser/mygroup State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 50 CPU Utilized: 182-04:57:51 CPU Efficiency: 52.31% of 348-07:04:10 core-walltime Job Wall-clock time: 6-23:10:53 Memory Utilized: 45.86 GB Memory Efficiency: 18.34% of 250.00 GB
Here we have requested 50 CPU and 250GB of memory, during several days:
Only 52.31% of CPU is being used (100% of 50 CPU on 52.31% of total time, 52.31% of 50 CPU on 100% of total time, or a mix). It's not really efficient. It could be explained sometimes by I/O operations like read, write or get data over Internet (so CPU are just waiting for data), but it deserves further investigations.
Memory used is only 45.86 GB of 250.00 GB allocated (18.34%). So, next time, ask for less (something like 60 GB should be sufficient).
Check your job is currently using the GPU, for example, you can use
nvidia-smi command during processing.
We can misused some libraries, parameters and finally not used the GPU.
For example, if your job runs on
ssh gpu-node-03 nvidia-smi
So we can check your software (process) are currently using all of GPU or part of GPU (MIG).