Skip to content

Troubleshooting

💡 The following Troubleshooting can be completed by consulting the IFB Community Forum

[SLURM] Invalid account or account/partition combination specified

Complete message:

srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Explanation 1

Your current default SLURM account should be the demo one. You may have seen a red notice at login? You can check that using:

$ sacctmgr list user $USER
      User   Def Acct     Admin
---------- ---------- ---------
   cnorris       demo      None
Solution

If you don't already have a project, you have to request one from the platform: https://my.cluster.france-bioinformatique.fr/manager2/project

Otherwise, you already have a project/account, you can either:

  • Specify at each job your SLURM account:
srun -A my_account command
#!/bin/bash
#SBATCH -A my_account
command
  • Change your default account
sacctmgr update user $USER set defaultaccount=my_account

⚠️ status_bar is updated hourly. So it may still display demo as your default account by don't worry, it should have work.

[RStudio] Timeout or do not start

Try to clean session files and cache:

# Remove (rm) or move (mv) RStudio files
# mv ~/.rstudio ~/.rstudio.backup-2022-27-02
rm -rf ~/.rstudio
rm -rf ~/.local/share/rstudio
rm .RData

Retry.

If it doesn't work, try to remove your configuration (settings will be lost)

rm -rf ~/.config/rstudio

Retry.

If it doesn't work, contact the support (IFB Community Forum)

[JupyterHUB] Timeout or do not start

Kill your job/session using the web interface (Menu "File" --> "Hub Control Panel" --> "Stop server") or in command line:

# Remove running jupyter job
scancel -u $USER -n jupyter

Clean session files, cache:

# Remove (rm) or move (mv) JupyterHUB directories
# mv ~/.jupyter ~/.jupyter.backup-2022-27-02
rm -rf ~/.jupyter 
rm -rf ~/.local/share/jupyter

[GPU] How to know the availability of GPU nodes

We can use sinfo command with "Generic resources (gres)" information.

For example:

sinfo -N -O nodelist,partition:15,Gres:30,GresUsed:50 -p gpu
NODELIST            PARTITION      GRES                          GRES_USED                                         
gpu-node-01         gpu            gpu:1g.5gb:14                 gpu:1g.5gb:0(IDX:N/A)                             
gpu-node-02         gpu            gpu:3g.20gb:2,gpu:7g.40gb:1   gpu:3g.20gb:1(IDX:0),gpu:7g.40gb:0(IDX:N/A)       
gpu-node-03         gpu            gpu:7g.40gb:2                 gpu:7g.40gb:2(IDX:0-1)    

In other words: * gpu-node-01: 14 profiles 1g.5gb, 0 used * gpu-node-02: 2 profiles 3g.20gb, 1 used * gpu-node-02: 1 profile 7g.40gb, 0 used * gpu-node-03: 1 profile 7g.40gb, 2 used

So we can see which GPU/profiles are immediately available.

More information about this "profile" ("Multi-Instance GPU"): * https://ifb-elixirfr.gitlab.io/cluster/doc/slurm/slurm_at/#gpu-nodes * https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

[SLURM] How to use resources wisely

Be vigilant about the proper use of resources.

Do tests on small datasets before launching your whole analysis.

And check the resources usage:

CPU / Memory

You can use: * htop: on the node, during the job * seff: once your job is finished. * sacct, ...

For example with the seff command, you can check the CPU and memory usage (once your job is finished):

# for the jobid `2435594`
$ seff 2435594
Job ID: 2435594
Cluster: core
User/Group: myuser/mygroup
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 50
CPU Utilized: 182-04:57:51
CPU Efficiency: 52.31% of 348-07:04:10 core-walltime
Job Wall-clock time: 6-23:10:53
Memory Utilized: 45.86 GB
Memory Efficiency: 18.34% of 250.00 GB

Here we have requested 50 CPU and 250GB of memory, during several days:

Only 52.31% of CPU is being used (100% of 50 CPU on 52.31% of total time, 52.31% of 50 CPU on 100% of total time, or a mix). It's not really efficient. It could be explained sometimes by I/O operations like read, write or get data over Internet (so CPU are just waiting for data), but it deserves further investigations.

Memory used is only 45.86 GB of 250.00 GB allocated (18.34%). So, next time, ask for less (something like 60 GB should be sufficient).

GPU

Check your job is currently using the GPU, for example, you can use nvidia-smi command during processing. We can misused some libraries, parameters and finally not used the GPU.

For example, if your job runs on gpu-node-03:

ssh gpu-node-03 nvidia-smi

So we can check your software (process) are currently using all of GPU or part of GPU (MIG).