Personal tools
You are here: Home / HowTo / Compute Cluster / Tips and Tricks

Tips and Tricks

I just paste my weekly tips and tricks into this page

Nov 27 2009:

How to delay start of array jobs:

When you have a big array job and want to delay the start because you don't want to all jobs to read from the fileserver at the same time (bandwidth exhaustion). You could put your tasks at hold using the -h option:

> qsub -t <task_range> -h <other options> <job>

Now all tasks wont start until you release the hold. You can now by hand or in a script release the hold for only part of the tasks like:

> qrls <task_id> -t <task_range_to_release>

To get an impression of the resource consumption look at the ganglia page:

Dec 4:

This week some commands to get an overview on current cluster activity:

> qstat -g c

gives an overview of currently defined queues, slots Used/Available and so on.

> qstat -u <username>

show info on running and pending jobs for a user (-u "*" for all users jobs)

> qstat -u <username> -s r

selects only running jobs (use 'p' for pending)

> qhost -h <node name>

gives some current load values for a node

> qhost -h <node name> -q

adds info on the different queue instances on this node like used slots, "disabled" or "suspended

> qselect -qs d

shows all "disabled" queue instances

For more info see the manpages of qstat, qhost and qselect. 

Feb 5 2010

If you have jobs you want to keep running for longer than the standard 4 days, you have to request the RESOURCE "longrun" not the QUEUE "longrun". In the end your job will end up in the queue but only if you request the resource. I think there were some misinformations in older tips or documentation by me. You request the resource like

# qsub -l longrun <your job>

Feb 26

This week I noticed a pitfall regarding multi-process jobs and slots. If you have a construct like this in your job script
<prog1>|<prog2>|<prog3> this runs all 3 programs in parallel. So if all 3 are computations you need to reserve at least 3 slots.
Please keep that in mind. 

Mar 5

You may not be aware of it but there are quotas active for the usage of cluster resources. At the moment only restricted resource is slots.
To get an idea what quotas are defined call

# qconf -srqsl

To get details of a quota use

# qconf -srqs <quota_name>

To get your current consumption:

# qquota

(This will return an empty line if you did not use any restricted resource)

Apr 16

This week I noticed a small pitfall: one user edited a job script:
She commented out (with a '#' sign) a line beginning with '$' . Unfortunately qsub interprets lines beginning with '#$' as setting an GridEngine option. You either need to change that into an '# $' (inserting a blank) or tell qsub that this is a "binary" by using:

 # qsub -b y <full path to script>

Jun 6

After we started the new management of memory you probably want to know if you requested to much and could get away with a lower value next time (and so have a better chance for your job to start)
You can have a look at the output from

# qacct -j <JOB_ID>

and look at the line with the maxvmem value. This tells you the maximum virtual memory used by your job.

Jun 24

If you always have the same options on your qsub line you can write them into a .sge_request in your home directory. You can put different options into one line have them one option per line, every line starting with a '#' is regarded as a comment. Example:

# this is a .sge_request example
# the following line request only 512MB memory by default (per slot)
-l h_vmem=512M
# I want to get an email when the job is finished
-m e

If you start your jobs with the option -cwd (run in current working dir) then also a .sge_request in that directory is taken into account. The order of evalutation is:

  1.  clusterwide default
  2. home-dir .sge_request
  3. cwd .sge_request (if -cwd option)
  4. command line options

If you want to clear default settings you can do with the -clear command line option.

Jul 09

How to get information about failed jobs:
If you started your job with qsub -m a you will get an email if the job aborted. (With other parameters to the -m option you can also get emails when the job starts and when it ends, man qsub).

This email gives you some basic info on the job: where it ran, at which time, what memory it used. You can also get this from: qacct -j <job_id>
You will also see an exit_code. If you see a 137 here, it means the job was killed by a signal 9 (KILL) (128+9=137)
This usually happens when the job ran over some resource limits, either memory or runtime.
To investigate this further you need to login into the node the job run and grep for your job_id in /opt/sge/default/spool/<nodename>/messages

Aug 27

If you accidentally deleted a file there is good chance that it is in our daily snapshots. These are done once a day and kept for a week.

Supposed the missing file is <filesystem_root>/<sub_path> where <filesystem_root> is the root location of the filesystem (e.q. /data/bioinformatics, /data/proteomics, /home ...) and <sub_path> the path therein.

Then you could find your file in <filesystem_root>/.zfs/snapshot/<day>/<sub_path>

Please notice that the .zfs will not be shown when looking at the <filesystem_root>. It will "magically" appear when you access it. More can be read on the Self-Restore page.

 Jan 7 2011

In the cluster documentation and messages there is some confusion between interactive and immediate jobs. Today I'll try to clarify it a bit (or make it worse?).
So queues can have the "interactive" flag set (ATM that only our "interactive" and the "high*" queues have that set). The opposite is the "Batch" flag.
Despite the name "interactive" means "immediate jobs", while batch means "non immediate jobs". "Immediate" jobs will fail if the requested resources can currently not be satisfied while "non immediate jobs" will get queued and wait till resources get available.
You set the type of your job with the "-now no|yes" switch.
The interactive cluster programs qrsh and qlogin default to "-now yes" while qsub defaults to "-now no". But you can change this behavior and can have your qrsh wait until you get your resources.

Document Actions