Personal tools
You are here: Home / HowTo / Compute Cluster / Abuse

Abuse

What could happen if you use the cluster the wrong way.

Results of mis-using the cluster

If you do computation on the login-node(s)

The login node is one very limited resource: ALL Cluster users rely on a stable login-node to connect to the cluster services. If you break the login-node (see below), the machine needs to be restarted and all other users suffer from your misbehavior, because any existing shell (to login1 or interactively to the compute nodes) and all running applications (i.e. editors with unsaved changes) are terminated by this of course.

You miss-use the login node if, you run computations on it. This includes

  • Any process that consumes a noticeable amount of main memory
    i.e. more than a few ten Megabyte of the installed 16 GB memory shared by all.
    If the main memory is exhausted, the system first starts to swap (which slows the machine down by an order of magnitude), and secondly it starts killing "randomly" memory-intensive processes, that could be even system services (like NFS-access to /home, or the user-authentication). So the system must be restarted then, and we can find out, who mis-behaved...
  • Any process that consumes a noticeable amount of cpu-time
    i.e. more than a few ten seconds on the installed 8 cores that are shared by all.
    If the load on the machine raises beyond 8, any response of the machine could be delayed by seconds up to minutes, i.e. making the machine unusable. If this lasts for longer, we usually also need to decide to restart the machine.
  • Any Process that causes a noticeable amount of wait-I/O for one CPU
    i.e. more that 10% for a number of minutes.
    This mainly is a waste of cpu-time and mostly relevant on compute-nodes (see below). If your processes cause wait-I/O (see NFS bottlenecks), please try to find a way to optimize your workflow. (This is also true for processes hat are started via Grid Engine...)

If we notice such a process, first we will kindly inform the user.
If mis-usage continues, see Consequences.

There are a few exceptions that permit processes listed above:

  • Tasks that need access to the Internet (as the compute nodes do not have Internet-access)
    e.g. rsync, scp/sftp or software (like "R") that updates its modules from a repository
  • ssh-sessions that forward information (e.g. X11-sessions) between a compute node and the outside world (Campus-Workstation or Internet)

If you run computation outside the grid engine

The criteria for abuse processes (listed above) also apply to the compute nodes: You may direct ssh-login to the nodes to check your processes (or why they have died), but intense computations must be run under the surveillance of the Grid Engine (i.e. use qlogin / qrsh or qsub).

It is also not permitted to run computations for another user, as this breaks any possible accounting (in the future) and could circumvent limitations that other users have.

Once again here is a very limited set of exceptions (e.g. for pipelines that severely use memory-mapped files and thus cannot be run inside the grid engines limitations). Users that may do so get a special training.

Consequences

Everyone can make a mistake, so don't be upset, if we kindly inform you about such one.

If "mistakes" repeat on purpose, there are different consequences beyond informative mails to the user itself:

  • Information of the responsible group leader
  • Personal discussion of the problems that you've caused
  • Time-limited cut-down of your access to the Resources offered by Grid Engine
  • Time-limited blocking of your access to the Cluster.
Document Actions