Personal tools
You are here: Home / HowTo / Compute Cluster / NFS bottlenecks

NFS bottlenecks

Filed under: ,
How to identify if the NFS server performance is the bottleneck for your parallel jobs and some strategies to reduce this problem

Why could network speed be a problem for your job?

The problem described here will arise if you start many jobs/tasks on the cluster which read/write relativly big input/output at the same time. We will concentrate on the case of input since this is the more common case. For output problems you need to adjust the strategies.

These problems should reduced because the delay the finish of your jobs and also can block the jobs of other people from running.

It's not hard to imagine that 100 nodes with gigabit ethernet accessing the same fileserver with 2 gigabit ethernet cards will saturate this link easily. Another bottleneck is read and write speed of the disks arrays inside the fileservers. While these are determined by the hardware there are strategies to mitigate the arising problems.

Symptoms of your jobs waiting for IO:

  • actual runtime is much longer then expected for this input size (you did test your job with smaller input first, right?)
  • on the ganglia page you see a high percentage of WAIT CPU on the CPU graph (Ganglia MDC internal only)
  • on the ganglia page you see high values for "your" nodes when switching to the cpu_wio view
  • when you ssh into on of the nodes running your job, "ps aux" shows your job in state "D"


some possible strategies, adjust to your needs

  • reduce file size if possible (e.g. if large chunks of input are uninteresting for your jobs)
  • if your individual jobs only work on small parts of a big input file: split up the input before (e.g. mapping jobs)
  • use mmap to load only the needed parts of the input file
  • delayed start of individual jobs (estimate how long a job needs to read the input and only start the next job after that period)
    This is far from perfect. You have to keep in mind that other jobs by other people could block the same fileserver
  • put all your jobs aside from one on hold in the beginning and release the hold whenever a job finishes it's "read input" phase
  • file staging: if you're pretty sure the same input is needed again and again for your jobs you can preload these input to all nodes you plan to use
    • you can use a subdirectory in /tmp (all nodes, about) or /scratch (only bignodes with resource scratch defined) for this. These directories are cleaned up regularly.
    • to make sure your preloaded data is there you could use "rsync" at the beginning of your script
Document Actions