Tutorial of good clusterig practices
If you want to throw some process to the cluster, to prevent possible problems, please read the following tutorial of good
practices for throwing processes in the cluster correctly:
- Don't use the projects filesystem to work with files that require a very intense use of the disk, this means to create and delete thousands of small files, pipelines on large files, etc. For this use you should use the scratch filesystem. The scratch filesystem is more efficient for this kind of operations. If you have your files in projects, copy them to scratch, work with them in scratch, and when you have the result, copy the result into the projects. If you use projects filesystem for disk intensive operations, commands as ls, ll, rm, bash tab completion as well as your jobs can take a lot of time.
- Test your jobs before launching them. A program could work in your computer or in another cluster and fail here. Maybe here we need to install a library or maybe you got a typo in the input file path. Submitting small test executions is fast and really useful.
- You should have a roughly idea of the amount of resources your job needs: time, memory, hard disk space. If not, you're increasing the risk of a job failure due to walltime, too much swapping or lack of disk space. Which will be a waste of time for you and anyone else using the cluster.
- Use a reasonable amount of memory. Any extra memory you are requesting, is memory that other users (or even you) will not be able to use, and make everything slower (jobs not running because, even though there are free cpu slots, there is not enough free memory).
- It doesn't make sense to send jobs shorter than 1'. Because even when a queue is empty it takes some seconds for a job to start. And when it's finished it takes again some time to leave the queue, inform the master, write accounting logs, and so on. Therefore, if your job takes only a few seconds of computation, it spends most of the time getting in and out of the queue, instead of computing.
- As an extension of the previous point, if you want to run an installed program, and that program has a parallel version, let us known. They tend to be much faster and efficient.
Here you can find some advices to help you when throws some process in to cluster:
- Keep your home directory as clean as possible. An ideal home directory would be clean of files. It would only have there a directory for each of your projects. Once done that, whenever you submit a job, use -e/-o options in order to send those files to the directory of the Project they belong.
- Use sbatch. Once you’re in marvin you are in the login node, this is not the cluster, it is only the front end and queue manager computer. If you run heavy things here the computer will go slower not only to you, but to anyone else using it.
- Use sbatch for anything that it is not editing, moving or zipping files. But if the files are pretty big, then sbatch also the zips.
- Use the “--mem=XXX” option. It is needed by the scheduler system in order to know how much memory will need your job.
- We do have more than 700 CPUs. That mean we can run 700 jobs simultaneously. Yeah, it’s not Marenostrum but… use it!!! I mean, if you have to run a process that will last 20 hours, but you manage it to become 20 processes of 1 hour each… It will work 20 times faster!!! Sometimes it is not possible to break it into pieces that way, but much others it is.
- But keep in mind: N cores => No more than 30xN simultaneous sbatch jobs. as mentioned, it is ok to split a 200h job in 20 x 10h jobs, or even in 100 x 2h jobs, but it doesn’t make sense to try to split in 20000 jobs.
- If you need to run several similar jobs, here is an easy way to do it, making a script that sends jobs with different parameters, but try to avoid neverending self-calling loops. They are pretty nasty to kill