GNU Parallel Tutorial
1. Introduction to GNU Parallel
2.1 PARALLEL + GREP: Extract VCF positions from a list
2.2 PARALLEL + GAWK: Change read names in a fastq file
2.3 Parallelizing many commands
1.INTRODUCTION TO GNU PARALLEL
GNU Parallel is a tool for executing one or more commands in parallel in one or more nodes and/or CPUs. It can parallelize several different commands to run at once or split a big input file into blocks to work with them in parallel at the same time.
In this tutorial, we offer some useful examples of its utilization with real data in a science research context. But if you want more information, you might read this tutorial or this manual (or type “man parallel” on the shell).
You can use it in your local machine or in the Marvin cluster, in which case you should load the corresponding module:
module load parallel
1.1 Interesting options
--progress
Show progress of computations.
--pipepart
Spread input to jobs on stdin (standard input). Read a block of data from a physical file (given by -a) and give one block of data as input to one job.
The block size is determined by --block.
--block
Size of block in bytes to read at a time
--keep-order / -k
Keep sequence of output same as the order of input. Normally the output of a job will be printed as soon as the job completes
--quote / -q
Quote command. This will quote the command line so special characters are not interpreted by the shell. See the section QUOTING. Most people will never need this. Quoting is disabled by default.
--jobs
Number of jobslots on each machine. Run up to N jobs in parallel. 0 means as many as possible. Default is 100% which will run one job per CPU core on each machine.
{}
Input line. This replacement string will be replaced by a full line read from the input source. The input source is normally stdin (standard input), but can also be given with -a, :::, or ::::.
The replacement string {} can be changed with -I.
{.}
Input line without extension. This replacement string will be replaced by the input with the extension removed. If the input line contains . after the last / the last . till the end of the string will be removed and {.} will be replaced with the remaining.
{/}
Basename of input line. This replacement string will be replaced by the input with the directory part removed.
{//}
Dirname of input line. This replacement string will be replaced by the dir of the input line.
{/.}
Basename of input line without extension. This replacement string will be replaced by the input with the directory and extension part removed. It is a combination of {/} and {.}.
Check the manual for other options or string replacements.
1.2 Interesting advices
When using parallel to work with one file (splitting it in blocks), this tool does not accept compressed files, nor input nor output. Keep in mind to compress/decompress the file (if needed) out of the parallel command.
2.EXAMPLES
2.1 PARALLEL + GREP: Extract VCF positions from a list
We have 2 files: one large VCF file (12G) with all variants in one individual, and a list of positions (~50000) of interest to find coincident variants.
One way to do it, it would be:
time fgrep -f [pos_list] -w [vcf] > [vcf_out] real 0m46.259s user 0m42.645s sys 0m3.573s |
It spends 46 seconds, PARALLEL + GAWK: Change read names in a fastq filenot bad. But if we parallelize the input file:
srun --cpus-per-task=6 time parallel --progress --pipepart --block 1G -a [vcf] -k -q fgrep -f [pos_list] -w > [vcf_out] real 0m13.220s user 1m6.558s sys 0m8.984s |
I only takes 13 seconds! Take into account that there are several ways to run this command: in this example we used srun demanding 6 CPUs, but we would obtain the same result with the command within a script and running it with sbatch, or executing in an interactive session (with “-c 6” ).
2.2 PARALLEL + GAWK: Change read names in a fastq file
We want to change all the read names of a fastq, keeping only the string before the first “/”. We could do something like this:
time gawk '{if (NR%4 == 1) {split($0,xx,"/"); print xx[1] ;} else {print $0} }' [fastq_in] > [fastq_out] real 0m33.172s user 0m31.771s sys 0m1.204s |
And now with parallel (inside an interactive session with 8 CPUs):
time parallel --progress --pipepart --block 1G -a [fastq_in] -k -q gawk '{if (NR%4 == 1) {split($0,xx,"/"); print xx[1] ;} else {print $0} }' > [fastq__out] real 0m16.991s user 0m32.095s sys 0m11.746s |
Again, it takes considerably less time with parallelization.
2.3 Parallelizing many commands
We want to execute a program for all the fastq files (96 in total) within a specific folder and with a different folder for each file. This program only works if it is executed inside the folder where the fastq are located, and it takes approximately 10 minutes per sample. We could run a function which changes the directory and runs the program like this:
function run_fx() { cd $2 [program] pipeline -a case-sample -s $1 cd ../../ } export -f run_fx
ls fastq_to_analyse/*/*fq | parallel --progress -k run_fx {/} {//} |
It takes 45 minutes to analyze 96 files in a machine with 4 CPUs. It is required to export the function to be read by parallel.
Here is very useful {/} and {//} to send basename and dirname (respectively) to the function. For instance:
Input: fastq_to_analyze/sample_1/sample_1.fastq
{/} -> sample_1.fastq
{//} -> fastq_to_analyze/sample_1/