Index

1. Introduction to GNU Parallel

1.1 Interesting options

1.2 Interesting advice

2. Examples

2.1 PARALLEL + GREP: Extract VCF positions from a list

2.2 PARALLEL + GAWK: Change read names in a fastq file

2.3 Parallelizing many commands

 

 1 INTRODUCTION TO GNU PARALLEL

 

GNU Parallel is a tool for executing one or more commands in parallel in one or more nodes and/or CPUs. It can parallelize several different commands to run at once or split a big input file into blocks to work with them in parallel at the same time.

In this tutorial, we offer some useful examples of its utilization with real data in a science research context. But if you want more information, you might read this tutorial or this manual (or type “man parallel” on the shell).

 

You can use it in your local machine or in the Marvin cluster, in which case you should load the corresponding module:

 

module load parallel

 

Back to top

 

1.1 Interesting options

 

--progress

Show progress of computations.

 

--pipepart

Spread input to jobs on stdin (standard input). Read a block of data from a physical file (given by -a) and give one block of data as input to one job.

 

The block size is determined by --block.

 

--block

Size of block in bytes to read at a time

 

--keep-order / -k

Keep sequence of output same as the order of input. Normally the output of a job will be printed as soon as the job completes

 

--quote / -q

Quote command. This will quote the command line so special characters are not interpreted by the shell. See the section QUOTING. Most people will never need this. Quoting is disabled by default.

 

--jobs

Number of jobslots on each machine. Run up to N jobs in parallel. 0 means as many as possible. Default is 100% which will run one job per CPU core on each machine.

 

{}

Input line. This replacement string will be replaced by a full line read from the input source. The input source is normally stdin (standard input), but can also be given with -a, :::, or ::::.

The replacement string {} can be changed with -I.

 

{.}

Input line without extension. This replacement string will be replaced by the input with the extension removed. If the input line contains . after the last / the last . till the end of the string will be removed and {.} will be replaced with the remaining.

 

{/}

Basename of input line. This replacement string will be replaced by the input with the directory part removed.

 

{//}

Dirname of input line. This replacement string will be replaced by the dir of the input line.

 

{/.}

Basename of input line without extension. This replacement string will be replaced by the input with the directory and extension part removed. It is a combination of {/} and {.}.

 

Check the manual for other options or string replacements.

 

Back to top

 

1.2 Interesting advices

 

When using parallel to work with one file (splitting it in blocks), this tool does not accept compressed files, nor input nor output. Keep in mind to compress/decompress the file (if needed) out of the parallel command.

 

Back to top

 

 

2 EXAMPLES

 

2.1 PARALLEL + GREP: Extract VCF positions from a list

 

We have 2 files: one large VCF file (12G) with all variants in one individual, and a list of positions (~50000) of interest to find coincident variants.

One way to do it, it would be:

 

time fgrep -f [pos_list] -w [vcf] > [vcf_out]

real    0m46.259s

user    0m42.645s

sys    0m3.573s

 

It spends 46 seconds, PARALLEL + GAWK: Change read names in a fastq filenot bad. But if we parallelize the input file:

 

 

srun --cpus-per-task=6 time parallel --progress --pipepart --block 1G -a [vcf] -k -q fgrep -f [pos_list] -w > [vcf_out]

real    0m13.220s

user    1m6.558s

sys    0m8.984s

 

 

I only takes  13 seconds! Take into account that there are several ways to run this command: in this example we used srun demanding 6 CPUs, but we would obtain the same result with the command within a script and running it with sbatch, or executing in an interactive session (with “-c 6” ).

 

Back to top

 

2.2 PARALLEL + GAWK: Change read names in a fastq file

 

We want to change all the read names of a fastq, keeping only the string before the first “/”. We could do something like this:

 

 

time gawk '{if (NR%4 == 1) {split($0,xx,"/"); print xx[1] ;}  else {print $0} }' [fastq_in] > [fastq_out]

real    0m33.172s

user    0m31.771s

sys    0m1.204s

 

 

And now with parallel (inside an interactive session with 8 CPUs):

 

 

time parallel --progress --pipepart --block 1G -a [fastq_in] -k -q gawk '{if (NR%4 == 1) {split($0,xx,"/"); print xx[1] ;}  else {print $0} }'  > [fastq__out]

real    0m16.991s

user    0m32.095s

sys    0m11.746s

 

 

Again, it takes considerably less time with parallelization.

 

Back to top

 

2.3 Parallelizing many commands

 

We want to execute a program for all the fastq files (96 in total) within a specific folder and with a different folder for each file. This program only works if it is executed inside the folder where the fastq are located, and it takes approximately 10 minutes per sample. We could run a function which changes the directory and runs the program like this:

 

 

function run_fx() {

    cd $2

    [program] pipeline -a case-sample -s $1

    cd ../../

}

export -f run_fx

 

ls fastq_to_analyse/*/*fq | parallel --progress  -k run_fx {/} {//}

 

 

It takes 45 minutes to analyze 96 files in a machine with 4 CPUs. It is required to export the function to be read by parallel.

Here is very useful {/} and {//} to send basename and dirname (respectively) to the function. For instance:

 

Input: fastq_to_analyze/sample_1/sample_1.fastq

    {/} -> sample_1.fastq

    {//} -> fastq_to_analyze/sample_1/

 

Back to top