Using GNU Parallel to speed up and simplify data analyzes

GNU Parallel is one of the lesser known tools for Unix that can significantly speed up data analyzes. Parallel is a tool to run commands in parallel in different processes. It has much easier to understand syntax than xargs and allows to do many other things that otherwise require actual programming. Below are some examples that illustrate how parallel can be used for data analyzes.

Fetching data from multiple domains

parallel wget ::: www.domain1.com/file1.zip www.domain1.com/file2.zip www.domain1.com/file3.zip 

This command will in parallel download files from respective URLs listed in command line. This is convenient if you want to download just a few files

If you have a list of files to download in a file urls.txt, you can use following command

parallel -a urls.txt wget

Alternatively you can use piping to supply a list of files

cat urls.txt | parallel wget

Do time consuming analyzes on multiple files

Let’s say that you have a list of files that you want to run some analyzes on. You have fancy 2 CPU 32 core machine but your algorithm is not easily parallelizable (or you are just too lazy to make it work in multithreaded mode). Sometimes you can save significant time by processing files at the same time in multiple processes.

ls | parallel processing -i {} -o {.}.out

This command with run ./processing on all files in current directory. {} – is input parameter, in our case file name. {.} is input parameter with stripped out extension. By using this command with we will put output information into filename.out.

You can also use Unix globbing to select files. Command below runs processing command on all files with *.dat extension in current directory.

parallel processing -i {} -o {.}.out ::: *.dat 

Split huge file and process smaller parts independently

If you have one very large file and you want to process every line, you can split the file into multiple smaller one using split command and then use parallel to process multiple files in parallel.

#split file into chunks of 5000 lines each. 
#output files will be called data_split_aa...split -l 5000 data data_split_ax
split -l 5000 data data_split_

#process all files from split
parallel processing {} ::: data_split-*

Trying out multiple parameters

You can use parallel to iterate over multiple parameters – no need to write nested loops!

[~/parallel]$ parallel processing {1} {2} {3} ::: A B C ::: 1 2 3 ::: X Y Z
A 1 Z
A 1 Y
...
C 3 Z

Command above runs processing on all combinations of {1, 2, 3}, {A, B, C} and {X, Y, Z}.

Use parallel to run multiple command at the same time

One of the easiest use cases for parallel is to run multiple different commands at the same time. Just put whatever you want to run in a single file and run parallel on it. It’s similar to cat file_to_run | sh but much faster.

parallel < files_with_commands_to_run.sh

GNU Parallel is not part of most Linux distributions but you can get it here. More detailed documentation is available at http://www.gnu.org/software/parallel/man.html .

This entry was posted in data analyzes, linux, tools, unix. Bookmark the permalink.
0 comments