Musings on R

Load Balanced Parallelization with snowfall

Published by Xavier on 2013-03-05 Xavi

For some reason, I didn't notice a few months ago the best way to perform a parallelized version of Lapply with package snowfall.

We implemented the parallel version of function lapply with the function sfLapply, in the development of our pipeline prototype for Exome Variant Analysis ( https://launchpad.net/eva ).

However, I've just read the nice tutorial from Knaus & Porzelius (2009), in which he shows a nice diagram to clarify why sfClusterApplyLB can be better to have a load balanced version of your own code:

Click to enlarge
Click to enlarge

Therefore, we changed the critical line, easily, from :

# ...
  start3 <- Sys.time(); result2 <- sfLapply(1:length(params$file_list), wrapper2.parallelizable.per.sample) ; duration <- Sys.time()-start3;
  # ...


# ...
  start3 <- Sys.time(); result2 <- sfClusterApplyLB(1:length(params$file_list), wrapper2.parallelizable.per.sample) ; duration <- Sys.time()-start3;
  # ...

(as you can see, we are parallelizing here per samples, not per processes within each sample; one thing at a time, since we only have a few spare cpus in our servers and we are not running the process in a real cluster yet)

With our test datasets, we cannot notice any great difference (a couple of small files for debugging purposes), but we'll be glad to check the potential improvement (let's hope so) with real case scenarios in short, in which some samples are way bigger than some other ones...

In my todo list there is a new entry related to the other interesting function called "sfClusterApplySR", explained also in the standard vignettes from snowfall:

Another helpful function for long running clusters is sfClusterApplySR, which saves intermediate results after processing n-indices (where n is the amount of CPUs). If it is likely you have to interrupt your program (probably because of server maintenance) you can start using sfClusterApplySR and restart your program without the results produced up to the shutdown time.

And we hope to find some time in the following months to test a similar parallelization process with the "parallel" package (even if I have no clue yet whether there is any equivalent approach for load-balanced parallelization).

Some day...