Large scale bioinformatics experiments are usually composed by a set of data flows generated by a chain of activities (programs or services) that may be modeled as scientific workflows. Current Scientific Workflow Management Systems (SWfMS) are used to orchestrate these workflows to control and monitor the whole execution. It is very common in bioinformatics experiments to process very large datasets. In this way, data parallelism is a common approach used to increase performance and reduce overall execution time. However, most of current SWfMS still lack on supporting parallel executions in high performance computing (HPC) environments. Additionally keeping track of provenance data in distributed environments is still an open, yet important problem. Recently, Hydra middleware was proposed to bridge the gap between the SWfMS and the HPC environment, by providing a transparent way for scientists to parallelize workflow executions while capturing distributed provenance. This paper analy...
Fábio Coutinho, Eduardo S. Ogasawara, Danie