As bioinformatics datasets grow ever larger, and analyses become increasingly complex, there is a need for data handling infrastructures to keep pace with developing technology. Large-scale bioinformatics analyses often require the use of multiple existing software tools, each of which may be computationally intensive. In addition, the output of one tool often needs to be fed as input into other tools, thereby forming connected pipelines of data flow. Moreover, many iterations of these pipelines may need to be executed due to the large number of data items to be analysed.
Microbase makes use of available Grid or Cloud compute resources in order to enact distributed workflows consisting of multiple analysis software packages. These workflows can exploit the inherent parallelism present in such environments. Using the Microbase framework, it is possible to obtain input files and store output result files for each software tool in a straightforward manner, regardless of where the files physically reside. The system also detects and appropriately handles failures that can occur while a workflow is executing.
Microbase2.0 is our current stable release. The architecture is almost entirely distributed (see Figure) and therefore does not require any centralised components. System status information such as job queues are stored in distributed shared memory, courtesy of a Hazelcast data Grid. File storage for workflow components is provided by the Microbase Filesystem (MBFS) API. A number of MBFS plugins are currently available that allow file storage and retrieval from a local filesystem, or Amazon S3 Cloud storage. An experimental BitTorrent transfer mechanism is also provided for bandwidth-efficient transfers of large files simultaneously to a large number of nodes.