Automatically Building Systems for Dedicated Data Analysis Utilizing Virtual Clouds

Optimized and dedicated data analysis platforms are needed to fulfill the diverse requirements of big data applications. However, building an optimized platform is not easy. We propose the Virtual Cloud Provider (VCP), which is an infrastructure to automatically build data analysis platforms customized for applications over multiple clouds. We have also developed an ecosystem for genome analysis workflows to utilize workflow execution records.


Virtual Cloud Provider (VCP)

VCP is a middleware to build an application environment over multiple cloud providers and the Japanese academic backbone network SINET6. VCP uses overlay cloud and overlay network architectures to treats multiple real clouds (‘On-Premise provider and Real Cloud provider A-C’ in the figure below) as a single virtual cloud (‘Virtual Cloud Provider’).

VCP

VCP is used as the core software for NII’s GakuNin cloud-on-demand configuration service.

OCS

Dynamic Reconfiguration Framework

We propose a framework that adds and removes computing resources (BM or VM) during runtime. The main idea is that we can represent requirements of computing resources to be reconfigured as constraints on specifications of computing resources.

Reconf

It consists of two subsystems:

We have developed a prototype for genome analysis workflows as described below.

Ecosystem to Utilize Execution Records of Genome Analyses Workflows

Collecting workflow execution records such as execution time for each step is important for selecting appropriate computing resources. We have developed an ecosystem to collect and utilize workflow execution records including our reconfiguarion prototype.

ecosystem

A Prototype for Genome Analysis Workflows

We developed a prototype based on Galaxy for genome analysis workflows. It consists of several modules: Galaxy, an application scheduler module (AS module in the figure), a resource allocator (RA in the figure), metrics server and VCP. Galaxy and an AS module behave as an application scheduler.

Reconf-workflow

Our prototype introduces two types of virtual workflow step: prepare job and reconf job for dynamic reconfiguration:

These virtual steps interact with an AS module to make a plan and to allocate computing resources. An AS core is a reconfiguration algorithm in the AS module and is designed as an external program to easily replace with other reconfiguration algorithm. We integrated our prototype with the reconfiguration algorithm by Hokkaido University group. An AS module and RA can be integrated with ep3.

Metrics Collection Scheme in our Prototype

Our prototype collects two types of metrics: container metrics and workflow metrics.

metrics collection scheme

We design the metrics format of workflow metrics to be compatible with workflow metrics obtained by CWL-metrics by National Institute of Genetics group, which is a metrics collector for Common Workflow Language (CWL).

DrillHawk: A Visualizer of Workflow Metrics

DrillHawk enables us to take a drill-down approach in which we first check the list of collected workflow execution records, compare several execution records using workflow metrics, and analyze the specific execution records by using Kibana.