Application-Centric Overlay Cloud Utilizing Inter-Cloud

Application-Centric Overlay Cloud Utilizing Inter-Cloud

Optimized and dedicated data analysis platforms are needed to fulfill the diverse requirements of big data applications. However, building an optimized platform is not easy. We propose the application-centric overlay cloud, which is an infrastructure to automatically build data analysis platforms customized for applications over multiple clouds.

Research Goal

Develop new infrastructure technology to automatically build data analysis platforms that are optimized for applications. We call this infrastructure the application-centric overlay cloud.
Build an inter-cloud testbed configured by production-ready computing and networking resources.
Apply the overlay cloud technology to genome sequencing (distributed data processing) and fluid acoustic analysis (coupled simulation).

Overview

NII group developed the following middleware.

Virtual Cloud Provider (VCP)

VCP is a middleware to build an application environment over multiple cloud providers and the Japanese academic backbone network SINET5. VCP uses overlay cloud and overlay network architectures to treats multiple real clouds (‘On-Premise provider and Real Cloud provider A-C’ in the figure below) as a single virtual cloud (‘Virtual Cloud Provider’).

VCP supports from a simple application environment on a single cloud provider (Application-3), to a complex application environment using multiple cloud providers (Application 1 and 2).
VCP leverages Docker containers for quick application deployment and application environment reproducibility (right hand figure shows the a compute instance architecture).
VCP supports both common and GPU-enabled instances. (blue boxes)
VCP supports the following cloud providers;
- Public Cloud: Amazon Web Services, Microsoft Azure, Oracle Cloud Infrastructure, Sakura Cloud
- Academic Cloud: Hokkaido-Univ. Server Service
- On-premise: VMware

VCP

VCP is used as the core software for NII’s GakuNin cloud-on-demand configuration service.

User of the service can build an application environment with the following steps.
- Log-in Jupyter Notebook Server
- Select a building template of the target application environment (a Jupyter Notebook file)
- Run
A Python-based development kit called VCP SDK and a monitoring tool are provided.
Jupyter Notebook-based templates for typical academic applications, such as moodle, Galaxy, and OpenHPC, are provided.

OCS

Dynamic Reconfiguration Framework

We propose a framework that adds and removes computing resources (BM or VM) during runtime. The main idea is that we can represent requirements of computing resources to be reconfigured as constraints on specifications of computing resources.

Reconf

It consists of two subsystems:

The application scheduler monitors the application state and sends the constraints for new computing resources (e.g., #CPU cores, memory size, and location of resources to handle privacy-sensitive data) to the resource allocator.
The resource allocator finds the resources that satisfy the constraints from the application scheduler and allocate them by using VCP.

We have developed a prototype for genome analysis workflows as described below.

An Ecosystem to Utilize Execution Records of Genome Analyses Workflows

Collecting workflow execution records such as execution time for each step is important for selecting appropriate computing resources. We have developed an ecosystem to collect and utilize workflow execution records including our reconfiguarion prototype.

ecosystem

We developed a prototype of our reconfiguration system based on Galaxy and ep3, a CWL-based workflow engine developed by our project. It collects and utilizes workflow metrics and container metrics.
We also developed DrillHawk, which is a visualizer of workflow metrics.
Our reconfiguration system and DrillHawk utilize container metrics and workflow metrics obtained by CWL-metrics as well as the metrics obtained by our prototype.

A Prototype for Genome Analysis Workflows

We developed a prototype based on Galaxy for genome analysis workflows. It consists of several modules: Galaxy, an application scheduler module (AS module in the figure), a resource allocator (RA in the figure), metrics server and VCP. Galaxy and an AS module behave as an application scheduler.

Reconf-workflow

Our prototype introduces two types of virtual workflow step: prepare job and reconf job for dynamic reconfiguration:

The prepare job step is invoked before executing a workflow and makes an allocation plan.
The reconf job step is invoked before executing each step and allocates computing resources according to the allocation plan.

These virtual steps interact with an AS module to make a plan and to allocate computing resources. An AS core is a reconfiguration algorithm in the AS module and is designed as an external program to easily replace with other reconfiguration algorithm. We integrated our prototype with the reconfiguration algorithm by Hokkaido University group. An AS module and RA can be integrated with ep3.

Metrics Collection Scheme in our Prototype

Our prototype collects two types of metrics: container metrics and workflow metrics.

Container metrics: Telegraf on the resource periodically collects container metrics such as CPU usage and memory usage.
Workflow metrics: After finishing workflow execution, an application scheduler generates workflow metrics such as execution time for each step.

metrics collection scheme

We design the metrics format of workflow metrics to be compatible with workflow metrics obtained by CWL-metrics by National Institute of Genetics group, which is a metrics collector for Common Workflow Language (CWL).

DrillHawk: A Visualizer of Workflow Metrics

DrillHawk enables us to take a drill-down approach in which we first check the list of collected workflow execution records, compare several execution records using workflow metrics, and analyze the specific execution records by using Kibana.

It first shows collected workflow records in the metrics server.
We can choose several workflow records and compare their execution times and monetary costs. In the figure, each bar represents an execution time for each workflow execution and each color represents a execution time for each step.
We can investigate more details with Kibana from the above comparing graph.