Status Report

Pre-feasibility Study of Astronomical Data Archive Systems Powered by Public Cloud Computing and Hadoop Hive

By SpaceRef Editor

November 21, 2016

Filed under arXiv.org e-Print archive, Astrophysics, US

Satoshi Eguchi
(Submitted on 18 Nov 2016)

The size of astronomical observational data is increasing yearly. For example, while Atacama Large Millimeter/submillimeter Array is expected to generate 200 TB raw data every year, Large Synoptic Survey Telescope is estimated to produce 15 TB raw data every night. Since the increasing rate of computing is much lower than that of astronomical data, to provide high performance computing (HPC) resources together with scientific data will be common in the next decade. However, the installation and maintenance costs of a HPC system can be burdensome for the provider. I note public cloud computing for an alternative way to get sufficient computing resources inexpensively. I build Hadoop and Hive clusters by utilizing a virtual private server (VPS) service and Amazon Elastic MapReduce (EMR), and measure their performances. The VPS cluster behaves differently day by day, while the EMR clusters are relatively stable. Since partitioning is essential for Hive, several partitioning algorithms are evaluated. In this paper, I report the results of the benchmarks and the performance optimizations in cloud computing environment.

Comments: 4 pages, 2 figures, proceedings of the Astronomical Data Analysis Software and Systems Conference XXVI
Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM)
Cite as: arXiv:1611.06039 [astro-ph.IM] (or arXiv:1611.06039v1 [astro-ph.IM] for this version)
Submission history
From: Satoshi Eguchi
[v1] Fri, 18 Nov 2016 10:50:43 GMT (52kb)
https://arxiv.org/abs/1611.06039

SpaceRef Editor

SpaceRef staff editor.

Follow on Twitter