Big Data Drives New Approaches to Doing Science

Joshua Peek, jegpeek@stsci.edu &
Marc Postman, postman@stsci.edu

Big data is everywhere, and astronomy is no exception. Our ability as a society to measure ever more about consumers, take ever more pictures, and send ever more messages has been mirrored in our ability to acquire ever more digital information about the cosmos. Experiments of the next decade like Large Synoptic Survey Telescope and the Square Kilometer Array are slated to ingest an unprecedented volume of astronomical data.

The data we handle at the Institute are also arriving faster and in larger volumes—Webb, TESS, and WFIRST will collect much more data than our current observatories, and the Milkulski Archive for Space Telescopes at STScI is already the home the massive 2000-terabyte PanSTARRS database, recently brought into the Institute with a forklift.

Figure 1: STScI staff members (from left to right) Jeff Valenti, Andrew Fruchter, Rick White, and Armin Rest celebrate the safe arrival of the PanSTARRS storage hardware at the Institute.

Figure 2: This is what 2000 terabytes of data, loaded onto the PanSTARRS storage hardware, looked like as it arrived at the Institute.

A team of scientists, engineers, and IT experts at the Institute recently completed a study1 of the technological and conceptual impacts of Big Data in astronomy, focusing on our current and future data holdings and addressing how astronomers can achieve the maximum scientific potential with this wealth of data. A key theme of the report is that we should see the raw volume and velocity of the data not as a limitation, but as a great opportunity for scientific discovery.

While astronomical data volumes have increased, the cost of storing, processing, and transmitting this information have decreased by much larger margins. It is thus up to us not only to build sophisticated computational systems to take advantage of faster processing capabilities, but also to implement clever methodologies and tools that provide us a deeper understanding of the physics of the cosmos from these huge data sets, and not just make smaller error bars on our existing measurements.

Figure 3: The PanSTARRS two-petabyte (2000-terabyte) storage array and its servers installed in the Institute's computing center.

The Institute’s big-data study centered around a number of science cases, which pushed up against the limits of our current science computing capabilities. Some of these cases require extracting of scientific knowledge from measurements of millions or billions of images of objects in the universe. These images hold huge amounts of untapped information about dark matter (via gravitational lensing), galaxy evolution, and star formation. It is difficult to gain detailed scientific understanding from these images, both because the data volume is so large, and because we fundamentally do not know how to best classify and measure such images. One way forward is to harness Machine Learning and Deep Learning methodologies that use data-driven approaches to find the most information-rich aspects of images.

Figure 4. The science cases discussed in the report. Each case requires some disk space (x-axis) number of CPU (y-axis) new methodologies (fill colors), and bandwidth (outline color). The arrows indicate possibilities for growth. Our existing capabilities and the integrated requirements to perform all these projects are also shown.

Data-intensive imaging tasks are not limited to objects with observed shapes. Future missions like WFIRST will sample tens of millions of stars in a single day, allowing us to understand the detailed history of star formation across many galaxies. To process these resolved stellar populations will require both increased computational infrastructure and more deeply integrated databases.

Future missions will also survey billions of galaxies, whose redshifts we need to measure, or whose low-resolution spectra we need to characterize. The process of computing redshifts from imaging (a.k.a. photometric redshifts) for so many galaxies is computationally intensive, but also may require interaction from users, who will want to tune the redshift-determination procedure to meet specific scientific needs. Spectral characterization will also need to be optimized and automated. Developing a “science as a service” approach, in which users within and beyond the Institute can run specific computationally intensive tasks through application programming interfaces, became a central proposal of the report.

Time domain observations may also require significant computational resources, as well as new methodologies. For example, the GALEX gPhoton database, which has over 1 trillion records of time-tagged photon arrival events, could allow us to find all sorts of new variable phenomena, but requires immense processor power in a distributed and integrated computer environment to perform many searches simultaneously. Light echoes—reflections of supernovae explosions bouncing through our Galaxy—require not only accessing petabyte-scale image databases, advanced image classification algorithms, and light-ray tracing over the time domain, but may also require serving imagery to citizen scientists for inspection, putting higher performance requirements on our network bandwidth capacity.

All of these advanced scientific investigations will not only require silicon and copper, but will also require our staff to rethink data itself as a priority. One of the key recommendations of the report is that the Institute should establish a Data Science Mission Office (DSMO), elevating our astronomical archives (MAST) and computational infrastructure to the same level as Webb and Hubble. The Institute has moved to establish DSMO this year, and with it, our commitment to exploring the universe as much with our computers, algorithms, and data, as with our telescopes.


1 The full report is available at http://archive.stsci.edu/reports/BigDataSDTReport_Final.pdf