Making data democratic

This story originally appeared on the website of the College of Engineering

The National Science Foundation (NSF) awarded a $5.6 million project to a team of researchers led by School of Computing professor Valerio Pascucci, who is also director of the Center for Extreme Data Management in the College of Engineering, to build the critical infrastructure needed to connect large-scale experimental and computational facilities and recruit others to data-driven sciences.

With the pilot project, called the National Science Data Fabric (NSDF), the team will deploy the first infrastructure capable of bridging the gap between massive scientific data sources—including state-of-the-art physics laboratories generating mountains of data—and the computing resources that can process their results, including the Internet2 network connectivity and an extensive range of high-performance computing facilities and commercial cloud resources around the nation.

The figure indicates sites involved in the proposed NSDF pilot testbed including the five main development sites and three Minority Serving Institutions (MSIs) with the computing environments at each campus, the Texas Advanced Computing Center (TACC), and the Massachusetts Green High Performance Computing Center (MGHPCC); data sources include Cornell High Energy Synchrotron Source (CHESS), IceCube facility, and the XENON dark matter experiment. The sites are connected by a high-speed network backbone provided by Internet2 and interoperate with OSG StashCaches and other resources.

“By democratizing data access for large-scale scientific investigations, NSDF will lower the cost of entry for scientists who want to participate in cutting-edge research,” Pascucci said. “Piloting this innovative cyberinfrastructure component will connect a much broader community of researchers with massive data sources that only a selected few can manage today.”

Pascucci says progress toward technological advances that benefit society will require a cyberinfrastructure that enables high-speed access to the data generated by large experimental facilities around the world. He points to several centers generating data that will benefit from the project: the IceCube neutrino detector at the South Pole, the XENONnT dark matter detector in Italy and the Cornell High Energy Synchrotron Source (CHESS).

Students will be able to work with data streamed directly from High Energy Synchrotron Sources, he added. Institutions will be able to innovate education, workforce training and research. Pascucci predicts scientific discovery will accelerate with the infrastructure in place.

Pascucci’s team includes co-principal investigators (PIs): Frank Würthwein, interim director of the San Diego Supercomputer Center; Michela Taufer, professor at the University of Tennessee, Knoxville; Alex Szalay, a professor of physics and astronomy and computer science at the Johns Hopkins University; and John Allison, a professor of materials science and engineering at the University of Michigan at Ann Arbor. The team will partner with NSF-funded efforts such as Fabric and Open Science Grid (OSG), and industry partners including IBM Cloud and Google Cloud.

“The National Science Data Fabric is an effort that aims to transform the end-to-end data management lifecycle with advances in storage and network connections; deployment of scalable tools for data processing, analytics and visualization; and support for security and trustworthiness,” said Würthwein.