The data in a single human genome includes approximately 20,000 genes, which if stored in a traditional platform would represent several hundred gigabytes. To better understand those genes, CASI stores molecular data from a variety of sources such as the Cancer Genome Atlas Project. Each one of those datasets represents tens to hundreds of terabytes. Combining a specialized genomic characterization of one million individually variable DNA locations produces the equivalent of about 20 billion rows of gene-variant combinations. CASI’s Hadoop cluster holds data on thousands of individuals.
Now the CASI team uses Hortonworks Data Platform (HDP™) as a distributed infrastructure to calculate those 20 billion rows that reflect the output of CASI’s high-performance computing. Once they’ve generated the calculations, the HDP environment lets the team seamlessly query and assemble the resulting information.
The improvement over its previous architecture astounded the ASU team. “Your average database of 20 billion rows is simply unapproachable with traditional, standard technology,” said Dr. Buetow. “We firmly believe that this data-intensive compute environment has the capacity to transform biomedicine. With our Hadoop infrastructure, we can run data-intensive queries of these large-scale resources, and they return results in seconds. This is transformational.”
The HDP cluster at Arizona State University has accumulated more than a petabyte of genomic data from multiple studies involving over 500 individuals in each study. Researchers in five different teams access this genomic data lake to investigate urgent cancer research questions such as:
Access to such a huge, rich dataset, combined with highly efficient computational power has transformed the kinds of questions that ASU researchers can ask.
“One could estimate that we have a thousand-fold more capacity to approach problems, but to be honest that would be on a low estimate. I think we have almost infinite capacity now to ask and answer the questions that we couldn’t approach before,” says Dr. Buetow.
Over the last 5-6 years, researchers have focused on the interplay between 20,000 individual genes and the millions of variants in our DNA. Before the Hadoop infrastructure, it was impossible for scientists to undertake this kind of complex investigation.
Now ASU researchers rapidly comb the terabytes of cancer data to perform efficient analysis. One of the analytical approaches uses Cytoscape, an open-source software platform for visualizing complex interaction networks and biological pathways. HDP works hand-in-hand with Cytoscape to provide the raw output necessary to visualize a cancer network and to integrate it with gene expression profiles.
When ASU’s Research Computing department embarked on building a data-intensive environment, they teamed up to design the system according to the well-defined needs of the university’s biomedical researchers.
Through HDP, the team avoided complicated machine-to-machine interconnections and wired those interconnections into the distributed framework from the very beginning.
Jay Etchings is ASU’s Director of Operations for Research Computing. He partners closely with Dr. Buetow to define and deliver the IT backbone that the team needs. The Next Generation Cyber Capability (NGCC) project combines Apache Hadoop with high-performance computing.
Here’s how Mr. Etchings describes their IT strategy:
“One of the features of the way we’ve set up our data-intensive environment is to have it be on the same fabric as utility computing and on the same fabric as traditional high performance computing. A user in our environment seamlessly goes between their sand box (where they may be developing code) to the Hadoop space. Or, if they need to be running something in a more traditional high performance computing space, they can actually output that traditional HPC job into data frameworks that we could then process in the Hadoop environment.”
The Research Computing at ASU initiative represents a leading academic supercomputing center - providing a high-performance computing environment (Big Iron HPC), a high-end data intensive ecosystem (Big Data), a highly available 100 gigabit Internet2, a software defined Science DMZ, in-memory computation required for advanced data analysis and machine learning with Apache Spark. It is situated in an enterprise datacenter on campus within a 5000 square foot secured facility. The initiative’s support staff consist of computational scientists and programmers with expertise in many areas of scientific and parallel computing: big data analytics (in memory), custom software development, database engineering, and scientific visualization.