Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
Hortonworks Customer
Arizona State University

Arizona State University (ASU) is the largest public university by enrollment in the United States, with more than 83,000 students and 3,300 faculty members.

ASU's charter, approved by the board of regents in 2014, is based on the "New American University" model created by ASU President Michael M. Crow. It defines ASU as "a comprehensive public research university, measured not by whom it excludes, but rather by whom it includes and how they succeed; advancing research and discovery of public value; and assuming fundamental responsibility for the economic, social, cultural and overall health of the communities it serves."

This innovative model is one of the reasons that US News & World Report named ASU as “the 2016 most innovative school in America” (just ahead of #2 Stanford and #3 MIT).

The Complex Adaptive Systems Initiative Sought a New Way to Integrate Genomics Data with Cancer Research

The Complex Adaptive Systems Initiative (CASI) is one of ASU’s flagship programs aligned with the New American University model. CASI’s research mission is to develop and promote a new type of science that embraces the complexity of natural systems.

In 2012, Dr. Kenneth Buetow joined ASU as CASI’s director of Computational Sciences and Informatics. Previously, Dr. Buetow served as the founding Director of the Center for Biomedical Informatics and Information Technology within the National Cancer Institute (NCI) at the National Institutes of Health (NIH). At the NCI, Dr. Buetow served a dual role as the director of the NCI Center for Bioinformatics and chief of the Laboratory of Population Genetics. His research focused on the genetic basis of cancer.

Dr. Buetow and his team had started exploratory work with Apache™ Hadoop® at the NCI, in recognition of the emerging Big Data trends. Buetow made the decision to transition to ASU because of his belief in the New American University model and in its trans-disciplinary approach to research that has a real-world impact on individuals, communities and the greater world.

When Dr. Buetow came to ASU, he focused on continuing the same line of research on precision medicine that he conducted at the NCI. “We start either with data that exist or data that we generate by simulation,” explained Buetow, “and then we use high-performance computation to try to find either novel patterns that may predispose one to developing cancer or to develop a better understanding of cancer’s architecture so we have better ways of intervening.”

Complex Systems Require Complex Analysis of the Relationships Between the Parts

Cancer is very complicated. It is not a single disease, but rather the result of the interplay between many moving parts.

Not only is cancer complicated, it is also complex. The whole of the system is more than the sum of its parts. For example, think of the pointillist painters who put individual dots of color on canvas. Only when you look at the dots from a distance do you understand what the painting portrays. ASU sought out Hadoop and Big Data approaches to both store more “dots” and to view the combination of dots in different ways to understand the complete picture.

For example, one area of focus is liver cancer. An increasingly common cause of liver cancer is obesity and diabetes, but when one looks into the origins of why populations become obese, it’s about much more than simple caloric dynamics. Obesity also has to do with differences in lifestyles, housing options and access to healthy food.

ASU’s CASI seeks to investigate how those parts come together with genomics in order to better understand and solve the complex problem of liver cancer. Solutions to such complex problems require storage of massive amounts of data and also powerful data processing tools.

Challenges Storing and Processing Genomics Data Constrained the Set of Possible Research Questions

Prior platforms for genomics cancer research limited both storage and processing, thus limiting the complexity of questions that investigators could ask and answer. Commonly referred to as “lamp-posting,” this scenario meant that cancer researchers could only “look where the light was”—even though the most interesting research questions had to do with discovering answers still “in the dark”.

As a result, researchers were trimming their research agendas based on the data processing that they thought they could perform with the existing data storage and compute architectures.

According to Dr. Buetow, “What we’ve been able to do already is find relationships that previously had not been described. The reason this was not practical to do in the past is that there’s literally millions of variants and tens of thousands of individuals genes. It was just not computationally and/or storage practical to do all those possible combinations.”

Proprietary Platforms Limited Scholarly Collaboration and Reduced Research Time

ASU wanted next-generation cyber capabilities for its collaborative community. Ideally, each participating investigator would bring their own data, tools and expertise and share those with colleagues within ASU and beyond.

With an earlier methodology, a researcher would get a grant, set up an isolated server, and then encourage graduate students to work with their proprietary, closely guarded data. Even as the “New American University” model changed academic habits to encourage a culture of collaboration, that IT legacy of isolated data siloes hindered future collaboration.

Dr. Buetow recalled prior points in his career where information sharing within a department or with colleagues at other institutions required “heroics of complex interconnections” and massive amounts of technical work to filter these transfers through institutional firewalls. “It would involve days to weeks of work. You would have to learn different languages and command structures. It was pretty hard.

This slowed sharing and collaboration, which in turn slowed time to insight.

HDP Stores Genomic Data, Making It Broadly Accessible At An Affordable Cost

    When ASU turned to Hortonworks for a genomics “data lake”, CASI team members needed a connected platform to:
  • store and process huge amounts of data,

  • make that data and tools accessible to others within and outside of the university, and

  • do it all at a cost that wouldn’t escalate as genomics data grew to petabytes in their cluster.

Storing More Than Four Petabytes of Data, Processed at Interactive Speeds

The data in a single human genome includes approximately 20,000 genes, which if stored in a traditional platform would represent several hundred gigabytes. To better understand those genes, CASI stores molecular data from a variety of sources such as the Cancer Genome Atlas Project. Each one of those datasets represents tens to hundreds of terabytes. Combining a specialized genomic characterization of one million individually variable DNA locations produces the equivalent of about 20 billion rows of gene-variant combinations. CASI’s Hadoop cluster holds data on thousands of individuals.

Now the CASI team uses Hortonworks Data Platform (HDP™) as a distributed infrastructure to calculate those 20 billion rows that reflect the output of CASI’s high-performance computing. Once they’ve generated the calculations, the HDP environment lets the team seamlessly query and assemble the resulting information.

The improvement over its previous architecture astounded the ASU team. “Your average database of 20 billion rows is simply unapproachable with traditional, standard technology,” said Dr. Buetow. “We firmly believe that this data-intensive compute environment has the capacity to transform biomedicine. With our Hadoop infrastructure, we can run data-intensive queries of these large-scale resources, and they return results in seconds. This is transformational.”

The HDP cluster at Arizona State University has accumulated more than a petabyte of genomic data from multiple studies involving over 500 individuals in each study. Researchers in five different teams access this genomic data lake to investigate urgent cancer research questions such as:

  • Why do some people develop cancer and other people don’t?

  • Why do some people respond to particular therapies while others do not?

  • How can we predict who should get particular therapies?

  • How do we develop next-generation therapies for those who don’t respond to the existing ones?

Access to such a huge, rich dataset, combined with highly efficient computational power has transformed the kinds of questions that ASU researchers can ask.

“One could estimate that we have a thousand-fold more capacity to approach problems, but to be honest that would be on a low estimate. I think we have almost infinite capacity now to ask and answer the questions that we couldn’t approach before,” says Dr. Buetow.

Understanding the Architecture of Cancer with Visualization Tools Like Cytoscape

Over the last 5-6 years, researchers have focused on the interplay between 20,000 individual genes and the millions of variants in our DNA. Before the Hadoop infrastructure, it was impossible for scientists to undertake this kind of complex investigation.

Now ASU researchers rapidly comb the terabytes of cancer data to perform efficient analysis. One of the analytical approaches uses Cytoscape, an open-source software platform for visualizing complex interaction networks and biological pathways. HDP works hand-in-hand with Cytoscape to provide the raw output necessary to visualize a cancer network and to integrate it with gene expression profiles.

Source: Cytoscape Website –

Seamlessly Interconnecting Within a Lab, Across ASU and Around the World

When ASU’s Research Computing department embarked on building a data-intensive environment, they teamed up to design the system according to the well-defined needs of the university’s biomedical researchers.

Through HDP, the team avoided complicated machine-to-machine interconnections and wired those interconnections into the distributed framework from the very beginning.

Jay Etchings is ASU’s Director of Operations for Research Computing. He partners closely with Dr. Buetow to define and deliver the IT backbone that the team needs. The Next Generation Cyber Capability (NGCC) project combines Apache Hadoop with high-performance computing.

Here’s how Mr. Etchings describes their IT strategy:

“One of the features of the way we’ve set up our data-intensive environment is to have it be on the same fabric as utility computing and on the same fabric as traditional high performance computing. A user in our environment seamlessly goes between their sand box (where they may be developing code) to the Hadoop space. Or, if they need to be running something in a more traditional high performance computing space, they can actually output that traditional HPC job into data frameworks that we could then process in the Hadoop environment.”


One of the features of the way we’ve set up our data-intensive environment is to have it be on the same fabric as utility computing and on the same fabric as traditional high performance computing. A user in our environment seamlessly goes between their sand the Hadoop space. Or...they can actually output that traditional [high performance computing] job into data frameworks that we could then process in the Hadoop environment.

Jay Etchings, Director of Operations for Research Computing, ASU

The Results: Faster Speed to Insight and Greater Scholarly Collaboration

Dr. Buetow quickly understood the power of CASI’s approach.

He told us, “My epiphany came when I ran a relatively complex SQL query against a table that had 20 billion rows in it. The query returned results in a minute or two. I was dumbfounded. Prior to using this environment, one never could comprehensively construct the networks. Basically, you couldn’t or didn’t ask those comprehensive questions. Now [with HDP] we have both the availability of data and the technical capability to analyze it. We are able to explore spaces where we simply couldn’t go before. It just wasn’t possible before having this technology. This has sped our time to insight infinitely in some cases. Some questions were not possible before, and now they return results in a day.”

Moreover, CASI’s strategy for Hadoop adoption follows with President Obama’s “National Cancer Moonshot” policy that encourages “sharing data to generate new ideas and new breakthroughs”. Dr. Buetow describes the synergy between Hortonworks’ open-source approach and that spirit of data sharing, “Because of the open-source framework of the Hadoop that we’ve instantiated, it allows us to create a federated framework where others who wanted to be running comparable problems can set this up in their own institutions.”

Next Steps: Innovating Connected Data Platforms for the Benefit of Biomedicine

The CASI team intends to contribute code and tools that it is developing for cancer research back to the open-source community so that other researchers can take advantage of their groundbreaking advances.

Again, Dr. Buetow: “We’re hoping that Hortonworks’ focus and leadership in the open-source community permits the rapid dissemination to a much broader community of new features and new capabilities critical to the type of work that we do.”

The CASI team believes that their use of the 100% open-source HDP platform gives them a voice, through Hortonworks committers, to communicate genomics requirements to the Apache Software Foundation and the broader open-source community.

“The 100% open-source framework for Hortonworks permits us to leverage the much larger open-source community,” said Dr. Buetow.

About Research Computing at ASU

The Research Computing at ASU initiative represents a leading academic supercomputing center - providing a high-performance computing environment (Big Iron HPC), a high-end data intensive ecosystem (Big Data), a highly available 100 gigabit Internet2, a software defined Science DMZ, in-memory computation required for advanced data analysis and machine learning with Apache Spark. It is situated in an enterprise datacenter on campus within a 5000 square foot secured facility. The initiative’s support staff consist of computational scientists and programmers with expertise in many areas of scientific and parallel computing: big data analytics (in memory), custom software development, database engineering, and scientific visualization.