Trending

Filed under: News

 

Big Data Systems Are Making A Difference In The Fight Against Cancer

Jan 17 2014, 5:21pm CST | by

Big Data Systems Are Making A Difference In The Fight Against Cancer
Photo Credit: Forbes
 
 
 

By Ben Lorica

As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters.” Along those lines, computational biology and medicine are areas where skilled data professionals are already beginning to make an impact. I recently came across a compelling open source project from UC Berkeley’s AMPLab: ADAM is a processing engine and set of formats for genomics data.

Second-generation sequencing machines produce more detailed and thus much larger files for analysis (250+ GB file for each person). Existing data formats and tools are optimized for single-server processing and do not easily scale out. ADAM uses distributed computing tools and techniques to speedup key stages of the variant processing pipeline (including sorting and deduping):

Very early on the designers of ADAM realized that a well-designed data schema (that specifies the representation of data when it is accessed) was key to having a system that could leverage existing big data tools. The ADAM format uses the Apache Avro data serialization system and comes with a human-readable schema that can be accessed using many programming languages (including C/C++/C#, Java/Scala, php, Python, Ruby). ADAM also includes a data format/access API implemented on top of Apache Avro and Parquet, and a data transformation API implemented on top of Apache Spark. Because it’s built with widely adopted tools, ADAM users can leverage components of the Hadoop (Impala, Hive, MapReduce) and BDAS (Shark, Spark, GraphXMLbase) stacks for interactive and advanced analytics.

Although active development only started in Sept/2013, early results indicate that distributed computing tools and techniques lead to substantial speedups. Below are two recent tests from different computing clusters: Amazon EC2 and a cluster at the Icahn School of Medicine at Mt. Sinai(1). The combination of sorting and deduping took 38 hours using existing tools, but runs on less than an hour on a 32-node ADAM cluster.

Computational results like the ones above are drawing the attention of the science community: the AMPLab recently joined an Oregon Health & Science University (OHSU) research initiative to BeatAML (acute myeloid leukemia).

How to help

ADAM is a new project with a small codebase (11,000 lines of code). If you’re a big data hacker looking for a high-impact project to work on, consider contributing to the development of ADAM. Components are developed under an Apache License, so your contributions benefit the open source community. For details on how to contribute contact Matt Massie, lead developer of ADAM.


(0) This post is based on an extended conversation with Matt Massie. For more on ADAM, see this recent technical report.
(1) The cancer research program at the Icahn School of Medicine at Mt. Sinai, was the subject of a moving feature on Esquire magazine.

This post originally appeared on O’Reilly Strata (“Big Data systems are making a difference in the fight against cancer“). It’s republished with permission.

Source: Forbes

You Might Also Like

Updates

Shopping Deals

 
 
 

<a href="/latest_stories/all/all/31" rel="author">Forbes</a>
Forbes is among the most trusted resources for the world's business and investment leaders, providing them the uncompromising commentary, concise analysis, relevant tools and real-time reporting they need to succeed at work, profit from investing and have fun with the rewards of winning.

 

 

Comments

blog comments powered by Disqus

Latest stories

Katy Perry is the worst party entertainer in her new &quot;Birthday&quot; video.
Katy Perry is the worst party entertainer in her new "Birthday" video.
The singer portrays five different characters in her latest video to hit the market.
 
 
Large-Screen iPhone 6 Emerges As The Leaks Suddenly Get Physical
Large-Screen iPhone 6 Emerges As The Leaks Suddenly Get Physical
The iPhone 6 will be just a little smaller than the just-released Samsung Galaxy S5, and virtually identical to the Nexus 5, but thinner. This is the conclusion of Toronto-based mobile device expert Lewis...
 
 
Is Marketing on Facebook Still Relevant In 2015?
Is Marketing on Facebook Still Relevant In 2015?
A recent study done by researchers at Princeton University suggests that Facebook will experience a rapid decline in activity over the next few years. The research, which compares the lifecycle of popular social...
 
 
Single complaint sparks NHTSA investigating into 60k 2014 Chevrolet Impalas
Single complaint sparks NHTSA investigating into 60k 2014 Chevrolet Impalas
Driver says anti-collision brake system activated without need
 
 
 

The Hottest Photos of Victoria's Secret Fashion Show 2013

 

Viral Stories the Web