Subscribe free to our newsletters via your
. 24/7 Space News .




TECH SPACE
Scientists review new ways to process and analyze Big Data
by Staff Writers
Princeton NJ (SPX) Sep 11, 2014


Illustration of spurious correlation by showing the distribution of the maximum absolute sample correlation coefficients between the first and the four of the rest of 800 (in red) and 6,400 (in blue) independently drawn standard Gaussian random variables with sample size n = 60. It can be seen that the maximum spurious correlation coefficient is very high. Image courtesy Science China Press.

Big Data presents scientists with unfolding opportunities, including, for instance, the possibility of discovering heterogeneous characteristics in the population leading to the development of personalized treatments and highly individualized services. But ever-expanding data sets introduce new challenges in terms of statistical analysis, bias sampling, computational costs, noise accumulation, spurious correlations, and measurement errors.

The era of Big Data - marked by a Big Bang-like explosion of information about everything from patterns of use of the World Wide Web to individual genomes - is being propelled by massive amounts of very high-dimensional or unstructured data, continuously produced and stored at a decreasing cost.

"In genomics we have seen a dramatic drop in price for whole genome sequencing," state Jianqing Fan and Han Liu, scientists at Princeton University, and Fang Han at Johns Hopkins.

"This is also true in other areas such as social media analysis, biomedical imaging, high-frequency finance, analysis of surveillance videos and retail sales," they point out in a paper titled "Challenges of Big Data analysis" published in the Beijing-based journal National Science Review.

With the quickening pace of data collection and analysis, they add, "scientific advances are becoming more and more data-driven and researchers will more and more think of themselves as consumers of data."

Increasingly complex data sets are emerging across the sciences. In the field of genomics, more than 500 000 microarrays are now publicly available, with each array containing tens of thousands of expression values of molecules; in biomedical engineering, tens of thousands of terabytes of functional magnetic resonance images have been produced, with each image containing more than 50 000 voxel values. Massive and high-dimensional data is also being gathered from social media, e-commerce, and surveillance videos.

Expanding streams of social network data are being channeled and collected by Twitter, Facebook, LinkedIn and YouTube. This data, in turn, is being used to predict influenza epidemics, stock market trends, and box-office revenues for particular movies.

The social media and Internet contain burgeoning information on consumer preferences, leading economic indicators, business cycles, and the economic and social states of a society.

"It is anticipated that social network data will continue to explode and be exploited for many new applications," predict the co-authors of the study. New applications include ultra-individualized services.

And in the area of Internet security, they add, "When a network-based attack takes place, historical data on network traffic may allow us to efficiently identify the source and targets of the attack."

With Big Data emerging from many frontiers of scientific research and technological advances, researchers have focused on the development of new computational infrastructure and data-storage methods, of fast algorithms that are scalable to massive data with high dimensionality.

"This forges cross-fertilization among different fields including statistics, optimization and applied mathematics," the scientists add.

The massive sample sizes giving rise to Big Data fundamentally challenge the traditional computing infrastructure.

"In many applications, we need to analyze Internet-scale data containing billions or even trillions of data points, which makes even a linear pass of the whole dataset unaffordable," the researchers point out.

The basic approach to store and process such data is to divide and conquer. The idea is to partition a large problem into more tractable and independent sub-problems. Each sub- problem is tackled in parallel by different processing units. On a small scale, this divide-and-conquer strategy can be implemented either by multi-core computing or grid computing.

On a larger scale, handling enormous arrays of data requires a new computing infrastructure that supports massively parallel data storage and processing.

The researchers present Hadoop as an example of a basic software and programming infrastructure for Big Data processing. Alongside Hadoop's distributed file system, they review MapReduce, a programming model for processing large datasets in a parallel fashion, cloud computing, convex optimization, and random projection algorithms, which are specifically designed to meet Big Data's computational challenges.

Hadoop is a Java-based software framework for distributed data management and processing. It contains a set of open source libraries for distributed computing using the MapReduce programming model and its own distributed file system called HDFS. Hadoop automatically facilitates scalability and takes cares of detecting and handling failures.

HDFS is designed to host and provide high-throughput access to large datasets that are redundantly stored across multiple machines. It ensures Big Data's survivability and high availability for parallel applications.

In terms of statistical analysis, Big Data presents another set of new challenges. Researchers tend to collect as many features of the samples as possible; as a result, these samples are commonly heterogeneous and high dimensional.

High dimensionality brings new problems, including noise accumulation, spurious correlation, and incidental endogeneity. For instance, high dimensionality gives rise to spurious correlation. In studying the association between cancers and certain genomic and clinical factors, it might be possible that prostate cancer is highly correlated to an unrelated gene.

However, such a high correlation could be explained by high dimensionality: In studies that include so many features, ranging from genomic information to height, weight and gender to favorite foods and sports, some high correlations emerge merely by chance.

Jianqing Fan, Fang Han, and Han Liu. "Challenges of Big Data analysis." Natl Sci Rev (June 2014) 1 (2): 293-314

.


Related Links
Princeton University
Space Technology News - Applications and Research






Comment on this article via your Facebook, Yahoo, AOL, Hotmail login.

Share this article via these popular social media networks
del.icio.usdel.icio.us DiggDigg RedditReddit GoogleGoogle




Memory Foam Mattress Review
Newsletters :: SpaceDaily :: SpaceWar :: TerraDaily :: Energy Daily
XML Feeds :: Space News :: Earth News :: War News :: Solar Energy News





TECH SPACE
Photon speedway puts big data in the fast lane
Berkeley CA (SPX) Aug 28, 2014
A series of experiments conducted by Lawrence Berkeley National Laboratory (Berkeley Lab) and SLAC National Accelerator Laboratory (SLAC) researchers is shedding new light on the photosynthetic process. The work also illustrates how light sources and supercomputing facilities can be linked via a "photon science speedway" as a solution to emerging challenges in massive data analysis. Last y ... read more


TECH SPACE
China Aims for the Moon, Plans to Bring Back Lunar Soil

Electric Sparks May Alter Evolution of Lunar Soil

China to test recoverable moon orbiter

China to send orbiter to moon and back

TECH SPACE
Opportunity Flash-Memory Reformat Planned

Memory Reformat Planned for Opportunity Mars Rover

Scientist uncovers red planet's climate history in unique meteorite

A Salty, Martian Meteorite Offers Clues to Habitability

TECH SPACE
Aurora Season Has Started

Russian, US Scientists to Prepare Astronauts for Extreme Situations in Space

Russia's Space Geckos Die Due to Technical Glitch Two Days Before Landing

US to Stop Using Soyuz Spacecraft, Invest in Domestic Private Space Industry

TECH SPACE
China launches two satellites via one rocket

China Sends Life to Moon

Same-beam VLBI Tech monitors Chang'E-3 movement on moon

China Sends Remote-Sensing Satellite into Orbit

TECH SPACE
Expedition 40 Heads Into Final Week on ISS

3-D Printer Could Turn Space Station into 'Machine Shop'

Russia May Continue ISS Work Beyond 2020

Science and Departure Preps for Station Crew

TECH SPACE
Sea Launch Takes Proactive Steps to Address Manifest Gap

SpaceX rocket explodes during test flight

Russian Cosmonauts Carry Out Science-Oriented Spacewalk Outside ISS

Optus 10 delivered to French Guiana for Ariane 5 Sept launch

TECH SPACE
Orion Rocks! Pebble-Size Particles May Jump-Start Planet Formation

Rotation of Planets Influences Habitability

Planet-like object may have spent its youth as hot as a star

Young binary star system may form planets with weird and wild orbits

TECH SPACE
Artificial membranes on silicon

Ultra-thin Detector Captures Unprecedented Range of Light

Grooving Crystal Surfaces Repel Water

A Metallic Alloy That is Tough and Ductile at Cryogenic Temperatures




The content herein, unless otherwise known to be public domain, are Copyright 1995-2014 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. Privacy Statement All images and articles appearing on Space Media Network have been edited or digitally altered in some way. Any requests to remove copyright material will be acted upon in a timely and appropriate manner. Any attempt to extort money from Space Media Network will be ignored and reported to Australian Law Enforcement Agencies as a potential case of financial fraud involving the use of a telephonic carriage device or postal service.