Subscribe to our free daily newsletters
. 24/7 Space News .

Subscribe to our free daily newsletters

Big-data algorithms could cut analysis times from months to days
by Staff Writers
Boston MA (SPX) Oct 25, 2016

"The goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in," says Max Kanter MEng '15.

Last year, MIT researchers presented a system that automated a crucial step in big-data analysis: the selection of a "feature set," or aspects of the data that are useful for making predictions. The researchers entered the system in several data science contests, where it outperformed most of the human competitors and took only hours instead of months to perform its analyses.

This week, in a pair of papers at the IEEE International Conference on Data Science and Advanced Analytics, the team described an approach to automating most of the rest of the process of big-data analysis - the preparation of the data for analysis and even the specification of problems that the analysis might be able to solve. The researchers believe that, again, their systems could perform in days tasks that used to take data scientists months.

"The goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in," says Max Kanter MEng '15, who is first author on last year's paper and one of this year's papers. "[Data scientists want to know], 'Why don't you show me the top 10 things that I can do the best, and then I'll dig down into those?' So [these methods are] shrinking the time between getting a data set and actually producing value out of it."

Both papers focus on time-varying data, which reflects observations made over time, and they assume that the goal of analysis is to produce a probabilistic model that will predict future events on the basis of current observations.

Real-world problems
The first paper describes a general framework for analyzing time-varying data. It splits the analytic process into three stages: labeling the data, or categorizing salient data points so they can be fed to a machine-learning system; segmenting the data, or determining which time sequences of data points are relevant to which problems; and "featurizing" the data, the step performed by the system the researchers presented last year.

The second paper describes a new language for describing data-analysis problems and a set of algorithms that automatically recombine data in different ways, to determine what types of prediction problems the data might be useful for solving.

According to Kalyan Veeramachaneni, a principal research scientist at MIT's Laboratory for Information and Decision Systems and senior author on all three papers, the work grew out of his team's experience with real data-analysis problems brought to it by industry researchers.

"Our experience was, when we got the data, the domain experts and data scientists sat around the table for a couple months to define a prediction problem," he says. "The reason I think that people did that is they knew that the label-segment-featurize process takes six to eight months. So we better define a good prediction problem to even start that process."

In 2015, after completing his master's, Kanter joined Veeramachaneni's group as a researcher. Then, in the fall of 2015, Kanter and Veeramachaneni founded a company called Feature Labs to commercialize their data-analysis technology. Kanter is now the company's CEO, and after receiving his master's in 2016, another master's student in Veeramachaneni's group, Benjamin Schreck, joined the company as chief data scientist.

Data preparation
Developed by Schreck and Veeramachaneni, the new language, dubbed Trane, should reduce the time it takes data scientists to define good prediction problems, from months to days. Kanter, Veeramachaneni, and another Feature Labs employee, Owen Gillespie, have also devised a method that should do the same for the label-segment-featurize (LSF) process.

To get a sense of what labeling and segmentation entails, suppose that a data scientist is presented with electroencephalogram (EEG) data for several patients with epilepsy and asked to identify patterns in the data that might signal the onset of seizures.

The first step is to identify the EEG spikes that indicate seizures. The next is to extract a segment of the EEG signal that precedes each seizure. For purposes of comparison, "normal" segments of the signal - segments of similar length but far removed from seizures - should also be extracted. The segments are then labeled as either preceding a seizure or not, information that a machine-learning algorithm can use to identify patterns that indicate seizure onset.

In their LSF paper, Kanter, Veeramachaneni, and Gillespie define a general mathematical framework for describing such labeling and segmentation problems. Rather than EEG readings, for instance, the data might be the purchases by customers of a particular company, and the problem might be to determine from a customer's buying history whether he or she is likely to buy a new product.

There, the pertinent data, for predictive purposes, may be not a customer's behavior over some time span, but information about his or her three most recent purchases, whenever they occurred. The framework is flexible enough to accommodate such different specifications. But once those specifications are made, the researchers' algorithm performs the corresponding segmentation and labeling automatically.

Finding problems
With Trane, time-series data is represented in tables, where the columns contain measurements and the times at which they were made. Schreck and Veeramachaneni defined a small set of operations that can be performed on either columns or rows.

A row operation is something like determining whether a measurement in one row is greater than some threshold number, or raising it to particular power. A column operation is something like taking the differences between successive measurements in a column, or summing all the measurements, or taking just the first or last one.

Fed a table of data, Trane exhaustively iterates through combinations of such operations, enumerating a huge number of potential questions that can be asked of the data - whether, for instance, the differences between measurements in successive rows ever exceeds a particular value, or whether there are any rows for which it is true that the square of the data equals a particular number.

To test Trane's utility, the researchers considered a suite of questions that data scientists had posed about roughly 60 real data sets. They limited the number of sequential operations that Trane could perform on the data to five, and those operations were drawn from a set of only six row operations and 11 column operations. Remarkably, that comparatively limited set was enough to reproduce every question that researchers had in fact posed - in addition to hundreds of others that they hadn't.

Thanks for being here;
We need your help. The SpaceDaily news network continues to grow but revenues have never been harder to maintain.

With the rise of Ad Blockers, and Facebook - our traditional revenue sources via quality network advertising continues to decline. And unlike so many other news sites, we don't have a paywall - with those annoying usernames and passwords.

Our news coverage takes time and effort to publish 365 days a year.

If you find our news sites informative and useful then please consider becoming a regular supporter or for now make a one off contribution.

SpaceDaily Contributor
$5 Billed Once

credit card or paypal
SpaceDaily Monthly Supporter
$5 Billed Monthly

paypal only


Related Links
Massachusetts Institute of Technology
Space Technology News - Applications and Research

Comment on this article via your Facebook, Yahoo, AOL, Hotmail login.

Share this article via these popular social media networks DiggDigg RedditReddit GoogleGoogle

Previous Report
Big data processing enables worldwide bacterial analysis
Munich, Germany (SPX) Oct 10, 2016
Sequencing data from biological samples such as the skin, intestinal tissues, or soil and water are usually archived in public databases. This allows researchers from all over the globe to access them. However, this has led to the creation of extremely large quantities of data. To be able to explore all these data, new evaluation methods are necessary. Scientists at the Technical Universit ... read more

Spectacular Lunar Grazing Occultation of Bright Star on Oct. 18

Hunter's Supermoon to light up Saturday night sky

Small Impacts Are Reworking Lunar Soil Faster Than Scientists Thought

A facelift for the Moon every 81,000 years

Modeling floods that formed canyons on Earth and Mars

Anxious wait for news of Mars lander's fate

Robot explorers headed for Mars quest: ESA

Scientists simulate a space mission in Mars-analogue Utah desert

Beaches, skiing and tai chi: Club Med, Chinese style

NASA begins tests to qualify Orion parachutes for mission with crew

New Zealand government open-minded on space collaboration

Growing Interest: Students Plant Seeds to Help NASA Farm in Space

China to enhance space capabilities with launch of Shenzhou-11

China closer to establishing permanent space station

Chinese astronauts reach orbiting lab: Xinhua

Astronauts enjoy range of delicacies on Shenzhou XI

Tools Drive NASA's TReK to New Discoveries

Hurricane Nicole delays next US cargo mission to space

Automating sample testing thanks to space

Orbital CRS-5 launching hot and bright science to space

US-Russia Standoff Leaves NASA Without Manned Launch Capabilities

Swedish Space Corporation Celebrates 50th Anniversary of Esrange Space Center

Ariane 5 ready for first Galileo payload

ILS Announces Two Missions under Its EUTELSAT Multi-Launch Agreement

Proxima Centauri might be more sunlike than we thought

Stars with Three Planet-Forming Discs of Gas

TESS will provide exoplanet targets for years to come

The death of a planet nursery?

Lego-like wall produces acoustic holograms

Metamaterial uses light to control its motion

Louisiana Tech University professor develops new mechanism for strengthening materials

How water flows near the superhydrophobic surface

Memory Foam Mattress Review
Newsletters :: SpaceDaily :: SpaceWar :: TerraDaily :: Energy Daily
XML Feeds :: Space News :: Earth News :: War News :: Solar Energy News

The content herein, unless otherwise known to be public domain, are Copyright 1995-2017 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. Privacy Statement