Data mining without prejudice

By By Larry Hardesty, MIT News Office | 17 Dec 2011

The information age is also the age of information overload. Companies, governments, researchers and private citizens are accumulating digital data at an unprecedented rate, and amid all those quintillions of bytes could be the answers to questions of vital human interest: What environmental conditions contribute most to disease outbreaks? What sociopolitical factors contribute most to educational success? What player statistics best predict a baseball team's win-loss record?

This graphic depicts the top 0.25 percent of the relationships that the researchers' techniques found in data on the concentration of microbes in the human gut. Image courtesy of David Reshef

There are a host of mathematical tools for finding possible relationships among data, but most of them require some prior knowledge about what those relationships might be. The problem becomes much harder if you start with a blank slate, and harder still if the datasets are large. But that's exactly the problem that researchers at MIT, Harvard University and the Broad Institute tackle in a paper appearing this week in the journal Science.

David Reshef, a joint MD-PhD student in the Harvard-MIT Division of Health Sciences and Technology (HST) - who, along with his brother, Yakir, is lead author on the new paper - says his team's approach to data mining tries to maximise two properties that are often in conflict; he calls these generality and equitability.

The graph of the relationship between two variables in a dataset could take any shape: for a company's hourly employees, the graph of hours worked to wages would approximate a straight line. A graph of flu incidence versus time, however, might undulate up and down, representing familiar seasonal outbreaks, whereas adoption of a new technology versus time might follow a convex curve, starting off slowly and ramping up as the technology proves itself. An algorithm for mining large datasets needs to be able to recognise any such relationship; that's what Reshef means by generality.

Equitability is a little more subtle. If you actually tried to graph workers' hours against wages, you probably wouldn't get a perfectly straight line. There might be some overtime hours at higher wages that throw things off slightly, or Christmas bonuses, or reimbursement for expenses. In engineers' parlance, there could be noise in the signal.

Most data-mining algorithms score possible relationships between variables according to their noisiness; the noisier the relationship, the less likely that it represents a real-world dependency. But linear relationships, undulating relationships or curved relationships with the same amount of noise should all score equally well. That's equitability.

As it happens, most previous attempts to create general data-mining algorithms have tended to privilege some relationships over others. So, for instance, a very noisy linear relationship might receive the same score as a nearly noiseless undulating relationship, making it difficult to interpret the algorithm's output.

Reshef holds both a bachelor's and a master's from MIT, but while both degrees are in computer science, for his master's thesis he chose to work with Pardis Sabeti, an assistant professor of biology at Harvard and a member of Harvard and MIT's joint Broad Institute.

''I had started trying to think about some large epidemiological datasets, and since I wasn't an epidemiologist, I didn't really know what to look for,'' Reshef says. ''I just kind of wanted to know, 'What are the variables in these datasets that are most associated?' Being as naïve as I was, I hadn't quite realised how difficult of a question that was to answer.''

Once he realised the scale of the problem, ''I roped my brother'' - then an undergraduate math major at Harvard - ''in to help me.''

Reshef's eight co-authors on the Science paper include not only his brother, Sabeti, and Michael Mitzenmacher, a Harvard professor of computer science, but also colleagues from Oxford University in the UK (where Reshef was a Marshall Scholar), the Broad Institute and the Weizmann Institute in Israel, where Yakir is now a Fulbright Scholar.

The procedure that their algorithm follows can be interpreted visually. Effectively, the algorithm considers every pair of variables in a dataset and plots them against each other. It then overlays each graph with a series of denser and denser grids and identifies the grid cells that contain data points. Using principles borrowed from information theory, the algorithm assesses how orderly the patterns produced by the data-containing cells are. The score for each pair of variables is based on the score of its most orderly pattern.

''The fundamental idea behind this approach is that if a pattern exists in the data, there will be some gridding that can capture it,'' Reshef says. And because the cells in a grid can track a curve as easily as they can a straight line, the method isn't tied to any particular type of relationship.

The problem that Reshef and his colleagues have tackled ''is a very important problem,'' says Eli Upfal, a professor of computer science at Brown University, and the researchers' algorithm ''looks like a very novel and a very interesting approach.''

''Most of the analysis [of relationships between data] assumes some model, and a big chunk of the work assumes linear models,'' Upfal says. ''What we have here now is a technique that is independent of any assumptions about the data.''

Upfal cautions that ''the proof will be to what extent scientists really adopt this approach, and that will take some time to see.'' But, he says, ''for the initial paper presenting a new technique, it definitely looks great.''

Server CPU Shortages Grip China as AI Boom Strains Intel and AMD Supply Chains

By Cygnus | 06 Feb 2026

Intel and AMD server CPU shortages are hitting China as AI data center demand surges, pushing lead times to six months and driving prices higher.

Budget 2026-27 Seeks Fiscal Balance Amid Rupee Volatility and Industrial Stagnation

By Cygnus | 02 Feb 2026

India's Budget 2026-27 targets fiscal discipline with record capex as markets tumble, the rupee weakens and manufacturing struggles to regain momentum.

The Thirsty Cloud: Why 2026 Is the Year AI Bottlenecks Shift From Chips to Water

By Axel Miller | 28 Jan 2026

As AI server density surges in 2026, data centers face a new bottleneck deeper than chips — the massive water demand required for cooling next-generation infrastructure.

The New Airspace Economy: How Geopolitics Is Rewriting Aviation Costs in 2026

By Axel Miller | 22 Jan 2026

Airspace bans, sanctions and corridor risk are forcing airlines into costly detours in 2026, raising fuel burn, reducing aircraft utilisation and pushing airfares higher worldwide.

India’s Data Center Arms Race: The Battle for Power, Cooling, and AI Real Estate

By Cygnus | 22 Jan 2026

India’s data centre boom is turning into an AI arms race where power contracts, liquid cooling and fast commissioning decide the winners across Mumbai, Chennai and Hyderabad.

India’s Oil Balancing Act: Refiners Rebuild Middle East Supply Lines as Russia Flows Disrupt

By Axel Miller | 21 Jan 2026

India’s refiners are rebalancing crude sourcing as Russian imports fell to a two-year low in December 2025, lifting OPEC’s share and raising geopolitical risk concerns.

Arctic Fever: How ‘Greenland Tariff’ Politics Sparked a Global Flight to Safety

By Axel Miller | 20 Jan 2026

Greenland-linked tariff threats have injected fresh uncertainty into transatlantic trade, triggering a risk-off shift in markets and reshaping global supply chain planning.

The New Oil (Part 5): Friend-Shoring, Supply Chain Fragmentation and the Cost of Resilience

By Cygnus | 19 Jan 2026

Friend-shoring is reshaping lithium, rare earth and graphite supply chains, creating a resilience premium and new winners and losers in clean tech.

The New Oil (Part 4): Can Technology Break the Dependency?

By Cygnus | 16 Jan 2026

Can magnet recycling and rare-earth-free motors reduce global dependence on strategic minerals? Part 4 explores breakthroughs, limits and timelines.

Data mining without prejudice

By By Larry Hardesty, MIT News Office | 17 Dec 2011

Latest articles

Global Chip Sales Expected to Hit $1 Trillion This Year, Industry Group Says

Citi to Match Government Seed Funding for Children’s ‘Trump Accounts’

Huawei-Backed Aito Partners With UAE Dealer to Enter Middle East Market

AI is No Bubble: Nvidia Supplier Wistron Sees Order Surge Through 2027

Tech Selloff Weighs on Asian Markets; Indonesia Slides After Moody’s Outlook Cut

Amazon Plans $200 Billion AI Spending Surge; Shares Slide on Investor Jitters

Server CPU Shortages Grip China as AI Boom Strains Intel and AMD Supply Chains

OpenAI launches ‘Frontier’ AI agent platform in enterprise push

Toyota set for third straight quarterly profit drop as costs and tariffs weigh

Featured articles

Server CPU Shortages Grip China as AI Boom Strains Intel and AMD Supply Chains

By Cygnus | 06 Feb 2026

Budget 2026-27 Seeks Fiscal Balance Amid Rupee Volatility and Industrial Stagnation

By Cygnus | 02 Feb 2026

The Thirsty Cloud: Why 2026 Is the Year AI Bottlenecks Shift From Chips to Water

By Axel Miller | 28 Jan 2026

The New Airspace Economy: How Geopolitics Is Rewriting Aviation Costs in 2026

By Axel Miller | 22 Jan 2026

India’s Data Center Arms Race: The Battle for Power, Cooling, and AI Real Estate

By Cygnus | 22 Jan 2026

India’s Oil Balancing Act: Refiners Rebuild Middle East Supply Lines as Russia Flows Disrupt

By Axel Miller | 21 Jan 2026

Arctic Fever: How ‘Greenland Tariff’ Politics Sparked a Global Flight to Safety

By Axel Miller | 20 Jan 2026

The New Oil (Part 5): Friend-Shoring, Supply Chain Fragmentation and the Cost of Resilience

By Cygnus | 19 Jan 2026

The New Oil (Part 4): Can Technology Break the Dependency?

By Cygnus | 16 Jan 2026

Latest articles

Global Chip Sales Expected to Hit $1 Trillion This Year, Industry Group Says

Citi to Match Government Seed Funding for Children’s ‘Trump Accounts’

Huawei-Backed Aito Partners With UAE Dealer to Enter Middle East Market

AI is No Bubble: Nvidia Supplier Wistron Sees Order Surge Through 2027

Tech Selloff Weighs on Asian Markets; Indonesia Slides After Moody’s Outlook Cut

Amazon Plans $200 Billion AI Spending Surge; Shares Slide on Investor Jitters

Server CPU Shortages Grip China as AI Boom Strains Intel and AMD Supply Chains

OpenAI launches ‘Frontier’ AI agent platform in enterprise push

Toyota set for third straight quarterly profit drop as costs and tariffs weigh