Speeding algorithms by shrinking data

23 Nov 2012

In computer science, the buzzword of the day is ''big data.'' The proliferation of cheap, internet-connected sensors - such as the GPS receivers, accelerometers and cameras in smartphones - has meant an explosion of information whose potential uses have barely begun to be explored. In large part, that's because processing all that data can be prohibitively time-consuming.

Most computer scientists try to make better sense of big data by developing ever-more-efficient algorithms. But in a paper presented this month at the Association for Computing Machinery's International Conference on Advances in Geographic Information Systems, MIT researchers take the opposite approach, describing a novel way to represent data so that it takes up much less space in memory but can still be processed in conventional ways.

While promising significant computational speedups, the approach could be more generally applicable than other big-data techniques, since it can work with existing algorithms.

In the new paper, the researchers apply their technique to two-dimensional location data generated by GPS receivers, a very natural application that also demonstrates clearly how the technique works. As Daniela Rus, a professor of computer science and engineering and director of MIT's Computer Science and Artificial Intelligence Laboratory, explains, GPS receivers take position readings every 10 seconds, which adds up to a gigabyte of data each day. A computer system trying to process GPS data from tens of thousands of cars in order to infer traffic patterns could quickly be overwhelmed.

But in analysing the route traversed by a car, it's generally not necessary to consider the precise coordinates of every point along the route. The essential information is the points at which the car turns; the path between such points can be approximated by a straight line. That's what the new algorithm does.

A key aspect of the algorithm, explains Dan Feldman, a postdoc in Rus' group and lead author on the new paper, is that it can compress data on the fly. For instance, it could compress the first megabyte of data it receives from a car, then wait until another megabyte builds up and compress again, then wait for another megabyte, and so on - and yet the final representation of the data would preserve almost as much information as if the algorithm had waited for all the data to arrive before compressing.