Massive Graphs Pose Big Problems

SIAM News, April 22, 1999

By Barry Cipra

Faced with data sets that strain the limits of computation---long-distance billing data for one phone company amounts to an estimated 20 terabytes a year---telecommunications researchers need new theories for handling massive graphs.

There are certain kinds of mathematical problems where too much information can be detrimental to the mathematician, confusing him more than helping him. ---Henry Sutton (a.k.a. David Slavitt), The Exhibitionist

Psychologists have shown that the human brain is ill-equipped to handle large amounts of information. We have trouble caching more than seven numbers at a time, and our main memory is notoriously ponderous. Even the things we know "by heart" are stored in peculiar ways: Just try timing yourself as you say the alphabet backward.

Computers, of course, do better. The evolutionary forces that produced them favored the fast and accurate retention of data and the ability to zip information around. The small numbers we use to comprehend computer capabilities are exponents, staking out orders of magnitude. In prefix form, they've progressed from kilo- through mega- to giga-, with tera- now coming into play (and peta- just over the horizon).

But past a point, even computers choke on too much information. Joan Feigenbaum, a computer scientist at AT&T Labs-Research, has seen it happen. The telecommunications industry, whose raison d'être is the transfer of information, generates huge amounts of data in tracking its own operations. Efforts to analyze this data set are stymied by its sheer size. Forget about NP-difficulty: Even algorithms that run in polynomial time are brought to their knees when the input won't all fit in RAM.

In an invited presentation, "Massive Graphs: Algorithms, Applications, and Open Problems," at the 1999 SIAM Annual Meeting in Atlanta next month, Feigenbaum will explain some of the challenges posed by data sets that strain the limits of computation. In addition to reporting recent progress made by, among others, her AT&T Labs colleagues James Abello, Adam Buchsbaum, Raffaele Giancarlo, Panos Pardalos, Mauricio Recende, and Jeff Westbrook, she will consider the many open problems that remain.

Dial "M" for "Massive"

He had long ago concluded that he possessed only one small and finite brain, and he had fixed a habit of determining most carefully with what he would fill it. ---Annie Dillard, The Living

Every day, AT&T handles approximately 300 million long-distance calls. The phone company's billing records typically display four numbers: the originating phone number, the terminating phone number, the time of the call, and its duration. The billing data associated with all these phone calls amount to an estimated 20 terabytes a year.

Most of us could keep our business records for a thousand years on the 2-gigabyte hard drive in our PCs. AT&T needs the equivalent of 10,000 hard drives for a single year.

If all a phone company wanted to do was to send out bills and collect its money, it would be a relatively simple matter of distributing the data to bill-collecting computers and then finding enough magnetic tape to store the records for a few years. But the forces of economic competition, in symbiosis with scientific curiosity, have produced the desire to do more. Telecommunications researchers would like to understand exactly how the networks are being used. Such knowledge could enable the long-distance carriers to optimize their operations; with the resulting additional profits, they might then offer lower rates in order to attract more customers.

There's quite a lot of "folklore" about network usage, Feigenbaum points out, but scientifically "very little is known." For example, it's often observed that, compared with other carriers, AT&T has lots of noncommercial customers who make very few phone calls. Other guesstimates concern the prevalence of pockets of phone numbers that traffic extensively among themselves, and patterns in the 50 x 50 matrix of phone calls from, say, Maine to Montana.

"The folklore has never been tested, because the data sets are huge, the computations are hard, and the data are company-proprietary," Feigenbaum says.

Conceptually, the phone-record database can be viewed as a gigantic, directed multigraph. The nodes are telephone numbers, and the directed edges, drawn as arrows connecting nodes, represent phone calls. The graph is a multigraph because one number might call another several times in the given time period. The edges are weighted by the time and duration of the corresponding calls.

One immediate problem is simply the display of such massive graphs. The graceful spiderweb you see in a phone-company advertisement becomes a dense, black blur if you try to draw the entire graph at once. It stands to reason: To flash a terabyte of data on a 1000 x 1000 screen, you need to cram a megabyte of data into each pixel. Something's got to give, and researchers haven't yet figured out what's essential and what's superfluous.

It gets even worse when you want the computer to tell you something about a massive graph, such as the size of quasi-cliques (subgraphs in which a fixed, minimum fraction of all possible edges actually occur). Algorithms for analyzing graphs are well developed in classical computer science, Feigenbaum says. But whether they're polynomial-time or not, the existing algorithms assume that all the input data fit in main memory, and for massive graphs, that assumption isn't valid. (Today's massive graphs may fit in tomorrow's RAM, of course, but by then the data sets of interest will presumably be even larger.) And when you don't have the entire graph at your fingertips, she says, "this whole algorithmic paradigm breaks down."

Thinking Big

I must create a system, or be enslaved by another man's.
---William Blake, Jerusalem

What's needed are new theories for handling these massive graphs, Feigenbaum says. Among the ingredients called for are "external memory" algorithms, whose complexity is measured in terms of the number of disk accesses, and "streaming" algorithms, which hang onto the database baby while tossing out the bath-water. Probabilistic models that simulate the typicality and evolution of telephone or Internet traffic are also needed; the current theory of random graphs, bifurcated as it is between complete amorphousness and rigid latticity, is inappropriate for the nebulously structured nature of telecommunications. Finally, there's the delicate question of providing researchers with real data they can use to test their theories and algorithms while protecting the privacy of the phone companies' customers. Cryptographic protocols may help.

The scope of the challenge has widened further with the arrival of the latest wrinkle in telecommunications: Internet traffic. In fact, Feigenbaum says, the massive graph metaphor may not be the right mathematical abstraction for the way Internet communications are conducted. It's a little like the segue from classical physics to quantum mechanics: Old-fashioned intuitions can be costly.

The packet-switching approach of Internet traffic is distinctly different from the dedicated-line paradigm of long-distance phone calls. In particular, no one keeps records of Internet traffic-mainly because there's no billing! Even if there were, an IP address is not the same thing as a phone number, Feigenbaum points out, so it's not clear what the nodes of an Internet graph would be. Finally, keeping track of Internet traffic would make the phone companies' current record-keeping tasks look puny by comparison: A single gateway router can generate more than 10 gigabytes of summary data per day.

One niche where graph theory still holds water, as it were, is the hyperlinked landscape of the Web. The Web "digraph" currently has about half a billion nodes (i.e., pages), with approximately 20 million being added every month. The number of directed edges emerging from each node varies widely, but an average of 10 links per page is a good guess. One advantage for researchers is that raw data regarding the Web are, almost by definition, publicly available. The distinct drawback is that there's no single site where it's all stored; if you want to know how Web sites are interconnected, you've pretty much got to go surfing.

The coming years will likely witness a contest between advances in theory and algorithms and the continued growth of massive data sets. For an industry as fiercely competitive as telecommunications, that's only fitting.