


SIAM NewsMining Multilingual Documents September 24, 2008 By Dianne P. O'Leary and John M. Conroy
Figure 1. Interface for the prototype system QCS. Answering questions from information found on the Web is hard in any case, but what if the relevant information is in a variety of different languages? Suppose, for example, that we want to determine what citizens of Kenya think about their new cabinet. We will miss essential information if we consider only Englishlanguage documents. But challenges arise when we include nonEnglish documents: It is not realistic to expect human translations of these documents, and existing machinetranslation systems are far from perfect. Keywords can be translated incorrectly (e.g., "Kenyan box" for "Kenyan cabinet"), and translations are often ungrammatical. When punctuation is missing in source documents, context or referential information almost certainly gets lost. In spite of these challenges, we must rely on machine translations. Even if we retrieve the relevant documents, we have not necessarily answered the question. Some questionsfor example, Who is Michael Jordan?yield such overwhelming amounts of information on one possible answer (in this case, the retired basketball player) that it is difficult to find other possibilities (the jazz musician, the mathematician, and the many other people with this name). Therefore, it is important that documents be clustered. And the clusters will be most useful if each is accompanied by a summary. To answer a question, then, users need a tool that can retrieve a set of relevant documents, cluster the documents by subtopic, and summarize each cluster (in English). Our research has been directed toward the creation of a tool with these capabilities, and we have developed a system that can function without human intervention. What might be surprising, even to the SIAM community, is that retrieval, clustering, and summarization can all be handled by purely mathematical techniques. Document retrieval can be formulated as a problem in linear algebra called "latent semantic analysis." Each document is represented by a column in a matrix, in which each row corresponds to a term in the dictionary. The magnitude of the entry in a given row and column indicates the importance of that term in that document. We use the singular value decomposition to form a lowrank approximation to the matrix; we can then rank documents for relevance to a question by forming the inner product between the vector representing the question and the matrix. Approximations other than the SVD (the rankrevealing QR, the semidiscrete decomposition, or a nonnegative decomposition) might improve retrieval. Document clustering is an optimization problem: Find cluster centers that minimize some measure of total distance from each document vector to its cluster center. Once we have formed a document cluster, only two mathematical decisions remain: Choose sentences from the documents to include in the summary, and order the sentences in the summary. We score each sentence by an algorithm that estimates the probability that a human would choose that sentence, based on the estimated worth of its individual words. We then frame the problem of choosing from the collection of highscoring sentences as a subsetselection problem. We use latent semantic analysis and a nonnegative variant of the QR factorization to solve this problem. As to the ordering of the sentences in the summary, one approach is to solve a Traveling Salesperson Problem that forms a "tour" of all the highscoring sentences, with the "distances" between sentences inversely proportional to the cosine of the angle between the term–sentence vectors. With Daniel Dunlavy (of Sandia National Laboratories, Albuquerque) and Judith Schlesinger (of the IDA Center for Computing Sciences, in Bowie, Maryland), we have developed and implemented a prototype system called QCS (Query, Cluster, Summarize). The interface, shown in Figure 1, displays a query, summaries for the topscoring clusters, and pointers to the documents in these clusters. Arabic, with more than 30 varieties (Egyptian, Iraqi, Gulf Arabic, . . .), is a particularly challenging language for machine translation; even standardized newswire articles typically omit vowels, punctuation, and sentence deliminators. The Multilingual Summarization Evaluation (MSE), a competition devised by computational linguists, focuses on summarizing collections of Arabic and English newswire articles that have already been retrieved and clustered. In this evaluation, we found that our mathematicsbased algorithms, coupled with simple linguistic rules to trim unnecessary words from the sentences, outperformed all the other competing automatic summarizing systems, each of which was based much more deeply in linguistics. In fact, our system even outscored three of four human summarizers in a measure called ROUGE2 that scores common twoword pairs, as well as in an evaluation done by human judges. Although machinegenerated summaries are far from perfect, mathematicsparticularly linear algebra and optimizationforms the basis for highly effective algorithmic approaches. References [2] D.M. Dunlavy, D.P. O'Leary, J.M. Conroy, and J.D. Schlesinger, QCS: A system for querying, clustering, and summarizing documents, Infor. Process. Manage., 43:6 (2007), 1588–1605; DOI: 10.1016/j.ipm.2007.01.003; based on a 2006 technical report. [3] J.D. Schlesinger, D.P. O'Leary, and J.M. Conroy, Arabic/English multidocument summarization with CLASSYThe past and the future, CICLing Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 17–23, 2008; also in Comput. Linguist. Intel. Text Process., Lecture Notes in Computer Science, Springer, Berlin, 4919 (2008), 568–581. Diane O'Leary is a professor in the Department of Computer Science and the Institute for Advanced Computer Studies at the University of Maryland, College Park. John Conroy is a research scientist at the IDA Center for Computing Sciences, in Bowie, Maryland.
Dianne O'Leary of the University of Maryland (right), coauthor of the accompanying article on the mining of multilingual documents (based on her invited talk at this year's SIAM Conference on Data Mining), is pictured here with AWM president Cathy Kessel and SIAM president Cleve Moler on the occasion of an altogether different talk: the 2008 AWM–SIAM Sonia Kovalevsky Lecture. A joint honor of the Association for Women in Mathematics and SIAM, the Kovalevsky Lecture is intended to highlight significant contributions of women to applied and computational mathematics. In her lecture, "A Noisy Adiabatic Theorem: Wilkinson Meets Schrödinger's Cat," given in San Diego at the 2008 SIAM Annual Meeting, O'Leary departed from her usual linear algebra focus in favor of a PDE/physics theme in honor of Sonia Kovalevsky. Schrödinger's cat, the central character in a thought experiment regarding quantum superposition, and James Wilkinson, famous for his work on the effects of perturbations on floatingpoint computations, were used to set the stage for a discussion of the effects of perturbation on quantum systems, and the resulting implications for adiabatic quantum computing. Readers who missed what Moler described as "an absolutely great lecture" will be happy to know that it (and most of the other invited talks at the meeting) was taped and will be accessible as soon as possible at www.siam.org.
