Speaker: Ian H. Witten Time: 2:00pm, Sep. 10, 2012 (Monday)Venue: Room 305, Computer Science BuildingContact: Xipeng Qiu xpqiu@fudan.edu.cn
Abstract:
Wikipedia is a goldmine of information. Each article describes a single concept, and together they constitute a vast investment of manual effort and judgment.
“Wikification” is the process of automatically augmenting a plain-text document with hyperlinks to Wikipedia articles. This involves associating phrases in the document with concepts, disambiguating them, and selecting the most pertinent. All three processes can be addressed by exploiting Wikipedia as a source of data. For the first, link anchor text illustrates how concepts are described in running text. For the second and third, Wikipedia provides millions of examples that can be used to prime machine-learned algorithms for disambiguation and selection respectivelyWikification produces a semantic representation of any document in terms of concepts. We apply this to (a) select index terms for scientific documents, and (b) determine the similarity of two documents, in both cases outperforming humans in terms of agreement with human judgment. We also apply it to document clustering and classification algorithms, and to produce back of the book indexes, improving on the state of the art in each case.The Wikipedia Miner toolkit is open source and runs quickly (using Hadoop to parallelize where appropriate).Biography:Ian H. Witten is a professor of computer science at the University of Waikato in New Zealand. He directs the New Zealand Digital Library research project. His research interests include information retrieval, machine learning, text compression, and programming by demonstration. He received an MA in Mathematics from Cambridge University, England; an MSc in Computer Science from the University of Calgary, Canada; and a PhD in Electrical Engineering from Essex University, England. He is a fellow of the ACM and of the Royal Society of New Zealand. He has published widely on digital libraries, machine learning, text compression, hypertext, speech synthesis and signal processing, and computer typography.