CIKM 2009 papers online
August 20, 2009 16:48 Filed in: Papers
Three CIKM 2009 papers are online now. The first,
The Impact of Document Structure on Keyphrase
Extraction by Katja Hofmann, Manos Tsagkias,
Edgar Meij and Maarten de Rijke, can be downloaded
here. Keyphrases are short
phrases that reflect the main topic of a
document. Because manually annotating documents
with keyphrases is a time-consuming process,
several automatic approaches have been
developed. Typically, candidate phrases are
extracted using features such as position or
frequency in the document text. Document
structure may contain useful information about
which parts or phrases of a document are
important, but has rarely been considered as a
source of information for keyphrase extraction.
We address this issue in the context of
keyphrase extraction from scientific literature.
We introduce a new, large corpus that consists
of full-text journal articles, where the rich
collection and document structure available at
the publishing stage is explicitly annotated. We
explore features based on the XML tags contained
in the documents, and based on generic section
types derived using position and cue words in
section titles. For XML tags we find sections,
abstract, and title to perform best, but many
smaller elements may be beneficial in
combination with other features. Of the generic
section types, the discussion section is found
to be the most useful for keyphrase extraction.
The second paper, A Query Model Based on Normalized Log-Likelihood, by Edgar Meij, Wouter Weerkamp and Maarten de Rijke, is available here. Leveraging information from relevance assessments has been proposed as an effective means for improving retrieval. We introduce a novel language modeling method which uses information from each assessed document and their aggregate. While most previous approaches focus either on features of the entire set or on features of the individual relevant documents, our model exploits features of both the documents and the set as a whole. When evaluated, we show that our model is able to significantly improve over state-of-art feedback methods.
The third paper, Predicting the Volume of Comments\\ on Online News Stories by Manos Tsagkias, Wouter Weerkamp and Maarten de Rijke is available here. On-line news agents provide commenting facilities for readers to express their views with regard to news stories. The number of user supplied comments on a news article may be indicative of its importance or impact. We report on exploratory work that predicts the comment volume of news articles prior to publication using five feature sets. We address the prediction task as a two stage classification task: a binary classification identifies articles with the potential to receive comments, and a second binary classification receives the output from the first step to label articles ``low'' or ``high'' comment volume. The results show solid performance for the former task, while performance degrades for the latter.
The second paper, A Query Model Based on Normalized Log-Likelihood, by Edgar Meij, Wouter Weerkamp and Maarten de Rijke, is available here. Leveraging information from relevance assessments has been proposed as an effective means for improving retrieval. We introduce a novel language modeling method which uses information from each assessed document and their aggregate. While most previous approaches focus either on features of the entire set or on features of the individual relevant documents, our model exploits features of both the documents and the set as a whole. When evaluated, we show that our model is able to significantly improve over state-of-art feedback methods.
The third paper, Predicting the Volume of Comments\\ on Online News Stories by Manos Tsagkias, Wouter Weerkamp and Maarten de Rijke is available here. On-line news agents provide commenting facilities for readers to express their views with regard to news stories. The number of user supplied comments on a news article may be indicative of its importance or impact. We report on exploratory work that predicts the comment volume of news articles prior to publication using five feature sets. We address the prediction task as a two stage classification task: a binary classification identifies articles with the potential to receive comments, and a second binary classification receives the output from the first step to label articles ``low'' or ``high'' comment volume. The results show solid performance for the former task, while performance degrades for the latter.



