TREC 2007 Working Notes papers published

Two more papers have just been published. The University of Amsterdam at the TREC 2007 Blog Track, by Breyten Ernsting, Wouter Weerkamp, and Maarten de Rijke and The University of Amsterdam at the TREC 2007 Enterprise Track by Krisztian Balog, Katja Hofmann, Wouter Weerkamp and Maarten de Rijke appeared in the TREC 2007 Working Notes.

In the first paper, we describe our participation in the TREC 2007 Blog track. In the opinion task we looked at the differences in performance between Indri and our mixture model, the influence of external expansion and document priors to improve opinion finding; results show that an out-of-the-box Indri implementation outperforms our mixture model, and that external expansion on a news corpus is very benificial. Opinion finding can be improved using either lexicons or number of comments as document priors. In our approach to the feed distillation task we integrated time-based and frequency aspects into the retrieval model; we find that time-based retrieval improves results slightly, while frequency-based retrieval results in substantial improvements under the right circumstances.

In the second paper, we describe our participation in the TREC 2007 Enterprise track and detail our language modeling-based approaches. For document search, our focus was on estimating a mixture model using a standard web collection, and on constructing query models by employing blind relevance feedback and using the example documents provided with the topics. We found that settings performing well on a web collection do not carry over to the CSIRO collection, but the use of advanced query models resulted in significant improvements. In expert search, our experiments concerned document representation, identification of candidate experts, and combinations of expert search strategies. We find no significant difference in average precision but observe small overall positive effects of the advanced models, with large differences between individual topics.

iTunes is not playing.

CIKM 2007 paper published

``More Like These'': Growing Entity Classes from Seeds by Luis Sarmento, Valentin Jijkoun, Maarten de Rijke and Eugenio Oliviera has now been published in the proceedings of CIKM 2007. One of the important lexical aquisition tasks is creating sets of entities of a specific class from a handful of seed examples. In this paper we present a corpus-based approach to the class expansion task. Given a text collection, for a given set of seed entities we use co-occurence statistics to define a class membership function that is used to rank candidate entities for inclusion in the class. We describe a novel evaluation framework for this class expansion problem, using data from Wikipedia. Analysis of the results indicates that the method improves as the size of the collection increases, which makes it very appropriate given the constant growth of avilable text data. The paper is available here.

iTunes is not playing.

WI 2007 paper published

Fact Discovery in Wikipedia, by Sisay Fissaha Adafre, Valentin Jijkoun, and myself, has now been published in the proceedings of Web Intelligence 2007. In it, we address the task of extracting focused salient information items, relevant and important for a given topic, from a large encyclopedic resource. Specifically, for a given topic (a Wikipedia article) we identify snippets from other articles in Wikipedia that contain important information for the topic of the original article, without duplicates. We compare several methods for addressing the task, and find that a mixture of content-based, link-based, and layout-based features outperforms other methods, especially in combination with the use of so-called reference corpora that capture the key properties of entities of a common type. These reference corpora will also play a big role in Sisay's forthcoming PhD thesis. A PDF version of the paper is available here (opens in a new window).

Listening to ''14 Grass Quit Glade Dub'', by Scientist (Play Count: 23)