SIGIR 2008 workshop paper online (4)

Using Term Clouds to Represent Segment-Level Semantic Content of Podcasts by Marguerite Fuller, Manos Tsagias, Eamonn Newman, Jana besser, Martha Larson, Gareth Jones and Maarten de Rijke is available online now. Spoken audio, like any time-continuous medium, is notoriously difficult to browse or skim without support of an interface providing semantically annotated jump points to signal the user where to listen in. Creation of time-aligned metadata by human annotators is prohibitively expensive, motivating the investigation of representations of segment-level semantic content based on transcripts generated by automatic speech recognition (ASR). This paper examines the feasibility of using term clouds to provide users with a structured representation of the semantic content of podcast episodes. Podcast episodes are visualized as a series of sub-episode segments, each represented by a term cloud derived from a transcript generated by automatic speech recognition (ASR). Quality of segment-level term clouds is measured quantitatively and their utility is investigated using a small-scale user study based on human labeled segment boundaries. Since the segment-level clouds generated from ASR-transcripts prove useful, we examine an adaptation of text tiling techniques to speech in order to be able to generate segments as part of a completely automated indexing and structuring system for browsing of spoken audio. Results demonstrate that the segments generated are comparable with human selected segment boundaries.

Listening to ''Gathering Dust'', by The Durutti Column (Play Count: 70)

SIGIR 2008 workshop paper online (3)

Integrating Contextual Factors into Topic-centric Retrieval Models for Finding Similar Experts by Katja Hofmann, Krisztian Balog, Toine Bogers, and Maarten de Rijke is available online now. Expert finding has been addressed from multiple viewpoints, including expertise seeking and expert retrieval. The focus of expertise seeking has mostly been on descriptive or predictive models, for example to identify what factors affect human decisions on locating and selecting experts. In expert retrieval the focus has been on algorithms similar to document search, which identify topical matches based on the content of documents associated with experts.

We report on a pilot study on an expert finding task in which we explore how contextual factors identified by expertise seeking models can be integrated with topic-centric retrieval algorithms and examine whether they can improve retrieval performance for this task. We focus on the task of \emph{similar expert finding}: given a small number of example experts, find similar experts. Our main finding is that, while topical knowledge is the most important factor, human subjects also consider other factors, such as reliability, up-to-dateness, and organizational structure. We find that integrating these factors into topical retrieval models can significantly improve retrieval performance.

Listening to ''BWV 0826 Partita #2 in c-moll - 5. Rondeaux'', by Pieter-Jan Belder, harpsicord (Play Count: 6)

SIGIR 2008 Workshop paper online (2)

Blogger, Stick to your Story: Modeling Topical Noise in Blogs with Coherence Measures by Jiyin He, Wouter Weerkamp, Martha Larson, and Maarten de Rijke is available now. Topical noise in blogs arises when bloggers digress from the central topical thrust of their blogs. We introduce a method to explicitly incorporate a model of topical noise into a language modeling approach to the task of blog distillation. Topical noise is integrated into the model using a coherence score, which reflects the tightness of the topical structure of a blog. Tests performed on the TRECBlog06 corpus show that a naive integration of the coherence score as blog prior fails to achieve performance improvements. Instead, we develop a set of more sophisticated models in which the coherence score is weighted by a function of the blog retrieval score. The proposed models help improve effectiveness of our language modeling approach to the blog distillation task.

Listening to ''Run to Yuki'', by Yuji Nomi (Play Count: 3)

SIGIR 2008 Workshop paper online

Named Entity Normalization in User Generated Content by Valentin Jijkoun, Mahboob Khalid, Maarten Marx and Maarten de Rijke is available online now. Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems.

A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references.

To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements.

iTunes is not playing.

Web lecture fragments

The University's ICTO program (ICT for education) has decided to fund a project lead by Martha Larson aimed at generating re-usable snippets from video lectures. The proposal (and planned work) is based in part on ISLA-tv.

Listening to ''Sketch For Summer'', by The Durutti Column (Play Count: 68)