SIGIR 2008 workshop paper online (4)
June 28, 2008 17:03 Filed in: Papers
Using Term Clouds to Represent Segment-Level
Semantic Content of Podcasts by Marguerite
Fuller, Manos Tsagias, Eamonn Newman, Jana besser,
Martha Larson, Gareth Jones and Maarten de Rijke is
available online now. Spoken
audio, like any time-continuous medium, is
notoriously difficult to browse or skim without
support of an interface providing semantically
annotated jump points to signal the user where
to listen in. Creation of time-aligned metadata
by human annotators is prohibitively expensive,
motivating the investigation of representations
of segment-level semantic content based on
transcripts generated by automatic speech
recognition (ASR). This paper examines the
feasibility of using term clouds to provide
users with a structured representation of the
semantic content of podcast episodes. Podcast
episodes are visualized as a series of
sub-episode segments, each represented by a term
cloud derived from a transcript generated by
automatic speech recognition (ASR). Quality of
segment-level term clouds is measured
quantitatively and their utility is investigated
using a small-scale user study based on human
labeled segment boundaries. Since the
segment-level clouds generated from
ASR-transcripts prove useful, we examine an
adaptation of text tiling techniques to speech
in order to be able to generate segments as part
of a completely automated indexing and
structuring system for browsing of spoken audio.
Results demonstrate that the segments generated
are comparable with human selected segment
boundaries.
Listening to ''Gathering Dust'', by The Durutti Column (Play Count: 70)
Listening to ''Gathering Dust'', by The Durutti Column (Play Count: 70)
SIGIR 2008 workshop paper online (3)
June 27, 2008 06:34 Filed in: Papers
Integrating Contextual Factors into Topic-centric
Retrieval Models for Finding Similar Experts by
Katja Hofmann, Krisztian Balog, Toine Bogers, and
Maarten de Rijke is available online now. Expert
finding has been addressed from multiple
viewpoints, including expertise seeking and
expert retrieval. The focus of expertise seeking
has mostly been on descriptive or predictive
models, for example to identify what factors
affect human decisions on locating and selecting
experts. In expert retrieval the focus has been
on algorithms similar to document search, which
identify topical matches based on the content of
documents associated with experts.
We report on a pilot study on an expert finding task in which we explore how contextual factors identified by expertise seeking models can be integrated with topic-centric retrieval algorithms and examine whether they can improve retrieval performance for this task. We focus on the task of \emph{similar expert finding}: given a small number of example experts, find similar experts. Our main finding is that, while topical knowledge is the most important factor, human subjects also consider other factors, such as reliability, up-to-dateness, and organizational structure. We find that integrating these factors into topical retrieval models can significantly improve retrieval performance.
Listening to ''BWV 0826 Partita #2 in c-moll - 5. Rondeaux'', by Pieter-Jan Belder, harpsicord (Play Count: 6)
We report on a pilot study on an expert finding task in which we explore how contextual factors identified by expertise seeking models can be integrated with topic-centric retrieval algorithms and examine whether they can improve retrieval performance for this task. We focus on the task of \emph{similar expert finding}: given a small number of example experts, find similar experts. Our main finding is that, while topical knowledge is the most important factor, human subjects also consider other factors, such as reliability, up-to-dateness, and organizational structure. We find that integrating these factors into topical retrieval models can significantly improve retrieval performance.
Listening to ''BWV 0826 Partita #2 in c-moll - 5. Rondeaux'', by Pieter-Jan Belder, harpsicord (Play Count: 6)
SIGIR 2008 Workshop paper online (2)
June 21, 2008 08:11 Filed in: Papers
Blogger, Stick to your Story: Modeling Topical
Noise in Blogs with Coherence Measures by Jiyin
He, Wouter Weerkamp, Martha Larson, and Maarten de
Rijke is available now. Topical noise in
blogs arises when bloggers digress from the
central topical thrust of their blogs. We
introduce a method to explicitly incorporate a
model of topical noise into a language modeling
approach to the task of blog distillation.
Topical noise is integrated into the model using
a coherence score, which reflects the tightness
of the topical structure of a blog. Tests
performed on the TRECBlog06 corpus show that a
naive integration of the coherence score as blog
prior fails to achieve performance improvements.
Instead, we develop a set of more sophisticated
models in which the coherence score is weighted
by a function of the blog retrieval score. The
proposed models help improve effectiveness of
our language modeling approach to the blog
distillation task.
Listening to ''Run to Yuki'', by Yuji Nomi (Play Count: 3)
Listening to ''Run to Yuki'', by Yuji Nomi (Play Count: 3)
SIGIR 2008 Workshop paper online
June 20, 2008 15:52 Filed in: Papers
Named Entity Normalization in User Generated
Content by Valentin Jijkoun, Mahboob Khalid,
Maarten Marx and Maarten de Rijke is available online now. Named
entity recognition is important for semantically
oriented retrieval tasks, such as question
answering, entity retrieval, biomedical
retrieval, trend detection, and event and entity
tracking. In many of these tasks it is important
to be able to accurately normalize the
recognized entities, i.e., to map surface forms
to unambiguous references to real world
entities. Within the context of structured
databases, this task (known as record linkage
and data de-duplication) has been a topic of
active research for more than five decades. For
edited content, such as news articles, the named
entity normalization (NEN) task is one that has
recently attracted considerable attention. We
consider the task in the challenging context of
user generated content (UGC), where it forms a
key ingredient of tracking and media-analysis
systems.
A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references.
To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements.
iTunes is not playing.
A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references.
To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements.
iTunes is not playing.
Web lecture fragments
June 18, 2008 22:09 Filed in: Work
The University's ICTO program (ICT for education) has
decided to fund a project lead by Martha Larson aimed
at generating re-usable snippets from video lectures.
The proposal (and planned work) is based in part on
ISLA-tv.
Listening to ''Sketch For Summer'', by The Durutti Column (Play Count: 68)
Listening to ''Sketch For Summer'', by The Durutti Column (Play Count: 68)



