Pseudo Test Collections for Training and Tuning Microblog Rankers

Our paper “Pseudo Test Collections for Training and Tuning Microblog Rankers” by Richard Berendsen, Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp has been accepted at SIGIR 2013, in Dublin, Ireland, 28 July– 1 August. The abstract follows.

Recent years have witnessed a persistent interest in generating pseudo test collections, both for training and evaluation purposes. We describe a method for generating queries and relevance judgments for microblog search in an unsupervised way. Our starting point is this intuition: tweets with a hashtag are relevant to the topic covered by the hashtag and hence to a suitable query derived from the hashtag. Our baseline method selects all commonly used hashtags, and all associated tweets as relevance judgments; we then generate a query from these tweets. Next, we generate a timestamp for each query, allowing us to use temporal information in the training process. We then enrich the generation process with knowledge derived from an editorial test collection for microblog search.

We use our pseudo test collections in two ways. First, we tune parameters of a variety of well known retrieval methods on them. Correlations with parameter sweeps on an editorial test collection are high on average, with a large variance over retrieval algorithms. Second, we use the pseudo test collections as training sets in a learning to rank scenario. Performance close to the training error on the editorial collection is achieved in all cases. Our results demonstrate the utility of tuning and training microblog search retrieval algorithms on automatically generated training material.

We are working towards releasing a pre-print soon.

Posted in Publications | Tagged , , , , , | Leave a comment

Periscope: Tweet highlights before they happen

Are you fed up with all the clutter on Twitter? I am too. I decided to cut through the crap, and develop a system that shows the most noteworthy information at any point in time: Periscope.

Periscope crawls public tweets written in English, and monitors, in real-time, all the words in these tweets. It spots topics that are about to flare, and shows you the most representative tweet for this topic as of the past minute. The only thing it doesn’t do is to reload the page for you :)

While developing Periscope I realized that combating spam on Twitter is non-trivial. I have implemented anti-spam mechanisms, however, they are still immature, and there are many topics which are spam. Later versions will become more spam-resistant.

An extremely important issue for me is topic threading, i.e., how topics evolve, merge, or split apart. I have built in Periscope an alpha version of topic threading mechanisms. Sometimes they work but not always. If you bump into a topic with lots of past, unrelated, summaries, then you’ll know that you have probably ran into one of these cases.

I invite you to see what the world will be talking about, right now.

Posted in Uncategorized | Tagged , , , | Leave a comment

Mining Social Media: Tracking Content and Predicting Behavior

Mining Social Media: Tracking Content and Predicting Behavior

Mining Social Media: Tracking Content and Predicting Behavior, Manos Tsagkias. Ph.D. thesis, University of Amsterdam, 2012.

The advent of social media has established a symbiotic relationship between social media and online news. This relationship can be leveraged for tracking news content, and predicting behavior with tangible real-world applications, e.g., online reputation management, ad pricing, news ranking, and media analysis. In this thesis we focus on tracking news content in social media, and predicting user behavior.

In the first part, we develop methods for tracking content which build upon, and extend practices in Information Retrieval. We begin with discovering social media posts that discuss a news article yet they do not provide a hyperlink to it. Our methods model news articles using several channels of information, either endogenous or exogenous to the article. These models are then used to query an index of social media posts. During this process we found that the query models are close in size to the documents to be retrieved, violating a standard assumption of language modeling. We correct for this discrepancy by introducing two hypergeometric language models for modeling both queries, and documents to be retrieved.

In the second part, we focus on predicting behavior. First we look at predicting listeners’ preference in spoken user generated content, namely, podcasts. Then, we predict popularity of news articles from several news agents in terms of the volume of comments they receive. We develop models for predicting the popularity of an article for both before and after it is published. Finally, we look at a different aspect of news impact: how reading a news article affects future user browsing behavior. In each setting, we find patterns that characterize the underlying behavior and extract features that we then use to establish models for predicting online behavior.

I will defend my Ph.D. thesis on Wednesday, 5 December 2012, at 14:00 (GMT+1), at Agnietenkapel, Amsterdam. You are most welcome to join. In the meantime, please feel free to grab a copy, and cite the book:

@phdthesis{tsagkias2012-thesis,
Title = {Mining Social Media: Tracking Content and Predicting Behavior},
Author = {Manos Tsagkias},
Year = {2012},
School = {University of Amsterdam}
}

Posted in Uncategorized | 1 Comment

PhD thesis submitted

Tuesday, 16 October 2012: Success! The committee members approved the thesis without comments. I’m finishing up a few things before sending the book to the printer, and then we’re on the final straight before the defense. Mark your calendars: Wednesday, 5 December at 14:00 at Agnietenkapel, in Amsterdam.

Yes. It is a fact and happened on Monday. 3 Sep 2012 at noon, to be exact. I submitted my PhD thesis!

My book “Mining Social Media: Tracking Content and Predicting Behavior” is now in the hands of the committee members, and waits to be read. The committee has 6 weeks to go through all of the 187 pages of content before they send their feedback and announce their final decision.

I will defend my work in Agnietenkapel, in Amsterdam, on 5 December 2012 at 14:00. If you happen to be in Amsterdam, you are very welcome to drop by!

Posted in Uncategorized | Tagged , , | 1 Comment

Language Intent Models for Inferring User Browsing Behavior

I am very happy that our paper “Language Intent Models for Inferring User Browsing Behavior” by Manos Tsagkias, and Roi Blanco has been accepted at SIGIR 2013, which will be held in Portland, Oregon, 12–16 August 2012. The paper was realized during my three-month internship at Yahoo! Research Barcelona during September–December 2011. The abstract follows:

Modeling user browsing behavior is an active research area with
tangible real-world applications, e.g., organizations can adapt
their online presence to their visitors browsing behavior with
positive effects in user engagement, and revenue. We concentrate on
online news agents, and present a semi-supervised method for
predicting news articles that a user will visit after reading an
initial article. Our method tackles the problem using language
intent models trained on historical data which can cope
with unseen articles. We evaluate our method on a large set of
articles and in several experimental settings. Our results
demonstrate the utility of language intent models for predicting
user browsing behavior within online news sites.

Download the PDF, or the BibTex.

Posted in Publications | Tagged , | Leave a comment

Generating Pseudo Test Collections for Learning to Rank Scientific Articles

Our paper “Generating Pseudo Test Collections for Learning to Rank Scientific Articles” by Richard Berendsen, Manos Tsagkias, Maarten de Rijke, and Edgar Meij has been accepted at CLEF 2012 in Rome, Italy. The abstract follows:

Pseudo test collections are automatically generated to provide
training material for learning to rank methods. We propose a
method for generating pseudo test collections in the domain of
digital libraries, where data is relatively sparse, but comes with
rich annotations. Our intuition is that documents are annotated
to make them better findable for certain information needs. We
use these annotations and the associated documents as a source for
pairs of queries and relevant documents.

We investigate how learning to rank performance varies when we use
different methods for sampling annotations, and show how our
pseudo test collection ranks systems compared to editorial topics
with editorial judgements.

Our results demonstrate that it is possible to train a learning to
rank algorithm on generated pseudo judgments. In some cases,
performance is on par with learning on manually obtained ground
truth.

Download the PDF, or the Bibtex.

Posted in Publications | Tagged , | Leave a comment

Our poster won the Best Poster Award at ECIR 2012

I am very excited that our paper “Predicting IMDB Movie Ratings Using Social Media” by Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke won the best poster award at 34 th European Conference on Information Retrieval (ECIR 2012)!

Andrei and Mathias are currently towards the end of their masters degree at University of Amsterdam. The idea of this poster came up while we were brainstorming for a project for an IR course. Great job guys!

Special thanks go to Yahoo Labs Barcelona for organizing this year’s ECIR, and for supporting this award.

Posted in Publications | Tagged , , , , , | 1 Comment

Predicting IMDb Movie Ratings Using Social Media

UPDATE Apr 5, 2012: Our poster paper won the Best Poster Award at ECIR 2012.

It’s great news that our poster paper “Predicting IMDB Movie Ratings Using Social Media” with Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke has been accepted in ECIR 2012, in Barcelona, Spain, 1–5 April 2012.

We predict IMDb movie ratings and consider two sets of features: surface and textual features. For the latter, we assume that no social media signal is isolated and use data from multiple channels that are linked to a particular movie, such as tweets from Twitter and comments from YouTube. We extract textual features from each channel to use in our prediction model and we explore whether data from either of these channels can help to extract a better set of textual feature for prediction. Our best performing model is able to rate movies very close to the observed values.

I found exciting that using content textual features from Twitter, and the ratio of likes over dislikes on YouTube movie trailers led to very good prediction estimates with 0.35 mean average error.

Posted in Publications | Tagged , , , , , , , , , , | 1 Comment

Yahoo! Barcelona

I moved to Barcelona the first week of September for a 3 month internship at Yahoo! Barcelona. I’m very excited about it, and looking forward to getting to know the people here and the problems they are working on. I think it will be an extremely valuable experience for my current and future work!

Posted in Uncategorized | 1 Comment

Hypergeometric Language Models for Republished Article Finding

Update (13 June 2011): A pre-print version of the paper is now available: Hypergeometric Language Models for Republished Article Finding, BibTex.

I’m very happy that our paper “Hypergeometric Language Models for Republished Article Finding” by me, Maarten de Rijke, and Wouter Weerkamp has been accepted as a full paper at SIGIR 2011, in Beijing, China, 24-28 July. The abstract follows:

Republished article finding is the task of identifying instances
of articles that have been published in one source and republished
more or less verbatim in another source, which is often a social
media source. We address this task as an ad hoc retrieval
problem, using the source article as a query. Our approach is
based on language modeling. We revisit the assumptions underlying
the unigram language model taking into account the fact that in
our setup queries are as long as complete news articles. We argue
that in this case, the underlying generative assumption of
sampling words from a document with replacement, i.e., the
multinomial modeling of documents, produces less accurate query
likelihood estimates.

To make up for this discrepancy, we consider distributions that
emerge from sampling without replacement: the central and
non-central hypergeometric distributions. We present two
retrieval models that build on top of these distributions: a log
odds model and a bayesian model where document parameters are
estimated using the Dirichlet compound multinomial distribution.

We analyse the behavior of our new models using a corpus of news
articles and blog posts and find that for the task of republished
article finding, where we deal with queries whose length
approaches the length of the documents to be retrieved, models
based on distributions associated with sampling without
replacement outperform traditional models based on multinomial
distributions.

Posted in Publications | Tagged , , | 2 Comments