Predicting IMDb Movie Ratings Using Social Media

It’s great news that our poster paper “Predicting IMDB Movie Ratings Using Social Media” with Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke has been accepted in ECIR 2012, in Barcelona, Spain, 1–5 April 2012.

We predict IMDb movie ratings and consider two sets of features: surface and textual features. For the latter, we assume that no social media signal is isolated and use data from multiple channels that are linked to a particular movie, such as tweets from Twitter and comments from YouTube. We extract textual features from each channel to use in our prediction model and we explore whether data from either of these channels can help to extract a better set of textual feature for prediction. Our best performing model is able to rate movies very close to the observed values.

I found exciting that using content textual features from Twitter, and the ratio of likes over dislikes on YouTube movie trailers led to very good prediction estimates with 0.35 mean average error.

Posted in Publications | Tagged , , , , , , , , , , | Leave a comment

Yahoo! Barcelona

I moved to Barcelona the first week of September for a 3 months internship at Yahoo! Barcelona. I’m very excited about it, and looking forward to getting to know the people here and the problems they are working on. I think it will be an extremely valuable experience for my current and future work!

Posted in Uncategorized | Leave a comment

Hypergeometric Language Models for Republished Article Finding

Update (13 June 2011): A pre-print version of the paper is now available: Hypergeometric Language Models for Republished Article Finding, BibTex.

I’m very happy that our paper “Hypergeometric Language Models for Republished Article Finding” by me, Maarten de Rijke, and Wouter Weerkamp has been accepted as a full paper at SIGIR 2011, in Beijing, China, 24-28 July. The abstract follows:

Republished article finding is the task of identifying instances
of articles that have been published in one source and republished
more or less verbatim in another source, which is often a social
media source. We address this task as an ad hoc retrieval
problem, using the source article as a query. Our approach is
based on language modeling. We revisit the assumptions underlying
the unigram language model taking into account the fact that in
our setup queries are as long as complete news articles. We argue
that in this case, the underlying generative assumption of
sampling words from a document with replacement, i.e., the
multinomial modeling of documents, produces less accurate query
likelihood estimates.

To make up for this discrepancy, we consider distributions that
emerge from sampling without replacement: the central and
non-central hypergeometric distributions. We present two
retrieval models that build on top of these distributions: a log
odds model and a bayesian model where document parameters are
estimated using the Dirichlet compound multinomial distribution.

We analyse the behavior of our new models using a corpus of news
articles and blog posts and find that for the task of republished
article finding, where we deal with queries whose length
approaches the length of the documents to be retrieved, models
based on distributions associated with sampling without
replacement outperform traditional models based on multinomial
distributions.

Posted in Publications | Tagged , , | 2 Comments

Twitter hashtags: Joint Translation and Clustering

Our poster paper “Twitter hashtags: Joint Translation and Clustering” by Simon Carter, me, and Wouter Weerkamp has been accepted at Web Science 2011, held in Koblenz, Germany, June 14-17.

We look at the potential problem of hashtag translation in Twitter. People tweet in different languages which also potentially affect the hashtags they assign to their tweets. Hashtags are one of the main sources for mining trending topics in Twitter. To this end, being able to identify mappings of hashtags inbetween languages can be beneficial for correcting trending topic statistics. As a proof of concept we provide an example use case: The hashtag #33mineros became popular in spanish speaking countries after 33 miners were trapped in a mine in Chile. However, english speaking people were tweeting the hashtag #33miners. Can we translate hashtags? and how?

Posted in Publications | Tagged , , | Leave a comment

How People use Twitter in Different Languages

Our poster paper “How People use Twitter in Different Languages” by Wouter Weerkamp, Simon Carter, and me has been accepted at Web Science 2011, held in Koblenz, Germany, June 14-17.

We look at how people from different countries use language differently in Twitter. In particular, we report on statistics from hashtags, links, mentions, and replies derived from a small set of five languages: Dutch, English, German, French, and Spanish.

Posted in Publications | Tagged , , | Leave a comment

Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts

Our short paper “Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts” by Kamran Massoudi, me, Maarten de Rijke, and Wouter Weerkamp has been accepted for oral presentation at ECIR 2011, 19-21 April, Dublin. The abstract follows:

We propose a retrieval model for searching microblog posts for a given
topic of interest.  We develop a language modeling approach tailored
to microblogging characteristics, where redundancy-based IR methods
cannot be used in a straightforward manner.  We enhance this model
with two groups of quality indicators: textual and microblog specific.
Additionally, we propose a dynamic query expansion model for microblog
post retrieval.  Experimental results on Twitter data reveal the
usefulness of boolean search, and demonstrate the utility of quality
indicators and query expansion in microblog search.

Download preprint PDF, and BibTex.

Posted in Publications | Tagged , , , | Leave a comment

Semi-Supervised Priors for Microblog Language Identification

Our paper “Semi-Supervised Priors for Microblog Language Identification” by Simon Carter, me, and Wouter Weerkamp has been accepted at DIR 2011 workshop in Amsterdam, as poster presentation. The presentation will be on February, 4. The abstract follows:

Offering access to information in microblog posts requires successful language identification. Language identification on sparse and noisy data can be challenging. In this paper we explore the performance of a state-of-the-art n-gram-based language identifier, and we introduce two semi-supervised priors to enhance performance at microblog post level: (i)~blogger-based prior, using previous posts by the same blogger, and (ii)~link-based prior, using the pages linked to from the post. We test our models on five languages (Dutch, English, French, German, and Spanish), and a set of 1,000 tweets per language. Results show that our priors improve accuracy, but that there is still room for improvement.

Download the PDF.

BibTex:
@misc{dir2011-carter,
Title = {Semi-Supervised Priors for Microblog Language Identification},
Author = {Simon Carter and Manos Tsagkias and Wouter Weerkamp},
Year = {2011},
Month = {February},
}

Posted in Publications | Tagged , , , , , | Leave a comment

Linking Online News and Social Media

Update (24 Nov 2010): A pre-print version of the paper is now available: Linking Online News and Social Media (PDF), BibTex. The ground truth we used for our experiments is also available.

Our paper “Linking Online News and Social Media” with Maarten de Rijke and Wouter Weerkamp is accepted at WSDM 2011. I’m working towards the camera ready version so there’s no PDF link for now, but here’s the abstract:

Much of what is discussed in social media is inspired by events in the news and, vice versa, social media provide us with a handle on the impact of news events. We address the following linking social media utterances task: given a news article, find social media utterances that implicitly reference it.

We follow a three-step approach: we derive multiple query models from a given source news article, which are then used to retrieve utterances from a target social media index, resulting in multiple ranked lists that we then merge into a single result list using data fusion techniques.

Query models are created by exploiting the structure of the source news article and by using explicitly linked social media utterances that are known to discuss the source article.

To combat query drift resulting from the large volume of text, either in the source news article itself or in social media utterances explicitly linked to it, we introduce a graph-based method for selecting discriminative terms.

For our experimental evaluation, we use data from Twitter, Digg, Delicious, the New York Times Community, Wikipedia, and the blogosphere to generate query models. We show that different query models, based on different data sources, provide complementary information and manage to retrieve different social media utterances from our target index. As a consequence, (article dependent) data fusion methods manage to significantly boost retrieval performance over individual approaches. Our graph-based term selection method is shown to help improve both effectiveness and efficiency.

Posted in Publications | Tagged , , , | 2 Comments

KL-divergence of two documents

Let P and Q be two probability distributions of a discrete random variable. If the following two properties hold:

  1. when P and Q both sum to 1
  2. and for any i such that P(i) > 0 and Q(i) > 0

then, we can define their KL-divergence as:

D_{KL}(P||Q) = \sum_{i}P(i)log\frac{P(i)}{Q(i)}

and it has three properties:

  1. D_{KL}(P||Q) \neq D_{KL}(Q||P) (asymmetry)
  2. it is additive for independent distributions
  3. D_{KL} \geq 0 with D_{KL} = 0 iff P=Q

Working with documents

We regard a document d as discrete distribution of |d| random variables, where |d| is the number of words in the document. Now, let d_{1} and d_{2} be two documents for which we want to calculate their KL-divergence. We run into two problems:

  1. we need to compute the KL-divergence twice due to asymmetry: D_{KL}(d_{1}||d_{2}) and D_{KL}(d_{2}||d_{1}).
  2. also, due to the 2nd constraint for defining KL-divergence, our calculations should only consider words occurring in both d_{1} and d_{2}.

Symmetric KL-divergence

We start from the 2nd property of KL-divergence:

\begin{array}{rcl} D_{KL}(P||Q) + D_{KL}(Q||P) & = & \sum_{i}P(i)log\frac{P(i)}{Q(i)} + \sum_{i}Q(i)log\frac{Q(i)}{P(i)} \\& = & \sum_{i}P(i)log\frac{P(i)}{Q(i)}+Q(i)log\frac{Q(i)}{P(i)}\\ & = & \sum_{i}P(i)log\frac{P(i)}{Q(i)}-Q(i)log\frac{P(i)}{Q(i)}\\ & = & \sum_{i}(P(i)-Q(i))log\frac{P(i)}{Q(i)}\end{array}

Ok! It looks good! Now we need to compute KL-divergence only once for every pair of documents.

Over which random variables?

Now let’s turn into how to handle documents with no or little overlapping vocabularies. To illustrate the problem, consider the following documents:

d1:

This is a document

d2:

This is a sentence

For each document we remove stopwords (‘this’, ‘is’, ‘a’) so they become:

d1:

document

d2:

sentence

According to constraint 2, we need to operate on the intersection of the documents’ vocabularies: d_{1}\cap d_{2}=\emptyset. We end up with the empty set and therefore we cannot compute directly the KL-divergence. In this case we can assign it a large number like 1e33.

Let’s see what happens when we have larger documents.

d1:

Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file.

d2:

In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it.

After stopword removal, lowercasing and discarding words less than 2 characters, the documents become:

d1:

many research publications want you use bibtex better organizes
whole process suppose concreteness your source file tex basically
you create file bib containing bibliography run bibtex file

d2:

case you must supply both left right because delimiter height made
match whatever contained between two commands left doesn have actual
left delimiter you use left some reason

The vocabulary intersection of the documents consists of two terms: “use” and “you”. In d_{1} “use” occurs 1 time and “you” occurs 2 times. Surprisingly, in d_{2} “use” also occurs 1 time and “you” occurs 2 times too. The distributions D_{1} and D_{2} are equal, and therefore D_{KLsym}(D_{1}||D_{2}) = 0. So these documents are deemed equal! A better stopword list could have removed “use” and “you” and in that case the documents would have an infinite KL-divergence as in the first example. However it is easy to think of similar examples where stopword lists wouldn’t have been of much help.

So, how can we overcome this problem?

Simple back-off

Since operating on the vocabulary intersection is not an option, we need to find a trick that allows us to consider the entire vocabulary of the documents. Smoothing comes to mind. Dirichlet and Laplacian smoothing are amongst the most popular smoothing techniques but after smoothing the probability distribution doesn’t sum up to 1 and violates the first constraint for defining KL-divergence.

Brigette Biggi suggested a back off smoothing method which keeps the probability distributions sums to 1 and also allows operating on the entire vocabulary. According to their proposed back-off method, the smoothed probability P'(t, d) of a term t in a document d is:

P'(t_{i},d) = \left\{ \begin{array}{ll} \gamma P(t_{i}|d) & \quad \text{if ti occurs in d}\\ \epsilon & \quad \text{otherwise}\\ \end{array} \right.

where

P(t_{i}|d) = \frac{tf(t_{i}, d)}{\sum_{x\in d}tf(t_{x},d)}

the interesting part is on how \gamma and \epsilon are calculated. In order to keep the term probabilities for d_{1} and d_{2} summing up to 1, the following constraint should be met:

\sum_{i \in d}\gamma P(t_{i}|d) + \sum_{i \in d_{1}, i \notin d_{2}}\epsilon = 1

where \gamma is a normalization coefficient:

\gamma = 1 - \sum_{i \in d_{1}, i \notin d_{2}}\epsilon

and \epsilon is a positive number smaller than the minimum term probability occurring in either d_{1} or d_{2}.

The code

To illustrate the above, I wrote a small Python script.

import re, math, collections

def tokenize(_str):
    stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', 'are', 'will', 'in', 'it', 'to', 'that']
    tokens = collections.defaultdict(lambda: 0.)
    for m in re.finditer(r"(\w+)", _str, re.UNICODE):
        m = m.group(1).lower()
        if len(m) < 2: continue
        if m in stopwords: continue
        tokens[m] += 1

    return tokens
#end of tokenize

def kldiv(_s, _t):
    if (len(_s) == 0):
        return 1e33

    if (len(_t) == 0):
        return 1e33

    ssum = 0. + sum(_s.values())
    slen = len(_s)

    tsum = 0. + sum(_t.values())
    tlen = len(_t)

    vocabdiff = set(_s.keys()).difference(set(_t.keys()))
    lenvocabdiff = len(vocabdiff)

    """ epsilon """
    epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001

    """ gamma """
    gamma = 1 - lenvocabdiff * epsilon

    # print "_s: %s" % _s
    # print "_t: %s" % _t

    """ Check if distribution probabilities sum to 1"""
    sc = sum([v/ssum for v in _s.itervalues()])
    st = sum([v/tsum for v in _t.itervalues()])

    if sc < 9e-6:
        print "Sum P: %e, Sum Q: %e" % (sc, st)
        print "*** ERROR: sc does not sum up to 1. Bailing out .."
        sys.exit(2)
    if st < 9e-6:
        print "Sum P: %e, Sum Q: %e" % (sc, st)
        print "*** ERROR: st does not sum up to 1. Bailing out .."
        sys.exit(2)

    div = 0.
    for t, v in _s.iteritems():
        pts = v / ssum

        ptt = epsilon
        if t in _t:
            ptt = gamma * (_t[t] / tsum)

        ckl = (pts - ptt) * math.log(pts / ptt)

        div +=  ckl

    return div
#end of kldiv

d1 = """Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file."""
d2 = """In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it."""

print "KL-divergence between d1 and d2:", kldiv(tokenize(d1), tokenize(d2))
print "KL-divergence between d2 and d1:", kldiv(tokenize(d2), tokenize(d1))

The output looks like this:

KL-divergence between d1 and d2: 6.52185430964
KL-divergence between d2 and d1: 6.51142363095

Now, KL-divergence is greater than zero, so the documents are not thought to be the same as before! Good job. Looking at KL symmetry, although the divergence of the two pairs is not identical, it is sufficiently close.

Acknowledgments

The above is a compilation of knowledge found in Wikipedia and from the paper “Reducing the Plagiarism Detection Search Space on the basis of Kullback-Leibler Distance” by Alberto Barron-Cedeno, Paolo Rosso, and Jose-Miguel Benedi. Examples and code are mine and you are free to use them.

Posted in Uncategorized | Leave a comment

Computational science?

I was reading a tutorial on neuron simulation and I was happily surprised when I found 10 slides discussing the reproducibility of results in computation sciences. The authors focus on computational neuroscience, but I can see similarities in the issues they raise in Information Retrieval too. First, they take a step back and revisit what makes science science:

Refutable hypotheses
Hypotheses must be stated with sufficient detail and precision so that one can devise meaningful tests or counterexamples.
Reproducible experiments
Experiments must be described and performed so carefully, that others can reproduce them. Genuine failure to reproduce results invalidates original findings.
Accumulation of knowledge
Accumulation of knowledge through exchange, evolution and (sometimes) revolution of ideas.

Then, they turn into computational science quoting Donoho’s et al (2009) paper on Reproducible Research in Computational Harmonic Analysis:

The vast majority of results being generated by current computational science practice suffer a large and growing credibility gap: it is impossible to verify most of the computational results shown in conferences and papers.
… [C]urrent computational science practice does not generate routinely verifiable knowledge.
… Almost no time is devoted to explaining to the audience why one should believe that errors have been found and eliminated. The core of the presentation is not about the struggle to root out error–as it would be in mature fields–it is rather a sales pitch[.]
… How dare we imagine that computational science, as routinely practiced, is reliable! Many researchers using scientific computing are not even trying to follow systematic, rigorous discipline that would be in principle allow others to verify the claims they make.

Computing in Science & Engineering 11:8-18 (2009), doi: 10.1109/MCSE.2009.15

The authors give tips on how to better organize the work- and data-flow of the programs used in computational experiments for the results to be reproducible in the future and also to keep a log of what had happened in the course of the experimentation.

I can see the difficulties of verifying results (or bugs in the implementation!) in computational science, especially given the sheer volume of submissions targeted at conferences and journals. Nevertheless, when the systems under investigation are complex, there are so many parameters one can change apart from the core model. For example, in Information Retrieval, one can perform tokenization in multiple ways, smoothing can be involved in more than one system component and might need component-dependent values etc. No matter how little is the impact of individual parameters/models to the entire system, their effect accumulates non-deterministicaly and can lead to results in either side of the performance spectrum.

Can we go through all possible values for each system parameter and all their combinations with different models? Unfortunately, no, not even if we considered only one model; it would take infinite time to study all possible states of the system: some parameters have unbounded values, some are sensitive to small perturbations while others not, etc. Most authors choose the parameters of their system either arbitrarily, or based on prior work (which is the best in my opinion) or based on small-scale experiments using held-out data (second best). However, when the dataset changes we are no longer sure if these values are good anymore. And we need to start from scratch ..

It would be beneficial if experimentation testbeds existed[0], but as they are programs themselves, every new version would cancel results based on the previous version. We would then need “glue” papers to report on the performance change between versions and also give some sort of “tool” to port/translate results from previous versions to the current version. It sounds taxing indeed, but it may give us the opportunity to focus on the little details which we mostly leave out from our papers (mainly due to space constraints) and start building a well understood system built from prior research for future research.

[0] In IR we are happy for having a couple big ones: Lucene, Indri.

Posted in Uncategorized | Leave a comment