Update (13 June 2011): A pre-print version of the paper is now available: Hypergeometric Language Models for Republished Article Finding, BibTex.
I’m very happy that our paper “Hypergeometric Language Models for Republished Article Finding” by me, Maarten de Rijke, and Wouter Weerkamp has been accepted as a full paper at SIGIR 2011, in Beijing, China, 24-28 July. The abstract follows:
Republished article finding is the task of identifying instances
of articles that have been published in one source and republished
more or less verbatim in another source, which is often a social
media source. We address this task as an ad hoc retrieval
problem, using the source article as a query. Our approach is
based on language modeling. We revisit the assumptions underlying
the unigram language model taking into account the fact that in
our setup queries are as long as complete news articles. We argue
that in this case, the underlying generative assumption of
sampling words from a document with replacement, i.e., the
multinomial modeling of documents, produces less accurate query
likelihood estimates.To make up for this discrepancy, we consider distributions that
emerge from sampling without replacement: the central and
non-central hypergeometric distributions. We present two
retrieval models that build on top of these distributions: a log
odds model and a bayesian model where document parameters are
estimated using the Dirichlet compound multinomial distribution.We analyse the behavior of our new models using a corpus of news
articles and blog posts and find that for the task of republished
article finding, where we deal with queries whose length
approaches the length of the documents to be retrieved, models
based on distributions associated with sampling without
replacement outperform traditional models based on multinomial
distributions.

Interesting work!, I hope you’ve read our SPIRE’10 paper http://dx.doi.org/10.1007/978-3-642-16321-0_32 , where, to the best of our knowledge, the idea of hypergeometric language model was first introduced, and used to solve a similar information retrieval problem than yours.
regards
Felipe Bravo-Marquez
Thank you Felipe. I have read your work. It is indeed the first to my knowledge that uses the extended (also called Fisher’s non-central) hypergeometric distribution for IR. I found that sampling from extended hypergeometric becomes exponentially expensive as the sample size increases, and that there is ongoing research in statistics about making sampling from the extended hypergeometric computationally tractable.
While researching on the use of the hypergeometric distribution for IR, I found a paper from Wilbur dating back to 1993! Wilbur models the vocabulary intersection between the query and a set of relevant documents using the central hypergeometric distribution.
Little has been done since then, probably because the multinomial distribution is a good approximation to the hypergeometric for most IR scenarios, i.e., when the sample size (query) is cosiderably smaller than the population size (document). However, as we show in the paper, in the case of document-long queries, the multinomial approximation does not hold anymore, and the use of the “vanilla” hypergeometric distribution is required.