KL-divergence of two documents

Let P and Q be two probability distributions of a discrete random variable. If the following two properties hold:

  1. when P and Q both sum to 1
  2. and for any i such that P(i) > 0 and Q(i) > 0

then, we can define their KL-divergence as:

D_{KL}(P||Q) = \sum_{i}P(i)log\frac{P(i)}{Q(i)}

and it has three properties:

  1. D_{KL}(P||Q) \neq D_{KL}(Q||P) (asymmetry)
  2. it is additive for independent distributions
  3. D_{KL} \geq 0 with D_{KL} = 0 iff P=Q

Working with documents

We regard a document d as discrete distribution of |d| random variables, where |d| is the number of words in the document. Now, let d_{1} and d_{2} be two documents for which we want to calculate their KL-divergence. We run into two problems:

  1. we need to compute the KL-divergence twice due to asymmetry: D_{KL}(d_{1}||d_{2}) and D_{KL}(d_{2}||d_{1}).
  2. also, due to the 2nd constraint for defining KL-divergence, our calculations should only consider words occurring in both d_{1} and d_{2}.

Symmetric KL-divergence

We start from the 2nd property of KL-divergence:

\begin{array}{rcl} D_{KL}(P||Q) + D_{KL}(Q||P) & = & \sum_{i}P(i)log\frac{P(i)}{Q(i)} + \sum_{i}Q(i)log\frac{Q(i)}{P(i)} \\& = & \sum_{i}P(i)log\frac{P(i)}{Q(i)}+Q(i)log\frac{Q(i)}{P(i)}\\ & = & \sum_{i}P(i)log\frac{P(i)}{Q(i)}-Q(i)log\frac{P(i)}{Q(i)}\\ & = & \sum_{i}(P(i)-Q(i))log\frac{P(i)}{Q(i)}\end{array}

Ok! It looks good! Now we need to compute KL-divergence only once for every pair of documents.

Over which random variables?

Now let’s turn into how to handle documents with no or little overlapping vocabularies. To illustrate the problem, consider the following documents:

d1:

This is a document

d2:

This is a sentence

For each document we remove stopwords (‘this’, ‘is’, ‘a’) so they become:

d1:

document

d2:

sentence

According to constraint 2, we need to operate on the intersection of the documents’ vocabularies: d_{1}\cap d_{2}=\emptyset. We end up with the empty set and therefore we cannot compute directly the KL-divergence. In this case we can assign it a large number like 1e33.

Let’s see what happens when we have larger documents.

d1:

Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file.

d2:

In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it.

After stopword removal, lowercasing and discarding words less than 2 characters, the documents become:

d1:

many research publications want you use bibtex better organizes
whole process suppose concreteness your source file tex basically
you create file bib containing bibliography run bibtex file

d2:

case you must supply both left right because delimiter height made
match whatever contained between two commands left doesn have actual
left delimiter you use left some reason

The vocabulary intersection of the documents consists of two terms: “use” and “you”. In d_{1} “use” occurs 1 time and “you” occurs 2 times. Surprisingly, in d_{2} “use” also occurs 1 time and “you” occurs 2 times too. The distributions D_{1} and D_{2} are equal, and therefore D_{KLsym}(D_{1}||D_{2}) = 0. So these documents are deemed equal! A better stopword list could have removed “use” and “you” and in that case the documents would have an infinite KL-divergence as in the first example. However it is easy to think of similar examples where stopword lists wouldn’t have been of much help.

So, how can we overcome this problem?

Simple back-off

Since operating on the vocabulary intersection is not an option, we need to find a trick that allows us to consider the entire vocabulary of the documents. Smoothing comes to mind. Dirichlet and Laplacian smoothing are amongst the most popular smoothing techniques but after smoothing the probability distribution doesn’t sum up to 1 and violates the first constraint for defining KL-divergence.

Brigette Biggi suggested a back off smoothing method which keeps the probability distributions sums to 1 and also allows operating on the entire vocabulary. According to their proposed back-off method, the smoothed probability P'(t, d) of a term t in a document d is:

P'(t_{i},d) = \left\{ \begin{array}{ll} \gamma P(t_{i}|d) & \quad \text{if ti occurs in d}\\ \epsilon & \quad \text{otherwise}\\ \end{array} \right.

where

P(t_{i}|d) = \frac{tf(t_{i}, d)}{\sum_{x\in d}tf(t_{x},d)}

the interesting part is on how \gamma and \epsilon are calculated. In order to keep the term probabilities for d_{1} and d_{2} summing up to 1, the following constraint should be met:

\sum_{i \in d}\gamma P(t_{i}|d) + \sum_{i \in d_{1}, i \notin d_{2}}\epsilon = 1

where \gamma is a normalization coefficient:

\gamma = 1 - \sum_{i \in d_{1}, i \notin d_{2}}\epsilon

and \epsilon is a positive number smaller than the minimum term probability occurring in either d_{1} or d_{2}.

The code

To illustrate the above, I wrote a small Python script.

import re, math, collections

def tokenize(_str):
    stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', 'are', 'will', 'in', 'it', 'to', 'that']
    tokens = collections.defaultdict(lambda: 0.)
    for m in re.finditer(r"(\w+)", _str, re.UNICODE):
        m = m.group(1).lower()
        if len(m) < 2: continue
        if m in stopwords: continue
        tokens[m] += 1

    return tokens
#end of tokenize

def kldiv(_s, _t):
    if (len(_s) == 0):
        return 1e33

    if (len(_t) == 0):
        return 1e33

    ssum = 0. + sum(_s.values())
    slen = len(_s)

    tsum = 0. + sum(_t.values())
    tlen = len(_t)

    vocabdiff = set(_s.keys()).difference(set(_t.keys()))
    lenvocabdiff = len(vocabdiff)

    """ epsilon """
    epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001

    """ gamma """
    gamma = 1 - lenvocabdiff * epsilon

    # print "_s: %s" % _s
    # print "_t: %s" % _t

    """ Check if distribution probabilities sum to 1"""
    sc = sum([v/ssum for v in _s.itervalues()])
    st = sum([v/tsum for v in _t.itervalues()])

    if sc < 9e-6:
        print "Sum P: %e, Sum Q: %e" % (sc, st)
        print "*** ERROR: sc does not sum up to 1. Bailing out .."
        sys.exit(2)
    if st < 9e-6:
        print "Sum P: %e, Sum Q: %e" % (sc, st)
        print "*** ERROR: st does not sum up to 1. Bailing out .."
        sys.exit(2)

    div = 0.
    for t, v in _s.iteritems():
        pts = v / ssum

        ptt = epsilon
        if t in _t:
            ptt = gamma * (_t[t] / tsum)

        ckl = (pts - ptt) * math.log(pts / ptt)

        div +=  ckl

    return div
#end of kldiv

d1 = """Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file."""
d2 = """In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it."""

print "KL-divergence between d1 and d2:", kldiv(tokenize(d1), tokenize(d2))
print "KL-divergence between d2 and d1:", kldiv(tokenize(d2), tokenize(d1))

The output looks like this:

KL-divergence between d1 and d2: 6.52185430964
KL-divergence between d2 and d1: 6.51142363095

Now, KL-divergence is greater than zero, so the documents are not thought to be the same as before! Good job. Looking at KL symmetry, although the divergence of the two pairs is not identical, it is sufficiently close.

Acknowledgments

The above is a compilation of knowledge found in Wikipedia and from the paper “Reducing the Plagiarism Detection Search Space on the basis of Kullback-Leibler Distance” by Alberto Barron-Cedeno, Paolo Rosso, and Jose-Miguel Benedi. Examples and code are mine and you are free to use them.

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

5 Responses to KL-divergence of two documents

  1. JS says:

    Thanks for posting this. Very useful. Code samples showing how to compute the K-L divergence seem to be rare.

  2. priya says:

    Thanks for posting. Could understand KLD better.

  3. Pingback: Nick Diakopoulos » Comment Readers Want Relevance!

  4. Lars says:

    A bit of NumPy can greatly speed up your KL-d calculations; see my gist (nowhere near as full-featured as this version).

Leave a Reply

Your email is never published nor shared. Required fields are marked *

*

You may use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>