Let
and
be two probability distributions of a discrete random variable. If the following two properties hold:
- when
and
both sum to 1
- and for any
such that
and 
then, we can define their KL-divergence as:

and it has three properties:
(asymmetry)
- it is additive for independent distributions
with
iff 
Working with documents
We regard a document
as discrete distribution of
random variables, where
is the number of words in the document. Now, let
and
be two documents for which we want to calculate their KL-divergence. We run into two problems:
- we need to compute the KL-divergence twice due to asymmetry:
and
.
- also, due to the 2nd constraint for defining KL-divergence, our calculations should only consider words occurring in both
and
.
Symmetric KL-divergence
We start from the 2nd property of KL-divergence:

Ok! It looks good! Now we need to compute KL-divergence only once for every pair of documents.
Over which random variables?
Now let’s turn into how to handle documents with no or little overlapping vocabularies. To illustrate the problem, consider the following documents:
d1:
This is a document
d2:
This is a sentence
For each document we remove stopwords (‘this’, ‘is’, ‘a’) so they become:
d1:
document
d2:
sentence
According to constraint 2, we need to operate on the intersection of the documents’ vocabularies:
. We end up with the empty set and therefore we cannot compute directly the KL-divergence. In this case we can assign it a large number like 1e33.
Let’s see what happens when we have larger documents.
d1:
Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file.
d2:
In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it.
After stopword removal, lowercasing and discarding words less than 2 characters, the documents become:
d1:
many research publications want you use bibtex better organizes
whole process suppose concreteness your source file tex basically
you create file bib containing bibliography run bibtex file
d2:
case you must supply both left right because delimiter height made
match whatever contained between two commands left doesn have actual
left delimiter you use left some reason
The vocabulary intersection of the documents consists of two terms: “use” and “you”. In
“use” occurs 1 time and “you” occurs 2 times. Surprisingly, in
“use” also occurs 1 time and “you” occurs 2 times too. The distributions
and
are equal, and therefore
. So these documents are deemed equal! A better stopword list could have removed “use” and “you” and in that case the documents would have an infinite KL-divergence as in the first example. However it is easy to think of similar examples where stopword lists wouldn’t have been of much help.
So, how can we overcome this problem?
Simple back-off
Since operating on the vocabulary intersection is not an option, we need to find a trick that allows us to consider the entire vocabulary of the documents. Smoothing comes to mind. Dirichlet and Laplacian smoothing are amongst the most popular smoothing techniques but after smoothing the probability distribution doesn’t sum up to 1 and violates the first constraint for defining KL-divergence.
Brigette Biggi suggested a back off smoothing method which keeps the probability distributions sums to 1 and also allows operating on the entire vocabulary. According to their proposed back-off method, the smoothed probability
of a term
in a document
is:

where

the interesting part is on how
and
are calculated. In order to keep the term probabilities for
and
summing up to 1, the following constraint should be met:

where
is a normalization coefficient:

and
is a positive number smaller than the minimum term probability occurring in either
or
.
The code
To illustrate the above, I wrote a small Python script.
import re, math, collections
def tokenize(_str):
stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', 'are', 'will', 'in', 'it', 'to', 'that']
tokens = collections.defaultdict(lambda: 0.)
for m in re.finditer(r"(\w+)", _str, re.UNICODE):
m = m.group(1).lower()
if len(m) < 2: continue
if m in stopwords: continue
tokens[m] += 1
return tokens
#end of tokenize
def kldiv(_s, _t):
if (len(_s) == 0):
return 1e33
if (len(_t) == 0):
return 1e33
ssum = 0. + sum(_s.values())
slen = len(_s)
tsum = 0. + sum(_t.values())
tlen = len(_t)
vocabdiff = set(_s.keys()).difference(set(_t.keys()))
lenvocabdiff = len(vocabdiff)
""" epsilon """
epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001
""" gamma """
gamma = 1 - lenvocabdiff * epsilon
# print "_s: %s" % _s
# print "_t: %s" % _t
""" Check if distribution probabilities sum to 1"""
sc = sum([v/ssum for v in _s.itervalues()])
st = sum([v/tsum for v in _t.itervalues()])
if sc < 9e-6:
print "Sum P: %e, Sum Q: %e" % (sc, st)
print "*** ERROR: sc does not sum up to 1. Bailing out .."
sys.exit(2)
if st < 9e-6:
print "Sum P: %e, Sum Q: %e" % (sc, st)
print "*** ERROR: st does not sum up to 1. Bailing out .."
sys.exit(2)
div = 0.
for t, v in _s.iteritems():
pts = v / ssum
ptt = epsilon
if t in _t:
ptt = gamma * (_t[t] / tsum)
ckl = (pts - ptt) * math.log(pts / ptt)
div += ckl
return div
#end of kldiv
d1 = """Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file."""
d2 = """In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it."""
print "KL-divergence between d1 and d2:", kldiv(tokenize(d1), tokenize(d2))
print "KL-divergence between d2 and d1:", kldiv(tokenize(d2), tokenize(d1))
The output looks like this:
KL-divergence between d1 and d2: 6.52185430964
KL-divergence between d2 and d1: 6.51142363095
Now, KL-divergence is greater than zero, so the documents are not thought to be the same as before! Good job. Looking at KL symmetry, although the divergence of the two pairs is not identical, it is sufficiently close.
Acknowledgments
The above is a compilation of knowledge found in Wikipedia and from the paper “Reducing the Plagiarism Detection Search Space on the basis of Kullback-Leibler Distance” by Alberto Barron-Cedeno, Paolo Rosso, and Jose-Miguel Benedi. Examples and code are mine and you are free to use them.