Let and be two probability distributions of a discrete random variable. If the following two properties hold:
- when and both sum to 1
- and for any such that and
then, we can define their KL-divergence as:
and it has three properties:
- it is additive for independent distributions
- with iff
Working with documents
We regard a document as discrete distribution of random variables, where is the number of words in the document. Now, let and be two documents for which we want to calculate their KL-divergence. We run into two problems:
- we need to compute the KL-divergence twice due to asymmetry: and .
- also, due to the 2nd constraint for defining KL-divergence, our calculations should only consider words occurring in both and .
We start from the 2nd property of KL-divergence:
Ok! It looks good! Now we need to compute KL-divergence only once for every pair of documents.
Over which random variables?
Now let’s turn into how to handle documents with no or little overlapping vocabularies. To illustrate the problem, consider the following documents:
This is a document
This is a sentence
For each document we remove stopwords (‘this’, ‘is’, ‘a’) so they become:
According to constraint 2, we need to operate on the intersection of the documents’ vocabularies: . We end up with the empty set and therefore we cannot compute directly the KL-divergence. In this case we can assign it a large number like 1e33.
Let’s see what happens when we have larger documents.
Many research publications want you to use BibTeX, which better organizes the whole process. Suppose for concreteness your source file is x.tex. Basically, you create a file x.bib containing the bibliography, and run bibtex on that file.
In this case you must supply both a \left and a \right because the delimiter height are made to match whatever is contained between the two commands. But, the \left doesn't have to be an actual 'left delimiter', that is you can use '\left)' if there were some reason to do it.
After stopword removal, lowercasing and discarding words less than 2 characters, the documents become:
many research publications want you use bibtex better organizes whole process suppose concreteness your source file tex basically you create file bib containing bibliography run bibtex file
case you must supply both left right because delimiter height made match whatever contained between two commands left doesn have actual left delimiter you use left some reason
The vocabulary intersection of the documents consists of two terms: “use” and “you”. In “use” occurs 1 time and “you” occurs 2 times. Surprisingly, in “use” also occurs 1 time and “you” occurs 2 times too. The distributions and are equal, and therefore . So these documents are deemed equal! A better stopword list could have removed “use” and “you” and in that case the documents would have an infinite KL-divergence as in the first example. However it is easy to think of similar examples where stopword lists wouldn’t have been of much help.
So, how can we overcome this problem?
Since operating on the vocabulary intersection is not an option, we need to find a trick that allows us to consider the entire vocabulary of the documents. Smoothing comes to mind. Dirichlet and Laplacian smoothing are amongst the most popular smoothing techniques but after smoothing the probability distribution doesn’t sum up to 1 and violates the first constraint for defining KL-divergence.
Brigette Biggi suggested a back off smoothing method which keeps the probability distributions sums to 1 and also allows operating on the entire vocabulary. According to their proposed back-off method, the smoothed probability of a term in a document is:
the interesting part is on how and are calculated. In order to keep the term probabilities for and summing up to 1, the following constraint should be met:
where is a normalization coefficient:
and is a positive number smaller than the minimum term probability occurring in either or .
To illustrate the above, I wrote a small Python script.
import re, math, collections def tokenize(_str): stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', 'are', 'will', 'in', 'it', 'to', 'that'] tokens = collections.defaultdict(lambda: 0.) for m in re.finditer(r"(\w+)", _str, re.UNICODE): m = m.group(1).lower() if len(m) < 2: continue if m in stopwords: continue tokens[m] += 1 return tokens #end of tokenize def kldiv(_s, _t): if (len(_s) == 0): return 1e33 if (len(_t) == 0): return 1e33 ssum = 0. + sum(_s.values()) slen = len(_s) tsum = 0. + sum(_t.values()) tlen = len(_t) vocabdiff = set(_s.keys()).difference(set(_t.keys())) lenvocabdiff = len(vocabdiff) """ epsilon """ epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001 """ gamma """ gamma = 1 - lenvocabdiff * epsilon # print "_s: %s" % _s # print "_t: %s" % _t """ Check if distribution probabilities sum to 1""" sc = sum([v/ssum for v in _s.itervalues()]) st = sum([v/tsum for v in _t.itervalues()]) if sc < 9e-6: print "Sum P: %e, Sum Q: %e" % (sc, st) print "*** ERROR: sc does not sum up to 1. Bailing out .." sys.exit(2) if st < 9e-6: print "Sum P: %e, Sum Q: %e" % (sc, st) print "*** ERROR: st does not sum up to 1. Bailing out .." sys.exit(2) div = 0. for t, v in _s.iteritems(): pts = v / ssum ptt = epsilon if t in _t: ptt = gamma * (_t[t] / tsum) ckl = (pts - ptt) * math.log(pts / ptt) div += ckl return div #end of kldiv d1 = """Many research publications want you to use BibTeX, which better organizes the whole process. Suppose for concreteness your source file is x.tex. Basically, you create a file x.bib containing the bibliography, and run bibtex on that file.""" d2 = """In this case you must supply both a \left and a \right because the delimiter height are made to match whatever is contained between the two commands. But, the \left doesn't have to be an actual 'left delimiter', that is you can use '\left)' if there were some reason to do it.""" print "KL-divergence between d1 and d2:", kldiv(tokenize(d1), tokenize(d2)) print "KL-divergence between d2 and d1:", kldiv(tokenize(d2), tokenize(d1))
The output looks like this:
KL-divergence between d1 and d2: 6.52185430964 KL-divergence between d2 and d1: 6.51142363095
Now, KL-divergence is greater than zero, so the documents are not thought to be the same as before! Good job. Looking at KL symmetry, although the divergence of the two pairs is not identical, it is sufficiently close.
The above is a compilation of knowledge found in Wikipedia and from the paper “Reducing the Plagiarism Detection Search Space on the basis of Kullback-Leibler Distance” by Alberto Barron-Cedeno, Paolo Rosso, and Jose-Miguel Benedi. Examples and code are mine and you are free to use them.