Contextual Document Clustering

Vladimir Dobrynin; David Patterson; Niall Rooney

doi:10.1007/978-3-540-24752-4_13

Contextual Document Clustering

Vladimir Dobrynin ${ }^{1}$ , David Patterson ${ }^{2}$ , Niall Rooney ${ }^{2}$
${ }^{1}$ Faculty of Applied Mathematics and Control Processes, St Petersburg State
University, 198904 Petrodvoretz, St. Petersburg, Russia
{vdobr}@oasis.apmath.spbu.ru
${ }^{2}$ Nikel, Faculty of Informatics, University of Ulster
{wd.patterson, nf.rooney}@ulster.ac.uk

Abstract

In this paper we present a novel algorithm for document clustering. This approach is based on distributional clustering where subject related words, which have a narrow context, are identified to form metatags for that subject. These contextual words form the basis for creating thematic clusters of documents. We believe that this approach will be invaluable in creating an information retrieval system that will allow for a more personalized and intelligent search and retrieval mechanism. In a similar fashion to other research papers on document clustering, we analyze the quality of this approach with respect to document categorization problems and show it to outperform the information theoretic method of sequential information bottleneck.

1 Introduction

Document clustering is an important feature of information retrieval systems and can be used to solve many varied tasks, from the search for duplicate documents to the visualization of retrieval results and web mining [22]. In this paper we present a novel approach to clustering documents based on word distributions, so that each cluster is identified by a context word and the documents contained in the cluster have a high level of specificity and relevancy to each other. We propose this technique as means of indexing a document store and show its efficacy through a series of document classification experiments.
Generally clustering algorithms can be categorized as hierarchical (agglomerative and divisive) or partitional in nature [9]. Partitional clustering such as the well known k-means tends to be a much more efficient approach to clustering, although it in general requires apriori knowledge of the number of clusters. Most clustering algorithms determine the similarity between points based on a distance measure. This forms the basis of standard clustering approaches such as k-means, k-medoids and hierarchical methods such as single-link, complete link and group average clustering [9]. Recent extensions to these classical approaches have included clustering by committees [15] which uses the mutual information between documents based on their respective feature vectors to determine similarity. Committees of clusters are then formed where each member covers a tight number of closely related clusters and each

committee member is defined to be dissimilar to others. This approach is shown to outperform many of the classical approaches. Another approach proposed by Liu et al [14] uses a richer feature set to represent each document. Clustering is then performed using a Gaussian mixture model together with an Expectation-Maximization algorithm which iteratively refines the clusters based on an evaluation of which features are discriminatory.
In terms of clustering a corpus of documents, a more natural measure of the similarity between documents is based on the word distributions of the documents. An information theoretic framework can then be applied to cluster documents or words/features in the documents. Such approaches belong to the area of distributional clustering $[2,16]$ where the similarity measure is based on information theoretical divergence crtieria [13]. The goal of word/feature clustering is to provide a mechanism of feature reduction, where the original feature space is transformed into the space of new features represented by the word clusters. Feature reduction is a desirable way of addressing the problem of high feature dimensionality within document corpora, and as such feature clustering is a more effective mechanism than feature selection $[2,6]$ .
The most recent focus on the research in distributional clustering has focussed primarily on the Information Bottleneck (IB) clustering framework [20]. The IB method is based on the following abstract concepts. Given the empirical joint probability distribution of two variables, one variable is “compressed” so that the maximum mutual information is maintained between both. In the context of clustering, the variables represent the set of documents and the set of words. Using this method, it is possible to form either word clusters or document clusters. In terms of document clustering, the original IB algorithm was agglomerative and sub-optimal in nature (i.e. does not necessarily form the “best” clusters). A number of new algorithms have been presented based on IB, with the intent of either trying to improve its performance in terms of text categorization problems or its efficiency or both $[3,4,7,18,19]$ . For example the sequential informational Bottleneck (siB) [19] method is partitional which improves its efficiency and ensures that the most optimal partition of documents into clusters is formed given the mutual information.
While the approach we present in this paper is based on distributional clustering, our research goals differ from other distributional document clustering such as IB. The later tries to provide a more compact representation of the data, whereas we are primarily interested in identifying documents which belong to highly specific contexts, for example a document collection may contain the word “computer” and “motherboard”, but it is likely that the former word will occur in a diverse range of documents with no subject relation whereas the latter word will occur in documents that are highly related. Such words we consider as having a “narrow” context. In essence these contextual words provide a mechanism of grouping together semantically related documents. This Contextual Document Clustering (CDC) approach of identifying words that form clusters of narrow scope will lend itself well to organize a collection of documents in a more meaningful fashion. It is worth stressing also that the contextual words are not identified by any pre-defined categories that

the documents belong to, but are determined automatically in a unsupervised fashion. It is important that a document clustering algorithm can operate in a unsupervised manner because for many document corpora, such predefined categories will not exist or will be unknown. CDC is more generic than other approaches in that it does not require such prior knowledge. The motivation behind CDC is our belief that an information retrieval system can be built in a more effective manner by grouping documents into clusters using CDC rather than standard indexing approaches, such as Latent semantic indexing which due to the transformation of the data space are inevitably going to result in the loss of valuable information. The efficacy of CDC is measured by how many relevant documents are retrieved as a result of a query (the precision of the mechanism), and how many relevant documents in the corpus are actually returned (the recall). CDC should allow the retrieval process for a given query to be performed efficiently as it only has to identify initially the small number of clusters relevant to the query. These clusters will contain on average a small number of documents relative to the corpus size as a whole. In early research in Information retrieval, document clustering was considered as a desirable method of indexing documents [21] but the approach was not pursued due to the high computational demands required at the time. As such recent research on document clustering has tended to focus on its applicability to clustering the results of a query in an IR system. Document categorization has been used frequently in the literature as a independent measure of determining the quality of the clustering process. Typically document categorization is a classification problem where a document is represented using the bag of words model, some form of feature reduction is performed and a multi-class, multi-labelled categorization is carried out using a standard classifier such as SVM or Naive Bayes. A review of such techniques is presented in [17]. Although this approach provides the best known results for well known data-sets such as the Reuters-21578, it suffers from the practical issue that the bag-of-words may have high dimensionality even after feature reduction. In our evaluation section we compare the results of document categorization performed on the Reuters-21578 and 20Newsgroups for the CDC technique and discuss them in comparison to the recent document clustering approach, (sIB) [19].

2 Contextual Clustering

In this Section we provide a theoretical description of CDC. In addition, we provide an analysis of the algorithm’s time complexity, and summarize the main points concerning sIB, as we compare our results to this technique. Contextual Clustering consists of two steps, identifying the narrow contexts and clustering the documents where contexts act as cluster centroids.

2.1 Context Word Identification

The term “word context” is used to describe the probability distribution of a set of words which co-occur with the given word in a document. In

other words the word context is described by a conditional probability distribution

p(Y \mid z)

where $Y$ is a random variable with values in the collection dictionary and $z$ is the given word which acts as a descriptor of the context.
As described in the introduction, it is obvious that not all word contexts are useful only those that have a narrow context. Let $\mathcal{X}$ be the set of all documents in the document collection and $\mathcal{Y}$ be set of all words occurring in documents from $\mathcal{X}$ . The context “narrowness” is measured by a determination of the entropy of its probability distribution and the document frequency for a context word $z$ .
Let $t f(x, y)$ denote the number of occurrence of the word $y \in \mathcal{Y}$ in the document $x \in \mathcal{X}$ . Then the empirical joint probability distribution $p(X, Y)$ of random variables $X$ and $Y$ with values from $\mathcal{X}$ and $\mathcal{Y}$ is calculated by

p(x, y)=\frac{t f(x, y)}{\sum_{x^{\prime}, y^{\prime}} t f\left(x^{\prime}, y^{\prime}\right)}

where $x \in \mathcal{X}$ and $y \in \mathcal{Y}$ and the independence of words occurrences in the collection documents is assumed, so that

p\left(y_{1}, y_{2} \mid x\right)=p\left(y_{1} \mid x\right) p\left(y_{2} \mid x\right)

where $y_{1}, y_{2} \in \mathcal{Y}$ and $p\left(y_{1}, y_{2} \mid x\right)$ denote the probability of co-occurrence of words $y_{1}$ and $y_{2}$ in the document $x$ .
Words with a narrow word contexts are selected by calculating the conditional probability distributions $p(Y \mid z)$ for all word $z \in \mathcal{Y}$ :

\begin{gathered} p(y \mid z)=\frac{p(y, z)}{p(z)} \\ p(y, z)=\sum_{x} p(x) p(y \mid x) p(z \mid x) \\ p(z)=\sum_{x} p(x, z) \\ p(x)=\sum_{y} p(x, y) \end{gathered}

and measuring the entropy:

H(Y \mid z)=H[p(Y \mid z)]=-\sum_{y} p(y \mid z) \log p(y \mid z)

The above approach to calculate $p(y \mid z)$ suffers from the disadvantage that the function $p(x, y)$ must be re-calculated any time a new document is added to the document corpus. A more amenable approach to incremental learning could be determined by assuming that $p(x)$ is the same for all documents and by calculating the empirical conditional distribution $p(y \mid z)$ directly:

p(y \mid z)=\frac{\sum_{x \in D_{z}} t f(x, y)}{\sum_{x \in D_{x}, y^{\prime}} t f\left(x, y^{\prime}\right)}

where $D_{z}$ is the set of all documents that contain the word $z$ , and not making the assumption that the word occurrences in a document are independent.
Let (1a) denote narrow word selection criterion when the independence of word occurrences in a document is assumed and (1b) otherwise. Both (1a) and (1b) are unsupervised techniques.

Only those word contexts are selected as narrow based on a consideration of the entropy $H(Y \mid z)$ and the document frequency $d f(z)=|\{x$ : $\left.t f(x, z)>0\}\right\}$ . The maximum entropy $H_{\max }(Y \mid z)$ occurs for a uniform distribution and equals the $\log \left|T\left(D_{z}\right)\right|$ where $T\left(D_{z}\right)$ is the set of words belonging to documents where $z$ occurs. Heaps Law [1] states that the dictionary size of a document collection is of the order $O\left(n^{\beta}\right)$ , where $n$ is the total text size and $\beta<1$ . As the text size for documents where $z$ occurs is $d f(z)$ times a constant $k$ denoting the average size for these documents, then $\left|T\left(D_{z}\right)\right|=O\left((k \cdot d f(z))^{\beta}\right)$ and $H(Y \mid z)=O(\log d f(z))$ To take into account the dependence between $H(Y \mid z)$ and $d f(z)$ the set of words $\mathcal{Y}$ is divided into a set of subsets such that all words from a given subset have document frequencies from the same interval:

\begin{gathered} \mathcal{Y}=\cup_{i} \mathcal{Y}_{i} \\ \mathcal{Y}_{i}=\left\{z: z \in \mathcal{Y}, d f_{i} \leq d f(z)<d f_{i+1}\right\} \\ i=1, \ldots, r \end{gathered}

If

d f_{i+1}=\alpha \cdot d f_{i}

where $\alpha>1$ is a constant, then $H_{\max }(Y \mid z)$ is bounded from above by a linear function of the interval $\left[d f_{i}, d f_{i+1}\right)$ index $i$ . So as a consequence a threshold $H_{\max }(i)$ is set for every interval $\left[d f_{i}, d f_{i+1}\right)$ and select word set $\mathcal{Z} \subset \mathcal{Y}$ such that

\begin{gathered} \mathcal{Z}=\cup_{i}\left\{z: z \in \mathcal{Y}, d f_{i} \leq d f(z)<d f_{i+1}\right. \\ \left.H(Y \mid z) \leq H_{\max }(i)\right\} \end{gathered}

The entropy threshold values are selected empirically based on the entropy distribution over all possible contexts for given a document corpus. As this entropy distribution is dependent on the approach used to calculate the conditional probabilities $p(y \mid z)$ , different threshold values may be selected for criterion (1a) than those for criterion (1b) for a given document corpus. An additional constraint is included in Criterion (1b) to limit the maximum number of words $W_{c}$ contained within the contexts where naturally only those words $y$ with the maximum conditional probability $p(y \mid z)$ are kept. This was added also to improve the efficiency of (1b) over (1a) in the situation where a context or contexts contain a large number of words $>W_{c}$ . It also may improve the accuracy of the technique by filtering out irrelevant words.
An alternative supervised means of choosing the narrow word contexts is to assume that in total there are $N$ word contexts and $r$ document frequency intervals as before. If $\mathcal{Y}_{i}$ denotes the set of words from the

corpus dictionary which contains all words $z$ such that $d f_{i} \leq d f(z)<$ $d f_{i+1}$ . For every $i=1, \ldots, r$ a set $\mathcal{Z}_{i} \subset \mathcal{Y}_{i}$ is selected such that :

\left|\mathcal{Z}_{i}\right|=\frac{N \cdot\left|\mathcal{Y}_{i}\right|}{\sum_{j=1, \ldots, r^{\prime}}\left|\mathcal{Y}_{j}\right|}

and

z_{1} \in \mathcal{Z}_{i}, z_{2} \in \mathcal{Y}_{i}-\mathcal{Z}_{i} \rightarrow H\left(Y \mid z_{1}\right) \leq H\left(Y \mid z_{2}\right)

Then $\mathcal{Z}=\cup_{i} \mathcal{Z}_{i}$ , where $\mathcal{Z}$ is the set of selected word contexts. This second approach has the advantage that the entropy threshold bounds do not need to be determined empirically for a particular data-set.
Criterion (2) is similar to criterion (1b) in that the distribution $p(y \mid z)$ is calculated directly and there is a limit also on the maximum number of words a context may contain.

2.2 Document Clustering

Selected narrow word contexts act as cluster centroids and the clustering process works by assigning every document from the document collection to the cluster with the closest centroid, as measured by the JS-divergence. Let $x$ be a document. Then conditional probability distribution of its words $p(Y \mid x)$ is given by

p(y \mid x)=\frac{p(x, y)}{p(x)}

and the distance between document $x$ and word $z$ context is measured using the Jensen-Shannon divergence [13] of $p(Y \mid z)$ and $p(Y \mid x)$ distributions.
Let $p_{1}(Y)$ and $p_{2}(Y)$ be two probability distributions of a random variable $Y$ . Then the Jensen-Shannon divergence of $p_{1}$ and $p_{2}$

J S_{\left\{\pi_{1}, \pi_{2}\right\}}\left[p_{1}, p_{2}\right]=H[\bar{p}]-\pi_{1} H\left[p_{1}\right]-\pi_{2} H\left[p_{2}\right]

where $\pi_{1} \geq 0, \pi_{2} \geq 0, \pi_{1}+\pi_{2}=1, \bar{p}=\pi_{1} p_{1}+\pi_{2} p_{2}$ , is a nonnegative bounded function of $p_{1}$ and $p_{2}$ which is equal to zero iff $p_{1}=p_{2}$ . Also this function is concave with respect to $\pi_{1}$ and $\pi_{2}$ with unique maximum value in the point $\{0.5,0.5\}$ .
So the document $x$ will be assigned to the cluster $C_{z}$ with centroid for word $z$ if

z=\arg \min _{z} J S_{\{0.5,0.5\}}\left[p\left(Y \mid z^{\prime}\right), p(Y \mid x)\right]

2.3 Complexity Analysis of CDC

Let $S$ be a set of documents used to generate $N$ contexts, $d f_{\text {inv }}(S)$ the average document frequency of a word in $S$ , length $h_{\text {inv }}(S)$ average document length in $S$ , and $T(S)$ the set of all distinct words in $S$ , then the time complexity for context generation is

O\left(|T(S)| \cdot d f_{\text {inv }}(S) \cdot \text { length }_{\text {inv }}(S)\right)

If length $_{\text {avr }}(S)$ is bounded from above by a constant value not dependent on $|S|$ then time complexity for context generation is

O(|S|)

and time complexity for clustering is

O(N \cdot|S|)

From this it clear that the process of the context word discovery and document clustering is only linear dependent on the size of the document collection, so it is much more efficient mechanism than standard document clustering approaches which are usually quadratic in their time complexity [5].

2.4 Sequential Information Bottleneck

In this section we review the sIB technique. As stated given a joint distribution $p(X \mid Y)$ , the Information Bottleneck method looks for a compact representation of $X$ which preserves as much information as possible about the relevant variable $Y$ . The mutual information, $I(X ; Y)$ , between the random variables $X$ and $Y$ is given by.

I(X ; Y)=-\sum_{x \in X, y \in Y} p(x) p(y \mid x) \log \frac{p(y \mid x)}{p(y)}

A compressed representation $T$ of $X$ is defined by $P(T \mid X)$ . The compactness of the representation is determined by $I(T ; X)$ while the quality of the clusters is measured by the fraction of information they capture about $Y$ , namely $I(T ; Y) / I(X ; Y)$ Intuitively in this procedure the information contained in $X$ about $Y$ is “squeezed” through a compact “bottleneck” of clusters T that is forced to represent the relevant part of $X$ with respect to $Y$ . The sequential clustering works by finding a partition of $T(X)$ which maximizes a score function given by $I(T ; Y)$ .The algorithm starts with an initial random partition $T=\left\{t_{1}, t_{2}, . ., t_{K}\right\}$ of $X$ . Similar to the k-means algorithm $K$ is a pre-determined value. At each step in the method some $x \in X$ is drawn out of its current cluster $t(x)$ and is represented as a new singleton cluster $\{x\} . x$ is then merged into $t^{\text {new }}$ such that $t^{\text {new }}=\operatorname{argmin}_{t \in T} d(\{x\}, t)$ where $d$ is

d(x, t)=(p(x)+p(t)) J S(p(y \mid x), p(y \mid t))

Slonim [19] shows that this iterative process is guaranteed to converge to a local maximum of the score function, where no more re-assignments can improve upon the result.

3 Experimental Setup

The experimental evaluation of the contextual clustering technique for document categorization used two standard data-sets.

The first dataset was the training and test parts of the ModApte split of the Reuters-21578 collection where the documents selected fall into 10 most frequent categories (“earn”, “acq”, “money-fx”, “crude”, “grain”, “trade”, “interest”, “wheat”, “ship”, “corn”). Note that because documents can be assigned to more than one category, the total number of categories documents can belong to is 102 . Note that these categories are not used to cluster documents contextually. The cluster context’s are determined as previously described in section 2.1 in a totally unsupervised fashion. However this domain knowledge about the document categories is used to assess the quality of the CDC clustering.
Document preprocessing includes

removing all file headers except title and dateline
lowering upper case characters
removing all digits and all non alpha-numeric characters
removing stop words ( 470 words) and words which occur in the collection only once
removing the body of short documents (all information such document contains is stored in its title)

As a result there were 9035 documents in this collection and 16720 unique words.
To select a word $(z)$ with narrow word context the bounds for criteria (1a) were empirically set for this data-set to:

\begin{aligned} & 5 \leq d f(z)<12 \quad A N D \quad H(Y \mid z) \leq 5.00 \quad O R \\ & 12 \leq d f(z)<25 \quad A N D \quad H(Y \mid z) \leq 5.25 \quad O R \\ & 25 \leq d f(z)<50 \quad A N D \quad H(Y \mid z) \leq 5.50 \quad O R \\ & 50 \leq d f(z)<100 \quad A N D \quad H(Y \mid z) \leq 5.75 \quad O R \\ & 100 \leq d f(z) \quad A N D \quad H(Y \mid z) \leq 6.0 \end{aligned}

This resulted in 907 narrow word contexts.
To select a word $(z)$ with narrow word context the bounds for criterion (1b) were empirically set for this data-set to:

\begin{aligned} & 5 \leq d f(z)<12 \quad A N D \quad H(Y \mid z) \leq 5.25 \quad O R \\ & 12 \leq d f(z)<25 \quad A N D \quad H(Y \mid z) \leq 5.50 \quad O R \\ & 25 \leq d f(z)<50 \quad A N D \quad H(Y \mid z) \leq 5.75 \quad O R \\ & 50 \leq d f(z)<100 \quad A N D \quad H(Y \mid z) \leq 6.0 \quad O R \\ & 100 \leq d f(z) \quad A N D \quad H(Y \mid z) \leq 6.25 \end{aligned}

and $W_{c}$ was set to 1000 . This resulted in 854 narrow word contexts. Using criteria (2), the document frequency intervals were set to $d f_{1}=$ $5, d f_{2}=12, d f_{3}=25, d f_{4}=50, d f_{5}=100, d f_{6}=$ [corpus] and the number of contexts was set $N=1000$ , ( as criteria (1) had resulted in 907 word contexts). Naturally this results in a 1000 narrow word contexts. The second data-set was the 20 Usenet Newsgroups (20NG) as collected by [12]. This corpus contains roughly 20000 documents evenly distributed over 20 newsgroups, some of which are of very similar topics. There is roughly $4 \%$ cross-posting in this corpus. This data-set was filtered to remove duplicate documents Additional preprocessing included

removing all headers except subject
lowering upper case characters
removing all non alpha-numeric characters and tokens without alphabetic symbols
removing stop words (470 words) and words which occur in the collection only once

As a result there were 18938 documents in this collection and 58879 unique words.
To select the set $\mathcal{Z}$ of words which mark narrow contexts for this collection we used only criterion (1b) and criterion (2) ( Criterion (1a) proved too inefficient for this data-set).
Criterion (1b) was set with empirical bounds

\begin{aligned} & 5 \leq d f(z)<12 \quad A N D \quad H(Y \mid z) \leq 5.50 \quad O R \\ & 12 \leq d f(z)<25 \quad A N D \quad H(Y \mid z) \leq 6.00 \quad O R \\ & 25 \leq d f(z)<50 \quad A N D \quad H(Y \mid z) \leq 6.50 \quad O R \\ & 50 \leq d f(z)<100 \quad A N D \quad H(Y \mid z) \leq 7.0 \quad O R \\ & 100 \leq d f(z) \quad A N D \quad H(Y \mid z) \leq 7.5 \end{aligned}

and $W_{z}$ was set to 1000 as before. This resulted in 1422 contexts being selected. The 20NGs were investigated with the same settings as the Reuters data for criterion(2).

4 Precision and Recall

In this section, we describe how precision and recall are measured in the experiments. Let document $x_{z}$ be the nearest document to the centroid (context of the word $z$ ) of the cluster $C_{z}$ and $T_{e}\left(x_{z}\right)$ be then category set of document $x_{z}$ assigned by an expert ( the predefined cateogories). Then the category set $T\left(C_{z}\right)$ is assigned to the cluster $C_{z}$ and has value

T\left(C_{z}\right)=T_{e}\left(x_{z}\right)

if

x_{z}=\arg \min _{x} J S_{\{0.5,0.5\}}\left[p(Y \mid z), p\left(Y \mid x^{\prime}\right)\right]

All documents, assigned to the cluster $C_{z}$ , are classified as category set $T\left(C_{z}\right)$ , whereas the true classification of a document $x$ is $T_{e}(x)$ . A document $x \in$ corpus is considered to be classified correctly if the intersect of $T_{e}(x)$ and the category set it is assigned to $T(C(x))$ is non-empty:

T(C(x)) \cap T_{e}(x) \neq \emptyset

Then the precision $p_{1}$ of the classification is calculated as

p_{1}=\frac{|\{x: T(C(x)) \cap T_{e}(x) \neq \emptyset\}|}{|\text { corpus }|}

The quality of the CDC method can be calculated in a more stringent fashion. Let $T$ be the set of all categories, $t \in T$ and given the following definitions:

Number of “true positives” for the category $t$

T P_{t}=\left|\left\{x: x \in C_{z} \rightarrow t \in T\left(C_{z}\right) \cap T_{e}(x)\right\}\right|

Number of “false positives” for the category $t$

F P_{t}=\left|\left\{x: x \in C_{z} \rightarrow t \in T\left(C_{z}\right)-T_{e}(x)\right\}\right|

Number of “false negatives” for the category $t$

F N_{t}=\left|\left\{x: x \in C_{z} \rightarrow t \in T_{e}(x)-T\left(C_{z}\right)\right\}\right|

Then the microaverage precision $p^{\mu}$ is calculated as

p^{\mu}=\frac{\sum_{t \in T} T P_{t}}{\sum_{t \in T}\left(T P_{t}+F P_{t}\right)}

and microaverage recall $r^{\mu}$ as

r^{\mu}=\frac{\sum_{t \in T} T P_{t}}{\sum_{t \in T}\left(T P_{t}+F N_{t}\right)}

In order to compare CDC to the sIB technique described in 19, we also present results based on the latter paper’s measurements of precision and recall. In the sIB evaluation approach, all documents are assigned to the most dominant label in that cluster. A document is considered correctly classified if its true label set contains the dominant label. These dominant labels form a category set $C$ and for each $c \in C$ is measured $\alpha(c)$ , the number of documents correctly assigned to $c ; \beta(c)$ , the number of documents incorrectly classified to $c$ and $\gamma(c)$ , the number of documents incorrectly not assigned to $c$ . Then the precision and recall are defined as:

\begin{aligned} p_{s I B} & =\frac{\sum_{c \in C} \alpha(c)}{\sum_{c \in C}(\alpha(c)+\beta(c))} \\ r_{s I B} & =\frac{\sum_{c \in C} \alpha(c)}{\sum_{c \in C}(\alpha(c)+\gamma(c))} \end{aligned}

In general this tends to give a more optimistic measure of precision and recall than that defined by $p^{\mu}$ and $r^{\mu}$ .

5 Experimental Results

5.1 Reuters dataset

The precision and recall results for the Reuters data-set are presented in Table 1 for the three narrow word selection criteria. Results are shown both for the case where documents closest to the context centroids are kept and when they are removed shown in brackets in Table 1. The reason for their exclusion it that they may be considered to bias optimistically the results as their category sets are used to measure precision.

Criteria	$1(a)$	$1(b)$	2
number of contexts	$907(907)$	$854(854)$	$1000(1000)$
number of non-empty	$863(848)$	$790(781)$	$893(879)$
clusters
average cluster size	$10.5(10.0)$	$11.4(10.9)$	$10.1(9.7)$
max cluster size	$448(447)$	$458(458)$	$459(456)$
$p_{s I B}$	0.886	0.886	0.888
$p_{1}$	$0.832(0.822)$	$0.845(0.836)$	$0.845(0.836)$
$p^{\mu}$	$0.742(0.725)$	$0.697(0.681)$	$0.69(0.673)$
$r^{\mu}$	$0.76(0.744)$	$0.765(0.75)$	$0.773(0.758)$

Table 1. Classification Results for the Reuters data-set using both criteria

It should be observed that each criterion resulted in a high number of non-empty clusters which contain on average a small number of documents. There was not much variation in the number of clusters, average cluster size and maximum cluster size among the criteria. $p_{s I B}$ gave the highest value as a measure of precision as expected (0.89). This outperformed sIB which for this data-set had a precision value of 0.86 [19] (a $3 \%$ increase). It also significantly outperformed sIB in terms of recall. sIB is a hard clustering technique which tries to constrain the Reuters data-set to the same number of clusters as there are most frequent categories i.e. 10. However as this data-set is multi-labelled, their recall was low ( at best 0.56 [19]), which is $20 \%$ lower than CDC.
The values of precision ( regardless of the method of measurement) and recall are very similar to each other whichever criteria we used, showing that each criterion was equally valid in identifying narrow word contexts. However criterion (1a) was a much slower technique than the other two. As would be expected, removing the closest documents to the centroid results in a slight reduction in precision and recall which is at worse $1.7 \%$ .

5.2 20Newsgroups dataset

The results for precision and recall for the 20Newsgroups data-set are presented in Table 2. Results are shown both for the case where documents closest to the context centroids are kept and when they are removed in a similar fashion to Table 1.
Note that this data-set is uni-labelled. As a result the micro-average precision $p^{\mu}$ , micro-average recall $r^{\mu}$ and $p_{1}$ have the same value. Similarly $p_{s I B}$ is the same as $r_{s I B}$ So only the value for micro-average precision is presented in the Table 2. Again as was the case for the Reuters data, there was a high number of non-empty clusters and a small overall average cluster size.
Table 2 shows that both criteria resulted in a high number of non-empty clusters with a small average cluster size. The precision $p_{s I B}$ was the same for both criterion (0.71) and significantly outperformed sIB which

Criteria	$1(b)$	2
number of contexts	$1422(1422)$	$1000(1000)$
number of non-empty	$1247(1246)$	$907(904)$
clusters
average cluster size	$15.2(14.5)$	$20.9(20.3)$
max cluster size	$443(442)$	$309(309)$
$p_{s I B}=r_{s I B}$	0.711	0.711
$p_{1}=p^{\mu}=r^{\mu}$	$0.674(0.658)$	$0.649(0.639)$

Table 2. Results for 20NGs data-set
had a value of precision and recall, of $0.58[19], 13 \%$ less than CDC’s $p_{s I B}$ . Again $p_{1}=p^{\mu}=r^{\mu}$ were very similar for both criteria, indicating the validity of both criteria for contextual clustering. More significance should be put on the results for the 20NGS than for the Reuters dataset. The reason for this is that the Reuters data-set is quite simplistic. In other words, it is a relatively easy for a categorization technique to achieve high values of precision [17]. The 20NGs data-set is a more challenging data-set and therefore it is more difficult to obtain such high levels of precision. As such the results gives strength to our belief that CDC is competent technique for document clustering.
In summary, CDC provided a more competent technique for document categorization than the sIB approach. Criteria $1(\mathrm{a}-\mathrm{b})$ and 2 were equally accurate in terms of precision and recall, however criterion 1(b) and 2 provided a more efficient means of clustering the documents. The results demonstrates emphatically the benefit of forming a relatively high number of small contextual clusters. This should also prove desirable in finding the documents relevant to a query as it should allow for only a small number of clusters and hence documents in the whole data-set, to be considered for relevancy.

6 Conclusion Remarks and Future Work

The contextual clustering approach based on narrow context word selection partitions a document collection into a large number of relatively small thematic homogeneous clusters, regardless of the clustering criteria used. The number of clusters generated by CDC is a reflection of the real complex thematic structure of a document corpus, which can not be adequately expressed by classifying documents into a small number of categories or topics. This was borne out in the experimental evaluation of CDC. It showed a high precision and recall, when applied to the document categorization of the ModApte split Reuters data and the 20NG data-sets, which outperformed another information theoretic document clustering approach, sIB which relies on the user-defined categories.
It is envisaged that CBC could provide the foundation for an IR system designed to allow the efficient retrieval of semantically related documents

relevant to user queries. Its time complexity means that it can cluster a whole document corpus in an acceptable running time, something which could be easily improved upon further if the process was parallelised. It is proposed that within the context of an IR system, CDC also provides the user the opportunity of interactively browsing the query results and/or the whole document collection. This will help him/her to determine and/or adapt query key-words appropriate to their information needs. In addition the information retrieval system could also allow for personalization techniques whereby the search mechanisms could be dependent on the user’s familiarity with the document content. The focus of our future work will concentrate on these areas of research.

References

Baeza-Yates and Ribeiro-Neto: Modern Information Retrieval, ACM Press, 1999.
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 96-103, 1998.
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In Proceedings of SIGIR-01, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 146-153,2001.
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter,Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, Vol 1:1-48, 2002.
Cutting, D.,Pedersen, J., Karger, D., Tukey, J.: Scatter/Gather: Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318-329, 1992.
Dhillon, Y.,Manella, S., Kumar, R.: Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification, Journal of Machine Learning Research Vol 3:1265-1287, 2003.
El-Yaniv R., Souroujon O.: Iterative double clustering for unsupervised and semi-supervised learning. In Proceedings of ECML-01, 12th European Conference on Machine Learning. pp. 121 - 132,2001.
Hofmann, T.: Probabilistic latent semantic indexing. In Proceedings of the 22nd ACM-SIGIR Intemational Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.
Jain, A. K., Murty, M. N. and Flynn, P. J.: Data Clustering: A Review. ACM Computing Surveys 31(3):26423,1999.
Joachims, T.: A statistical learning model for Support Vector Machines. SIGIR’01, New Orleans, USA, 2001.
Karipis, G., Han, E.H.: Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorisation, University of Minnesota, Technical Report TR-00-0016, 2000 .

Lang, K.: Learning to Filter netnews In Proceedings of 12th International Conference on Machine Learning, pp 331-339, 1995.
Lin, J: Divergence Measures Based on the Shannon Entropy, IEEE Transactions on Information Theory, 37(1), pp145-151, 1991.
Liu, X., Gong, Y., Xu, W., Zhu, S: Document clustering with cluster refinement and model selection capabilities. In Proceedings of SIGIR02, 25th ACM International Conference on Research and Development in Information Retrieval, pp. 191-198,2002.
Pantel, P. ,Lin, D.: Document clustering with committees. In the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR), 2002.
Pereira, F., Tishby, N., Lee L.: Distributional clustering of English words. In 30th Annual Meeting of the Association for Computational Linguistics, Columbus. Ohio, pp. 183-190, 1993.
Sebastiani, F.: Machine learning in automated text categorization, ACM Computer Surveys, Vol.34, No.1, March 2002, pp. 1-47, 2002.
Slonim, N.,Tishby N: Document Clustering using word clusters via the Information Bottleneck method. In the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2000.
Slonim, N.,Friedman, N., Tishby N.: Unsupervised document classification using sequential information maximization. In the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR),2002.
Tishby N., Pereira F., Bialek W.: The Information bottleneck method. Invited paper to The 37th annual Allerton Conference on Communication, Control, and Computing, 1999.
Van Rijsbergen, C. J.: Information retrieval, ButterworthHeinemann, 1979.
Zamir, O. and Etzioni, O.: Web document Clustering, A feasibility demonstration in ACM SIGIR 98, pp 46-54, 1998.

Contextual Document Clustering

Sign up for access to the world's latest research

Abstract

Related papers

Contextual Document Clustering

Abstract

1 Introduction

2 Contextual Clustering

2.1 Context Word Identification

2.2 Document Clustering

2.3 Complexity Analysis of CDC

2.4 Sequential Information Bottleneck

3 Experimental Setup

4 Precision and Recall

5 Experimental Results

5.1 Reuters dataset

5.2 20Newsgroups dataset

6 Conclusion Remarks and Future Work

References

References (22)

Related papers

Related topics