diff --git a/pydf/data/blei2003.txt b/pydf/data/blei2003.txt index 87ddc6d..8c6c2f3 100644 --- a/pydf/data/blei2003.txt +++ b/pydf/data/blei2003.txt @@ -271,468 +271,4 @@ p(θ, z | w, α, β)). (5) -Thus the optimizing values of the variational parameters are found by minimizing the KullbackLeibler (KL) divergence between the variational distribution and the true posterior p(θ, z | w, α, β). This minimization can be achieved via an iterative fixed-point method. In particular, we show in Appendix A.3 that by computing the derivatives of the KL divergence and setting them equal to zero, we obtain the following pair of update equations: φni ∝ βiwn exp{Eq [log(θi ) | γ]} γi = αi + ∑N n=1 φni . (6) (7) - -As we show in Appendix A.1, the expectation in the multinomial update can be computed as follows: Eq [log(θi ) | γ] = Ψ(γi ) − Ψ ∑k j =1 γ j , (8) - -where Ψ is the first derivative of the log Γ function which is computable via Taylor approximations (Abramowitz and Stegun, 1970). Eqs. (6) and (7) have an appealing intuitive interpretation. The Dirichlet update is a posterior Dirichlet given expected observations taken under the variational distribution, E[zn | φn ]. The multinomial update is akin to using Bayes’ theorem, p(zn | wn ) ∝ p(wn | zn ) p(zn ), where p(zn ) is approximated by the exponential of the expected value of its logarithm under the variational distribution. It is important to note that the variational distribution is actually a conditional distribution, varying as a function of w. This occurs because the optimization problem in Eq. (5) is conducted for fixed w, and thus yields optimizing parameters (γ∗ , φ∗ ) that are a function of w. We can write the resulting variational distribution as q(θ, z | γ∗ (w), φ∗ (w)), where we have made the dependence on w explicit. Thus the variational distribution can be viewed as an approximation to the posterior distribution p(θ, z | w, α, β). In the language of text, the optimizing parameters (γ∗ (w), φ∗ (w)) are document-specific. In particular, we view the Dirichlet parameters γ∗ (w) as providing a representation of a document in the topic simplex. -1004 - -L ATENT D IRICHLET A LLOCATION - -(1) (2) (3) (4) (5) (6) (7) (8) (9) - -initialize φ0 ni := 1/k for all i and n initialize γi := αi + N /k for all i repeat for n = 1 to N for i = 1 to k +1 := βiwn exp(Ψ(γti )) φtni normalize φtn+1 to sum to 1. t +1 γt +1 := α + ∑N n =1 φ n until convergence Figure 6: A variational inference algorithm for LDA. - -We summarize the variational inference procedure in Figure 6, with appropriate starting points for γ and φn . From the pseudocode it is clear that each iteration of variational inference for LDA requires O((N + 1)k) operations. Empirically, we find that the number of iterations required for a single document is on the order of the number of words in the document. This yields a total number of operations roughly on the order of N 2 k. 5.3 Parameter estimation In this section we present an empirical Bayes method for parameter estimation in the LDA model (see Section 5.4 for a fuller Bayesian approach). In particular, given a corpus of documents D = {w1 , w2 , . . . , wM }, we wish to find parameters α and β that maximize the (marginal) log likelihood of the data: (α, β) = - -d =1 - -∑ log p(wd | α, β). - -M - -As we have described above, the quantity p(w | α, β) cannot be computed tractably. However, variational inference provides us with a tractable lower bound on the log likelihood, a bound which we can maximize with respect to α and β. We can thus find approximate empirical Bayes estimates for the LDA model via an alternating variational EM procedure that maximizes a lower bound with respect to the variational parameters γ and φ, and then, for fixed values of the variational parameters, maximizes the lower bound with respect to the model parameters α and β. We provide a detailed derivation of the variational EM algorithm for LDA in Appendix A.4. The derivation yields the following iterative algorithm: -∗ 1. (E-step) For each document, find the optimizing values of the variational parameters {γ∗ d , φd : d ∈ D }. This is done as described in the previous section. - -2. (M-step) Maximize the resulting lower bound on the log likelihood with respect to the model parameters α and β. This corresponds to finding maximum likelihood estimates with expected sufficient statistics for each document under the approximate posterior which is computed in the E-step. -1005 - -B LEI , N G , AND J ORDAN - -η - -β - -k - -α - -θ - -z - -w - -N - -M - -Figure 7: Graphical model representation of the smoothed LDA model. - -These two steps are repeated until the lower bound on the log likelihood converges. In Appendix A.4, we show that the M-step update for the conditional multinomial parameter β can be written out analytically: βi j ∝ -j ∑ ∑ φ∗ dni wdn . M Nd - -(9) - -d =1 n =1 - -We further show that the M-step update for Dirichlet parameter α can be implemented using an efficient Newton-Raphson method in which the Hessian is inverted in linear time. 5.4 Smoothing The large vocabulary size that is characteristic of many document corpora creates serious problems of sparsity. A new document is very likely to contain words that did not appear in any of the documents in a training corpus. Maximum likelihood estimates of the multinomial parameters assign zero probability to such words, and thus zero probability to new documents. The standard approach to coping with this problem is to “smooth” the multinomial parameters, assigning positive probability to all vocabulary items whether or not they are observed in the training set (Jelinek, 1997). Laplace smoothing is commonly used; this essentially yields the mean of the posterior distribution under a uniform Dirichlet prior on the multinomial parameters. Unfortunately, in the mixture model setting, simple Laplace smoothing is no longer justified as a maximum a posteriori method (although it is often implemented in practice; cf. Nigam et al., 1999). In fact, by placing a Dirichlet prior on the multinomial parameter we obtain an intractable posterior in the mixture model setting, for much the same reason that one obtains an intractable posterior in the basic LDA model. Our proposed solution to this problem is to simply apply variational inference methods to the extended model that includes Dirichlet smoothing on the multinomial parameter. In the LDA setting, we obtain the extended graphical model shown in Figure 7. We treat β as a k × V random matrix (one row for each mixture component), where we assume that each row is independently drawn from an exchangeable Dirichlet distribution.2 We now extend our inference procedures to treat the βi as random variables that are endowed with a posterior distribution, -2. An exchangeable Dirichlet is simply a Dirichlet distribution with a single scalar parameter η. The density is the same as a Dirichlet (Eq. 1) where αi = η for each component. - -1006 - -L ATENT D IRICHLET A LLOCATION - -conditioned on the data. Thus we move beyond the empirical Bayes procedure of Section 5.3 and consider a fuller Bayesian approach to LDA. We consider a variational approach to Bayesian inference that places a separable distribution on the random variables β, θ, and z (Attias, 2000): q(β1:k , z1:M , θ1:M | λ, φ, γ) = ∏ Dir(βi | λi ) ∏ qd (θd , zd | φd , γd ), -i=1 d =1 k M - -where qd (θ, z | φ, γ) is the variational distribution defined for LDA in Eq. (4). As is easily verified, the resulting variational inference procedure again yields Eqs. (6) and (7) as the update equations for the variational parameters φ and γ, respectively, as well as an additional update for the new variational parameter λ: λi j = η + ∑ -M Nd j ∑ φ∗ dni wdn . - -d =1 n =1 - -Iterating these equations to convergence yields an approximate posterior distribution on β, θ, and z. We are now left with the hyperparameter η on the exchangeable Dirichlet, as well as the hyperparameter α from before. Our approach to setting these hyperparameters is again (approximate) empirical Bayes—we use variational EM to find maximum likelihood estimates of these parameters based on the marginal likelihood. These procedures are described in Appendix A.4. - -6. Example -In this section, we provide an illustrative example of the use of an LDA model on real data. Our data are 16,000 documents from a subset of the TREC AP corpus (Harman, 1992). After removing a standard list of stop words, we used the EM algorithm described in Section 5.3 to find the Dirichlet and conditional multinomial parameters for a 100-topic LDA model. The top words from some of the resulting multinomial distributions p(w | z) are illustrated in Figure 8 (top). As we have hoped, these distributions seem to capture some of the underlying topics in the corpus (and we have named them according to these topics). As we emphasized in Section 4, one of the advantages of LDA over related latent variable models is that it provides well-defined inference procedures for previously unseen documents. Indeed, we can illustrate how LDA works by performing inference on a held-out document and examining the resulting variational posterior parameters. Figure 8 (bottom) is a document from the TREC AP corpus which was not used for parameter estimation. Using the algorithm in Section 5.1, we computed the variational posterior Dirichlet parameters γ for the article and variational posterior multinomial parameters φn for each word in the article. Recall that the ith posterior Dirichlet parameter γi is approximately the ith prior Dirichlet parameter αi plus the expected number of words which were generated by the ith topic (see Eq. 7). Therefore, the prior Dirichlet parameters subtracted from the posterior Dirichlet parameters indicate the expected number of words which were allocated to each topic for a particular document. For the example article in Figure 8 (bottom), most of the γi are close to αi . Four topics, however, are significantly larger (by this, we mean γi − αi ≥ 1). Looking at the corresponding distributions over words identifies the topics which mixed to form this document (Figure 8, top). -1007 - -B LEI , N G , AND J ORDAN - -Further insight comes from examining the φn parameters. These distributions approximate p(zn | w) and tend to peak towards one of the k possible topic values. In the article text in Figure 8, the words are color coded according to these values (i.e., the ith color is used if qn (zi n = 1) > 0.9). With this illustration, one can identify how the different topics mixed in the document text. While demonstrating the power of LDA, the posterior analysis also highlights some of its limitations. In particular, the bag-of-words assumption allows words that should be generated by the same topic (e.g., “William Randolph Hearst Foundation”) to be allocated to several different topics. Overcoming this limitation would require some form of extension of the basic LDA model; in particular, we might relax the bag-of-words assumption by assuming partial exchangeability or Markovianity of word sequences. - -7. Applications and Empirical Results -In this section, we discuss our empirical evaluation of LDA in several problem domains—document modeling, document classification, and collaborative filtering. In all of the mixture models, the expected complete log likelihood of the data has local maxima at the points where all or some of the mixture components are equal to each other. To avoid these local maxima, it is important to initialize the EM algorithm appropriately. In our experiments, we initialize EM by seeding each conditional multinomial distribution with five documents, reducing their effective total length to two words, and smoothing across the whole vocabulary. This is essentially an approximation to the scheme described in Heckerman and Meila (2001). 7.1 Document modeling We trained a number of latent variable models, including LDA, on two text corpora to compare the generalization performance of these models. The documents in the corpora are treated as unlabeled; thus, our goal is density estimation—we wish to achieve high likelihood on a held-out test set. In particular, we computed the perplexity of a held-out test set to evaluate the models. The perplexity, used by convention in language modeling, is monotonically decreasing in the likelihood of the test data, and is algebraicly equivalent to the inverse of the geometric mean per-word likelihood. A lower perplexity score indicates better generalization performance.3 More formally, for a test set of M documents, the perplexity is: perplexity(Dtest ) = exp − ∑M d =1 log p(wd ) . ∑M d =1 Nd - -In our experiments, we used a corpus of scientific abstracts from the C. Elegans community (Avery, 2002) containing 5,225 abstracts with 28,414 unique terms, and a subset of the TREC AP corpus containing 16,333 newswire articles with 23,075 unique terms. In both cases, we held out 10% of the data for test purposes and trained the models on the remaining 90%. In preprocessing the data, -3. Note that we simply use perplexity as a figure of merit for comparing models. The models that we compare are all unigram (“bag-of-words”) models, which—as we have discussed in the Introduction—are of interest in the information retrieval context. We are not attempting to do language modeling in this paper—an enterprise that would require us to examine trigram or other higher-order models. We note in passing, however, that extensions of LDA could be considered that involve Dirichlet-multinomial over trigrams instead of unigrams. We leave the exploration of such extensions to language modeling to future work. - -1008 - -L ATENT D IRICHLET A LLOCATION - -ÖØ× - -Ù - -Ø× - -Ð Ö Ò - -Ù Ø ÓÒ - -Æ Ï ÁÄÅ ËÀÇÏ ÅÍËÁ ÅÇÎÁ ÈÄ ÅÍËÁ Ä ËÌ ÌÇÊ ÁÊËÌ ÇÊà ÇÈ Ê ÌÀ Ì Ê ÌÊ ËË ÄÇÎ - -ÅÁÄÄÁÇÆ Ì ÈÊÇ Ê Å Í Ì ÁÄÄÁÇÆ Ê Ä Ê ËÈ Æ ÁÆ Æ Ï ËÌ Ì ÈÄ Æ ÅÇÆ ÈÊÇ Ê ÅË ÇÎ ÊÆÅ ÆÌ ÇÆ Ê ËË - -ÀÁÄ Ê Æ ÏÇÅ Æ È ÇÈÄ ÀÁÄ ÊË ÅÁÄÁ Ë ÏÇÊÃ È Ê ÆÌË Ë Ë ÅÁÄ Ï Ä Ê Å Æ È Ê ÆÌ Ê ÄÁ - -Ë ÀÇÇÄ ËÌÍ ÆÌË Ë ÀÇÇÄË Í ÌÁÇÆ Ì À ÊË ÀÁ À ÈÍ ÄÁ Ì À Ê ÆÆ ÌÌ Å ÆÁ Ì Æ ÅÈÀ ËÌ Ì ÈÊ ËÁ ÆÌ Ä Å ÆÌ Ê À ÁÌÁ - -The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a real opportunity to make a mark on the future of the performing arts with these grants an act every bit as important as our traditional areas of support in health, medical research, education and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which will house young artists and provide new public facilities. The Metropolitan Opera Co. and New York Philharmonic will receive $400,000 each. The Juilliard School, where music and the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000 donation, too. Figure 8: An example article from the AP corpus. Each color codes a different factor from which the word is putatively generated. - -1009 - -B LEI , N G , AND J ORDAN - -3400 3200 3000 2800 Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI - -Perplexity - -2600 2400 2200 2000 1800 1600 1400 0 10 20 30 40 50 60 70 80 90 100 - -Number of Topics -7000 6500 6000 5500 5000 4500 4000 3500 3000 2500 0 Smoothed Unigram Smoothed Mixt. Unigrams LDA Fold in pLSI - -Perplexity - -20 - -40 - -60 - -80 - -100 - -120 - -140 - -160 - -180 - -200 - -Number of Topics -Figure 9: Perplexity results on the nematode (Top) and AP (Bottom) corpora for LDA, the unigram model, mixture of unigrams, and pLSI. - -1010 - -L ATENT D IRICHLET A LLOCATION - -Num. topics (k) 2 5 10 20 50 100 200 - -Perplexity (Mult. Mixt.) 22,266 2.20 × 108 1.93 × 1017 1.20 × 1022 4.19 × 10106 2.39 × 10150 3.51 × 10264 - -Perplexity (pLSI) 7,052 17,588 63,800 2.52 × 105 5.04 × 106 1.72 × 107 1.31 × 107 - -Table 1: Overfitting in the mixture of unigrams and pLSI models for the AP corpus. Similar behavior is observed in the nematode corpus (not reported). - -we removed a standard list of 50 stop words from each corpus. From the AP data, we further removed words that occurred only once. We compared LDA with the unigram, mixture of unigrams, and pLSI models described in Section 4. We trained all the hidden variable models using EM with exactly the same stopping criteria, that the average change in expected log likelihood is less than 0.001%. Both the pLSI model and the mixture of unigrams suffer from serious overfitting issues, though for different reasons. This phenomenon is illustrated in Table 1. In the mixture of unigrams model, overfitting is a result of peaked posteriors in the training set; a phenomenon familiar in the supervised setting, where this model is known as the naive Bayes model (Rennie, 2001). This leads to a nearly deterministic clustering of the training documents (in the E-step) which is used to determine the word probabilities in each mixture component (in the M-step). A previously unseen document may best fit one of the resulting mixture components, but will probably contain at least one word which did not occur in the training documents that were assigned to that component. Such words will have a very small probability, which causes the perplexity of the new document to explode. As k increases, the documents of the training corpus are partitioned into finer collections and thus induce more words with small probabilities. In the mixture of unigrams, we can alleviate overfitting through the variational Bayesian smoothing scheme presented in Section 5.4. This ensures that all words will have some probability under every mixture component. In the pLSI case, the hard clustering problem is alleviated by the fact that each document is allowed to exhibit a different proportion of topics. However, pLSI only refers to the training documents and a different overfitting problem arises that is due to the dimensionality of the p(z|d ) parameter. One reasonable approach to assigning probability to a previously unseen document is by marginalizing over d : p(w) = ∑ ∏ ∑ p(wn | z) p(z | d ) p(d ). -d n =1 z N - -Essentially, we are integrating over the empirical distribution on the topic simplex (see Figure 4). This method of inference, though theoretically sound, causes the model to overfit. The documentspecific topic distribution has some components which are close to zero for those topics that do not appear in the document. Thus, certain words will have very small probability in the estimates of -1011 - -B LEI , N G , AND J ORDAN - -each mixture component. When determining the probability of a new document through marginalization, only those training documents which exhibit a similar proportion of topics will contribute to the likelihood. For a given training document’s topic proportions, any word which has small probability in all the constituent topics will cause the perplexity to explode. As k gets larger, the chance that a training document will exhibit topics that cover all the words in the new document decreases and thus the perplexity grows. Note that pLSI does not overfit as quickly (with respect to k) as the mixture of unigrams. This overfitting problem essentially stems from the restriction that each future document exhibit the same topic proportions as were seen in one or more of the training documents. Given this constraint, we are not free to choose the most likely proportions of topics for the new document. An alternative approach is the “folding-in” heuristic suggested by Hofmann (1999), where one ignores the p(z|d ) parameters and refits p(z|dnew ). Note that this gives the pLSI model an unfair advantage by allowing it to refit k − 1 parameters to the test data. LDA suffers from neither of these problems. As in pLSI, each document can exhibit a different proportion of underlying topics. However, LDA can easily assign probability to a new document; no heuristics are needed for a new document to be endowed with a different set of topic proportions than were associated with documents in the training corpus. Figure 9 presents the perplexity for each model on both corpora for different values of k. The pLSI model and mixture of unigrams are suitably corrected for overfitting. The latent variable models perform better than the simple unigram model. LDA consistently performs better than the other models. 7.2 Document classification In the text classification problem, we wish to classify a document into two or more mutually exclusive classes. As in any classification problem, we may wish to consider generative approaches or discriminative approaches. In particular, by using one LDA module for each class, we obtain a generative model for classification. It is also of interest to use LDA in the discriminative framework, and this is our focus in this section. A challenging aspect of the document classification problem is the choice of features. Treating individual words as features yields a rich but very large feature set (Joachims, 1999). One way to reduce this feature set is to use an LDA model for dimensionality reduction. In particular, LDA reduces any document to a fixed set of real-valued features—the posterior Dirichlet parameters γ∗ (w) associated with the document. It is of interest to see how much discriminatory information we lose in reducing the document description to these parameters. We conducted two binary classification experiments using the Reuters-21578 dataset. The dataset contains 8000 documents and 15,818 words. In these experiments, we estimated the parameters of an LDA model on all the documents, without reference to their true class label. We then trained a support vector machine (SVM) on the low-dimensional representations provided by LDA and compared this SVM to an SVM trained on all the word features. Using the SVMLight software package (Joachims, 1999), we compared an SVM trained on all the word features with those trained on features induced by a 50-topic LDA model. Note that we reduce the feature space by 99.6 percent in this case. -1012 - -L ATENT D IRICHLET A LLOCATION - -95 - -98 - -97 -Accuracy - -90 - -Accuracy - -96 - -95 - -LDA Features Word Features - -94 - -LDA Features Word Features - -85 0 - -0.05 0.1 0.15 0.2 Proportion of data used for training - -0.25 - -93 0 - -0.05 0.1 0.15 0.2 Proportion of data used for training - -0.25 - -(a) - -(b) - -Figure 10: Classification results on two binary classification problems from the Reuters-21578 dataset for different proportions of training data. Graph (a) is EARN vs. NOT EARN. Graph (b) is GRAIN vs. NOT GRAIN. - -600 550 Predictive Perplexity 500 450 400 350 300 250 200 0 10 - -LDA Fold in pLSI Smoothed Mixt. Unigrams - -20 30 Number of Topics - -40 - -50 - -Figure 11: Results for collaborative filtering on the EachMovie data. - -Figure 10 shows our results. We see that there is little reduction in classification performance in using the LDA-based features; indeed, in almost all cases the performance is improved with the LDA features. Although these results need further substantiation, they suggest that the topic-based representation provided by LDA may be useful as a fast filtering algorithm for feature selection in text classification. -1013 - -B LEI , N G , AND J ORDAN - -7.3 Collaborative filtering Our final experiment uses the EachMovie collaborative filtering data. In this data set, a collection of users indicates their preferred movie choices. A user and the movies chosen are analogous to a document and the words in the document (respectively). The collaborative filtering task is as follows. We train a model on a fully observed set of users. Then, for each unobserved user, we are shown all but one of the movies preferred by that user and are asked to predict what the held-out movie is. The different algorithms are evaluated according to the likelihood they assign to the held-out movie. More precisely, define the predictive perplexity on M test users to be: predictive-perplexity(Dtest ) = exp − ∑M d =1 log p(wd ,Nd | wd ,1:Nd −1 ) ) . M - -We restricted the EachMovie dataset to users that positively rated at least 100 movies (a positive rating is at least four out of five stars). We divided this set of users into 3300 training users and 390 testing users. Under the mixture of unigrams model, the probability of a movie given a set of observed movies is obtained from the posterior distribution over topics: p(w|wobs ) = ∑ p(w|z) p(z|wobs ). -z - -In the pLSI model, the probability of a held-out movie is given by the same equation except that p(z|wobs ) is computed by folding in the previously seen movies. Finally, in the LDA model, the probability of a held-out movie is given by integrating over the posterior Dirichlet: p(w|wobs ) = - -∑ p(w|z) p(z|θ) p(θ|wobs )d θ, -z - -where p(θ|wobs ) is given by the variational inference method described in Section 5.2. Note that this quantity is efficient to compute. We can interchange the sum and integral sign, and compute a linear combination of k Dirichlet expectations. With a vocabulary of 1600 movies, we find the predictive perplexities illustrated in Figure 11. Again, the mixture of unigrams model and pLSI are corrected for overfitting, but the best predictive perplexities are obtained by the LDA model. - -8. Discussion -We have described latent Dirichlet allocation, a flexible generative probabilistic model for collections of discrete data. LDA is based on a simple exchangeability assumption for the words and topics in a document; it is therefore realized by a straightforward application of de Finetti’s representation theorem. We can view LDA as a dimensionality reduction technique, in the spirit of LSI, but with proper underlying generative probabilistic semantics that make sense for the type of data that it models. Exact inference is intractable for LDA, but any of a large suite of approximate inference algorithms can be used for inference and parameter estimation within the LDA framework. We have presented a simple convexity-based variational approach for inference, showing that it yields a fast -1014 - -L ATENT D IRICHLET A LLOCATION - -algorithm resulting in reasonable comparative performance in terms of test set likelihood. Other approaches that might be considered include Laplace approximation, higher-order variational techniques, and Monte Carlo methods. In particular, Leisink and Kappen (2002) have presented a general methodology for converting low-order variational lower bounds into higher-order variational bounds. It is also possible to achieve higher accuracy by dispensing with the requirement of maintaining a bound, and indeed Minka and Lafferty (2002) have shown that improved inferential accuracy can be obtained for the LDA model via a higher-order variational technique known as expectation propagation. Finally, Griffiths and Steyvers (2002) have presented a Markov chain Monte Carlo algorithm for LDA. LDA is a simple model, and although we view it as a competitor to methods such as LSI and pLSI in the setting of dimensionality reduction for document collections and other discrete corpora, it is also intended to be illustrative of the way in which probabilistic models can be scaled up to provide useful inferential machinery in domains involving multiple levels of structure. Indeed, the principal advantages of generative models such as LDA include their modularity and their extensibility. As a probabilistic module, LDA can be readily embedded in a more complex model— a property that is not possessed by LSI. In recent work we have used pairs of LDA modules to model relationships between images and their corresponding descriptive captions (Blei and Jordan, 2002). Moreover, there are numerous possible extensions of LDA. For example, LDA is readily extended to continuous data or other non-multinomial data. As is the case for other mixture models, including finite mixture models and hidden Markov models, the “emission” probability p(wn | zn ) contributes only a likelihood value to the inference procedures for LDA, and other likelihoods are readily substituted in its place. In particular, it is straightforward to develop a continuous variant of LDA in which Gaussian observables are used in place of multinomials. Another simple extension of LDA comes from allowing mixtures of Dirichlet distributions in the place of the single Dirichlet of LDA. This allows a richer structure in the latent topic space and in particular allows a form of document clustering that is different from the clustering that is achieved via shared topics. Finally, a variety of extensions of LDA can be considered in which the distributions on the topic variables are elaborated. For example, we could arrange the topics in a time series, essentially relaxing the full exchangeability assumption to one of partial exchangeability. We could also consider partially exchangeable models in which we condition on exogenous variables; thus, for example, the topic distribution could be conditioned on features such as “paragraph” or “sentence,” providing a more powerful text model that makes use of information obtained from a parser. - -Acknowledgements -This work was supported by the National Science Foundation (NSF grant IIS-9988642) and the Multidisciplinary Research Program of the Department of Defense (MURI N00014-00-1-0637). Andrew Y. Ng and David M. Blei were additionally supported by fellowships from the Microsoft Corporation. - -References -M. Abramowitz and I. Stegun, editors. Handbook of Mathematical Functions. Dover, New York, 1970. -1015 - -B LEI , N G , AND J ORDAN - -´ D. Aldous. Exchangeability and related topics. In Ecole d’´ et´ e de probabilit´ es de Saint-Flour, XIII— 1983, pages 1–198. Springer, Berlin, 1985. H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems 12, 2000. L. Avery. Caenorrhabditis genetic center http://elegans.swmed.edu/wli/cgcbib. bibliography. 2002. URL - -R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. D. Blei and M. Jordan. Modeling annotated data. Technical Report UCB//CSD-02-1202, U.C. Berkeley Computer Science Division, 2002. B. de Finetti. Theory of probability. Vol. 1-2. John Wiley & Sons Ltd., Chichester, 1990. Reprint of the 1975 translation. S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. P. Diaconis. Recent progress on de Finetti’s notions of exchangeability. In Bayesian statistics, 3 (Valencia, 1987), pages 111–125. Oxford Univ. Press, New York, 1988. J. Dickey. Multiple hypergeometric functions: Probabilistic interpretations and statistical uses. Journal of the American Statistical Association, 78:628–637, 1983. J. Dickey, J. Jiang, and J. Kadane. Bayesian methods for censored categorical data. Journal of the American Statistical Association, 82:773–781, 1987. A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman & Hall, London, 1995. T. Griffiths and M. Steyvers. A probabilistic approach to semantic representation. In Proceedings of the 24th Annual Conference of the Cognitive Science Society, 2002. D. Harman. Overview of the first text retrieval conference (TREC-1). In Proceedings of the First Text Retrieval Conference (TREC-1), pages 1–20, 1992. D. Heckerman and M. Meila. An experimental comparison of several clustering and initialization methods. Machine Learning, 42:9–29, 2001. T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999. F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1997. T. Joachims. Making large-scale SVM learning practical. In Advances in Kernel Methods - Support Vector Learning. M.I.T. Press, 1999. M. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, 1999. -1016 - -L ATENT D IRICHLET A LLOCATION - -M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical models. Machine Learning, 37:183–233, 1999. R. Kass and D. Steffey. Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models). Journal of the American Statistical Association, 84 (407):717–726, 1989. M. Leisink and H. Kappen. General lower bounds based on computer generated higher order expansions. In Uncertainty in Artificial Intelligence, Proceedings of the Eighteenth Conference, 2002. T. Minka. Estimating a Dirichlet distribution. Technical report, M.I.T., 2000. T. P. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Uncertainty in Artificial Intelligence (UAI), 2002. C. Morris. Parametric empirical Bayes inference: Theory and applications. Journal of the American Statistical Association, 78(381):47–65, 1983. With discussion. K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61–67, 1999. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000. C. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent semantic indexing: A probabilistic analysis. pages 159–168, 1998. A. Popescul, L. Ungar, D. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In Uncertainty in Artificial Intelligence, Proceedings of the Seventeenth Conference, 2001. J. Rennie. Improving multi-class text classification with naive Bayes. Technical Report AITR-2001004, M.I.T., 2001. G. Ronning. Maximum likelihood estimation of Dirichlet distributions. Journal of Statistcal Computation and Simulation, 34(4):215–221, 1989. G. Salton and M. McGill, editors. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. - -Appendix A. Inference and parameter estimation -In this appendix, we derive the variational inference procedure (Eqs. 6 and 7) and the parameter maximization procedure for the conditional multinomial (Eq. 9) and for the Dirichlet. We begin by deriving a useful property of the Dirichlet distribution. -1017 - -B LEI , N G , AND J ORDAN - -A.1 Computing E[log(θi | α)] The need to compute the expected value of the log of a single probability component under the Dirichlet arises repeatedly in deriving the inference and parameter estimation procedures for LDA. This value can be easily computed from the natural parameterization of the exponential family representation of the Dirichlet distribution. Recall that a distribution is in the exponential family if it can be written in the form: p(x | η) = h(x) exp ηT T (x) − A(η) , where η is the natural parameter, T (x) is the sufficient statistic, and A(η) is the log of the normalization factor. We can write the Dirichlet in this form by exponentiating the log of Eq. (1): p(θ | α) = exp -k k ∑k i=1 (αi − 1) log θi + log Γ ∑i=1 αi − ∑i=1 log Γ(αi ) . - -From this form, we immediately see that the natural parameter of the Dirichlet is ηi = αi − 1 and the sufficient statistic is T (θi ) = log θi . Furthermore, using the general fact that the derivative of the log normalization factor with respect to the natural parameter is equal to the expectation of the sufficient statistic, we obtain: E[log θi | α] = Ψ(αi ) − Ψ ∑k j =1 α j where Ψ is the digamma function, the first derivative of the log Gamma function. A.2 Newton-Raphson methods for a Hessian with special structure In this section we describe a linear algorithm for the usually cubic Newton-Raphson optimization method. This method is used for maximum likelihood estimation of the Dirichlet distribution (Ronning, 1989, Minka, 2000). The Newton-Raphson optimization technique finds a stationary point of a function by iterating: αnew = αold − H (αold )−1 g(αold ) where H (α) and g(α) are the Hessian matrix and gradient respectively at the point α. In general, this algorithm scales as O(N 3 ) due to the matrix inversion. If the Hessian matrix is of the form: H = diag(h) + 1z1T , (10) - -where diag(h) is defined to be a diagonal matrix with the elements of the vector h along the diagonal, then we can apply the matrix inversion lemma and obtain: H −1 = diag(h)−1 − diag(h)−1 11T diag(h)−1 −1 z−1 + ∑k j =1 h j gi − c hi - -Multiplying by the gradient, we obtain the ith component: (H −1 g)i = - -1018 - -L ATENT D IRICHLET A LLOCATION - -where c= -−1 z−1 + ∑k j =1 h j - -∑k j=1 g j /h j - -. - -Observe that this expression depends only on the 2k values hi and gi and thus yields a NewtonRaphson algorithm that has linear time complexity. A.3 Variational inference In this section we derive the variational inference algorithm described in Section 5.1. Recall that this involves using the following variational distribution: q(θ, z | γ, φ) = q(θ | γ) ∏ q(zn | φn ) -n =1 N - -(11) - -as a surrogate for the posterior distribution p(θ, z, w | α, β), where the variational parameters γ and φ are set via an optimization procedure that we now describe. Following Jordan et al. (1999), we begin by bounding the log likelihood of a document using Jensen’s inequality. Omitting the parameters γ and φ for simplicity, we have: log p(w | α, β) = log = log ≥ - -∑ p(θ, z, w | α, β)d θ -z - -∑ -z z - -p(θ, z, w | α, β)q(θ, z) dθ q(θ, z) -z - -∑ q(θ, z) log p(θ, z, w | α, β)d θ − ∑ q(θ, z) log q(θ, z)d θ -(12) - -= Eq [log p(θ, z, w | α, β)] − Eq [log q(θ, z)]. - -Thus we see that Jensen’s inequality provides us with a lower bound on the log likelihood for an arbitrary variational distribution q(θ, z | γ, φ). It can be easily verified that the difference between the left-hand side and the right-hand side of the Eq. (12) is the KL divergence between the variational posterior probability and the true posterior probability. That is, letting L (γ, φ; α, β) denote the right-hand side of Eq. (12) (where we have restored the dependence on the variational parameters γ and φ in our notation), we have: log p(w | α, β) = L (γ, φ; α, β) + D(q(θ, z | γ, φ) p(θ, z | w, α, β)). (13) - -This shows that maximizing the lower bound L (γ, φ; α, β) with respect to γ and φ is equivalent to minimizing the KL divergence between the variational posterior probability and the true posterior probability, the optimization problem presented earlier in Eq. (5). We now expand the lower bound by using the factorizations of p and q: - -L (γ, φ; α, β) = Eq [log p(θ | α)] + Eq [log p(z | θ)] + Eq [log p(w | z, β)] -− Eq [log q(θ)] − Eq [log q(z)]. -1019 - -(14) - -B LEI , N G , AND J ORDAN - -Finally, we expand Eq. (14) in terms of the model parameters (α, β) and the variational parameters (γ, φ). Each of the five lines below expands one of the five terms in the bound: - -L (γ, φ; α, β) = log Γ ∑kj=1 α j − ∑ log Γ(αi ) + ∑ (αi − 1) Ψ(γi ) − Ψ ∑kj=1 γ j -i=1 i=1 - -k - -k - -+∑ +∑ - -N - -n=1 i=1 N k V - -∑ φni - -k - -Ψ(γi ) − Ψ ∑k j =1 γ j (15) -k - -n=1 i=1 j=1 - -∑ ∑ φni wnj log βi j -k i=1 i=1 - -k − log Γ ∑k j=1 γ j + ∑ log Γ(γi ) − ∑ (γi − 1) Ψ(γi ) − Ψ ∑ j=1 γ j - -−∑ - -N - -n=1 i=1 - -∑ φni log φni , - -k - -where we have made use of Eq. (8). In the following two sections, we show how to maximize this lower bound with respect to the variational parameters φ and γ. A.3.1 VARIATIONAL -MULTINOMIAL - -We first maximize Eq. (15) with respect to φni , the probability that the nth word is generated by latent topic i. Observe that this is a constrained maximization since ∑k i=1 φni = 1. We form the Lagrangian by isolating the terms which contain φni and adding the appropriate i Lagrange multipliers. Let βiv be p(wv n = 1 | z = 1) for the appropriate v. (Recall that each wn is a vector of size V with exactly one component equal to one; we can select the unique v such that wv n = 1): - -L[φni ] = φni Ψ(γi ) − Ψ ∑kj=1 γ j + φni log βiv − φni log φni + λn ∑kj=1 φni − 1 , -where we have dropped the arguments of L for simplicity, and where the subscript φni denotes that we have retained only those terms in L that are a function of φni . Taking derivatives with respect to φni , we obtain: ∂L = Ψ(γi ) − Ψ ∑k j=1 γ j + log βiv − log φni − 1 + λ. ∂φni Setting this derivative to zero yields the maximizing value of the variational parameter φni (cf. Eq. 6): φni ∝ βiv exp Ψ(γi ) − Ψ ∑k j =1 γ j -1020 - -. - -(16) - -L ATENT D IRICHLET A LLOCATION - -A.3.2 VARIATIONAL D IRICHLET Next, we maximize Eq. (15) with respect to γi , the ith component of the posterior Dirichlet parameter. The terms containing γi are: - -L[γ] = ∑ (αi − 1) Ψ(γi ) − Ψ ∑kj=1 γ j + ∑ φni Ψ(γi ) − Ψ ∑kj=1 γ j -i=1 n =1 k − log Γ ∑k j=1 γ j + log Γ(γi ) − ∑ (γi − 1) Ψ(γi ) − Ψ ∑ j=1 γ j i=1 k - -k - -N - -. - -This simplifies to: - -L[γ] = ∑ Ψ(γi ) − Ψ ∑kj=1 γ j -i=1 - -k - -k αi + ∑N n=1 φni − γi − log Γ ∑ j=1 γ j + log Γ(γi ). - -We take the derivative with respect to γi : ∂L k = Ψ (γi ) αi + ∑N n=1 φni − γi − Ψ ∑ j=1 γ j ∂γi Setting this equation to zero yields a maximum at: γi = αi + ∑N n=1 φni . (17) - -j =1 - -∑ - -k - -α j + ∑N n =1 φ n j − γ j . - -Since Eq. (17) depends on the variational multinomial φ, full variational inference requires alternating between Eqs. (16) and (17) until the bound converges. A.4 Parameter estimation In this final section, we consider the problem of obtaining empirical Bayes estimates of the model parameters α and β. We solve this problem by using the variational lower bound as a surrogate for the (intractable) marginal log likelihood, with the variational parameters φ and γ fixed to the values found by variational inference. We then obtain (approximate) empirical Bayes estimates by maximizing this lower bound with respect to the model parameters. We have thus far considered the log likelihood for a single document. Given our assumption of exchangeability for the documents, the overall log likelihood of a corpus D = {w1 , w2 , . . . , wM } is the sum of the log likelihoods for individual documents; moreover, the overall variational lower bound is the sum of the individual variational bounds. In the remainder of this section, we abuse notation by using L for the total variational bound, indexing the document-specific terms in the individual bounds by d , and summing over all the documents. Recall from Section 5.3 that our overall approach to finding empirical Bayes estimates is based on a variational EM procedure. In the variational E-step, discussed in Appendix A.3, we maximize the bound L (γ, φ; α, β) with respect to the variational parameters γ and φ. In the M-step, which we describe in this section, we maximize the bound with respect to the model parameters α and β. The overall procedure can thus be viewed as coordinate ascent in L . -1021 - -B LEI , N G , AND J ORDAN - -A.4.1 C ONDITIONAL - -MULTINOMIALS - -To maximize with respect to β, we isolate terms and add Lagrange multipliers: - -L[β] = - -d =1 n=1 i=1 j=1 - -∑∑ - -M Nd - -∑ ∑ φdni wdn log βi j + ∑ λi ∑V j=1 βi j − 1 . -j i=1 - -k - -V - -k - -We take the derivative with respect to βi j , set it to zero, and find: βi j ∝ A.4.2 D IRICHLET The terms which contain α are: -j . ∑ ∑ φdni wdn M Nd - -d =1 n =1 - -L[α] = - -d =1 - -∑ - -M - -k log Γ ∑k j=1 α j − ∑ log Γ(αi ) + ∑ (αi − 1) Ψ(γdi ) − Ψ ∑ j=1 γd j i=1 i=1 - -k - -k - -Taking the derivative with respect to αi gives: -M ∂L = M Ψ ∑k α ) + − Ψ ( α i ∑ Ψ(γdi ) − Ψ ∑kj=1 γd j j =1 j ∂αi d =1 - -This derivative depends on α j , where j = i, and we therefore must use an iterative method to find the maximal α. In particular, the Hessian is in the form found in Eq. (10): ∂L = δ(i, j)M Ψ (αi ) − Ψ ∑k j =1 α j , ∂αi α j and thus we can invoke the linear-time Newton-Raphson algorithm described in Appendix A.2. Finally, note that we can use the same algorithm to find an empirical Bayes point estimate of η, the scalar parameter for the exchangeable Dirichlet in the smoothed LDA model in Section 5.4. - -1022 - +Thus the optimizing values of the variational parameters are found by minimizing the KullbackLeibler (KL) divergence between the variational distribution and the true posterior p(θ, z | w, α, β). This minimization can be achieved via an iterative fixed-point method. In particular, we show in Appendix A.3 that by computing \ No newline at end of file diff --git a/pydf/data/goodwyn2013.txt b/pydf/data/goodwyn2013.txt index babeceb..80fe2f8 100644 --- a/pydf/data/goodwyn2013.txt +++ b/pydf/data/goodwyn2013.txt @@ -130,105 +130,4 @@ Recurrent motifs and archetypes the narrative field described above. Narratives that audiences like to experience repeatedly, like folktales or myths, are defined as more resonant than those that do not have such staying power, like jokes, which spread quickly but ‘fizzle out’ equally rapidly. What is not included in this definition is why a given expression is resonant—that is, where various theoretical and/or philosophical models can suggest testable predictions. For example, a model based on cognitive anthropology and folklore analysis might propose that high PR expressions should: 1. Have a high proportion of elements that align with the ‘intuitive thinking patterns’ described by cognitive and developmental scientists: image schemata, agency detection, intuitive physics, folk biology, force dynamics, essentialism, causal reasoning and shape (see Gazzaniga et al. 2002, pp. 499–611; Karmiloff-Smith 1996, passim). In other words, high PR expressions will be full of personifications and concrete, easily imagined objects or people moving through space in mostly typical ways. They will contain characters doing mostly expected things in mostly expected ways, and will be devoid of a lot of violations in the expectations of folk biology, force dynamics, cause and effect expectations, and so on. 2. However, they will also be ‘minimally counterintuitive’ (a concept developed by Barrett 2007); that is, a story or image that aligns too well with intuitive expectations will be dismissed as prosaic and unremarkable. Therefore, in addition to aligning well with non-reflective beliefs described above, a highly resonant expression will be likely to have one or a small number of counterintuitive elements. This is because, as Barrett argues, expressions having all intuitive elements will be ignored as commonplace, whereas too many counterintuitive elements will be confusing and difficult to visualize and remember. Expressions that have a few counterintuitive elements will grab attention due to their novelty and will be more easily recalled and visualized. Barrett gives an example: Concepts that too greatly violate intuitive expectations generated by mental tools would be difficult to understand, remember, and communicate at a later time. For instance, a dog that experiences time backwards, is born of a rhino mated with a bullfrog, that sustains itself on graphite, speaks Latin, and changes into cheese on Thursdays would be a difficult concept to transmit faithfully. Such a concept’s primary limitation is that it so greatly violates the expectations (non-reflective beliefs) about dogs generated by human mental tools, that the conceptual structure of dog is undermined. (2007, p. 187) -The above dog violates, in order, the non-reflective beliefs governing cause and effect (backwards time), folk biology (such an animal mating is wrong, what the dog eats is wrong, and animals don’t speak), and finally, essentialism (cheese has a different kind of ‘essence’ than animals). One could make the example even more - -398 - -Erik Goodwyn - -non-resonant by allowing the dog to violate force dynamics through telekinesis, or agency, by describing it as without any intentions, and intuitive physics by allowing it to walk through walls. Such a dog would be as poorly resonant as an utterly commonplace dog that violates none of the above. A cat, however, that experiences time normally, is born of cat parents, eats cat food, always stays a cat, but speaks Latin, is an example of a ‘minimally counterintuitive’ expression that is likely to be much more resonant. Thus, high PR expressions are predicted to have a low, but non-zero number of intuitive violations in them. Personification of story elements will be the rule rather than the exception, as non-human things with human mentality and intentionality are easily comprehended and are minimally counterintuitive. 3. Resonant expressions will be emotionally evocative, perhaps stimulating and satisfying the convergent affective systems involving fear, rage, panic, play, care, lust, seeking, and hunger (Goodwyn 2012, pp. 23–27, pp. 28–59, pp. 202–05; Panksepp 1998, passim, 2005). Note that Jung requires archetypal images to be affectively charged and/or ‘numinous’ (Jung 1964, p. 87; von Franz 1996, p. 10). The most resonant expressions will be likely to trigger strong affects, such as the shocking eroticism and shame of Godiva’s naked ride, the voyeurism of Peeping Tom and the vengeful satisfaction at his blindness, the piety and inspiring loyalty of the citizens toward their Lady (an element which overtook the vague ‘it’s a miracle’ explanation), and so forth. High PR expressions should have vivid, emotionally satisfying content and low PR expressions are predicted to have a cool, detached quality or will be emotionally frustrating. 4. Resonant expressions should be sensually vivid and clearly defined. Expressions containing creatures moving through clearly defined and vivid environments that are simple and not overburdened with descriptors, yet not overly abstract either, are likely to be more resonant. Middle level categories that are easy to visualize like ‘shimmering sword’ will therefore be preferable to abstract categories like ‘weapon’ or dense expressions like ‘quillioned pattern-weld blade with Brighthampton scabbard and cross’. High PR expressions should, therefore, contain a lot of sensually vivid, simple, middle-level category descriptions, rather than very specific or very general and abstract descriptions. Objects and settings will tend to be described in vivid, static terms such as gold, crystal, metal. Descriptions will use basic colours like black, white, red, or blue, rather than ambiguous, complicated or changing terms. Characters will be simple types that act with very little reference to internal dynamics or psychological conflict, and they will often be extremely distinct in some way, for example in their social position (king, prince, god), or in their beauty, power, evilness or ugliness. Internally complex characters with indeterminate or prosaic qualities are less likely to be resonant. 5. Narratives with a sense of timelessness, such as stories set ‘long ago’ or in the eternal ‘dream time’ of the gods (Eliade 1954, pp. 17–73; Sproul 1979, - -Recurrent motifs and archetypes - -399 - -pp. 1–31; von Franz 1996, pp. 37–46, 2001, pp. 63–145) are likely to be more resonant. Folklorists suggest they evoke a satisfying ‘oceanic’ feeling. Cognitively, such generality in time and space allows a narrative to be applicable to many settings, easily visualized and adapted. Thus, high PR expressions are less likely to contain specific identifiers of time and space. 6. Rhythmic and prosodic utterances will be likely to have a special resonance for their ‘musical’ quality, which enhances recall and emotional engagement. Consider the statement of folklorists Robert Stockwell and Donka Minkova: -the primary role of meter [is] its mnemonic value. . .Beyond this obvious primary function, the rhythm that results from the performance of metrical regularity is intrinsically pleasurable. . .[and these derive] from a set of intrinsic properties of language. . .either onset identity (‘alliteration’), as in Old English, or coda identity (‘rime’), as later. These intrinsic properties are universal in language, within a small range of variation. (1977, p. 59) - -Many of the world’s oldest myths were originally in verse form, which adds strength to the validity of this criterion. Thus, high PR expressions are more likely to contain rhythmic or prosodic elements (such as ‘Magic mirror, on the wall. . .’) than low PR expressions. 7. Resonant stories will have simple plots such as a task to fulfil, or a progression from lack to gain. They will have a sense of urgency and conflict, but contain some kind of dramatic reversal or unforeseen event. Common themes include: help from mysterious or unknown powers, the weak prevailing over the powerful, self—destructive actions that turn out beneficial in the end, victory or catastrophe that is suddenly reversed, evil that thwarts itself, good intentions leading to harm, inaction leading to weal. Resonant narratives also often show that appearances are deceptive (this is identified by Lüthi as the most common folktale theme). Thus they typically contain irony. Non-resonant expressions will have no repetition, overly complex plots or no plot, and no unexpected reversals of events or irony. 8. Resonant stories will have an ‘interconnected’ quality, for example things may occur ‘in the nick of time’, or events may recur in doubles or triples. Both characters and events are likely to come in contrasting pairs and to interact in just such a fashion as to give the impression that though objects and characters are sharply drawn and distinct, they are nevertheless part of a teleologically organized process. Non-resonant expressions will have no repetition and will appear disjointed, purposeless and randomly juxtaposed.4 -4 - -The quality of interconnection shares many similarities with synchronicity. If the quality of interconnection proves to correctly predict resonance, this may help explain why events that appear to have a high level of synchronicity are particularly resonant and emotionally powerful. Moreover, with further investigation, we may be able to better characterize perceptions that ‘trigger’ a sense of acausal connection. - -400 - -Erik Goodwyn - -It is important to note that these predicted features are hypothecated without reference to evidence about whether a narrative has lasted across generations, or has been found to be independently invented cross-culturally. Rather, it is predicted that resonant narratives should be more likely to contain some or all of the above elements, and narratives that are not resonant (that do not survive transmission easily), should be less likely to. I invite clinicians and researchers to test these predictions against known recurrent motifs, and against clinical data. The most resonant dreams, patient self-narratives, psychotic fantasies and ritualistic obsessions (the most memorable, durable, and ‘sticky’) should also have the above qualities. If they do not, the criteria will need to be refined. In summary, it appears humans have a tendency to selectively transmit folktales, myths and legends in such a way that over multiple retellings they will drift into recurrent, resonant ‘attractor positions’ that share the same characteristics and themes regardless of particular origin. This can occur quickly, such as when a story ‘spreads like wildfire’ (and subsequently persists) or gradually builds over generations. Such attractor states will be associated with various universal and reliably emergent learnability and emotional factors that tend to preserve some narratives over others. The above criteria are hypothetical characteristics based on cognitive science and folklore studies, but it is up to empirical study to verify what they are. This puts us in the position to propose a testable model for the archetype, since in the present work we are proposing that archetypes generate motifs through the mechanism of psychological resonance. Those criteria which most reliably produce resonant narratives are therefore defined to be archetypal. The attractor state model therefore defines the archetype as the collection of psychological constraints and biases, of whatever origin, that work in concert to create one or more resonant attractor states in the narrative field. It is a group of processes that can increase the probability of generating and retaining, say, a dragon slayer story anywhere. Specifically, these constraints include the particular reliably emergent cognitive, semantic, memory, emotional, and visuospatial biases on human thought—whatever their origin may be—that are applicable to a given typical situation. Naturally at this point they are not fully characterized, but are put forward as a tentative, testable model inspired by cognitive anthropology and folklore study. The model predicts that these constraints and biases should include, but need not be limited to: 1) 2) 3) 4) 5) 6) 7) 8) 9) minimal counter-intuitiveness emotionality sensual vividness indeterminacy in time and space biasing toward middle-level categories low complexity containing rhythmic and prosodic elements having simple plots with reversals and/or irony apparent interconnection of events. - -Recurrent motifs and archetypes - -401 - -Non-resonant expressions will be overly counter-intuitive or overly mundane, emotionally detached or frustrating, sensually vague or abstract, specific in time and space, contain over-specific or over-general categories, be internally complex or ambiguous, will lack any rhythmic or poetic qualities, will lack a clear plot, will lack reversals and interconnection. As will be seen, one advantage of this definition is that it sets aside the question of origins and focuses on the constraints themselves and links them tightly with motifs. - -Discussion: comparing the attractor state model to Jung’s archetypes In the above, I have proposed a model of archetype working from the recurrent motif. If the recurrent motif is central to our definition of the archetype, this model also provides a method for refining the definition empirically using the features of recurrent motifs and clinical data. We may be departing somewhat from Jung, however, in such a model. For example, Jung writes that the archetype -Can be conceived as a mnemic deposit. . .which has arisen through the condensation of countless processes of a similar kind. In this respect it is a precipitate and, therefore, a typical basic form, of certain ever recurring psychic experiences. (1971/2005, para. 748) - -And he states further that the archetype -Can only be explained assuming them to be deposits of the constantly repeated experiences of humanity. . . [they] are a kind of readiness to produce over and over again the same or similar mythical ideas. Hence it seems as though what is impressed upon the unconscious were exclusively the subjective fantasy-ideas aroused by the physical process. We may therefore assume that the archetypes are recurrent impressions made by subjective reactions. (1953, para. 109) - -Jung appears to understand the archetypes in terms of biological perceptual predispositions, ‘deposited’ via an undefined evolutionary process, which produce recurrent motifs. Thus, he does not work backward from the motif as I have done, but starts more foundationally with biological assumptions in an attempt to show how it could possibly produce motifs, but no specific mechanism is given. Because of this choice, any attempt to construct a testable model from Jung’s speculations forces us to diverge from Jung. For example, it appears that Jung is saying that repeated encounters throughout the history of our species have left some sort of mark on the psyche that somehow generates ‘recurrent impressions made by subjective reactions’. But what mythic ideas are related to what recurrent species challenges? Why do we tell stories of dragon slayers more often than aardvark slayers? Which ‘certain ever-recurring psychic experiences’ leave such an imprint? Which physical processes? And exactly how does ‘a kind of readiness’ behave? - -402 - -Erik Goodwyn - -These are naturally criticisms from hindsight. Jung’s choice to work from early 20th century biological notions was a natural one, but resulted in a lack of specificity and the detail needed for a straightforward empirical approach. The attractor state model is advantageous empirically because it provides these details, but necessarily focuses out of some ideas (as discussed by Knox 2003) such as the archetype as eternal metaphysical entity, which–though possibly valid–is outside the realm of empirical inquiry at present. The attractor state model is only concerned with which potential criteria produce resonant themes and which do not, irrespective of origin. It asks only ‘what are the criteria for resonant expressions?’ not ‘what creates the criteria?’—that must come later, empirically, after we have firmly established what they are. It is therefore ‘closer’ to the data from motifs than the more comprehensive, but more speculative theories of Stevens, Knox, Haule, Hogenson and McDowell. It does, however, have the ability to generate data to eventually judge these more general theories of archetype. Thus, the attractor state model proposes a method of analysing clinical narratives and dreams, and differentiating them into those which are archetypal and those which are not, which is something the other models do not aspire to do. The present proposed criteria for resonance are inspired by the findings of cognitive anthropologists and folklorists, and my own explorations (Goodwyn 2012, passim). But whether these criteria are valid can—using this method—be determined empirically. This method may therefore compel us to abandon Jung’s hypotheses about ‘ancestral deposits’—or it may not—we are free to let the data reign.5 But before such a task is attempted, the present model allows us to clearly define what constraints and characteristics produce the most psychological resonance. Once accomplished, we will be far better equipped to determine the origins of such constraints and characteristics, whether biological, developmental, mathematical, interactional, or some combination. The reviewed theories of the archetype give well-reasoned proposals for how motif-making constraints might arise based on the findings of other disciplines, but sparse detail is provided concerning what those constraints actually are beyond abstract principles. They do not characterize features of recurrent motifs, they cannot differentiate between resonant and non-resonant expressions, and they cannot make any predictions. They do, however, give us solid theoretical reasons for believing such constraints exist in themselves, whether or not they fully agree with each other on why they exist. In any case, by postponing the theory of the origin of the constraints for now and focusing on the empirical search for what the constraints actually are, archetype theory, as proposed here, becomes an empirical endeavour that remains within clinical and experimental analytical psychology. -5 - -The reference to emotional content among the criteria remains an avenue in which motifs may have evolutionary or biological ‘deposit’ content, as affective neuroscientists (Panksepp 1998, 2005) continue to advance our knowledge of the way evolution has shaped our emotional responses. - -Recurrent motifs and archetypes - -403 - -Any expression that has been independently identified as a resonant attractor state via cross-cultural studies (such as the ARAS, the folklore index or by historians of symbolism) or longitudinal study (such as the Lady Godiva, King Arthur, or Grimm’s fairytale analysis) can be examined by clinical and experimental analytical psychologists to see if they have the above qualities. Such qualities are inspired by cognitive science and folklore studies, but any set of criteria could be proposed or tested against the recurrent motif. Predictive validity will allow us to refine these criteria with further data. - -Conclusion In summary, the attractor state model provides us with more than a hypothesis of archetype in terms of where the (generically defined) constraints may come from. Instead it postpones questions of origins and focuses tightly on what the constraints are, how resonant motifs arise, and why some motifs are resonant where others are not, allowing for easy recognition in clinical practice and straightforward testability in research.6 Using the presently proposed criteria, a minimally counter-intuitive, rhythmic, emotionally sharp narrative with middlelevel objects and characters, with an indistinct setting in time and place, with a simple conflict-oriented plot with dramatic reversals and a sense of interconnectedness should be more resonant (that is, more memorable and ‘sticky’) than a story without these qualities. This prediction should be tested empirically in both research and clinical settings. Firmly linking archetype to motif in this manner may diverge somewhat from Jung’s original intention. We may not ever be sure, given Jung’s multiple definitions and ambiguous language. But it provides us for the first time with a method of refining the definition of archetype (as offered here) with direct empirical study rather than theoretical speculation, and frees us from dependency upon any particular framework outside analytical psychology, apart from drawing inspiration from whatever field to propose further criteria. Such proposed criteria can subsequently be tested against known recurrent motifs and clinical material—for example, themes found to resonate strongly with patients. From this methodology, we will be able to more clearly differentiate which clinical narratives and dreams are more resonant because we will have an empirically derived set of criteria. As a thought experiment, try to recall the story about the hippo. This should be difficult, even though it has the same number of words (20) as the dragon slayer story, because I intentionally designed it to be non-resonant. Clinically, some life stories, fantasies, obsessions and dreams will be more resonant than others. Jungian psychology attends to resonant motifs and resonant characters like gods, spirits -6 - -Research programmes could be devised which present subjects with narratives and images to measure resonance directly via ability to recall stories with and without the proposed criteria for resonance, not unlike the ASI study of Rosen et al. (1991). - -404 - -Erik Goodwyn - -and dream or fantasy beings, but has not always provided clear criteria by which to judge. As models of the archetype become more rigorous and testable, we should in the future be able to more easily judge that question. Continued investigation of recurrent motifs can give us more sophisticated tools to understand the clinical narratives that contain them. I will conclude with some observations of relevance to clinical work. I proposed that non-resonant narratives will be confusingly counter-intuitive or overly mundane, emotionally detached or frustrating, sensually vague or abstract, specific in time and space, will contain over-specific or over-general categories, will be internally complex or ambiguous, and will lack any rhythmic or poetic qualities, a clear plot, reversals or irony and interconnection. Note that there can often be a similarity between these qualities and our patients’ self-narratives when they come into therapy for the first time. It is possible that psychotherapy works partially because patients and therapists continually examine a patient’s story, and slowly co-create a more resonant narrative over time. Research in Prolonged Exposure therapy (Foa et al. 2007, pp. 6–16) provides some support here. More resonant self-narratives will, significantly, be more emotionally satisfying and contain a strong sense of inter-connection and purposefulness. The tendency to gravitate toward greater resonance of self-narrative (facilitated by therapy) may, therefore, contribute significantly to endogenous psychological healing and integration. This can be tested clinically by having patients describe a self-narrative before and after therapy and correlating resonance of self-narrative with any clinical improvement. TRANSLATIONS OF ABSTRACT -Au niveau le plus élémentaire, les archétypes représentent une tentative de Jung pour expliquer le phénomène des motifs récurrents dans les mythes et les contes populaires (Jung 1956, 1959, para. 99). Mais l’archétype demeure controversé en tant qu’explication de motifs récurrents, car l’existence de motifs récurrents n’est pas une preuve que les archétypes existent. Ainsi, l’enjeu pour la théorie contemporaine de l’archétype n’est pas simplement de démontrer que les motifs récurrents existent, puisque cela n’est pas contesté, mais de démontrer que les archétypes existent et induisent les motifs récurrents. Cet article propose un nouveau modèle différent des autres car il explique comment l’archétype crée des motifs en résonance. Forcément, ce modèle clarifie, modifie et adapte les idées fondatrices de Jung sur l’archétype, de façon à donner un cadre de travail basé sur la pratique et les méthodologies contemporaines. Pour la première fois, un modèle d’archétype est proposé, qui peut être validé dans un champ empirique plutôt que théorique. On y parvient en reliant l’archétype aux données concrètes des motifs récurrents plutôt que dans des orientations académiques d’autres champs. - -Auf ganz grundlegender Ebene repräsentierten die Archetypen Jungs Versuch, das Phänomen der regelmäßig wiederkehrenden Mythen und Motive in den Volkssagen zu erklären (Jung 1956, 1959 } 99). Aber der Archetyp bleibt als Erklärung von häufig vorkommenden Motiven kontrovers, da die Existenz von wiederkehrenden Motiven die - -Recurrent motifs and archetypes - -405 - -Existenz der Archetypen nicht beweist. So besteht die an die gegenwärtige Archetypentheorie gestellte Herausforderung nicht darin zu zeigen, daß es die stetig wiederkehrenden Motive gibt, da dies nicht in Frage gestellt wird, sondern zu zeigen, daß Archetypen existieren und das Wiederkehren von Motiven bewirken. Der vorliegende Beitrag entwirft ein neues Modell, welches sich von anderen dadurch unterscheidet, daß es erklärt, wie der Archetyp resonante Motive erzeugt. Notwendigerweise klärt, modifiziert und adaptiert dieses Modell einige von Jungs grundlegenden Ideen über Archetypen. Dies geschieht in der Absicht, einen Arbeitsrahmen bereitzustellen, der auf heutiger Praxis und Methodologie fußt. Zum ersten Mal wird ein Archetypenmodell vorgestellt, welches eher auf empirischer als auf theoretischer Basis validiert werden kann. Dies wird durch die Verbindung des Archetyps mit den harten Daten wiederkehrender Motive erreicht, was einen Unterschied macht zu akademischen Trends auf anderen Gebieten. - -Al livello più basico gli archetipi rappresentano il tentativo di Jung di spiegare il fenomeno della ricorrenza dei miti e dei motivi popolari (Jung 1956, 1959, para. 99). Ma l’archetipo resta controverso come spiegazione dei motivi ricorrenti, poiché l’esistenza di motivi ricorrenti non prova che l’archetipo esista. Quindi la sfida per l’attuale teoria degli archetipi non è semplicemente dimostrare che esistono motivi ricorrenti, poiché ciò non è in discussione, ma dimostrare che gli archetipi esistono e causano motivi ricorrenti. Il presente lavoro propone un nuovo modello che differisce dagli altri nel fatto che spiega in che modo gli archetipi creano motivi risonanti. Tale modello necessariamente chiarisce, modifica e adatta alcune delle idee seminali di Jung sugli archetipi in modo da fornire una struttura di lavoro basata sulla pratica e sulle metodologie attuali. Per la prima volta viene proposto un archetipo che può essere convalidato su basi empiriche piuttosto che teoriche. Ciò si ottiene legando l’archetipo ai dati forti di motivi ricorrenti piuttosto che a tendenze accademiche in altri campi. - -На самом базовом уровне архетипы представляют собой попытку Юнга объяснить феномен повторяющихся мифологических и фольклорных мотивов (Юнг 1956, 1959, пар 99). Но архетип представляется противоречивым объяснением повторяющихся мотивов, поскольку существование повторяющихся мотивов не является доказательством существования архетипа. Таким образом, испытанием для современной теории архетипов оказывается не просто необходимость продемонстрировать существование повторяющихся мотивов – ведь это не оспаривается, – но продемонстрировать то, что архетипы существуют и являются причиной повторяющихся мотивов. Настоящая статья предлагает новую модель, непохожую на другие; она объясняет, как архетипы создают резонансные мотивы. Эта модель разъясняет, поправляет и адаптирует некоторые из основополагающих идей Юнга об архетипе, предоставляя рабочую рамку, опирающуюся на современную практику и методологию. Впервые предлагается такая модель архетипа, которую можно оценить на эмпирической, а не на теоретической основе. Это достигается тем, что архетип связывается с достоверной информацией повторяющихся мотивов как таковых, а не с академическими тенденциями из других областей. - -406 - -Erik Goodwyn - -En el nivel más básico, los arquetipos representaron la tentativa de Jung para explicar el fenómeno de motivos recurrentes de mitos y cuentos popular es (Jung 1956, 1959, el párr. 99). Pero el arquetipo escontroversial para la explicación de motivos recurrentes, y la existencia de motivos recurrentes no demuestra que los arquetipos existen. Así, el desafío para la teoría contemporánea de los arquetipos es, no solamente demostrar que los motivos recurrentes existen, puesto que ello no está en discussion, sino demostrar que los arquetipos existen y causan los motivos recurrentes. El presente trabajo propone un nuevo modelo que, a diferencia de otros, explica cómo el arquetipo crea motivos resonantes. Este modelo necesariamente clarifica, modifica y adapta algunas de ideas originales de Jung en relación al arquetipo para proporcionar una marco de trabajo basado en la práctica y metodologías contemporáneas. Por primera vez se propone un modelo de arquetipo que puede ser validado empíricamente antes que teóricamente. Esto se logra ligando el arquetipo a los datos sólidos de los motivos recurrentes antes que a las tendencias académicas en otros campos. - -References -ARAS (Archive for Research in Archetypal Symbolism). (2010). The Book of Symbols. New York: Taschen. Ashe, G. (2003). The Discovery of King Arthur. Gloucestershire: Sutton. Ashliman, D. L. (1987). A guide to folktales in the English language. London: Greenwood. Atran, S. C. (2002). In Gods We Trust. Oxford: Oxford University. Barrett, J. (2007). ‘Gods’. In Religion, Anthropology, and Cognitive Science, eds. H. Whitehouse & J. Laidlaw. Carolina: Carolina Academic Press. Boyer, P. (2001). Religion Explained. New York: Basic Books. Cirlot, J. E. ([1971] 2002). A Dictionary of Symbols. New York: Dover. Deacon, T. W. (2010). ‘A role for relaxed selection in the evolution of the language capacity’. PNAS, 107, 2, 9000–06. Davidson, H. E. (1978). Patterns of Folklore. Ipswich: D.S. Brewer. Eliade, M. (1954). The Myth of the Eternal Return. New Jersey: Mythos. ——— ([1958] 1990). Patterns in Comparative Religion. Nebraska: Bison Books. Foa, E. B., Hembree, E. A. & Rothbaum, B. O. (2007). Prolonged Exposure Therapy for PTSD. Oxford: Oxford. Fulk, R. D., Bjork, R. E. & Niles, J. D. (Eds.) (2009). Klaeber’s Beowulf: Fourth Edition. Toronto: University of Toronto. Gazzaniga, M. S., Ivry, R. B. & Mangun, G. R. (2002). Cognitive Neuroscience. New York: W.W. Norton. Garmonsway, G. N., Simpson, J. & Davidson, H. E. (1971). Beowulf and its analogues. New York: E. P. Dutton. Goodwyn, E. (2010). ‘Approaching archetypes: reconsidering innateness’. The Journal of Analytical Psychology, 55, 4, 502–21. ——— (2012). The Neurobiology of the Gods. London: Routledge. Grimm, J. (2003). Grimm’s Fairy Tales, 3rd Edition. New York: Bantam. Haule, J. (2012). Jung in the 21st Century (2 vols.). London: Routledge. Hogenson, G. B. (2001). ‘The Baldwin effect: a neglected influence on C. G. Jung’s evolutionary thinking’. Journal of Analytical Psychology, 46, 4, 591–611. ——— (2004). ‘Archetypes: emergence and the psyche’s deep structure’. In Analytical Psychology: Contemporary Perspectives in Jungian Analysis, eds. J. Cambray & L. Carter. New York: Brunner-Routledge. ——— (2009). ‘Archetypes as action patterns’. Journal of Analytical Psychology, 54, 3, 325–37. - -Recurrent motifs and archetypes - -407 - -——— (2010). ‘Responses to Erik Goodwyn’s “Approaching archeytpes: reconsidering innateness.’” Journal of Analytical Psychology, 55, 4, 543–49. Jung, C. G. (1919). ‘The structure and dynamics of the psyche’. CW 8. ——— ([1953] 1977). ‘Two essays on analytical psychology’. CW 7. ——— ([1956] 1990). ‘Symbols of transformation’. CW 5. ——— ([1959] 2006). ‘The archetypes and the collective unconscious’. CW 9i. ——— (1964). ‘Approaching the Unconscious’. In Man and His Symbols, eds. C. G. Jung & M. L. von Franz. New York: Dell. ——— (1971/2005). ‘Psychological types’. CW 6. Jung, L. & Meyer-Grass, M. (Eds.) (2008). Children’s Dreams: Notes from the Seminar Given in 1936–1940. Princeton: Princeton University Press. Karmiloff-Smith, A. (1996). Beyond Modularity. Cambridge: MIT Press. Knox, J. (2003). Archetype, Attachment, Analysis. London: Routledge. ——— (2010). ‘Responses to Erik Goodwyn’s “Approaching archetypes: reconsidering innateness”. Journal of Analytical Psychology, 55, 4, 522–33. Laidlaw, J. & Whitehouse, H. (2007). Introduction to H. Whitehouse and J Laidlaw. Leach, M. & Fried, J. (Eds.) (1984). Funk and Wagnalls Standard Dictionary of Folklore, Mythology and Legend. San Francisco: Harper & Row. Lehner, E. (1956). The Picture Book of Symbols. New York: William Penn. Lüthi, M. (1982). The European Folktale: Form and Nature. Bloomington: Indiana University Press. ——— (1984). The Fairytale as Art Form and Portrait of Man. Bloomington: Indiana University Press. McCauley, R. N. & Lawson, E. T. (2002). Bringing Ritual to Mind. Cambridge: Cambridge University Press. McDowell, M. J. (2001). ‘Principle of organization: a dynamic-systems view of the archetype-as-such’. Journal of Analytical Psychology, 46, 637–54. Merchant, J. (2010). ‘Responses to Erik Goodwyn’s “Approaching archetypes: reconsidering innateness”’. Journal of Analytical Psychology, 55, 4, 534–42. Panksepp, J. (1998). Affective Neuroscience. Oxford: Oxford University Press. ——— (2005). ‘Affective consciousness: core emotional feelings in animals and humans’. Cognition and Consciousness, 14, 30–80. Pietikainen, P. (1998). ‘Archetypes as symbolic forms’. Journal of Analytical Psychology, 43, 325–43. Propp, V. (1968). Morphology of the Folktale. Austin: University of Texas Press. Puhvel, J. (1987). Comparative Mythology. Baltimore: Johns Hopkins University Press. Pyysiäinen, I. (2009). Supernatural Agents. Oxford: Oxford University Press. Rosen, D. H., Smith, S. M., Huston, H. L. & Gonzales, G. (1991). ‘Empirical study of associations between symbols and their meanings: evidence of collective unconscious (archetypal) memory’. Journal of Analytical Psychology, 36, 211–28. Sørensen, J. (2007). A Cognitive Theory of Magic. New York: Alta Mira Press. Sperber, D. (1996). Explaining Culture. Oxford: Blackwell Press. Sproul, B. (1979). Primal Myths. New York: HarperOne. Stevens, A. (1998). Ariadne’s Clue. Princeton: Princeton University Press. ——— (2003). Archetype Revisited. London: Routledge. Stockwell, R. P. & Minkova, D. (1977). ‘Prosody’. In A Beowulf Handbook, eds. R. E. Bjork & J. D. Niles. Lincoln: University of Nebraska Press. Thompson, S. (1960). Motif-Index of Folk Literature. Bloomington: Indiana University Press. Tresidder, J. (ed.) (2005). The Complete Dictionary of Symbols. San Francisco: Chronicle. Üther, H. J. (2004). The Types of International Folktales. Helsinki: Finnish Academy of Science and Letters. - -408 - -Erik Goodwyn - -von Franz, M. L. (1996). The Interpretation of Fairy Tales. Boston: Shambhala. ——— (2001). Psyche and Matter. Boston: Shambhala. Whitehouse, H. (2004). ‘Toward a comparative anthropology of religion’. In Ritual and Memory, eds. H. Whitehouse & J. Laidlaw. New York: Alta Mira. ——— (2007). ‘Integrating ethnography, history, cognitive science of religion’. In Religion, Anthropology, and Cognitive Science, eds. H. Whitehouse & J. Laidlaw. Durham: Carolina Academic Press. Whitehouse, H. & Laidlaw, J. (eds.) (2007). Religion, Anthropology, and Cognitive Science. Durham, North Carolina: Carolina: Carolina Academic Press. Zipes, J. (2003). Introduction to the Complete Fairy Tales of the Brothers Grimm, 3rd edn. New York: Bantam Books. - -[MS first submitted July 2012; final version March 2013] - +The above dog violates, in order, the non-reflective beliefs governing cause and effect (backwards time), folk biology (such an animal mating is wrong, what the dog eats is wrong, and animals don’t speak), and \ No newline at end of file diff --git a/pydf/data/muellner2011.txt b/pydf/data/muellner2011.txt index c04ecc2..9f29075 100644 --- a/pydf/data/muellner2011.txt +++ b/pydf/data/muellner2011.txt @@ -135,209 +135,4 @@ nearest neighbor of a node x is one of the clusters a, b which are joined, then 10 -valid index here; however, for the single linkage method, it makes sense to choose b: if the nearest neighbor of a node was at index a, it is now at b, which represents the join a ∪ b. The remaining code in the main loop ensures that the array n _nghbr still contains lower bounds on the distances to the nearest neighbors. If the distance from the new cluster x to a cluster b d(I, J ) ≤ min{d(I, K ), d(J, K )} - -⇒ - -min{d(I, K ), d(J, K )} ≤ d(I ∪ J, K ) - -(1) - -for all disjoint nodes I, J, K at any stage of the clustering (Murtagh, 1984, § 3), (Murtagh, 1985, § 3.5). This is false.1 We give a correct proof which also shows the limitations of the algorithm. In Murtagh’s papers (Murtagh, 1984, 1985), it is not taken into account that the dissimilarity between clusters may depend on the order of clustering steps; on the other hand, it is explicitly said that the algorithm works for the “weighted” scheme, in which dissimilarities depend on the order of the steps. Since there is no published proof for the NN-chain algorithm but claims which go beyond what the algorithm can truly do, it is necessary to establish the correctness by a strict proof: Theorem 1. Fix a distance update formula. For any sequence of merging steps and any four disjoint clusters I, J, K, L resulting from these steps, require two properties from the distance update formula: • It fulfills the reducibility property (1). • The distance d(I ∪ J, K ∪ L) is independent of whether (I, J ) are merged first and then (K, L) or the other way round. Then the algorithm NN-chain-linkage produces valid stepwise dendrograms for the given method. Proposition 2. The “single”, “complete”, “average”, “weighted” and “Ward” distance update formulas fulfill the requirements of Theorem 1. Proof of Theorem 1. We prove the theorem by induction in the size of the input set S . The induction start is trivial since a dendrogram for a one-point set is empty. We call two nodes a, b ∈ S reciprocal nearest neighbors (“pairwise nearest neighbors” in the terminology of Murtagh, 1985) if the distance d[a, b] is minimal among all distances from a to points in S , and also minimal among all distances from b: d[a, b] = min d[a, x] = min d[b, x]. -x∈ S x=a x∈ S x=b - -Every finite set S with at least two elements has at least one pair of reciprocal nearest neighbors, namely a pair which realizes the global minimum distance. The list chain is in the algorithm constructed in a way such that every element is a nearest neighbor of its predecessor. If chain ends in [. . . , b, a, b], we know that a and b are reciprocal nearest neighbors. The main idea behind the algorithm is that reciprocal nearest neighbors a, b always contribute a row (a, b, d[a, b]) to the stepwise dendrogram, even if they are not discovered in ascending order of dissimilarities. -1 For - -example, consider the distance update formula d(I ∪ J, K ) := d(I, K ) + d(J, K ) + d(I, J ). This formula fulfills the reducibility condition. Consider the following distance matrix between five points in the first column below. The Primitive_clustering algorithm produces the correct stepwise dendrogram in the middle column. However, if the point A is chosen first in line 7 of NN-chain-core, the algorithm outputs the incorrect dendrogram in the right column. A B C D B 3 C 4 5 D 6 7 1 E 15 12 13 14 (C, (A, (AB, (ABCD, D, B, CD, E, 1) 3) 27) 85) (C, (A, (CD, (AB, D, B, E, CDE, 1) 3) 28) 87) - -12 - -Figure 4 The nearest-neighbor clustering algorithm. -1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: - -procedure NN-chain-linkage(S, d) S : node labels, d: pairwise dissimilarities L ← NN-chain-core(N, d) Stably sort L with respect to the third column. L ← Label(L) Find node labels from cluster representatives. return L end procedure procedure NN-chain-core(S, d) S : node labels, d: pairwise dissimilarities S ← (0, . . . , N − 1) chain = [ ] size [x] ← 1 for all x ∈ S while |S | > 1 do if length(chain ) ≤ 3 then a ← (any element of S ) E.g. S [0] chain ← [a] b ← (any element of S \ {a}) E.g. S [1] else a ← chain [−4] b ← chain [−3] Remove chain [−1], chain [−2] and chain [−3] Cut the tail (x, y, x). end if repeat c ← argminx=a d[x, a] with preference for b a, b ← c, a Append a to chain until length(chain ) ≥ 3 and a = chain [−3] a, b are reciprocal Append (a, b, d[a, b]) to L nearest neighbors. Remove a, b from S n ← (new node label) size [n] ← size [a] + size [b] Update d with the information d[n, x] = d[x, n] = Formula(d[a, x], d[b, x], d[a, b], size [a], size [b], size [x]) - -for all x ∈ S . S ← S ∪ { n} end while return L an unsorted dendrogram end procedure (We use the Python index notation: chain [−2] is the second-to-last element in the list chain .) -25: 26: 27: 28: - -13 - -Figure 5 A union-find data structure suited for the output conversion. -1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: - -procedure Label(L) L ← [] N ← (number of rows in L) + 1 Number of initial nodes. U ← new Union-Find(N ) for (a, b, δ ) in L do Append (U.Efficient-Find(a), U.Efficient-Find(b), δ ) to L U.Union(a, b) end for return L end procedure - -class Union-Find method Constructor(N ) parent ← new int[2N − 1] parent [0, . . . , 2N − 2] ← None 15: nextlabel ← N 16: end method -17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: - -N is the number of data points. - -SciPy convention: new labels start at N - -method Union(m, n) parent [m] = nextlabel parent [n] = nextlabel nextlabel ← nextlabel + 1 end method - -SciPy convention: number new labels consecutively - -method Find(n) This works but the search process is not efficient. while parent [n] is not None do n ← parent [n] end while return n end method method Efficient-Find(n) p←n while parent [n] is not None do n ← parent [n] end while while parent [p] = n do p, parent [p] ← parent [p], n end while return n end method end class This speeds up repeated calls. - -14 - -Lines 15 to 19 in NN-chain-core clearly find reciprocal nearest neighbors (a, b) in S . One important detail is that the index b is preferred in the argmin search in line 16, if the minimum is attained at several indices and b realizes the minimum. This can be respected in an implementation with no effort, and it ensures that reciprocal nearest neighbors are indeed found. That is, the list chain never contains a cycle of length > 2, and a chain = [. . . , b, a] with reciprocal nearest neighbors at the end will always be extended by b, never with an element c = b which coincidentally has the same distance to a. After line 19, the chain ends in (b, a, b). The nodes a and b are then joined, and the internal variables are updated as usual. We now show that the remaining iterations produce the same output as if the algorithm had started with the set S := (S \ {a, b}) ∪ {n}, where n is the new node label and the distance array d and the size array are updated accordingly. The only data which could potentially be corrupted is that the list chain could not contain successive nearest neighbors any more, since the new node n could have become the nearest neighbor of a node in the list. At the beginning of the next iteration, the last elements (b, a, b) are removed from chain . The list chain then clearly does not contain a or b at any place any more, since any occurrence of a or b in the list would have led to an earlier pair of reciprocal nearest neighbors, before (b, a, b) was appended to the list. Hence, chain contains only nodes which really are in S . Let e, f be two successive entries in chain , i.e. f is a nearest neighbor of e. Then we know d[e, f ] ≤ d[e, a] d[e, f ] ≤ d[e, b] d[a, b] ≤ d[a, e] d[a, b] ≤ d[b, e] - -Together with the reducibility property (1) (for I = a, J = b, K = e), this implies d[e, f ] ≤ d[e, n]. Hence, f is still the nearest neighbor of e, which proves our assertion. We can therefore be sure that the remaining iterations of NN-chain-core produce the same output as if the algorithm would be run freshly on S . By the inductive assumption, this produces a valid stepwise dendrogram for the set S with N − 1 nodes. Proposition 3 carries out the remainder of the proof, as it shows that the first line (a, b, d[a, b]) of the unsorted dendrogram, when it is sorted into the right place in the dendrogram for the nodes in S , is a valid stepwise dendrogram for the original set S with N nodes. Proposition 3. Let (S, d) be a set with dissimilarities (|S | > 1). Fix a distance update formula which fulfills the requirements in Theorem 1. Let a, b be two distinct nodes in S which are reciprocal nearest neighbors. Define S as (S \ {a, b}) ∪ {n}, where the label n represents the union a ∪ b. Let d be the updated dissimilarity matrix for S , according to the chosen formula. Let L = ((ai , bi , δi )i=0,...,m ) be a stepwise dendrogram for S . Let j be the index such that all δi is a stepwise dendrogram for (S, d). Proof. Since a and b are reciprocal nearest neighbors at the beginning, the reducibility property (1) guarantees that they stay nearest neighbors after any number of merging steps between other reciprocal nearest neighbors. Hence, the first j steps in a dendrogram for S cannot contain a or b, since these steps all happen at merging dissimilarities smaller than d[a, b]. This is the point where we must require that the sorting in line 3 of NN-chain-linkage is stable. Moreover, the first j rows of L cannot contain a reference to n: Again by the reducibility property, dissimilarities between n and any other node are at least as big as d[a, b]. Therefore, the first j rows of L are correct for a dendrogram for S . After j steps, we know that no inter-cluster distances in S \ {a, b} are smaller than d[a, b]. Also, d[a, b] is minimal among all distances from a and b, so the row (a, b, d[a, b]) is a valid next row in L. After this step, we claim that the situation is the same in both settings: The sets S after j steps and the set S after j + 1 steps, including the last one merging a and b into a new cluster n, are clearly equal as partitions of the original set. It is required to check that also the dissimilarities are the same in both settings. This is where we need the second condition in Theorem 1: The row (a, b, d[a, b]) on top of the array L differs from the dendrogram L by j transpositions, where (a, b, d[a, b]) is moved one step downwards. Each transposition happens between two pairs (a, b) and (ai , bi ), where all four nodes are distinct, as shown above. The dissimilarity from a distinct fifth node x to the join a ∪ b does not depend on the merging of ai and bi since there is no way in which dissimilarities to ai and bi enter the distance update formula Formula(d[a, x], d[b, x], d[a, b], size [a], size [b], size [x]). The symmetric statement holds for the dissimilarity d[x, ai ∪ bi ]. The nodes a, b, ai , bi are deleted after the two steps, so dissimilarities like d[a, ai ∪ bi ] can be neglected. The only dissimilarity between active nodes which could be altered by the transposition is d[a ∪ b, ai ∪ bi ]. It is exactly the second condition in Theorem 1 that this dissimilarity is independent of the order of merging steps. This finishes the proof of Theorem 1. We still have to prove that the requirements of Theorem 1 are fulfilled by the “single”, “complete”, “average”, “weighted” and “Ward” schemes: Proof of Proposition 2. It is easy and straightforward to check from the table in Figure 2 that the distance update schemes in question fulfill the reducibility property. Moreover, the table also conveys that the dissimilarities between clusters in the “single”, “complete” and “average” schemes do not depend on the order of the merging steps. For Ward’s scheme, the global dissimilarity expression in the third column in Figure 2 applies only if the dissimilarity matrix consists of Euclidean distances between vectors (which is the prevalent setting for Ward’s method). For a general argument, note that the global cluster dissimilarity for Ward’s method can also be expressed by a slightly more complicated expression: d(A, B ) = 1 |A| + |B | 2 -a∈A b∈B - -d(a, b)2 − - -|B | |A| - -d(a, a )2 − -a∈A a ∈A - -|A| |B | - -d(b, b )2 -b∈B b ∈B - -This formula can be proved inductively from the recursive distance update formula for Ward’s method, hence it holds independently of whether the data is Euclidean or not. This proves that the dissimilarities in Ward’s scheme are also independent of the order of merging steps. Dissimilarities in the “weighted” scheme, however, do in general depend on the order of merging steps. However, the dissimilarity between joined nodes I ∪ J and K ∪ L is always the - -16 - -Figure 6 The single linkage algorithm. -1: 2: 3: 4: 5: 6: - -procedure MST-linkage(S, d) S : node labels, d: pairwise dissimilarities L ← MST-linkage-core(S, d) Stably sort L with respect to the third column. L ← Label(L) Find node labels from cluster representatives. return L end procedure procedure MST-linkage-core(S0 , d) L ← [] c ← (any element of S0 ) D0 [s] ← ∞ for s ∈ S0 \ {c} for i in (1, . . . , |S0 | − 1) do Si ← Si−1 \ {c} for s in Si do Di [s] ← min{Di−1 [s], d[s, c]} end for n ← argmins∈Si Di [s] Append (c, n, Di [n]) to L c←n end for return L end procedure S0 : node labels, d: pairwise dissimilarities c: current node - -1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: - -n: new node - -an unsorted dendrogram - -1 (d[I, K ] + d[I, L] + d[J, K ] + d[J, L]), independent of the order of steps, mean dissimilarity 4 and this is all that is required for Proposition 2. - -3.3 The single linkage algorithm -In this section, we present and prove correctness of a fast algorithm for single linkage clustering. Gower and Ross (1969) observed that a single linkage dendrogram can be obtained from a minimum spanning tree (MST) of the weighted graph which is given by the complete graph on the singleton set S with the dissimilarities as edge weights. The algorithm here was originally described by Rohlf (1973) and is based on Prim’s algorithm for the MST (see Cormen et al., 2009, § 23.2). The single linkage algorithm MST-linkage is given in Figure 6. In the same way as the NNchain algorithm, it consists of a core algorithm MST-linkage-core and two postprocessing steps. The output structure of the core algorithm is again an unsorted list of clustering steps with node representatives instead of unique labels. As will be proved, exactly the same postprocessing steps can be used as for the NN-chain algorithm. Rohlf’s algorithm in its original version is a full Prim’s algorithm and maintains enough data to generate the MST. He also mentions a possible simplification which does not do enough bookkeeping to generate an MST but enough for single linkage clustering. It is this simplification that is discussed in this paper. We prove the correctness of this algorithm for two reasons: • Since the algorithm MST-linkage-core does not generate enough information to reconstruct a minimum spanning tree, one cannot refer to the short proof of Prim’s algo- - -17 - -rithm in any easy way to establish the correctness of MST-linkage. • Like for the NN-chain algorithm in the last section, it is not clear a priori that the algorithm resolves ties correctly. A third algorithm can serve as a warning here (see Section 5 for more details): There is an other fast algorithm for single linkage clustering, Sibson’s SLINK algorithm (Sibson, 1973). More or less by coincidence, all three algorithms NN-chain-core, MST-linkage-core and SLINK generate output which can be processed by exactly the same two steps: sorting followed by Label. In case of the SLINK algorithm this works fine if all dissimilarities are distinct but produces wrong stepwise dendrograms in situations when two merging dissimilarities are equal. There is nothing wrong with the SLINK algorithm, however. Sibson supplied a proof for the SLINK algorithm in his paper (Sibson, 1973), but it is written for a (non-stepwise) dendrogram as the output structure, not for a stepwise dendrogram. Hence, the additional information which is contained in a stepwise dendrogram in the case of ties is not provided by all, otherwise correct algorithms. This should be taken as a warning that ties demand more from an algorithm and must be explicitly taken into account when we prove the correctness of the MST-linkage algorithm below. Theorem 4. The algorithm MST-linkage yields an output which can also be generated by Primitive_clustering. We do not explicitly refer to Prim’s algorithm in the following, and we make the proof selfcontained, since the algorithm does not collect enough information to construct a minimum spanning tree. There are unmistakable similarities, of course, and the author got most of the ideas for this proof from Prim’s algorithm (see Cormen et al., 2009, § 23.2). Let us first make two observations about the algorithm MST-linkage. (a) Starting with the full initial set S0 , the algorithm MST-linkage-core chooses a “current node” c in each step and removes it from the current set Si in every iteration. Let c := Si S0 \ Si be the complement of the current set Si . Then Di [s] (s ∈ Si ) is the c distance from Si to s, i.e. Di [s] = min d[s, t]. c -t∈Si - -(b) Let L be the output of MST-linkage-core(S, d). The 2i entries in the first two columns and the first i rows contain only i + 1 distinct elements of S , since the second entry in one row is the first entry in the next row. We prove Theorem 4 by the following stronger variant: Theorem 5. Let L be the output of MST-linkage-core(S, d). For all n Let s(0), . . . , s(n) be the stably sorted indices (i.e. δs(i) ≤ δs(i+1) for all i and s(i) contains an ) additionally contains the point bn . In a diagram: a0 a1 a2 a3 a4 a5 ... . . . am . . . an bn - -The inter-cluster distances in S + from the cluster with bn to the other clusters might be smaller than without the point bn in S − . We show, however, that this does not affect the remaining clustering steps: In each step r > k , we have the following situation for some x ≤ y ≤ s(k ). The point bn might or might not be in the same cluster as bs(r) . . . . ax . . . ay . . . as(r) bs(r) = as(r)+1 ... . . . bn−1 - -Let the distance δs(r) be realized as the distance from bs(r) to ay . From Observation (a) and line 10 in MST-linkage-core, we know that this distance is minimal among the distances from X := {a0 , . . . , as(r) } to all other points in S0 \ X . In particular, the distance from X to bn ∈ S0 \ X is not smaller than δs(r) . This proves that the addition of bn in step k does not change the single linkage clustering in any later step r > k . This completes the inductive proof of Theorem 5 - -4 Performance -In this section, we compare the performance of the algorithms and give recommendations on which algorithm to choose for which clustering method. We compare both the theoretical, asymptotic worst-case performance, and the use-case performance on a range of synthetic random data sets. - -4.1 Asymptotic worst-case performance -Let N denote the problem size, which is in this case the number of input data points. The 2 input size is N 2 ∈ Θ(N ). The asymptotic run-time complexity of MST-linkage-core is obviously Θ(N 2 ), since there are two nested levels of loops in the algorithm (line 8 and implicitly line 10). The run-time complexity of the NN-chain-core algorithm is also Θ(N 2 ) (Murtagh, 1985, page 86). Postprocessing is the same for both algorithms and is less complex, namely O(N log N ) for sorting and Θ(N ) for Label, so the overall complexity is Θ(N 2 ). This is optimal (in the asymptotic sense): the lower bound is also quadratic since all Θ(N 2 ) input values must be processed. The NN-chain algorithm needs a writable working copy of the input array to store intermediate dissimilarities and otherwise only Θ(N ) additional memory. The generic algorithm has a best-case time complexity of Θ(N 2 ), but without deeper analysis, the worst-case complexity is O(N 3 ). The bottleneck is line 15 in Generic_linkage: In O(N ) iterations, this line might be executed up to O(N ) times and does a minimum search over O(N ) elements, which gives a total upper bound of O(N 3 ). This applies for all clustering schemes except single linkage, where the loop starting at line 14 is never executed and thus the worst-case performance is Θ(N 2 ). The memory requirements for the generic algorithm are similar to the NN-chain algorithm: a working copy of the dissimilarity array and additionally only Θ(N ) temporary memory. In contrast, the MST algorithm does not write to the input array d. All other temporary variables are of size O(N ). Hence, MST-linkage requires no working copy of the input - -20 - -array and hence only half as much memory as Generic_linkage and NN-chain-linkage asymptotically. Anderberg’s algorithm (Anderberg, 1973, pages 135–136) has the same asymptotic bounds as our generic algorithm. The performance bottleneck are again the repeated minimum searches among the updated dissimilarities. Since the generic algorithm defers minimum searches to a later point in the algorithm (if they need to be performed at all, by then), there are at least as many minimum searches among at least as many elements in Anderberg’s algorithm as in the generic algorithm. The only point where the generic algorithm could be slower is the maintenance of the priority queue with nearest neighbor candidates, since this does not exist in Anderberg’s algorithm. The bottleneck here are potentially up to O(N 2 ) updates of a queue in line 36 of Generic_linkage. In the implementation in the next section, the queue is realized by a binary heap, so an update takes O(log N ) time. This could potentially amount to O(N 2 log N ) operations for maintenance of the priority queue. However, a reasonable estimate is that the saved minimum searches in most cases save more time than the maintenance of the queue with O(N ) elements costs, and hence there is a good reason to believe that the generic algorithm is at least as fast as Anderberg’s algorithm. Note that the maintenance effort of the priority queue can be easily reduced to O(N 2 ) instead of O(N 2 log N ) worst case: • A different priority queue structure can be chosen, where the “decrease-key” operation takes only O(1) time. (Note that the bottleneck operation in line 36 of Generic_linkage never increases the nearest-neighbor distance, only decreases it.) The author did not test a different structure since a binary heap convinces by its simple implementation. • Changed keys (minimal distances) need not be updated in the priority queue immediately. Instead, the entire queue might be resorted/regenerated at the beginning of every iteration. This takes N − 1 times O(N ) time with a binary heap. Although this lowers the theoretical complexity for the maintenance of the binary queue, it effectively slowed down the algorithms in practice by a small margin. The reason is, of course, that the number and complexity of updates of the priority queue did by far not reach their theoretical upper bound in our test data sets (see below). Altogether, the maintenance of the priority queue, as proposed in Figure 3 seems quite optimal from the practical perspective. - -4.2 Use-case performance -In addition to the theoretical, asymptotic and worst-case considerations, we also measured the practical performance of the algorithms. Figure 7 shows the run-time of the algorithms for a number of synthetic test data sets (for details see below). The solid lines are the average over the data sets. (The graphs labeled “Day-Edelsbrunner” are discussed in Section 5.) The lightly colored bands show the range from minimum to maximum time over all data sets for a given number of points. The following observations can be made: • For single linkage clustering, the MST-algorithm is clearly the fastest one. Together with the fact that it has only half the memory requirements of the other algorithms (if the input array is to be preserved), and thus allows the processing of larger data sets, the MST-algorithm is clearly the best choice for single linkage clustering. • For the clustering schemes without inversions (all except “centroid” and “median”), the generic algorithm, the NN-chain algorithm and Anderberg’s algorithm have very similar performance. - -21 - -103 102 101 CPU time in s 100 10−1 10−2 10−3 10−4 10−5 - -Method: “single” - -101 - -102 - -103 Number of points - -104 - -103 102 101 CPU time in s 100 10−1 10−2 10−3 10−4 10−5 - -Method: “average” “complete”, “weighted” and “Ward” look very similar. - -101 - -102 - -103 Number of points - -104 - -103 102 101 CPU time in s 100 10−1 10−2 10−3 10−4 10−5 - -Method: “centroid” “median” looks very similar. - -101 - -102 - -103 Number of points - -104 - -Figure 7: Performance of several SAHN clustering algorithms. Legend: Generic algorithm (Figure 3), Anderberg (1973, pages 135–136), NN-chain algorithm (Figure 4), MST-algorithm (Figure 6), Day and Edelsbrunner (1984, Table 5). - -22 - -The NN-chain algorithm is the only one with guaranteed O(N 2 ) performance here. We can conclude that the good worst-case performance can be had here without any cutbacks to the use-case performance. • For the “centroid” and “median” methods, we see a very clear disadvantage to Anderberg’s algorithm. Here, the worst case cubic time complexity occurs already in the random test data sets. This happens with great regularity, over the full range of input sizes. Our Generic_linkage algorithm, on the other hand, does not suffer from this weakness: Even though the theoretical worst-case bounds are the same, the complexity does not raise above the quadratic behavior in our range of test data sets. Hence, we have grounds to assume that Generic_linkage is much faster in practice. - -4.3 Conclusions -Based on the theoretical considerations and use-case tests, we can therefore recommend algorithms for the various distance update schemes as follows: • “single” linkage clustering: The MST-algorithm is the best, with respect to worst-case complexity, use-case performance and memory requirements. • “complete”, “average”, “weighted”, “ward”: The NN-chain algorithm is preferred, since it guarantees O(N 2 ) worst case complexity without any disadvantage to practical performance and memory requirements. • “centroid”, “median”: The generic clustering algorithm is the best choice, since it can handle inversions in the dendrogram and the performance exhibits quadratic complexity in all observed cases. Of course, the timings in the use-case tests depend on implementation, compiler optimizations, machine architecture and the choice of data sets. Nevertheless, the differences between the algorithms are very clear here, and the comparison was performed with careful implementations in the identical environment. The test setup was as follows: All algorithms were implemented in C++ with an interface to Python (van Rossum et al.) and the scientific computing package NumPy (Num) to handle the input and output of arrays. The test data sets are samples from mixtures of multivariate Gaussian distributions with unity covariance matrix in various dimensions (2, 3, 10, 200) with √ various numbers of modes (1, 5, [ N ]), ranging from N = 10 upwards until memory was exhausted (N = 20000 except for single linkage). The centers of the Gaussian distributions are also distributed by a Gaussian distribution. Moreover, for the methods for which it makes sense (single, complete, average, weighted: the “combinatorial” methods), we also generated 10 test sets per number of input points with a uniform distribution of dissimilarities. The timings were obtained on a PC with an Intel dual-core CPU T7500 with 2.2 GHz clock speed and 4GB of RAM and no swap space. The operating system was Ubuntu 11.04 64-bit, Python version: 2.7.1, NumPy version: 1.5.1, compiler: GNU C++ compiler, version 4.5.2. Only one core of the two available CPU cores was used in all computations. - -5 Alternative algorithms -The MST algorithm has the key features that it (1) needs no working copy of the Θ(N 2 ) input array and only Θ(N ) working memory, (2) is fast since it reads every input dissimilarity only once and otherwise deals only with Θ(N ) memory. There is a second algorithm with these - -23 - -characteristics, Sibson’s SLINK algorithm (Sibson, 1973). It is based on the insight that a single linkage dendrogram for N + 1 points can be computed from the dendrogram of the first N points plus a single row of distances (d[N, 0], . . . , d[N, N − 1]). In this fashion, the SLINK algorithm even reads the input dissimilarities in a fixed order, which can be an advantage over the MST algorithm if the favorable input order can be realized in an application, or if dissimilarities do not fit into random-access memory and are read from disk. However, there is one important difference: even though the output data format looks deceptively similar to the MST algorithm (the output can be converted to a stepwise dendrogram by exactly the same process: sorting with respect to dissimilarities and a union-find procedure to generate node labels from cluster representatives), the SLINK algorithm cannot handle ties. This is definite, since e.g. the output in the example situation on page 5 is the same in all three cases, and hence no postprocessing can recover the different stepwise dendrograms. There is an easy way out by specifying a secondary order d(i, j ) ≺ d(k, l) :⇐⇒ d(i, j ) analysis, but based on his short remarks, the author cannot see how Křivánek’s algorithm achieves O(N 2 ) worst-case performance for SAHN clustering. - -6 Extension to vector data -If the input to a SAHN clustering algorithm is not the array of pairwise dissimilarities but N points in a D-dimensional real vector space, the lower bound Ω(N 2 ) on time complexity does not hold any more. Since much of the time in an SAHN clustering scheme is spent on nearest-neighbor searches, algorithms and data structures for fast nearest-neighbor searches can potentially be useful. The situation is not trivial, however, since (1) in the “combinatorial” methods (e.g. single, complete, average, weighted linkage) the inter-cluster distances are not simply defined as distances between special points like cluster centers, and (2) even in the “geometric” methods (the Ward, centroid and median schemes), points are removed and new centers added with the same frequency as pairs of closest points are searched, so a dynamic nearest-neighbor algorithm is needed, which handles the removal and insertion of points efficiently. Moreover, all known fast nearest-neighbor algorithms lose their advantage over exhaustive search with increasing dimensionality. Additionally, algorithms will likely work for one metric in RD but not universally. Since this paper is concerned with the general situation, we do not go further into the analysis of the “stored data approach” (Anderberg, 1973, § 6.3). We only list at this point what can be achieved with the algorithms from this paper. This will likely be the best solution for high-dimensional data or general-purpose algorithms, but there are better solutions for low-dimensional data outside the scope of this paper. The suggestions below are at least helpful to process large data sets since memory requirements are of class Θ(N D), but they do not overcome their Ω(N 2 ) lower bound on time complexity. • The MST algorithm for single linkage can compute distances on-the-fly. Since every pairwise dissimilarity is read in only once, there is no performance penalty compared to first computing the whole dissimilarity matrix and then applying the MST algorithm. Quite the contrary, computing pairwise distances in-process can result in faster execution since much less memory must be reserved and accessed. The MST algorithm is suitable for any dissimilarity measure which can be computed from vector representations (that is, all scale types are possible, e.g. R-valued measurements, binary sequences and categorical data). • The NN-chain algorithm is suitable for the “Ward” scheme, since inter-cluster distances can be defined by means of centroids as in Figure 2. The initial inter-point dissimilarities must be Euclidean distances (which is anyway the only setting in which Ward linkage describes a meaningful procedure). • The generic algorithm is suitable for the “Ward”, “centroid” and “median” scheme on Euclidean data. There is a simpler variant of the Generic_linkage algorithm in Section 6, which works even faster in this setting. The principle of the algorithm Generic_ linkage_variant is the same: each array entry mindist [x] maintains a lower bound on all dissimilarities d[x, y ] for nodes with label y > x. The Generic_linkage algorithm is designed to work efficiently with a large array of pairwise dissimilarities. For this purpose, the join of two nodes a and b re-uses the label b, which facilitates in-place updating of the dissimilarity array in an implementation. The Generic_linkage_variant algorithm, in contrast, generates a unique new label for each new node, which is smaller than all existing labels. Since the new label is at the beginning of the (ordered) list of - -25 - -Figure 8 The generic clustering algorithm (variant). -1: - -procedure Generic_linkage_variant(N, d) d is either an array or a function which computes dissimilarities from cluster centers. . . . (Lines 2 to 13 are the same as in Generic_linkage.) for x in S \ {N − 1} do . . . while b ∈ / S do . . . end while Remove a and b from Q. Append (a, b, δ ) to L. Create a new label n ← −i size [n] ← size [a] + size [b] S ← (S \ {a, b}) ∪ {n} for x in S \ {n} do Extend the distance information. d[x, n] ← d[n, x] ← Formula(d[a, x], d[b, x], d[a, b], size [a], size [b], size [x]) end for or Compute the cluster center for n as in Figure 2. n _nghbr [n] ← argminx>n d[n, x] Insert (n, d[n, n _nghbr [n]]) into mindist and Q end for return L end procedure Recalculation of nearest neighbors, if necessary. - -5: - -14: - -21: 22: 23: 24: 25: 26: 27: 28: 29: 27: 30: 31: 32: 33: 34: - -nodes and not somewhere in the middle, the bookkeeping of nearest neighbor candidates and minimal distances is simpler in Generic_linkage_variant: in particular, the two loops in lines 28–38 of Generic_linkage can be disposed of entirely. Moreover, experiments show that Generic_linkage_variant needs much less recalculations of nearest neighbors in some data sets. However, both algorithms are similar, and which one is faster in an implementation seems to depend strongly on the actual data structures and their memory layout. - -Another issue which is not in the focus of this paper is that of parallel algorithms. For the “stored matrix approach”, this has a good reason since the balance of memory requirements versus computational complexity does not make it seem worthwhile to attempt parallelization with current hardware. This changes for vector data, when the available memory is not the limiting factor and the run-time is pushed up by bigger data sets. In high-dimensional vector spaces, the advanced clustering algorithms in this paper require little time compared to the computation of inter-cluster distances. Hence, parallelizing the nearest-neighbor searches with their inherent distance computations appears a fruitful and easy way of sharing the workload. The situation becomes less clear for low-dimensional data, however. - -26 - -7 Conclusion -Among the algorithms for sequential, agglomerative, hierarchic, nonoverlapping (SAHN) clustering on data with a dissimilarity index, three current algorithms are most efficient: Rohlf’s algorithm MST-linkage for single linkage clustering, Murtagh’s algorithm NNchain-linkage for the “complete”, “average”, “weighted” and “Ward” schemes, and the author’s Generic_linkage algorithm for the “centroid” and “median” schemes and the “flexible” family. The last algorithm can also be used for an arbitrary distance update formula. There is even a simpler variant Generic_linkage_variant, which seems to require less internal calculations, while the original algorithm is optimized for in-place updating of a dissimilarity array as input. The Generic_linkage algorithm and its variant are new; the other two algorithms were described before, but for the first time they are proved to be correct. - -Acknowledgments -This work was funded by the National Science Foundation grant DMS-0905823 and the Air Force Office of Scientific Research grant FA9550-09-1-0643. - -References -NumPy: Scientific computing tools for Python. Available at http://numpy.scipy.org/. Michael R. Anderberg. Cluster analysis for applications. Academic Press, New York, 1973. ISBN 0120576503. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press, 3rd edition, 2009. William H. E. Day. Complexity theory: an introduction for practitioners of classification. In Clustering and classification, pages 199–233. World Scientific Publishing, River Edge, NJ, 1996. William H. E. Day and Herbert Edelsbrunner. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1(1):7–24, 1984. doi: 10.1007/BF01890115. Daniel Defays. An efficient algorithm for a complete link method. The Computer Journal, 20 (4):364–366, 1977. doi: 10.1093/comjnl/20.4.364. Damian Eads. Hierarchical clustering (scipy.cluster.hierarchy), 2007. Package for SciPy version 0.9.0. Available at http://www.scipy.org. Brian S. Everitt, Sabine Landau, Morven Leese, and Daniel Stahl. Cluster Analysis. John Wiley & Sons, 5th edition, 2011. doi: 10.1002/9780470977811. Allan D. Gordon. A review of hierarchical classification. Journal of the Royal Statistical Society. Series A (General), 150(2):119–137, 1987. doi: 10.2307/2981629. John C. Gower and G. J. S. Ross. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):54–64, 1969. doi: 10.2307/2346439. - -27 - -Pierre Hansen and Brigitte Jaumard. Cluster analysis and mathematical programming. Mathematical Programming, 79(1–3):191–215, 1997. doi: 10.1007/BF02614317. Scott Huddleston and Kurt Mehlhorn. A new data structure for representing sorted lists. Acta Informatica, 17(2):157–184, 1982. doi: 10.1007/BF00288968. Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ, 1988. Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967. doi: 10.1007/BF02289588. Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001. http://www.scipy.org. Leonard Kaufman and Peter J. Rousseeuw. Finding groups in data: An introduction to cluster analysis. John Wiley & Sons, New York, 1990. doi: 10.1002/9780470316801. Mirko Křivánek. Connected admissible hierarchical clustering. KAM series, (90-189), 1990. Department of Applied Mathematics, Charles University, Prague (CZ). Available at http: //kam.mff.cuni.cz/~kamserie/serie/clanky/1990/s189.pdf. G. N. Lance and W. T. Williams. A general theory of classificatory sorting strategies. Computer Journal, 9(4):373–380, 1967. doi: 10.1093/comjnl/9.4.373. Kurt Mehlhorn and Athanasios Tsakalidis. Data structures. In Handbook of theoretical computer science, Vol. A, pages 301–341. Elsevier, Amsterdam, 1990. Available at http://www.mpi-sb.mpg.de/~mehlhorn/ftp/DataStructures.pdf. Fionn Murtagh. A survey of recent advances in hierarchical clustering algorithms. Computer Journal, 26(4):354–359, 1983. doi: 10.1093/comjnl/26.4.354. Fionn Murtagh. Complexities of hierarchic clustering algorithms: State of the art. Computational Statistics Quarterly, 1(2):101–113, 1984. Available at http://thames.cs.rhul.ac. uk/~fionn/old-articles/complexities/. Fionn Murtagh. Multidimensional clustering algorithms, volume 4 of Compstat Lectures. Physica-Verlag, Würzburg/ Wien, 1985. ISBN 3-7051-0008-4. Available at http://www. classification-society.org/csna/mda-sw/. Daniel Müllner. fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Preprint, 2011. Will be available at http://math.stanford.edu/~muellner. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. http://www.R-project.org. F. James Rohlf. Hierarchical clustering using the minimum spanning tree. Comput. Journal, 16:93–95, 1973. Available at http://life.bio.sunysb.edu/ee/rohlf/reprints.html. F. James Rohlf. Single-link clustering algorithms. In P.R. Krishnaiah and L.N. Kanal, editors, Classification Pattern Recognition and Reduction of Dimensionality, volume 2 of Handbook of Statistics, pages 267–284. Elsevier, 1982. doi: 10.1016/S0169-7161(82)02015-X. R. Sibson. SLINK: an optimally efficient algorithm for the single-link cluster method. Comput. Journal, 16:30–34, 1973. doi: 10.1093/comjnl/16.1.30. - -28 - -Peter H. A. Sneath and Robert R. Sokal. Numerical taxonomy. W. H. Freeman, San Francisco, 1973. The MathWorks, Inc. MATLAB, 2011. http://www.mathworks.com. Guido van Rossum et al. Python programming language. Available at http://www.python. org. Wolfram Research, Inc. Mathematica, 2010. http://www.wolfram.com. - -Daniel Müllner Stanford University Department of Mathematics 450 Serra Mall, Building 380 Stanford, CA 94305 E-mail: muellner@math.stanford.edu http://math.stanford.edu/~muellner - -29 - +valid index here; however, for the single linkage method, it makes sense to choose b: if the nearest neighbor of a node was at index a, it is now at b, which represents the join a ∪ b. The remaining code in the main loop ensures that the array n _nghbr still contains lower bounds on the distances to the nearest neighbors. If the distance from the new cluster x to a cluster b \ No newline at end of file