# LDA vs Document Clustering

I was asked at the interview what's the difference between LDA and document clustering. I tried to explain it by explaining the difference between generative models that are assumed for the respective models. However, now I realise it would have been much more effective to give a much simpler example.

Imagine you have a dataset of objects that you can broadly classify as "plain bread" and "bread with seeds". For this example, it is important that these objects share some similarity, but also have important differences:

# Topic Keywords Case Study

In this post, I present a case study on a corpus of 10,000 news articles. We will investigate the topic structure of the corpus, by gradually "freezing" the topics through specifying their keywords, and seeing what other topics come up. The process shows how you can extract useful topics from your corpus, such that these topics would provide a meaningful basis for topic detection in future articles.

Limitations: 10,000 new articles represent approximately only 2 days of news from 400 top world newspapers and blogs. Therefore, the topic structure will be highly biased to the events reported during this period. Also, I will only use 250 Gibbs sampling iterations after burn in to infer the topics.

# Release v.0.0.2

Released version 0.0.2 of the Java source code. This release allows specifying LDA topic keywords. Please see software page for downloads.

# Release v.0.0.1

The first version of the Java source code for running the heavily optimised LDA Gibbs sampling has been released. Please see the software page for details.

Update: Fix 1 has now been released. ThreadCount was incorrectly used instead of TopicCount in $$\alpha$$ and $$\beta$$ configuration objects.

Please click the link below to see the 20 topics output produced from 10,000 news articles, after running just 200 iterations. First 100 iterations are burn-in with temperature from 1.0 to 0.1, and then 100 sampling iterations at temperature 0.1. It was completed in 15 seconds on a 4-core machine.

# Simulated Annealing for Dirichlet Priors in LDA

When estimating the parameters of the LDA (Latent Dirichlet Allocation) model using Gibbs sampling, if we set the Dirichlet priors to the fixed target values (usually small), then we reduce the mixing of the samples from the target distribution from the beginning, even though we haven't found a good approximation yet.

An alternative would be to initialise the Dirichlet priors with relatively high parameters alpha, and then gradually decrease them during burn-in period. This will allod the sampler to locate the approximate area of interest faster at the initial stages, while still sampling at the target prior values after burn-in.

This article describes application of simulated annealing technique for MCMC inference of multinomial random distributions with Dirichlet priors in LDA. It is implemented in my NLP library for optimised Gibbs sampling for LDA (see software). The full article can be found here (PDF).